How to use gpt2 for text summarization

Created on December 30, 2022 at 10:54 am

Category: Programming


GPT-2 (Generative Pre-trained Transformer 2) is a large language model developed by OpenAI that can generate human-like text. It is not designed specifically for text summarization, but you can use it to generate summaries by fine-tuning it on a dataset of summaries. Here’s a general outline of how you can do this:

  1. Collect a dataset of documents and corresponding summaries. This could be a set of news articles with their associated headlines, for example.
  2. Preprocess the text by lowercasing, tokenizing, and padding or truncating the sequences to a fixed length.
  3. Split the dataset into training and test sets.
  4. Fine-tune GPT-2 on the training set using a task-specific loss function such as cross-entropy loss. You can do this using the transformers library in Python.
  5. Evaluate the model on the test set using metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation) or BLEU (Bilingual Evaluation Understudy).
  6. Use the fine-tuned model to generate summaries for new documents by providing it with the document as input and setting the length of the summary to a desired value.

It’s also worth noting that there are other methods for text summarization that do not involve fine-tuning a language model, such as extractive summarization, which involves selecting the most important sentences from the original text to form the summary.

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

Load the GPT-2 model and tokenizer

model = GPT2LMHeadModel.from_pretrained(‘gpt2’)
tokenizer = GPT2Tokenizer.from_pretrained(‘gpt2’)

Preprocess the text by lowercasing, tokenizing, and padding or truncating the sequences

def preprocess_text(text, max_length):
input_ids = tokenizer.encode(text, return_tensors=’pt’).squeeze(0)
input_ids = input_ids[:max_length]
attention_mask = (input_ids != 0).long()
return input_ids, attention_mask

Generate a summary for a given document

def generate_summary(document, max_length=128):
input_ids, attention_mask = preprocess_text(document, max_length)
summary_ids = model.generate(input_ids, attention_mask=attention_mask)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary

document = “This is a document about text summarization. Text summarization is the process of generating a concise and coherent summary of a longer document. There are many methods for text summarization, including extractive summarization and abstractive summarization. Extractive summarization involves selecting the most important sentences from the original text to form the summary, while abstractive summarization involves generating a new summary that is based on the meaning of the original text.”

summary = generate_summary(document)
print(summary)

Connecting to http://lzomedia.com... Connected... Page load complete