Fine-tuning BERT for the Sentence Pair Classification Task

This tutorial will teach you how to fine-tune BERT for the Sentence Pair Classification task. In this task, each dataset sample contains two sentences and the appropriate target variable (label). The sentence pair classification can use for many tasks such as information retrieval, question-answering, reranking and more.

Look at the fantastic success of BERT and other Transformer models in many NLP tasks, and it should be no surprise that they also succeed at the sentence pair classification task. This tutorial explains how to use HuggingFace to fine-tune sentence pair classification using the power of Transformer models (BERT/RoBERTa).

Environment setup

Before we start our tutorial, we need to set up the environment for our project.

First, it’s recommended to install conda for environment management, e.g. Miniconda, anaconda. You can download and install it from here.
Create a new environment for the project:

$ conda create --prefix ./venv python=3.9
$ conda activate ./venv
$ pip install numpy transformers datasets pandas scikit-learn torch matplotlib

Here is the requirements.txt of our experiment on the NVIDIA A40 GPU. But you can use Google Colab to train this task by reducing the batch size when training and inference.

datasets==2.7.1
numpy==1.23.5
pandas==1.5.2
scikit-learn==1.1.3
scipy==1.9.3
torch==1.13.0
transformers==4.24.0
matplotlib

Now, we should import the package we will use in this project.

import transformers
import datasets
import json
import numpy as np
import matplotlib

Dataset preparation

Now we need a dataset to train and test our models after the environment is set up. To simplify, we decided to use the dataset squad to evaluate the effectiveness of the proposed method and fine-tune BERT for sentence pair classification. This dataset can be downloaded automatically by the datasets library, so what we need to do to get the data is quite simple.

# Load squad dataset
squad = datasets.load_dataset("squad")
print(squad)
print(squad["train"][0])

Output:

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})
{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

Each sample in the SQUAD dataset consists of id, title, context, question, and answers. This is a question-answering dataset, but why do we use this for sentence pair classification? Look at the picture below.

For the retrieval-based question-answering system, before you can find the correct answer to the question, you need to find the relevant document related to the question. Find the relevant document can be considered as sentence pair classification where each sample consists of:

question
context
label (related or not)

Create negative sample

In this tutorial, we also plan to create a dataset to train the model for the document retrieval task. Specifically, the sentence pair classification dataset will be generated from SQUAD as follows:

Each pair of (question, context) is labeled as 1 if relevant, else 0.
We already have the relevant pairs because each question in every sample of SQUAD has the correct context attached.
So, we only need to create the negative sample: question and context unrelated.

First, we will only need the questions and contexts from SQUAD.

# Extract question and context from each item in the squad
questions = [item["question"] for item in squad["train"]]
contexts = [item["context"] for item in squad["train"]]
questions.extend([item["question"] for item in squad["validation"]])
contexts.extend([item["context"] for item in squad["validation"]])
print(len(questions), len(contexts))

98169 98169

Because the dataset is quite large for building a tutorial, we decided to use only part of the dataset to reduce the training time.

We pick the first 10000 samples as the positive samples (context related to the question)
To create the negative, for each question, we find 10 other contexts not related to the question. The create_non_relevants() function will do this work.
After the dataset preparation, we split to train and test set by the ratio 80.
Each train and test set is then saved to a separate JSON file in JSON line format.

pair_data = []
n_samples = 10000
questions = questions[:n_samples]
contexts = contexts[:n_samples]

def create_non_relevants(index, n=10):
    other_indexs = [i for i in range(len(questions)) if i != index]
    return [i for i in np.random.choice(other_indexs, n)]

# For each questions, create 10 non relevant contexts
for index, question in enumerate(questions):
    pair_data.append({"question": question, "context": contexts[index], "label": 1})
    non_relevants_indexs = create_non_relevants(index)
    for non_relevant_index in non_relevants_indexs:
        pair_data.append({"question": question, "context": contexts[non_relevant_index], "label": 0})

# Shuffle and split train and test
test_size = 0.2
np.random.shuffle(pair_data)
train_size = int(len(pair_data) * (1 - test_size))
train_data = pair_data[:train_size]
test_data = pair_data[train_size:]

# Save to jsonl
with open("train.jsonl", "w") as f:
    for item in train_data:
        f.write(json.dumps(item) + "\n")

with open("test.jsonl", "w") as f:
    for item in test_data:
        f.write(json.dumps(item) + "\n")

print('Train size: ', len(train_data))
print('Test size: ', len(test_data))

Train size:  88000
Test size:  22000

Dataset visualization

Before using the prepared dataset to fine-tune BERT for the sentence pair classification task. We should verify that the length of each sample is fittable in the pre-trained. Most pre-trained BERT use max_length = 512 tokens.

dataset = datasets.load_dataset("json", data_files={"train": "train.jsonl", "test": "test.jsonl"})

question_and_context = []

for item in dataset["train"]:
    question_and_context.append(item["question"] + " " + item["context"])

# Visualize the distribution of the length (number of words) of the questions and contexts
import matplotlib.pyplot as plt
plt.hist([len(item.split()) for item in question_and_context], bins=100)
plt.show()

Looks like everything is fine. So, we move to the next section.

Train the sentence pair classification model

First, we need to load the tokenizer and model before use. In this tutorial, we use the RoBERTa, a variant of BERT. But actually, you can also use another pre-trained model.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

Before training

Also, you may want to see how the tokenizer encodes your data.

encoded = tokenizer("Hello, my dog is cute", "Hello, my cat is amazing", return_tensors="pt")
decoded = tokenizer.decode(encoded["input_ids"][0])
print(decoded)

Output:

<s>Hello, my dog is cute</s></s>Hello, my cat is amazing</s>

The RoBERTa inserts a padding </s> between the two sentences. This may differ when you using another pre-trained.

As you know, a tokenizer is required to parse the text, and a padding and truncation mechanism is to manage varied sequence lengths. To treat your dataset in a single step, apply a preprocessing function to the entire dataset using the Datasets’ map method:

def preprocess_function(batch):
    return tokenizer(batch["question"], batch["context"], truncation=True, padding="max_length")

dataset = datasets.load_dataset("json", data_files={"train": "train.jsonl", "test": "test.jsonl"})
tokenized_data = dataset.map(preprocess_function, batched=True)
print(tokenized_data)

Output:

DatasetDict({
    train: Dataset({
        features: ['question', 'context', 'label', 'input_ids', 'attention_mask'],
        num_rows: 88000
    })
    test: Dataset({
        features: ['question', 'context', 'label', 'input_ids', 'attention_mask'],
        num_rows: 22000
    })
})

Model configuration

The code below shows our model configuration for fine-tuning BERT for sentence pair classification. We use the F1 score as the evaluation metric to evaluate model performance.

Because this sentence pair classification task is a binary classification task, so we pass the num_labels is 2.

Make sure to reduce the batch_size if you get out of memory error.

from transformers import Trainer, TrainingArguments

def compute_metrics(eval_pred):
    f1_score = datasets.load_metric("f1")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    f1_score.add_batch(predictions=predictions, references=labels)
    return f1_score.compute()

model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results",  # output directory
    num_train_epochs=3,  # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=64,  # batch size for evaluation
    warmup_steps=500,  # number of warmup steps for learning rate scheduler
    weight_decay=0.01,  # strength of weight decay
    learning_rate=2e-5,  # learning rate
    save_total_limit=2,  # limit the total amount of checkpoints, delete the older checkpoints
    logging_dir="./logs",  # directory for storing logs
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
)

trainer = Trainer(
    model=model,  # the instantiated 🤗 Transformers model to be trained
    args=training_args,  # training arguments, defined above
    train_dataset=tokenized_data["train"],  # training dataset
    eval_dataset=tokenized_data["test"],  # evaluation dataset
    compute_metrics=compute_metrics,  # the callback that computes metrics of interest
)

Model training

After all, we can start training our sentence pair classification model:

trainer.train()

To evaluate the sentence pair classification model after training, you can add this line at the end.

trainer.evaluate()

Or load it from the saved checkpoint.

model = AutoModelForSequenceClassification.from_pretrained(
    "results/checkpoint-8200", num_labels=2
)
...
trainer.evaluate()

Output:

{'eval_loss': 0.016812607645988464, 'eval_f1': 0.9798155993022676, 'eval_runtime': 167.2537, 'eval_samples_per_second': 131.537, 'eval_steps_per_second': 2.057}

Conclusion

This tutorial about sentence pair classification’s source code can be accessed from this repo, and you can also open it in Google Colab by the link in README.

Fine-tuning BERT for the Sentence Pair Classification Task

Environment setup

Dataset preparation

Create negative sample

Dataset visualization

Train the sentence pair classification model

Before training

Model configuration

Model training

Conclusion

Share this article

💬 Comments