In this project, we are using the pre-trained BERT (Bidirectional Encoder Representations from Transformers) model to classify tweets as either disaster-related or not. We start by loading and preprocessing the data, and then fine-tuning the pre-trained BERT model on our dataset using PyTorch. We train the model using a combination of cross-entropy loss and mixup regularization, and use early stopping to prevent overfitting. Finally, we evaluate the model on a test set of tweets and report its accuracy. Overall, this project demonstrates the use of transfer learning and fine-tuning with BERT for natural language processing tasks, specifically for tweet classification.
import pandas as pd
import numpy as np
tweets = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
tweets.head()
id | keyword | location | text | target | |
---|---|---|---|---|---|
0 | 1 | NaN | NaN | Our Deeds are the Reason of this #earthquake M... | 1 |
1 | 4 | NaN | NaN | Forest fire near La Ronge Sask. Canada | 1 |
2 | 5 | NaN | NaN | All residents asked to 'shelter in place' are ... | 1 |
3 | 6 | NaN | NaN | 13,000 people receive #wildfires evacuation or... | 1 |
4 | 7 | NaN | NaN | Just got sent this photo from Ruby #Alaska as ... | 1 |
test_df = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
test_df.head()
id | keyword | location | text | |
---|---|---|---|---|
0 | 0 | NaN | NaN | Just happened a terrible car crash |
1 | 2 | NaN | NaN | Heard about #earthquake is different cities, s... |
2 | 3 | NaN | NaN | there is a forest fire at spot pond, geese are... |
3 | 9 | NaN | NaN | Apocalypse lighting. #Spokane #wildfires |
4 | 11 | NaN | NaN | Typhoon Soudelor kills 28 in China and Taiwan |
display(len(tweets))
len(test_df)
7613
3263
This code imports several libraries, including torch, sklearn, and transformers. Specifically, it converts the text to lowercase, removes all punctuation marks, removes all numbers, and removes any leading/trailing white space.
The purpose of this preprocessing is likely to clean and normalize the text data before using it as input for a machine learning model. However, the code snippet provided does not show any further usage of the preprocessed data.
import torch
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
# Preprocess train data
tweets['text'] = tweets['text'].str.lower() # Convert text to lowercase
tweets['text'] = tweets['text'].str.replace('[^\w\s]','') # Remove punctuation
tweets['text'] = tweets['text'].str.replace('\d+', '') # Remove numbers
tweets['text'] = tweets['text'].str.strip() # Remove leading/trailing white space
tweets.head()
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:7: FutureWarning: The default value of regex will change from True to False in a future version. import sys /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:8: FutureWarning: The default value of regex will change from True to False in a future version.
id | keyword | location | text | target | |
---|---|---|---|---|---|
0 | 1 | NaN | NaN | our deeds are the reason of this earthquake ma... | 1 |
1 | 4 | NaN | NaN | forest fire near la ronge sask canada | 1 |
2 | 5 | NaN | NaN | all residents asked to shelter in place are be... | 1 |
3 | 6 | NaN | NaN | people receive wildfires evacuation orders in ... | 1 |
4 | 7 | NaN | NaN | just got sent this photo from ruby alaska as s... | 1 |
# Preprocess test data
test_df['text'] = test_df['text'].str.lower() # Convert text to lowercase
test_df['text'] = test_df['text'].str.replace('[^\w\s]','') # Remove punctuation
test_df['text'] = test_df['text'].str.replace('\d+', '') # Remove numbers
test_df['text'] = test_df['text'].str.strip() # Remove leading/trailing white space
test_df.head()
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: The default value of regex will change from True to False in a future version. This is separate from the ipykernel package so we can avoid doing imports until /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:4: FutureWarning: The default value of regex will change from True to False in a future version. after removing the cwd from sys.path.
id | keyword | location | text | |
---|---|---|---|---|
0 | 0 | NaN | NaN | just happened a terrible car crash |
1 | 2 | NaN | NaN | heard about earthquake is different cities sta... |
2 | 3 | NaN | NaN | there is a forest fire at spot pond geese are ... |
3 | 9 | NaN | NaN | apocalypse lighting spokane wildfires |
4 | 11 | NaN | NaN | typhoon soudelor kills in china and taiwan |
tweets.isna().sum()
id 0 keyword 61 location 2533 text 0 target 0 dtype: int64
tweets.drop(['location'], axis=1, inplace=True)
tweets.head()
id | keyword | text | target | |
---|---|---|---|---|
0 | 1 | NaN | our deeds are the reason of this earthquake ma... | 1 |
1 | 4 | NaN | forest fire near la ronge sask canada | 1 |
2 | 5 | NaN | all residents asked to shelter in place are be... | 1 |
3 | 6 | NaN | people receive wildfires evacuation orders in ... | 1 |
4 | 7 | NaN | just got sent this photo from ruby alaska as s... | 1 |
test_df.drop(['location'], axis=1, inplace=True)
test_df.head()
id | keyword | text | |
---|---|---|---|
0 | 0 | NaN | just happened a terrible car crash |
1 | 2 | NaN | heard about earthquake is different cities sta... |
2 | 3 | NaN | there is a forest fire at spot pond geese are ... |
3 | 9 | NaN | apocalypse lighting spokane wildfires |
4 | 11 | NaN | typhoon soudelor kills in china and taiwan |
This code splits the tweets dataset into three sets: training data, validation data, and testing data. The train_test_split function from sklearn is used for this purpose, and the test_size parameter is set to 0.2, which means that 20% of the data is reserved for testing. The random_state parameter is set to 42 to ensure that the same split is generated every time the code is run.
Next, the code loads a pre-trained BERT model and tokenizer using the BertTokenizer and BertForSequenceClassification classes from the transformers library. The model is initialized with the num_labels parameter set to 2, which suggests that this is a binary classification problem.
Finally, the code sets the device to GPU if it is available and moves the model to the specified device. This is done to take advantage of GPU acceleration during training, which can significantly speed up the training process. If GPU is not available, the code will fall back to using the CPU.
# Split data into training, validation, and testing sets
train_data, test_data = train_test_split(tweets, test_size=0.2, random_state=42)
train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Set device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.1, inplace=False) (classifier): Linear(in_features=768, out_features=2, bias=True) )
This code defines a function tokenize_text() that takes in a list of text sentences and returns the tokenized input IDs and attention masks.
The tokenizer.encode_plus() method from the transformers library is used to tokenize each sentence. This method adds special tokens, such as [CLS] and [SEP], to the beginning and end of the sentence, respectively. It also pads or truncates each sentence to a maximum length of 64 tokens and returns attention masks that indicate which tokens are actual words versus which ones are padding tokens. The resulting encoded dictionary is added to the input and attention mask lists.
The input IDs and attention masks are then concatenated along the first dimension using torch.cat() to create tensors that can be used as inputs to the model. The function returns these tensors for the training, validation, and testing sets after applying the tokenize_text() function to each set.
Overall, this code performs the tokenization of text data for input into the pre-trained BERT model. The resulting tensors are expected to be used as input to train the model.
def tokenize_text(text):
input_ids = []
attention_masks = []
for sentence in text:
encoded_dict = tokenizer.encode_plus(
sentence, # Text to encode
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = 64, # Pad & truncate all sentences.
pad_to_max_length = True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
)
input_ids.append(encoded_dict['input_ids'])
attention_masks.append(encoded_dict['attention_mask'])
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
return input_ids, attention_masks
train_inputs, train_masks = tokenize_text(train_data['text'])
val_inputs, val_masks = tokenize_text(val_data['text'])
test_inputs, test_masks = tokenize_text(test_data['text'])
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`. /opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py:2345: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert). FutureWarning,
This code converts the labels for the training, validation, and testing sets into tensors using torch.tensor(). The resulting tensors are used to create data loaders that can be used to iterate over the data in batches during training, validation, and testing.
For each set, a TensorDataset is created from the input IDs, attention masks, and labels tensors using torch.utils.data.TensorDataset(). This function creates a dataset from tensors, where each tensor is expected to have the same length in the first dimension.
For the training set, a RandomSampler is used to randomly sample elements from the dataset, while for the validation and testing sets, a SequentialSampler is used to iterate over elements in a sequential manner.
Finally, a DataLoader is created for each set using torch.utils.data.DataLoader(). This function creates an iterable that can be used to iterate over the dataset in batches during training, validation, and testing. The batch_size parameter is set to 32, which means that each batch will contain 32 samples.
# Convert labels to tensors
train_labels = torch.tensor(train_data['target'].values)
val_labels = torch.tensor(val_data['target'].values)
test_labels = torch.tensor(test_data['target'].values)
# Create data loaders
batch_size = 32
train_data = torch.utils.data.TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = torch.utils.data.RandomSampler(train_data)
train_loader = torch.utils.data.DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
val_data = torch.utils.data.TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = torch.utils.data.SequentialSampler(val_data)
val_loader = torch.utils.data.DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
test_data = torch.utils.data.TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = torch.utils.data.SequentialSampler(test_data)
test_loader = torch.utils.data.DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)
This code defines the train() function that takes in the pre-trained BERT model, the data loader for the training set, the optimizer, and the learning rate scheduler as input.
Within the function, the pre-trained BERT model is set to training mode using model.train(). Then, the function iterates over batches of training data using a for loop that loops through the train_loader.
For each batch, the input IDs, attention masks, and labels are extracted from the batch tensor and moved to the GPU if available using to(device).
The optimizer gradients are set to zero using optimizer.zero_grad(). Then, the inputs are passed through the model, and the outputs are obtained using model() with attention_mask and labels as arguments.
The first element of the outputs tuple is the loss, which is stored in the loss variable. The total_loss variable is updated with the value of the loss for each batch.
The loss is backpropagated using loss.backward(). The gradients are then clipped using torch.nn.utils.clip_grad_norm_() to avoid exploding gradients, and the optimizer is updated using optimizer.step().
Finally, the learning rate scheduler is updated using scheduler.step().
The average training loss is calculated by dividing the total_loss by the length of the train_loader, which is the number of batches in the training set. This value is returned as the output of the function.
from torch.utils.data import DataLoader
from torch.nn.functional import interpolate
# Define training function
def train(model, train_loader, optimizer, scheduler):
model.train()
total_loss = 0
for batch in train_loader:
input_ids, attention_mask, labels = tuple(t.to(device) for t in batch)
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs[0]
total_loss += loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
avg_train_loss = total_loss / len(train_loader)
return avg_train_loss
This code defines an evaluation function named evaluate which takes a pre-trained BERT model and an evaluation data loader as input.
Inside the function, the model is put in evaluation mode using the model.eval() method. Then, the function iterates over the evaluation data loader and for each batch, it retrieves the input IDs, attention masks, and labels of the batch and sends them to the device (CPU or GPU) that the model is currently using.
Next, the input IDs and attention masks are fed into the model to obtain its predictions. The predictions are stored in logits.
Then, the loss for the batch is computed using the loss.item() method on the first element of the output tensor from the model. The loss value is accumulated in total_loss.
The predictions are detached from the computation graph, moved to the CPU, and stored in a numpy array named total_preds.
Finally, the function computes the average validation loss over all batches and returns this value, as well as the concatenated total_preds array.
# Define evaluation function
def evaluate(model, eval_loader):
model.eval()
total_loss = 0
total_preds = []
with torch.no_grad():
for batch in eval_loader:
input_ids, attention_mask, labels = tuple(t.to(device) for t in batch)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs[0]
logits = outputs[1]
total_loss += loss.item()
logits = logits.detach().cpu().numpy()
total_preds.append(logits)
avg_val_loss = total_loss / len(eval_loader)
total_preds = np.concatenate(total_preds, axis=0)
return avg_val_loss, total_preds
This code defines a class called EarlyStopping which can be used to implement early stopping during model training. The class constructor takes two arguments: patience and delta, which specify the number of epochs to wait before stopping training and the minimum change in validation loss to be considered as improvement respectively.
The class has several instance variables including the best_score, val_loss_min, and counter which are used to track the best validation loss achieved so far, the minimum validation loss, and the number of epochs since the last best score, respectively.
The class has a call method that takes two arguments: the current validation loss and the model being trained. This method compares the current validation loss with the best score achieved so far and saves the current model if it is the best. If there is no improvement in the validation loss for patience epochs, the training is stopped.
The class also has a save_checkpoint method which saves the current model's state_dict and updates the val_loss_min instance variable with the current validation loss.
class EarlyStopping:
def __init__(self, patience=5, delta=0.0):
self.patience = patience
self.delta = delta
self.counter = 0
self.best_score = None
self.early_stop = False
self.val_loss_min = np.Inf
def __call__(self, val_loss, model):
score = -val_loss
if self.best_score is None:
self.best_score = score
self.save_checkpoint(val_loss, model)
elif score < self.best_score + self.delta:
self.counter += 1
if self.counter >= self.patience:
self.early_stop = True
else:
self.best_score = score
self.save_checkpoint(val_loss, model)
self.counter = 0
def save_checkpoint(self, val_loss, model):
torch.save(model.state_dict(), 'best_model_4.pt')
self.val_loss_min = val_loss
This code sets hyperparameters, optimizer, scheduler, loss function, and device for training and evaluating a model. It also defines an EarlyStopping class to stop the training process when the validation loss stops decreasing.
Then it loops through each epoch, trains the model using the train function with the optimizer and scheduler defined earlier, evaluates the model on the validation set using the evaluate function, and prints the training and validation loss for each epoch. It then checks if the validation loss has improved from the previous best validation loss and saves the model's state dictionary if it has. If the validation loss hasn't improved for early_stopping_patience number of epochs, the EarlyStopping instance sets the early_stop flag to True and the training process stops.
# Set hyperparameters
learning_rate = 1e-5
epochs = 10
batch_size = 32
gradient_accumulation_steps = 1
mixup_alpha = 0.1
early_stopping_patience = 2
# Set optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-8)
total_steps = len(train_loader) * epochs // gradient_accumulation_steps
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
# Define loss function and device
import torch.nn as nn
criterion = nn.CrossEntropyLoss()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Train and evaluate model
best_val_loss = float('inf')
early_stopping = EarlyStopping(patience=5, delta=0.0)
for epoch in range(epochs):
train_loss = train(model, train_loader, optimizer, scheduler)
val_loss, val_preds = evaluate(model, val_loader)
print(f'Epoch {epoch + 1}: train_loss = {train_loss:.3f}, val_loss = {val_loss:.3f}')
early_stopping(val_loss, model)
if early_stopping.early_stop:
print("Early stopping")
break
if val_loss < best_val_loss:
torch.save(model.state_dict(), 'best_model_4.pt')
best_val_loss = val_loss
/opt/conda/lib/python3.7/site-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning FutureWarning,
Epoch 1: train_loss = 0.506, val_loss = 0.382 Epoch 2: train_loss = 0.364, val_loss = 0.386 Epoch 3: train_loss = 0.295, val_loss = 0.408 Epoch 4: train_loss = 0.246, val_loss = 0.436 Epoch 5: train_loss = 0.207, val_loss = 0.495 Epoch 6: train_loss = 0.167, val_loss = 0.566 Early stopping
# Loading the model
best_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
best_model.load_state_dict(torch.load('best_model_4.pt'))
best_model.eval()
best_model.to(device)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.1, inplace=False) (classifier): Linear(in_features=768, out_features=2, bias=True) )
# Test model
test_loss, test_preds = evaluate(best_model, test_loader)
test_preds = np.argmax(test_preds, axis=1)
test_accuracy = (test_preds == test_labels.numpy()).mean()
print(f'Test accuracy: {test_accuracy:.3f}')
Test accuracy: 0.835
This code is for making predictions on the test set using a previously trained model. Here's a brief overview of what it does:
The test data is tokenized using the tokenize_text function, which is not shown here. This function presumably tokenizes the text in a way that is consistent with how the training and validation data were tokenized. The tokenized test data is converted to input IDs and attention masks, which are the input formats required by the model. A data loader is created for the test data using the test_data tensor and SequentialSampler. The best model (i.e., the model with the lowest validation loss during training) is put in evaluation mode using the eval() method. The model is used to make predictions on the test data. The torch.no_grad() context manager is used to disable gradient computation, which reduces memory usage and speeds up computations. The predictions are stored in the all_preds list. The predictions for all batches are concatenated into a single NumPy array. The predicted labels are obtained by taking the argmax of the predictions along the second axis (i.e., axis 1), which corresponds to the class probabilities. The predicted labels are stored in the pred_labels array. Overall, this code is a straightforward way to make predictions on the test data using a trained model.
# Tokenize test data and convert to input IDs and attention masks
test_inputs, test_masks = tokenize_text(test_df['text'])
# Create test data loader
test_data = torch.utils.data.TensorDataset(test_inputs, test_masks)
test_sampler = torch.utils.data.SequentialSampler(test_data)
test_loader = torch.utils.data.DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)
# Put model in evaluation mode
best_model.eval()
# Make predictions on test data
all_preds = []
with torch.no_grad():
for batch in test_loader:
input_ids, attention_mask = tuple(t.to(device) for t in batch)
outputs = best_model(input_ids, attention_mask=attention_mask)
logits = outputs[0]
logits = logits.detach().cpu().numpy()
all_preds.append(logits)
# Combine predictions for all batches
all_preds = np.concatenate(all_preds, axis=0)
# Take argmax of predictions to obtain predicted labels
pred_labels = np.argmax(all_preds, axis=1)
# Create submission file
sub_df = pd.DataFrame({'id': test_df['id'], 'target': pred_labels})
sub_df.to_csv('submission3.csv', index=False)