Predicting Disaster Tweets with NLP

machine learning
tutorial
fastai
NLP
kaggle
Author

Aaron James

Published

April 21, 2025

NLP, Kaggle, and Disaster Tweets

nlp_tweets.png

This lesson focused on applying NLP using hugging face’s library. In the book we used the fastai library. I decided to apply what I learned to the kaggle NLP disaster tweets competition. I’ll be referencing the two course notebooks for this lesson Chapter 10 and getting-started-with-NLP

The goal of this competition is to take a set of tweets and determine based on their metadata whether or not they refer to real disasters. We use Natural Language Processing (NLP) to fit a model to the tweets to make our predictions.

The first thing I did was make sure I could download the dataset.

Downloading the dataset

I wrote a short script to download the dataset. Actually, I just used the code from the getting-started notebook.

Code
from pathlib import Path
import os

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
competition = 'nlp-getting-started'
if iskaggle:
    !pip install -Uqq fastai
else:
    import zipfile,kaggle
    path = Path(competition)
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)
nlp-getting-started.zip: Skipping, found more recently modified local copy (use --force to force download)

Loading the data and EDA

Now that the dataset is downloaded, we want to load it into memory to start playing around with various features.

Code
# import relevant frameworks
from fastai.imports import *
if iskaggle: path = Path('../input/' + competition)
df = pd.read_csv(path/'train.csv')
df_test = pd.read_csv(path/'test.csv')
df.describe(include='object')
keyword location text
count 7552 5080 7613
unique 221 3341 7503
top fatalities USA 11-Year-Old Boy Charged With Manslaughter of Toddler: Report: An 11-year-old boy has been charged with manslaughter over the fatal sh...
freq 45 104 10

Analyzing Data Distributions

Since this is my first time looking at the dataset, I want to see what category breakdowns seem to be significant. Basically, I want to get a sense of what variables it might be helpful to look at more closely.

Code
df.keyword.value_counts()
keyword
fatalities               45
deluge                   42
armageddon               42
sinking                  41
damage                   41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: count, Length: 221, dtype: int64
Code
df.location.value_counts()
location
USA                    104
New York                71
United States           50
London                  45
Canada                  29
                      ... 
Montr̩al, Qu̩bec       1
Montreal                 1
ÌÏT: 6.4682,3.18287      1
Live4Heed??              1
Lincoln                  1
Name: count, Length: 3341, dtype: int64

Create another baseline

I spent a good amount of time looking into novel ways to break the data down to give the model an optimal input. I considered:

  • removing the %20 and replacing with a space for the keywords
  • removing the location to see what impact that has
  • collapsing duplicate locations into one (theoretically, USA == United States)

Then I realized, before I can determine how those edits would affect my results… I need results. I was reminded that for our last lesson we started by creating a baseline. I decided to do this in the simplest way I could think of and iterate from there. To me, that was squishing all the default features of the data into a single string, and training the model on that input.

Code
input_col = 'inputs'
na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.text.fillna(na_fill)
df.head()
id keyword location text target inputs
0 1 NaN NaN Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all 1 LOC: ; KW: ; TEXT: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
1 4 NaN NaN Forest fire near La Ronge Sask. Canada 1 LOC: ; KW: ; TEXT: Forest fire near La Ronge Sask. Canada
2 5 NaN NaN All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected 1 LOC: ; KW: ; TEXT: All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
3 6 NaN NaN 13,000 people receive #wildfires evacuation orders in California 1 LOC: ; KW: ; TEXT: 13,000 people receive #wildfires evacuation orders in California
4 7 NaN NaN Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 1 LOC: ; KW: ; TEXT: Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
Code
# Suppress warnings to make the output look cleaner
from transformers.utils import logging
import warnings

warnings.filterwarnings("ignore")
Code
from datasets import Dataset,DatasetDict
from transformers import AutoModelForSequenceClassification,AutoTokenizer

# convert dataframe into a huggingface dataset
ds = Dataset.from_pandas(df)

model_nm = 'microsoft/deberta-v3-small'
tokz = AutoTokenizer.from_pretrained(model_nm)
def tok_func(x): return tokz(x[input_col])

# tokenize the input
tok_ds = ds.map(tok_func, batched=True)
tok_ds = tok_ds.rename_columns({'target':'labels'})
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

I briefly considered trying to engineer a perfect validation dataset before remembering that the goal was to create a simple baseline first

Code
# Splitting up the validation set now
dds = tok_ds.train_test_split(0.25, seed=1337)
dds
DatasetDict({
    train: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'labels', 'inputs', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5709
    })
    test: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'labels', 'inputs', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1904
    })
})

Defining the F1 Metric

The kaggle competition uses the F1 metric between the predicted values and expected values. So we have to define it in a way that huggingface understands to use it in our training.

Code
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {
        "f1": f1_score(labels, preds),
        "accuracy": accuracy_score(labels, preds),
        "precision": precision_score(labels, preds),
        "recall": recall_score(labels, preds)
    }
df_eval = dds['test']
df_eval
Dataset({
    features: ['id', 'keyword', 'location', 'text', 'labels', 'inputs', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1904
})

Now we can run our training loop. I’m not deeply concerned with every hyper parameter here, because I’m not there yet. I just want to focus on epochs, batch size bs and learning rate lr.

Code
from transformers import TrainingArguments,Trainer


epochs = 4
bs = 128
lr = 8e-5

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=compute_metrics)

trainer.train();
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[180/180 00:27, Epoch 4/4]
Epoch Training Loss Validation Loss F1 Accuracy Precision Recall
1 No log 0.524064 0.628931 0.752101 0.950119 0.470035
2 No log 0.456287 0.751911 0.812500 0.920068 0.635723
3 No log 0.462908 0.783172 0.824055 0.871758 0.710928
4 No log 0.476750 0.787008 0.820903 0.840000 0.740306

Iterating from the Baseline Results

Alright ok thank God. It took a while, but we got a baseline. I tried a few randoms seeds for the test split. I got F1 values of .7949, .7948, and .8049. This is good because it tells me that based on a few random splits the metric doesn’t change too much. Now we want to see what are the hardest tweets to classify. We started with the baseline to just give us a sense of what data gives the model trouble. We didn’t curate a validation set, because we were stumped on which aspects of the data to sort into a valdiation set. Analyzing the tweets that are hardest to classify hopefully gives us a sense of what data features lead to trouble for the model.

Analyzing the Model’s “Trouble” Tweets (or Error Analysis)

First I’ll store the main outputs of the training for further analysis

Code
import torch
import torch.nn.functional as F
from fastai.vision.all import *

# save our validation set under a new name
df_eval = dds['test']

# Get predictions and probabilities
outputs = trainer.predict(df_eval)
logits = outputs.predictions
labels = outputs.label_ids
probs = F.softmax(tensor(logits), dim=1).numpy()
preds = probs.argmax(axis=1)

Now I want to create a dataframe that has additional relevant features. We already have the training output, so these new features will be based on the training. Once I have this dataframe I can start visualizing it in different ways

Code
# Compute per-example loss
losses = [F.cross_entropy(tensor(logit), tensor(preds)) for logit, preds in zip(logits, preds)]

def count_cap(text):
    words = text.split()
    count = 0
    for word in words:
        if word and word[0].isupper():
            count += 1
    return count

# Set up up new dataframe with the results of the first pass
df_valid_results = pd.DataFrame({
    "text": [df_eval[i]["text"] for i in range(len(df_eval))],
    "keyword": [df_eval[i].get("keyword") or "n/a" for i in range(len(labels))],
    "location": [df_eval[i].get("location") or "n/a" for i in range(len(labels))],
    "label": labels,
    "cap_ratio": [count_cap(twt)/float(len(twt)) for twt in df_eval['text']],
    "pred": preds,
    "prob_1": probs[:, 1],
    "contains_link": [twt.count("http") > 0 for twt in df_eval['text']],
    "tweet_len": [len(twt) for twt in df_eval['text']],
    "hashtags": [twt.count("#") for twt in df_eval['text']],
    "is_location": [bool(loc) for loc in df_eval['location']],
    "loss": losses
})

# Tag confusion type
def label_type(row):
    if row.label == 1 and row.pred == 1: return "TP"
    elif row.label == 0 and row.pred == 0: return "TN"
    elif row.label == 1 and row.pred == 0: return "FN"
    else: return "FP"

df_valid_results["type"] = df_valid_results.apply(label_type, axis=1)

EDA on Baseline Results

You can see from the creation of this dataset some of my ideas on relevant features. Frankly I spent too much time going down the feature engineering rabbit hole. The whole point of our baseline is that we can see what are the tweets that the model find hardest to classify. Theoretically this will give us some insight on how to structure a more optimal input string.

The first method I tried was to sort each tweet by the cross-entropy loss function.

Code
df_valid_results.sort_values(by="loss", ascending=False).head()
text keyword location label cap_ratio pred prob_1 contains_link tweet_len hashtags is_location loss type
264 'Since1970the 2 biggest depreciations in CAD:USD in yr b4federal election coincide w/landslide win for opposition' http://t.co/wgqKXmby3B landslide n/a 0 0.007299 1 0.500412 True 137 0 False tensor(0.6923) FP
1831 Firepower in the lab [electronic resource] : automation in the fight against infectious diseases and bioterrorism /‰Û_ http://t.co/KvpbybglSR bioterrorism n/a 0 0.007092 1 0.501099 True 141 0 False tensor(0.6910) FP
654 @JakeGint the mass murder got her hot and bothered but at heart she was always a traditionalist. mass%20murder n/a 1 0.000000 0 0.498703 False 96 0 False tensor(0.6906) FN
399 Of what use exactly is the national Assembly? Honestly they are worthless. We are derailed. derailed Kwara, Nigeria 0 0.043956 0 0.496643 False 91 0 True tensor(0.6865) TN
960 Hat #russian soviet army kgb military #cossack #ushanka LINK:\nhttp://t.co/bla42Rdt1O http://t.co/EInSQS8tFq military n/a 0 0.018349 1 0.503387 True 109 3 False tensor(0.6864) FP

I can immediately see a problem with this. We don’t only get incorrect predictions! Indeed the highest loss value across our entire dataset was a correct prediction. This means, we have to sort our data by different values to determine which tweets caused the most trouble. My strategy was to sort by a new “confidence” feature, and only look at the data that was predicted incorrectly.

Code
df_valid_results["prob_0"] = 1 - df_valid_results["prob_1"]
df_valid_results["is_wrong"] = (df_valid_results["label"] != df_valid_results["pred"])
df_valid_results["confidence"] = df_valid_results[["prob_1", "prob_0"]].max(axis=1)

conf_sorted = df_valid_results.sort_values(by="confidence", ascending=False)
filtered = conf_sorted[conf_sorted["is_wrong"] == True]


filtered.head()
text keyword location label cap_ratio pred prob_1 contains_link tweet_len hashtags is_location loss type prob_0 is_wrong confidence
770 Over half of poll respondents worry nuclear disaster fading from public consciousness http://t.co/YtnnnD631z ##fukushima nuclear%20disaster Fukushima city Fukushima.pref 0 0.008333 1 0.998309 True 120 2 True tensor(0.0017) FP 0.001691 True 0.998309
1877 Angry Woman Openly Accuses NEMA Of Stealing Relief Materials Meant For IDPs: An angry Internally Displaced wom... http://t.co/6ySbCSSzYS displaced Nigeria 0 0.110294 1 0.998276 True 136 0 True tensor(0.0017) FP 0.001724 True 0.998276
305 #hot C-130 specially modified to land in a stadium and rescue hostages in Iran in 1980 http://t.co/zY3hpdJNwg #prebreak #best hostages china 0 0.015873 1 0.998269 True 126 3 True tensor(0.0017) FP 0.001731 True 0.998269
89 Satellite Spies Super Typhoon Soudelor from Space (Photo) http://t.co/VBhu2t8wgB typhoon Evergreen Colorado 0 0.075000 1 0.998245 True 80 0 True tensor(0.0018) FP 0.001755 True 0.998245
378 Angry Woman Openly Accuses NEMA Of Stealing Relief Materials Meant For IDPs: An angry Internally Displaced wom... http://t.co/Khd99oZ7u3 displaced Ojodu,Lagos 0 0.110294 1 0.998168 True 136 0 True tensor(0.0018) FP 0.001832 True 0.998168

These are the tweets that the model was most confident about that it got wrong. Unfortunately, I never really was able to extract more meaningful data from the dataset. When I compare features like keyword, location, has_link, or hashtags, there just isn’t much difference between the wrong predictions and the full dataset.

Code
def compare_preds(feature):
    full_set_data, wrongs_only_data = conf_sorted, filtered
    max_total_perc = full_set_data[feature].value_counts().max()/len(conf_sorted)
    max_wrong_perc = wrongs_only_data[feature].value_counts().max()/len(filtered)
    print(feature, max_total_perc, max_wrong_perc)


compare_preds('keyword')
compare_preds('location')
compare_preds('contains_link')
compare_preds('hashtags')
filtered['keyword'].value_counts()
keyword 0.007878151260504201 0.017595307917888565
location 0.3382352941176471 0.31671554252199413
contains_link 0.5341386554621849 0.5043988269794721
hashtags 0.7657563025210085 0.7771260997067448
keyword
bioterror              6
hellfire               5
pandemonium            5
burning%20buildings    4
fire                   4
                      ..
sirens                 1
wild%20fires           1
bombed                 1
injured                1
landslide              1
Name: count, Length: 168, dtype: int64

We see that the keywords values is significantly different in the incorrect portion of the dataset. But again, the only meaningful way to split this would be to break up each of these problem keywords into a train/valid proportion. We would hope that the model would generalize for the first few keywords and be able to predict the others two but that doesn’t seem likely. Unfortunately, it seems like the bins for each keyword are too small for the model to make meaningful generalizations.

Model Tweaking

I think I just want to try some small tweaks now to see if we can optimize the results. I thought about using another model to get sentiment analysis for the tweets, but I’m not sure if I want to do that right now.

Code
# Some functions that expedite turning new inputs into training data
epochs = 4
bs = 128
lr = 8e-5
wd = 0.01

def get_dds(df):
    inps = "location", "keyword", "text"
    ds = Dataset.from_pandas(df).rename_column('target', 'label')
    tok_ds = ds.map(tok_func, batched=True, remove_columns=inps)
    dds = tok_ds.train_test_split(0.25, seed=52)
    return dds

def get_model(): return AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2)

def get_trainer(dds, model=None):
    if model is None: model = get_model()
    args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
        evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
        num_train_epochs=epochs, weight_decay=wd, report_to='none')
    return Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                   tokenizer=tokz, compute_metrics=compute_metrics)

    

Now I can just run some crazy experiemnts!

Code
# Testing the effect of switching the fill string
na_fill = 'none'
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.text.fillna(na_fill)

print(df.inputs.head())
dds = get_dds(df)
get_trainer(dds).train()
0                                                                    LOC: none; KW: none; TEXT: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
1                                                                                                   LOC: none; KW: none; TEXT: Forest fire near La Ronge Sask. Canada
2    LOC: none; KW: none; TEXT: All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
3                                                                        LOC: none; KW: none; TEXT: 13,000 people receive #wildfires evacuation orders in California 
4                                                 LOC: none; KW: none; TEXT: Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
Name: inputs, dtype: object
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[180/180 00:28, Epoch 4/4]
Epoch Training Loss Validation Loss F1 Accuracy Precision Recall
1 No log 0.435222 0.743494 0.818803 0.899281 0.633714
2 No log 0.412216 0.797673 0.835609 0.813984 0.782003
3 No log 0.490506 0.782878 0.816176 0.766707 0.799747
4 No log 0.489541 0.789371 0.829307 0.807692 0.771863

TrainOutput(global_step=180, training_loss=0.35910987854003906, metrics={'train_runtime': 28.2531, 'train_samples_per_second': 808.264, 'train_steps_per_second': 6.371, 'total_flos': 464396789823708.0, 'train_loss': 0.35910987854003906, 'epoch': 4.0})

Doesn’t seem to be a large effect. Does lowercasing the strings help?

Code
na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.text.fillna(na_fill)
df[input_col] = df.inputs.str.lower()


dds = get_dds(df)
get_trainer(dds).train()
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[180/180 00:27, Epoch 4/4]
Epoch Training Loss Validation Loss F1 Accuracy Precision Recall
1 No log 0.440292 0.733683 0.813550 0.898897 0.619772
2 No log 0.412970 0.790576 0.831933 0.817321 0.765526
3 No log 0.454561 0.795322 0.834559 0.816000 0.775665
4 No log 0.493377 0.793103 0.829832 0.799228 0.787072

TrainOutput(global_step=180, training_loss=0.3674246470133464, metrics={'train_runtime': 27.6703, 'train_samples_per_second': 825.29, 'train_steps_per_second': 6.505, 'total_flos': 449771816332608.0, 'train_loss': 0.3674246470133464, 'epoch': 4.0})

What about removing the “%20”s from the keywords?

Code
na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill).str.replace("%20", ' ') + '; TEXT: ' + df.text.fillna(na_fill)

dds = get_dds(df)
get_trainer(dds).train()
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[180/180 00:28, Epoch 4/4]
Epoch Training Loss Validation Loss F1 Accuracy Precision Recall
1 No log 0.443140 0.737075 0.813025 0.883186 0.632446
2 No log 0.403884 0.785333 0.830882 0.828411 0.746515
3 No log 0.468701 0.784762 0.821954 0.786260 0.783270
4 No log 0.497474 0.789809 0.826681 0.793854 0.785805

TrainOutput(global_step=180, training_loss=0.35947723388671876, metrics={'train_runtime': 28.3494, 'train_samples_per_second': 805.518, 'train_steps_per_second': 6.349, 'total_flos': 461217459009756.0, 'train_loss': 0.35947723388671876, 'epoch': 4.0})

None of these transformations seem to be helping us become more accurate. There are several more experiments that we could run, but I had another idea that I wanted to try first…

Special Tokens?

I noticed that the high-confidence incorrect predictions above all had links. I wonder what ahppens if I make some subset of links, tags, and hashtags into special tokens. I think it makes more sense to do it with tags and links because the text of each of those doesn’t have much to do with the meaning of the data. The hashtag, however, does contain useful information. But I’ll run some experiments and see what happens.

Code
# create new special tokens
new_toks = ["[L]", "[A]", "[X]"]
tokz.add_special_tokens({'additional_special_tokens': new_toks})

# remove links and replace with [L]
df.loc[:, 'mod_text'] = (
    df['text']
    .astype(str) 
    .str.replace(r'https?://\S+', '[L]', regex=True))   # Match http or https links


na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.mod_text.fillna(na_fill)

dds = get_dds(df)
get_trainer(dds, model).train()
[180/180 00:24, Epoch 4/4]
Epoch Training Loss Validation Loss F1 Accuracy Precision Recall
1 No log 0.357041 0.801756 0.857668 0.948097 0.694550
2 No log 0.344914 0.837116 0.871849 0.884344 0.794677
3 No log 0.431433 0.828608 0.860294 0.842726 0.814956
4 No log 0.467705 0.826142 0.856092 0.827192 0.825095

TrainOutput(global_step=180, training_loss=0.22589988708496095, metrics={'train_runtime': 24.4876, 'train_samples_per_second': 932.552, 'train_steps_per_second': 7.351, 'total_flos': 374670733971756.0, 'train_loss': 0.22589988708496095, 'epoch': 4.0})

Awesome, huge jump! This makes sense because there’s no great way to tokenize links and get real meaning from them; the text of a shortened link doesn’t really tell much about whats in it.

Code
# keep previous formatting and remove mentions replace with [A]
df.loc[:, 'mod_text'] = (
    df['text']
    .astype(str)
    .str.replace(r'https?://\S+', '[L]', regex=True)
    .str.replace(r'@\w+', '[A]', regex=True))

na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.mod_text.fillna(na_fill)

dds = get_dds(df)
get_trainer(dds, model).train()
[180/180 00:23, Epoch 4/4]
Epoch Training Loss Validation Loss F1 Accuracy Precision Recall
1 No log 0.532192 0.827950 0.862920 0.862637 0.795944
2 No log 0.574797 0.800705 0.821954 0.746711 0.863118
3 No log 0.538084 0.822023 0.853992 0.830530 0.813688
4 No log 0.640214 0.824639 0.853466 0.817955 0.831432

TrainOutput(global_step=180, training_loss=0.12367356618245443, metrics={'train_runtime': 23.7139, 'train_samples_per_second': 962.979, 'train_steps_per_second': 7.59, 'total_flos': 362423531491416.0, 'train_loss': 0.12367356618245443, 'epoch': 4.0})
Code
# keep previous formatting and remove hashtags replace with [X]
df.loc[:, 'mod_text'] = (
    df['text']
    .astype(str)
    .str.replace(r'https?://\S+', '[L]', regex=True)
    .str.replace(r'@\w+', '[A]', regex=True)
    .str.replace(r'#\w+', '[X]', regex=True)
)

na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.mod_text.fillna(na_fill)

dds = get_dds(df)
get_trainer(dds, model).train()
[180/180 00:23, Epoch 4/4]
Epoch Training Loss Validation Loss F1 Accuracy Precision Recall
1 No log 0.682632 0.799759 0.825630 0.762946 0.840304
2 No log 0.551201 0.802696 0.830882 0.776987 0.830165
3 No log 0.612302 0.800490 0.828782 0.773964 0.828897
4 No log 0.698242 0.804938 0.834034 0.784597 0.826362

TrainOutput(global_step=180, training_loss=0.08939082887437609, metrics={'train_runtime': 23.3512, 'train_samples_per_second': 977.936, 'train_steps_per_second': 7.708, 'total_flos': 352560827121540.0, 'train_loss': 0.08939082887437609, 'epoch': 4.0})

Ok, so as I predicted, the best combination is masking the links and mentions with a single new special token each. This makes sense because hashtags are often real words, whereas mentions are only rarely, and links never are. It’s all about tokenization. If there are pieces of data aren’t easy to tokenize, then the model spends resources attempting to do so. Or worse, it sees correlations where there aren’t any.

Submitting Results

I’m deciding to submit pretty early. I’m using what I know how to do so far and I want to see how far that takes me. I leave some examples of further analysis that I’d considr in order to make this a more competitive result.

Code
# Rewrote dataframe transformation as a function
def transform_df(df_x):
    df_x.loc[:, 'mod_text'] = (
        df_x['text']
        .astype(str)
        .str.replace(r'https?://\S+', '[L]', regex=True)
        .str.replace(r'@\w+', '[A]', regex=True))
    na_fill = ''
    df_x[input_col] = 'LOC: ' + df_x.location.fillna(na_fill) + '; KW: ' + df_x.keyword.fillna(na_fill) + '; TEXT: ' + df_x.mod_text.fillna(na_fill)
    return df_x
    
df_trans = transform_df(df)
dds = get_dds(df_trans)
trainer = get_trainer(dds, model)
trainer.train()
[180/180 00:23, Epoch 4/4]
Epoch Training Loss Validation Loss F1 Accuracy Precision Recall
1 No log 0.948050 0.793727 0.820378 0.757192 0.833967
2 No log 0.806260 0.805412 0.841387 0.819135 0.792142
3 No log 1.134959 0.780175 0.801996 0.722462 0.847909
4 No log 0.833781 0.795699 0.830357 0.794192 0.797212

TrainOutput(global_step=180, training_loss=0.035915311177571616, metrics={'train_runtime': 23.6941, 'train_samples_per_second': 963.784, 'train_steps_per_second': 7.597, 'total_flos': 362423531491416.0, 'train_loss': 0.035915311177571616, 'epoch': 4.0})

I rewrote the transformation as a function so we can be sure we do exactly the same thing to our test dataset.

Code
df_test.head()
id keyword location text
0 0 NaN NaN Just happened a terrible car crash
1 2 NaN NaN Heard about #earthquake is different cities, stay safe everyone.
2 3 NaN NaN there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all
3 9 NaN NaN Apocalypse lighting. #Spokane #wildfires
4 11 NaN NaN Typhoon Soudelor kills 28 in China and Taiwan
Code
# Transform test dataset based on work we just did

df_test_trans = transform_df(df_test)
test_ds = Dataset.from_pandas(df_test).map(tok_func, batched=True)
outputs = trainer.predict(test_ds)
print(len(test_ds), len(df))
3263 7613
Code
outputs = trainer.predict(test_ds)
logits = outputs.predictions
probs = F.softmax(tensor(logits), dim=1).numpy()
preds = probs.argmax(axis=1)
df_test_results = pd.DataFrame({
    "pred": preds,
})
df_test.id
0           0
1           2
2           3
3           9
4          11
        ...  
3258    10861
3259    10865
3260    10868
3261    10874
3262    10875
Name: id, Length: 3263, dtype: int64

Now we can finally submit to kaggle.

Code
import datasets
submission = datasets.Dataset.from_dict({
    'id': df_test['id'],
    'target': df_test_results["pred"]
})
print(len(df_test.id), len(df_test_results.pred))
submission.to_csv('nlp_tweets_submission.csv', index=False)
3263 3263
22746

Reflection

So, I performed right in the middle of the pack (rank 367/784). I was able to classify the tweets at a rate of .79895. This makes sense. I didn’t really do a mega-deep dive into the data. I mostly just wanted to get some predictions and submit them. If I have time I’ll go deeper into it, and see if I can get a higher score. Here’s what I would try:

  • I would probably try to use a different model. We used deberta for prototyping, but if we used a stronger deberta or another model we’d likely do better. Maybe something specifically trained on tweets.
  • I would add sentiment analysis to see if I can use that feature to make a more robust validation set. We never quite found a feature of the data that was harder to classify, so we couldn’t make a meaningful validation set. I would want to do further research into this and see if a better validation set could be created.