Predicting Disaster Tweets with NLP

machine learning

tutorial

fastai

NLP

kaggle

Author

Aaron James

Published

April 21, 2025

NLP, Kaggle, and Disaster Tweets

This lesson focused on applying NLP using hugging face’s library. In the book we used the fastai library. I decided to apply what I learned to the kaggle NLP disaster tweets competition. I’ll be referencing the two course notebooks for this lesson Chapter 10 and getting-started-with-NLP

The goal of this competition is to take a set of tweets and determine based on their metadata whether or not they refer to real disasters. We use Natural Language Processing (NLP) to fit a model to the tweets to make our predictions.

The first thing I did was make sure I could download the dataset.

Downloading the dataset

I wrote a short script to download the dataset. Actually, I just used the code from the getting-started notebook.

Code

from pathlib import Path
import os

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
competition = 'nlp-getting-started'
if iskaggle:
    !pip install -Uqq fastai
else:
    import zipfile,kaggle
    path = Path(competition)
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

nlp-getting-started.zip: Skipping, found more recently modified local copy (use --force to force download)

Loading the data and EDA

Now that the dataset is downloaded, we want to load it into memory to start playing around with various features.

Code

# import relevant frameworks
from fastai.imports import *
if iskaggle: path = Path('../input/' + competition)
df = pd.read_csv(path/'train.csv')
df_test = pd.read_csv(path/'test.csv')
df.describe(include='object')

	keyword	location	text
count	7552	5080	7613
unique	221	3341	7503
top	fatalities	USA	11-Year-Old Boy Charged With Manslaughter of Toddler: Report: An 11-year-old boy has been charged with manslaughter over the fatal sh...
freq	45	104	10

Analyzing Data Distributions

Since this is my first time looking at the dataset, I want to see what category breakdowns seem to be significant. Basically, I want to get a sense of what variables it might be helpful to look at more closely.

Code

df.keyword.value_counts()

keyword
fatalities               45
deluge                   42
armageddon               42
sinking                  41
damage                   41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: count, Length: 221, dtype: int64

Code

df.location.value_counts()

location
USA                    104
New York                71
United States           50
London                  45
Canada                  29
                      ... 
MontrÌ©al, QuÌ©bec       1
Montreal                 1
ÌÏT: 6.4682,3.18287      1
Live4Heed??              1
Lincoln                  1
Name: count, Length: 3341, dtype: int64

Create another baseline

I spent a good amount of time looking into novel ways to break the data down to give the model an optimal input. I considered:

removing the %20 and replacing with a space for the keywords
removing the location to see what impact that has
collapsing duplicate locations into one (theoretically, USA == United States)

Then I realized, before I can determine how those edits would affect my results… I need results. I was reminded that for our last lesson we started by creating a baseline. I decided to do this in the simplest way I could think of and iterate from there. To me, that was squishing all the default features of the data into a single string, and training the model on that input.

Code

input_col = 'inputs'
na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.text.fillna(na_fill)
df.head()

	id	keyword	location	text	target	inputs
0	1	NaN	NaN	Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all	1	LOC: ; KW: ; TEXT: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
1	4	NaN	NaN	Forest fire near La Ronge Sask. Canada	1	LOC: ; KW: ; TEXT: Forest fire near La Ronge Sask. Canada
2	5	NaN	NaN	All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected	1	LOC: ; KW: ; TEXT: All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
3	6	NaN	NaN	13,000 people receive #wildfires evacuation orders in California	1	LOC: ; KW: ; TEXT: 13,000 people receive #wildfires evacuation orders in California
4	7	NaN	NaN	Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school	1	LOC: ; KW: ; TEXT: Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school

Code

# Suppress warnings to make the output look cleaner
from transformers.utils import logging
import warnings

warnings.filterwarnings("ignore")

Code

from datasets import Dataset,DatasetDict
from transformers import AutoModelForSequenceClassification,AutoTokenizer

# convert dataframe into a huggingface dataset
ds = Dataset.from_pandas(df)

model_nm = 'microsoft/deberta-v3-small'
tokz = AutoTokenizer.from_pretrained(model_nm)
def tok_func(x): return tokz(x[input_col])

# tokenize the input
tok_ds = ds.map(tok_func, batched=True)
tok_ds = tok_ds.rename_columns({'target':'labels'})

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

I briefly considered trying to engineer a perfect validation dataset before remembering that the goal was to create a simple baseline first

Code

# Splitting up the validation set now
dds = tok_ds.train_test_split(0.25, seed=1337)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'labels', 'inputs', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5709
    })
    test: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'labels', 'inputs', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1904
    })
})

Defining the F1 Metric

The kaggle competition uses the F1 metric between the predicted values and expected values. So we have to define it in a way that huggingface understands to use it in our training.

Code

from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {
        "f1": f1_score(labels, preds),
        "accuracy": accuracy_score(labels, preds),
        "precision": precision_score(labels, preds),
        "recall": recall_score(labels, preds)
    }
df_eval = dds['test']
df_eval

Dataset({
    features: ['id', 'keyword', 'location', 'text', 'labels', 'inputs', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1904
})

Now we can run our training loop. I’m not deeply concerned with every hyper parameter here, because I’m not there yet. I just want to focus on epochs, batch size bs and learning rate lr.

Code

from transformers import TrainingArguments,Trainer


epochs = 4
bs = 128
lr = 8e-5

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=compute_metrics)

trainer.train();

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

[180/180 00:27, Epoch 4/4]

Epoch	Training Loss	Validation Loss	F1	Accuracy	Precision	Recall
1	No log	0.524064	0.628931	0.752101	0.950119	0.470035
2	No log	0.456287	0.751911	0.812500	0.920068	0.635723
3	No log	0.462908	0.783172	0.824055	0.871758	0.710928
4	No log	0.476750	0.787008	0.820903	0.840000	0.740306

Iterating from the Baseline Results

Alright ok thank God. It took a while, but we got a baseline. I tried a few randoms seeds for the test split. I got F1 values of .7949, .7948, and .8049. This is good because it tells me that based on a few random splits the metric doesn’t change too much. Now we want to see what are the hardest tweets to classify. We started with the baseline to just give us a sense of what data gives the model trouble. We didn’t curate a validation set, because we were stumped on which aspects of the data to sort into a valdiation set. Analyzing the tweets that are hardest to classify hopefully gives us a sense of what data features lead to trouble for the model.

Analyzing the Model’s “Trouble” Tweets (or Error Analysis)

First I’ll store the main outputs of the training for further analysis

Code

import torch
import torch.nn.functional as F
from fastai.vision.all import *

# save our validation set under a new name
df_eval = dds['test']

# Get predictions and probabilities
outputs = trainer.predict(df_eval)
logits = outputs.predictions
labels = outputs.label_ids
probs = F.softmax(tensor(logits), dim=1).numpy()
preds = probs.argmax(axis=1)

Now I want to create a dataframe that has additional relevant features. We already have the training output, so these new features will be based on the training. Once I have this dataframe I can start visualizing it in different ways

Code

# Compute per-example loss
losses = [F.cross_entropy(tensor(logit), tensor(preds)) for logit, preds in zip(logits, preds)]

def count_cap(text):
    words = text.split()
    count = 0
    for word in words:
        if word and word[0].isupper():
            count += 1
    return count

# Set up up new dataframe with the results of the first pass
df_valid_results = pd.DataFrame({
    "text": [df_eval[i]["text"] for i in range(len(df_eval))],
    "keyword": [df_eval[i].get("keyword") or "n/a" for i in range(len(labels))],
    "location": [df_eval[i].get("location") or "n/a" for i in range(len(labels))],
    "label": labels,
    "cap_ratio": [count_cap(twt)/float(len(twt)) for twt in df_eval['text']],
    "pred": preds,
    "prob_1": probs[:, 1],
    "contains_link": [twt.count("http") > 0 for twt in df_eval['text']],
    "tweet_len": [len(twt) for twt in df_eval['text']],
    "hashtags": [twt.count("#") for twt in df_eval['text']],
    "is_location": [bool(loc) for loc in df_eval['location']],
    "loss": losses
})

# Tag confusion type
def label_type(row):
    if row.label == 1 and row.pred == 1: return "TP"
    elif row.label == 0 and row.pred == 0: return "TN"
    elif row.label == 1 and row.pred == 0: return "FN"
    else: return "FP"

df_valid_results["type"] = df_valid_results.apply(label_type, axis=1)

EDA on Baseline Results

You can see from the creation of this dataset some of my ideas on relevant features. Frankly I spent too much time going down the feature engineering rabbit hole. The whole point of our baseline is that we can see what are the tweets that the model find hardest to classify. Theoretically this will give us some insight on how to structure a more optimal input string.

The first method I tried was to sort each tweet by the cross-entropy loss function.

Code

df_valid_results.sort_values(by="loss", ascending=False).head()

	text	keyword	location	label	cap_ratio	pred	prob_1	contains_link	tweet_len	hashtags	is_location	loss	type
264	'Since1970the 2 biggest depreciations in CAD:USD in yr b4federal election coincide w/landslide win for opposition' http://t.co/wgqKXmby3B	landslide	n/a	0	0.007299	1	0.500412	True	137	0	False	tensor(0.6923)	FP
1831	Firepower in the lab [electronic resource] : automation in the fight against infectious diseases and bioterrorism /‰Û_ http://t.co/KvpbybglSR	bioterrorism	n/a	0	0.007092	1	0.501099	True	141	0	False	tensor(0.6910)	FP
654	@JakeGint the mass murder got her hot and bothered but at heart she was always a traditionalist.	mass%20murder	n/a	1	0.000000	0	0.498703	False	96	0	False	tensor(0.6906)	FN
399	Of what use exactly is the national Assembly? Honestly they are worthless. We are derailed.	derailed	Kwara, Nigeria	0	0.043956	0	0.496643	False	91	0	True	tensor(0.6865)	TN
960	Hat #russian soviet army kgb military #cossack #ushanka LINK:\nhttp://t.co/bla42Rdt1O http://t.co/EInSQS8tFq	military	n/a	0	0.018349	1	0.503387	True	109	3	False	tensor(0.6864)	FP

I can immediately see a problem with this. We don’t only get incorrect predictions! Indeed the highest loss value across our entire dataset was a correct prediction. This means, we have to sort our data by different values to determine which tweets caused the most trouble. My strategy was to sort by a new “confidence” feature, and only look at the data that was predicted incorrectly.

Code

df_valid_results["prob_0"] = 1 - df_valid_results["prob_1"]
df_valid_results["is_wrong"] = (df_valid_results["label"] != df_valid_results["pred"])
df_valid_results["confidence"] = df_valid_results[["prob_1", "prob_0"]].max(axis=1)

conf_sorted = df_valid_results.sort_values(by="confidence", ascending=False)
filtered = conf_sorted[conf_sorted["is_wrong"] == True]


filtered.head()

	text	keyword	location	cap_ratio	pred	prob_1	contains_link	tweet_len	hashtags	is_location	loss	type	prob_0	is_wrong	confidence
770	Over half of poll respondents worry nuclear disaster fading from public consciousness http://t.co/YtnnnD631z ##fukushima	nuclear%20disaster	Fukushima city Fukushima.pref	0.008333	1	0.998309	True	120	2	True	tensor(0.0017)	FP	0.001691	True	0.998309
1877	Angry Woman Openly Accuses NEMA Of Stealing Relief Materials Meant For IDPs: An angry Internally Displaced wom... http://t.co/6ySbCSSzYS	displaced	Nigeria	0.110294	1	0.998276	True	136	0	True	tensor(0.0017)	FP	0.001724	True	0.998276
305	#hot C-130 specially modified to land in a stadium and rescue hostages in Iran in 1980 http://t.co/zY3hpdJNwg #prebreak #best	hostages	china	0.015873	1	0.998269	True	126	3	True	tensor(0.0017)	FP	0.001731	True	0.998269
89	Satellite Spies Super Typhoon Soudelor from Space (Photo) http://t.co/VBhu2t8wgB	typhoon	Evergreen Colorado	0.075000	1	0.998245	True	80	0	True	tensor(0.0018)	FP	0.001755	True	0.998245
378	Angry Woman Openly Accuses NEMA Of Stealing Relief Materials Meant For IDPs: An angry Internally Displaced wom... http://t.co/Khd99oZ7u3	displaced	Ojodu,Lagos	0.110294	1	0.998168	True	136	0	True	tensor(0.0018)	FP	0.001832	True	0.998168

These are the tweets that the model was most confident about that it got wrong. Unfortunately, I never really was able to extract more meaningful data from the dataset. When I compare features like keyword, location, has_link, or hashtags, there just isn’t much difference between the wrong predictions and the full dataset.

Code

def compare_preds(feature):
    full_set_data, wrongs_only_data = conf_sorted, filtered
    max_total_perc = full_set_data[feature].value_counts().max()/len(conf_sorted)
    max_wrong_perc = wrongs_only_data[feature].value_counts().max()/len(filtered)
    print(feature, max_total_perc, max_wrong_perc)


compare_preds('keyword')
compare_preds('location')
compare_preds('contains_link')
compare_preds('hashtags')
filtered['keyword'].value_counts()

keyword 0.007878151260504201 0.017595307917888565
location 0.3382352941176471 0.31671554252199413
contains_link 0.5341386554621849 0.5043988269794721
hashtags 0.7657563025210085 0.7771260997067448

keyword
bioterror              6
hellfire               5
pandemonium            5
burning%20buildings    4
fire                   4
                      ..
sirens                 1
wild%20fires           1
bombed                 1
injured                1
landslide              1
Name: count, Length: 168, dtype: int64

We see that the keywords values is significantly different in the incorrect portion of the dataset. But again, the only meaningful way to split this would be to break up each of these problem keywords into a train/valid proportion. We would hope that the model would generalize for the first few keywords and be able to predict the others two but that doesn’t seem likely. Unfortunately, it seems like the bins for each keyword are too small for the model to make meaningful generalizations.

Model Tweaking

I think I just want to try some small tweaks now to see if we can optimize the results. I thought about using another model to get sentiment analysis for the tweets, but I’m not sure if I want to do that right now.

Code

# Some functions that expedite turning new inputs into training data
epochs = 4
bs = 128
lr = 8e-5
wd = 0.01

def get_dds(df):
    inps = "location", "keyword", "text"
    ds = Dataset.from_pandas(df).rename_column('target', 'label')
    tok_ds = ds.map(tok_func, batched=True, remove_columns=inps)
    dds = tok_ds.train_test_split(0.25, seed=52)
    return dds

def get_model(): return AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2)

def get_trainer(dds, model=None):
    if model is None: model = get_model()
    args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
        evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
        num_train_epochs=epochs, weight_decay=wd, report_to='none')
    return Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                   tokenizer=tokz, compute_metrics=compute_metrics)

Now I can just run some crazy experiemnts!

Code

# Testing the effect of switching the fill string
na_fill = 'none'
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.text.fillna(na_fill)

print(df.inputs.head())
dds = get_dds(df)
get_trainer(dds).train()

0                                                                    LOC: none; KW: none; TEXT: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
1                                                                                                   LOC: none; KW: none; TEXT: Forest fire near La Ronge Sask. Canada
2    LOC: none; KW: none; TEXT: All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
3                                                                        LOC: none; KW: none; TEXT: 13,000 people receive #wildfires evacuation orders in California 
4                                                 LOC: none; KW: none; TEXT: Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
Name: inputs, dtype: object

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[180/180 00:28, Epoch 4/4]

Epoch	Training Loss	Validation Loss	F1	Accuracy	Precision	Recall
1	No log	0.435222	0.743494	0.818803	0.899281	0.633714
2	No log	0.412216	0.797673	0.835609	0.813984	0.782003
3	No log	0.490506	0.782878	0.816176	0.766707	0.799747
4	No log	0.489541	0.789371	0.829307	0.807692	0.771863

TrainOutput(global_step=180, training_loss=0.35910987854003906, metrics={'train_runtime': 28.2531, 'train_samples_per_second': 808.264, 'train_steps_per_second': 6.371, 'total_flos': 464396789823708.0, 'train_loss': 0.35910987854003906, 'epoch': 4.0})

Doesn’t seem to be a large effect. Does lowercasing the strings help?

Code

na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.text.fillna(na_fill)
df[input_col] = df.inputs.str.lower()


dds = get_dds(df)
get_trainer(dds).train()

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[180/180 00:27, Epoch 4/4]

Epoch	Training Loss	Validation Loss	F1	Accuracy	Precision	Recall
1	No log	0.440292	0.733683	0.813550	0.898897	0.619772
2	No log	0.412970	0.790576	0.831933	0.817321	0.765526
3	No log	0.454561	0.795322	0.834559	0.816000	0.775665
4	No log	0.493377	0.793103	0.829832	0.799228	0.787072

TrainOutput(global_step=180, training_loss=0.3674246470133464, metrics={'train_runtime': 27.6703, 'train_samples_per_second': 825.29, 'train_steps_per_second': 6.505, 'total_flos': 449771816332608.0, 'train_loss': 0.3674246470133464, 'epoch': 4.0})

What about removing the “%20”s from the keywords?

Code

na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill).str.replace("%20", ' ') + '; TEXT: ' + df.text.fillna(na_fill)

dds = get_dds(df)
get_trainer(dds).train()

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[180/180 00:28, Epoch 4/4]

Epoch	Training Loss	Validation Loss	F1	Accuracy	Precision	Recall
1	No log	0.443140	0.737075	0.813025	0.883186	0.632446
2	No log	0.403884	0.785333	0.830882	0.828411	0.746515
3	No log	0.468701	0.784762	0.821954	0.786260	0.783270
4	No log	0.497474	0.789809	0.826681	0.793854	0.785805

TrainOutput(global_step=180, training_loss=0.35947723388671876, metrics={'train_runtime': 28.3494, 'train_samples_per_second': 805.518, 'train_steps_per_second': 6.349, 'total_flos': 461217459009756.0, 'train_loss': 0.35947723388671876, 'epoch': 4.0})

None of these transformations seem to be helping us become more accurate. There are several more experiments that we could run, but I had another idea that I wanted to try first…

Special Tokens?

I noticed that the high-confidence incorrect predictions above all had links. I wonder what ahppens if I make some subset of links, tags, and hashtags into special tokens. I think it makes more sense to do it with tags and links because the text of each of those doesn’t have much to do with the meaning of the data. The hashtag, however, does contain useful information. But I’ll run some experiments and see what happens.

Code

# create new special tokens
new_toks = ["[L]", "[A]", "[X]"]
tokz.add_special_tokens({'additional_special_tokens': new_toks})

# remove links and replace with [L]
df.loc[:, 'mod_text'] = (
    df['text']
    .astype(str) 
    .str.replace(r'https?://\S+', '[L]', regex=True))   # Match http or https links


na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.mod_text.fillna(na_fill)

dds = get_dds(df)
get_trainer(dds, model).train()

[180/180 00:24, Epoch 4/4]

Epoch	Training Loss	Validation Loss	F1	Accuracy	Precision	Recall
1	No log	0.357041	0.801756	0.857668	0.948097	0.694550
2	No log	0.344914	0.837116	0.871849	0.884344	0.794677
3	No log	0.431433	0.828608	0.860294	0.842726	0.814956
4	No log	0.467705	0.826142	0.856092	0.827192	0.825095

TrainOutput(global_step=180, training_loss=0.22589988708496095, metrics={'train_runtime': 24.4876, 'train_samples_per_second': 932.552, 'train_steps_per_second': 7.351, 'total_flos': 374670733971756.0, 'train_loss': 0.22589988708496095, 'epoch': 4.0})

Awesome, huge jump! This makes sense because there’s no great way to tokenize links and get real meaning from them; the text of a shortened link doesn’t really tell much about whats in it.

Code

# keep previous formatting and remove mentions replace with [A]
df.loc[:, 'mod_text'] = (
    df['text']
    .astype(str)
    .str.replace(r'https?://\S+', '[L]', regex=True)
    .str.replace(r'@\w+', '[A]', regex=True))

na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.mod_text.fillna(na_fill)

dds = get_dds(df)
get_trainer(dds, model).train()

[180/180 00:23, Epoch 4/4]

Epoch	Training Loss	Validation Loss	F1	Accuracy	Precision	Recall
1	No log	0.532192	0.827950	0.862920	0.862637	0.795944
2	No log	0.574797	0.800705	0.821954	0.746711	0.863118
3	No log	0.538084	0.822023	0.853992	0.830530	0.813688
4	No log	0.640214	0.824639	0.853466	0.817955	0.831432

TrainOutput(global_step=180, training_loss=0.12367356618245443, metrics={'train_runtime': 23.7139, 'train_samples_per_second': 962.979, 'train_steps_per_second': 7.59, 'total_flos': 362423531491416.0, 'train_loss': 0.12367356618245443, 'epoch': 4.0})

Code

# keep previous formatting and remove hashtags replace with [X]
df.loc[:, 'mod_text'] = (
    df['text']
    .astype(str)
    .str.replace(r'https?://\S+', '[L]', regex=True)
    .str.replace(r'@\w+', '[A]', regex=True)
    .str.replace(r'#\w+', '[X]', regex=True)
)

na_fill = ''
df[input_col] = 'LOC: ' + df.location.fillna(na_fill) + '; KW: ' + df.keyword.fillna(na_fill) + '; TEXT: ' + df.mod_text.fillna(na_fill)

dds = get_dds(df)
get_trainer(dds, model).train()

[180/180 00:23, Epoch 4/4]

Epoch	Training Loss	Validation Loss	F1	Accuracy	Precision	Recall
1	No log	0.682632	0.799759	0.825630	0.762946	0.840304
2	No log	0.551201	0.802696	0.830882	0.776987	0.830165
3	No log	0.612302	0.800490	0.828782	0.773964	0.828897
4	No log	0.698242	0.804938	0.834034	0.784597	0.826362

TrainOutput(global_step=180, training_loss=0.08939082887437609, metrics={'train_runtime': 23.3512, 'train_samples_per_second': 977.936, 'train_steps_per_second': 7.708, 'total_flos': 352560827121540.0, 'train_loss': 0.08939082887437609, 'epoch': 4.0})

Ok, so as I predicted, the best combination is masking the links and mentions with a single new special token each. This makes sense because hashtags are often real words, whereas mentions are only rarely, and links never are. It’s all about tokenization. If there are pieces of data aren’t easy to tokenize, then the model spends resources attempting to do so. Or worse, it sees correlations where there aren’t any.

Submitting Results

I’m deciding to submit pretty early. I’m using what I know how to do so far and I want to see how far that takes me. I leave some examples of further analysis that I’d considr in order to make this a more competitive result.

Code

# Rewrote dataframe transformation as a function
def transform_df(df_x):
    df_x.loc[:, 'mod_text'] = (
        df_x['text']
        .astype(str)
        .str.replace(r'https?://\S+', '[L]', regex=True)
        .str.replace(r'@\w+', '[A]', regex=True))
    na_fill = ''
    df_x[input_col] = 'LOC: ' + df_x.location.fillna(na_fill) + '; KW: ' + df_x.keyword.fillna(na_fill) + '; TEXT: ' + df_x.mod_text.fillna(na_fill)
    return df_x
    
df_trans = transform_df(df)
dds = get_dds(df_trans)
trainer = get_trainer(dds, model)
trainer.train()

[180/180 00:23, Epoch 4/4]

Epoch	Training Loss	Validation Loss	F1	Accuracy	Precision	Recall
1	No log	0.948050	0.793727	0.820378	0.757192	0.833967
2	No log	0.806260	0.805412	0.841387	0.819135	0.792142
3	No log	1.134959	0.780175	0.801996	0.722462	0.847909
4	No log	0.833781	0.795699	0.830357	0.794192	0.797212

TrainOutput(global_step=180, training_loss=0.035915311177571616, metrics={'train_runtime': 23.6941, 'train_samples_per_second': 963.784, 'train_steps_per_second': 7.597, 'total_flos': 362423531491416.0, 'train_loss': 0.035915311177571616, 'epoch': 4.0})

I rewrote the transformation as a function so we can be sure we do exactly the same thing to our test dataset.

Code

df_test.head()

	id	keyword	location	text
0	0	NaN	NaN	Just happened a terrible car crash
1	2	NaN	NaN	Heard about #earthquake is different cities, stay safe everyone.
2	3	NaN	NaN	there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all
3	9	NaN	NaN	Apocalypse lighting. #Spokane #wildfires
4	11	NaN	NaN	Typhoon Soudelor kills 28 in China and Taiwan

Code

# Transform test dataset based on work we just did

df_test_trans = transform_df(df_test)
test_ds = Dataset.from_pandas(df_test).map(tok_func, batched=True)
outputs = trainer.predict(test_ds)
print(len(test_ds), len(df))

3263 7613

Code

outputs = trainer.predict(test_ds)
logits = outputs.predictions
probs = F.softmax(tensor(logits), dim=1).numpy()
preds = probs.argmax(axis=1)
df_test_results = pd.DataFrame({
    "pred": preds,
})
df_test.id

0           0
1           2
2           3
3           9
4          11
        ...  
3258    10861
3259    10865
3260    10868
3261    10874
3262    10875
Name: id, Length: 3263, dtype: int64

Now we can finally submit to kaggle.

Code

import datasets
submission = datasets.Dataset.from_dict({
    'id': df_test['id'],
    'target': df_test_results["pred"]
})
print(len(df_test.id), len(df_test_results.pred))
submission.to_csv('nlp_tweets_submission.csv', index=False)

3263 3263

Reflection

So, I performed right in the middle of the pack (rank 367/784). I was able to classify the tweets at a rate of .79895. This makes sense. I didn’t really do a mega-deep dive into the data. I mostly just wanted to get some predictions and submit them. If I have time I’ll go deeper into it, and see if I can get a higher score. Here’s what I would try:

I would probably try to use a different model. We used deberta for prototyping, but if we used a stronger deberta or another model we’d likely do better. Maybe something specifically trained on tweets.
I would add sentiment analysis to see if I can use that feature to make a more robust validation set. We never quite found a feature of the data that was harder to classify, so we couldn’t make a meaningful validation set. I would want to do further research into this and see if a better validation set could be created.