This lesson focused on applying NLP using hugging face’s library. In the book we used the fastai library. I decided to apply what I learned to the kaggle NLP disaster tweets competition. I’ll be referencing the two course notebooks for this lesson Chapter 10 and getting-started-with-NLP
The goal of this competition is to take a set of tweets and determine based on their metadata whether or not they refer to real disasters. We use Natural Language Processing (NLP) to fit a model to the tweets to make our predictions.
The first thing I did was make sure I could download the dataset.
Downloading the dataset
I wrote a short script to download the dataset. Actually, I just used the code from the getting-started notebook.
11-Year-Old Boy Charged With Manslaughter of Toddler: Report: An 11-year-old boy has been charged with manslaughter over the fatal sh...
freq
45
104
10
Analyzing Data Distributions
Since this is my first time looking at the dataset, I want to see what category breakdowns seem to be significant. Basically, I want to get a sense of what variables it might be helpful to look at more closely.
I spent a good amount of time looking into novel ways to break the data down to give the model an optimal input. I considered:
removing the %20 and replacing with a space for the keywords
removing the location to see what impact that has
collapsing duplicate locations into one (theoretically, USA == United States)
Then I realized, before I can determine how those edits would affect my results… I need results. I was reminded that for our last lesson we started by creating a baseline. I decided to do this in the simplest way I could think of and iterate from there. To me, that was squishing all the default features of the data into a single string, and training the model on that input.
Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
1
LOC: ; KW: ; TEXT: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
1
4
NaN
NaN
Forest fire near La Ronge Sask. Canada
1
LOC: ; KW: ; TEXT: Forest fire near La Ronge Sask. Canada
2
5
NaN
NaN
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
1
LOC: ; KW: ; TEXT: All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
3
6
NaN
NaN
13,000 people receive #wildfires evacuation orders in California
1
LOC: ; KW: ; TEXT: 13,000 people receive #wildfires evacuation orders in California
4
7
NaN
NaN
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
1
LOC: ; KW: ; TEXT: Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
Code
# Suppress warnings to make the output look cleanerfrom transformers.utils import loggingimport warningswarnings.filterwarnings("ignore")
Code
from datasets import Dataset,DatasetDictfrom transformers import AutoModelForSequenceClassification,AutoTokenizer# convert dataframe into a huggingface datasetds = Dataset.from_pandas(df)model_nm ='microsoft/deberta-v3-small'tokz = AutoTokenizer.from_pretrained(model_nm)def tok_func(x): return tokz(x[input_col])# tokenize the inputtok_ds = ds.map(tok_func, batched=True)tok_ds = tok_ds.rename_columns({'target':'labels'})
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I briefly considered trying to engineer a perfect validation dataset before remembering that the goal was to create a simple baseline first
Code
# Splitting up the validation set nowdds = tok_ds.train_test_split(0.25, seed=1337)dds
The kaggle competition uses the F1 metric between the predicted values and expected values. So we have to define it in a way that huggingface understands to use it in our training.
Now we can run our training loop. I’m not deeply concerned with every hyper parameter here, because I’m not there yet. I just want to focus on epochs, batch size bs and learning rate lr.
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[180/180 00:27, Epoch 4/4]
Epoch
Training Loss
Validation Loss
F1
Accuracy
Precision
Recall
1
No log
0.524064
0.628931
0.752101
0.950119
0.470035
2
No log
0.456287
0.751911
0.812500
0.920068
0.635723
3
No log
0.462908
0.783172
0.824055
0.871758
0.710928
4
No log
0.476750
0.787008
0.820903
0.840000
0.740306
Iterating from the Baseline Results
Alright ok thank God. It took a while, but we got a baseline. I tried a few randoms seeds for the test split. I got F1 values of .7949, .7948, and .8049. This is good because it tells me that based on a few random splits the metric doesn’t change too much. Now we want to see what are the hardest tweets to classify. We started with the baseline to just give us a sense of what data gives the model trouble. We didn’t curate a validation set, because we were stumped on which aspects of the data to sort into a valdiation set. Analyzing the tweets that are hardest to classify hopefully gives us a sense of what data features lead to trouble for the model.
Analyzing the Model’s “Trouble” Tweets (or Error Analysis)
First I’ll store the main outputs of the training for further analysis
Code
import torchimport torch.nn.functional as Ffrom fastai.vision.allimport*# save our validation set under a new namedf_eval = dds['test']# Get predictions and probabilitiesoutputs = trainer.predict(df_eval)logits = outputs.predictionslabels = outputs.label_idsprobs = F.softmax(tensor(logits), dim=1).numpy()preds = probs.argmax(axis=1)
Now I want to create a dataframe that has additional relevant features. We already have the training output, so these new features will be based on the training. Once I have this dataframe I can start visualizing it in different ways
Code
# Compute per-example losslosses = [F.cross_entropy(tensor(logit), tensor(preds)) for logit, preds inzip(logits, preds)]def count_cap(text): words = text.split() count =0for word in words:if word and word[0].isupper(): count +=1return count# Set up up new dataframe with the results of the first passdf_valid_results = pd.DataFrame({"text": [df_eval[i]["text"] for i inrange(len(df_eval))],"keyword": [df_eval[i].get("keyword") or"n/a"for i inrange(len(labels))],"location": [df_eval[i].get("location") or"n/a"for i inrange(len(labels))],"label": labels,"cap_ratio": [count_cap(twt)/float(len(twt)) for twt in df_eval['text']],"pred": preds,"prob_1": probs[:, 1],"contains_link": [twt.count("http") >0for twt in df_eval['text']],"tweet_len": [len(twt) for twt in df_eval['text']],"hashtags": [twt.count("#") for twt in df_eval['text']],"is_location": [bool(loc) for loc in df_eval['location']],"loss": losses})# Tag confusion typedef label_type(row):if row.label ==1and row.pred ==1: return"TP"elif row.label ==0and row.pred ==0: return"TN"elif row.label ==1and row.pred ==0: return"FN"else: return"FP"df_valid_results["type"] = df_valid_results.apply(label_type, axis=1)
EDA on Baseline Results
You can see from the creation of this dataset some of my ideas on relevant features. Frankly I spent too much time going down the feature engineering rabbit hole. The whole point of our baseline is that we can see what are the tweets that the model find hardest to classify. Theoretically this will give us some insight on how to structure a more optimal input string.
The first method I tried was to sort each tweet by the cross-entropy loss function.
'Since1970the 2 biggest depreciations in CAD:USD in yr b4federal election coincide w/landslide win for opposition' http://t.co/wgqKXmby3B
landslide
n/a
0
0.007299
1
0.500412
True
137
0
False
tensor(0.6923)
FP
1831
Firepower in the lab [electronic resource] : automation in the fight against infectious diseases and bioterrorism /‰Û_ http://t.co/KvpbybglSR
bioterrorism
n/a
0
0.007092
1
0.501099
True
141
0
False
tensor(0.6910)
FP
654
@JakeGint the mass murder got her hot and bothered but at heart she was always a traditionalist.
mass%20murder
n/a
1
0.000000
0
0.498703
False
96
0
False
tensor(0.6906)
FN
399
Of what use exactly is the national Assembly? Honestly they are worthless. We are derailed.
derailed
Kwara, Nigeria
0
0.043956
0
0.496643
False
91
0
True
tensor(0.6865)
TN
960
Hat #russian soviet army kgb military #cossack #ushanka LINK:\nhttp://t.co/bla42Rdt1O http://t.co/EInSQS8tFq
military
n/a
0
0.018349
1
0.503387
True
109
3
False
tensor(0.6864)
FP
I can immediately see a problem with this. We don’t only get incorrect predictions! Indeed the highest loss value across our entire dataset was a correct prediction. This means, we have to sort our data by different values to determine which tweets caused the most trouble. My strategy was to sort by a new “confidence” feature, and only look at the data that was predicted incorrectly.
Over half of poll respondents worry nuclear disaster fading from public consciousness http://t.co/YtnnnD631z ##fukushima
nuclear%20disaster
Fukushima city Fukushima.pref
0
0.008333
1
0.998309
True
120
2
True
tensor(0.0017)
FP
0.001691
True
0.998309
1877
Angry Woman Openly Accuses NEMA Of Stealing Relief Materials Meant For IDPs: An angry Internally Displaced wom... http://t.co/6ySbCSSzYS
displaced
Nigeria
0
0.110294
1
0.998276
True
136
0
True
tensor(0.0017)
FP
0.001724
True
0.998276
305
#hot C-130 specially modified to land in a stadium and rescue hostages in Iran in 1980 http://t.co/zY3hpdJNwg #prebreak #best
hostages
china
0
0.015873
1
0.998269
True
126
3
True
tensor(0.0017)
FP
0.001731
True
0.998269
89
Satellite Spies Super Typhoon Soudelor from Space (Photo) http://t.co/VBhu2t8wgB
typhoon
Evergreen Colorado
0
0.075000
1
0.998245
True
80
0
True
tensor(0.0018)
FP
0.001755
True
0.998245
378
Angry Woman Openly Accuses NEMA Of Stealing Relief Materials Meant For IDPs: An angry Internally Displaced wom... http://t.co/Khd99oZ7u3
displaced
Ojodu,Lagos
0
0.110294
1
0.998168
True
136
0
True
tensor(0.0018)
FP
0.001832
True
0.998168
These are the tweets that the model was most confident about that it got wrong. Unfortunately, I never really was able to extract more meaningful data from the dataset. When I compare features like keyword, location, has_link, or hashtags, there just isn’t much difference between the wrong predictions and the full dataset.
We see that the keywords values is significantly different in the incorrect portion of the dataset. But again, the only meaningful way to split this would be to break up each of these problem keywords into a train/valid proportion. We would hope that the model would generalize for the first few keywords and be able to predict the others two but that doesn’t seem likely. Unfortunately, it seems like the bins for each keyword are too small for the model to make meaningful generalizations.
Model Tweaking
I think I just want to try some small tweaks now to see if we can optimize the results. I thought about using another model to get sentiment analysis for the tweets, but I’m not sure if I want to do that right now.
Code
# Some functions that expedite turning new inputs into training dataepochs =4bs =128lr =8e-5wd =0.01def get_dds(df): inps ="location", "keyword", "text" ds = Dataset.from_pandas(df).rename_column('target', 'label') tok_ds = ds.map(tok_func, batched=True, remove_columns=inps) dds = tok_ds.train_test_split(0.25, seed=52)return ddsdef get_model(): return AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2)def get_trainer(dds, model=None):if model isNone: model = get_model() args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True, evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2, num_train_epochs=epochs, weight_decay=wd, report_to='none')return Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'], tokenizer=tokz, compute_metrics=compute_metrics)
Now I can just run some crazy experiemnts!
Code
# Testing the effect of switching the fill stringna_fill ='none'df[input_col] ='LOC: '+ df.location.fillna(na_fill) +'; KW: '+ df.keyword.fillna(na_fill) +'; TEXT: '+ df.text.fillna(na_fill)print(df.inputs.head())dds = get_dds(df)get_trainer(dds).train()
0 LOC: none; KW: none; TEXT: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
1 LOC: none; KW: none; TEXT: Forest fire near La Ronge Sask. Canada
2 LOC: none; KW: none; TEXT: All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
3 LOC: none; KW: none; TEXT: 13,000 people receive #wildfires evacuation orders in California
4 LOC: none; KW: none; TEXT: Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
Name: inputs, dtype: object
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
None of these transformations seem to be helping us become more accurate. There are several more experiments that we could run, but I had another idea that I wanted to try first…
Special Tokens?
I noticed that the high-confidence incorrect predictions above all had links. I wonder what ahppens if I make some subset of links, tags, and hashtags into special tokens. I think it makes more sense to do it with tags and links because the text of each of those doesn’t have much to do with the meaning of the data. The hashtag, however, does contain useful information. But I’ll run some experiments and see what happens.
Code
# create new special tokensnew_toks = ["[L]", "[A]", "[X]"]tokz.add_special_tokens({'additional_special_tokens': new_toks})# remove links and replace with [L]df.loc[:, 'mod_text'] = ( df['text'] .astype(str) .str.replace(r'https?://\S+', '[L]', regex=True)) # Match http or https linksna_fill =''df[input_col] ='LOC: '+ df.location.fillna(na_fill) +'; KW: '+ df.keyword.fillna(na_fill) +'; TEXT: '+ df.mod_text.fillna(na_fill)dds = get_dds(df)get_trainer(dds, model).train()
Awesome, huge jump! This makes sense because there’s no great way to tokenize links and get real meaning from them; the text of a shortened link doesn’t really tell much about whats in it.
Ok, so as I predicted, the best combination is masking the links and mentions with a single new special token each. This makes sense because hashtags are often real words, whereas mentions are only rarely, and links never are. It’s all about tokenization. If there are pieces of data aren’t easy to tokenize, then the model spends resources attempting to do so. Or worse, it sees correlations where there aren’t any.
Submitting Results
I’m deciding to submit pretty early. I’m using what I know how to do so far and I want to see how far that takes me. I leave some examples of further analysis that I’d considr in order to make this a more competitive result.
I rewrote the transformation as a function so we can be sure we do exactly the same thing to our test dataset.
Code
df_test.head()
id
keyword
location
text
0
0
NaN
NaN
Just happened a terrible car crash
1
2
NaN
NaN
Heard about #earthquake is different cities, stay safe everyone.
2
3
NaN
NaN
there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all
3
9
NaN
NaN
Apocalypse lighting. #Spokane #wildfires
4
11
NaN
NaN
Typhoon Soudelor kills 28 in China and Taiwan
Code
# Transform test dataset based on work we just diddf_test_trans = transform_df(df_test)test_ds = Dataset.from_pandas(df_test).map(tok_func, batched=True)outputs = trainer.predict(test_ds)print(len(test_ds), len(df))
So, I performed right in the middle of the pack (rank 367/784). I was able to classify the tweets at a rate of .79895. This makes sense. I didn’t really do a mega-deep dive into the data. I mostly just wanted to get some predictions and submit them. If I have time I’ll go deeper into it, and see if I can get a higher score. Here’s what I would try:
I would probably try to use a different model. We used deberta for prototyping, but if we used a stronger deberta or another model we’d likely do better. Maybe something specifically trained on tweets.
I would add sentiment analysis to see if I can use that feature to make a more robust validation set. We never quite found a feature of the data that was harder to classify, so we couldn’t make a meaningful validation set. I would want to do further research into this and see if a better validation set could be created.