In part 2 of my post, I'm going to go over huggingface's pytorch transformers library located here.
This library provides over 30 pretrained state of the art transformer models on a variety of languages. Alike convolutional neural networks, transformers trained on a different linguistic dataset can be easily retrained on different language datasets. Using pretrained models rather than starting from scratch greatly reduces training time and can sometimes increase accuracy over training a model from scratch.
In this tutorial, I'll be fine-tuning a DistilBert model to predict the sentiment of IMDB movie reviews. DistilBert is a smaller version of the BERT model, allowing it to get most of the performance of BERT for much less training. More details are located in huggingface's blog post.
I'm using an IMDB movie reviews dataset, which has a list of movie reviews as well as either a "positive" or "negative" sentiment.
To use huggingface's pretrained models, we have to use their provided tokenizer. Because acquiring the sentiment from a review isn't reliant on stop words such as 'and', 'or', or 'at', we remove them.
Now in the dataset class, we attach the start token [CLF], insert padding, and convert the tokenized words to indicies.
Instantiating the DistilBert model is as simple as importing the class.
To train the model, all we have to do is pass the reviews and labels to the model and we get our losses back!
After 3 epochs, the pretrained transformer reaches a validation loss of 0.262 and a validation accuracy of 0.9021.
I trained other common linguistic models including an LSTM and a transformer implemented using nn.Transformer. As shown in the graph, the huggingface transformer still edges out in terms of accuracy.
DistilBert took the longest time to train by far with approximately — min, likely because it has the most amount of parameters. This is followed by nn.Transformer and then by the LSTM.
Some further improvements that could be made to this model:
- Improving the dataloader by allowing for variable length batches in order to reduce the amount of wasted memory spent on padding
- Optimizing the parameters further, such as by adding differential learning rates
My code is located here.