In part 1 of my series on transformers, I'm going to go over implementing a neural machine translation model using Pytorch's new nn.Transformer module.
Transformers, introduced by the paper Attention is All You Need, inherently don't have a sense of time, They instead rely on positional encoding to encode the order of elements. This gives the transformer architecture an important advantage over other language models such as recurrent neural networks: they are parallelizable and easy to expand. This has allowed huge models such as the 1.5 billion parameter GPT-2 to achieve state of the art performance on language modelling.
Now, with the release of Pytorch 1.2, we can build transformers in pytorch! We'll go over the basics of the transformer architecture and how to use nn.Transformer. In a transformer, the input sentence goes through an encoder where the sentence gets passed through encoders to become memory. Then the output sentence and memory passes through decoders where it outputs the translated sentence.
First, we tokenize the input data, pad the array if necessary, and convert the tokens to embeddings.
Now we add the positional encoding to the sentences in order to give some order to the words. In the Attention is All You Need model, they use sine and cosine embeddings to give generalizability to longer sentence sizes.
Masking in the encoder is required to make sure any padding doesn't contribute to the self-attention mechanism. In Pytorch, this is done by passing src_key_padding_mask to the transformer. For the example, this looks like [False, False, False, False, False, False, False, True, True, True] where the True positions should be masked. The output of the encoder is called memory.
Now we can move onto the decoder architecture. The initial steps are very similar to that of the encoder. We embed and pass all but the very last token of each sentence into the decoders.
We then pass these sequences through m decoders. In each decoder, the sequences propagate through self attention and then attention with the memory (from the encoder). So the decoder requires 3 masks:
- tgt_mask: Used in the self-attention, it ensures the decoder doesn't look at future tokens from a given subsequence. This looks like [[0 -inf -inf ... ], [0 0 -inf ...] ... [0 0 0 ...]]
- tgt_key_padding_mask: Also used in the self-attention, it ensures that the padding in the target sequence isn't accounted for.
- memory_key_padding_mask: Used in the attention with the memory, it ensures the padding in the memory isn't used. This is the same as the src_key_padding_mask
Afterwards, we pass each of the output sequences through a fully connected layer that outputs a probability for each token in the vocab size.
And here is the completed LanguageTransformer class!
class LanguageTransformer(nn.Module): def __init__(self, vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, max_seq_length, pos_dropout, trans_dropout): super().__init__() self.d_model = d_model self.embed_src = nn.Embedding(vocab_size, d_model) self.embed_tgt = nn.Embedding(vocab_size, d_model) self.pos_enc = PositionalEncoding(d_model, pos_dropout, max_seq_length) self.transformer = nn.Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, trans_dropout) self.fc = nn.Linear(d_model, vocab_size) def forward(self, src, tgt, src_key_padding_mask, tgt_key_padding_mask, memory_key_padding_mask, tgt_mask): src = rearrange(src, 'n s -> s n') tgt = rearrange(tgt, 'n t -> t n') src = self.pos_enc(self.embed_src(src) * math.sqrt(self.d_model)) tgt = self.pos_enc(self.embed_tgt(tgt) * math.sqrt(self.d_model)) output = self.transformer(src, tgt, tgt_mask=tgt_mask, src_key_padding_mask=src_key_padding_mask, tgt_key_padding_mask=tgt_key_padding_mask, memory_key_padding_mask=memory_key_padding_mask) output = rearrange(output, 't n e -> n t e') return self.fc(output)
I used the tatoeba dataset, a small dataset with around 160000 english to french language pairs available here.
Here are the results of training for 20 epochs:
My model achieves a validation loss of 0.99. However, it starts overfitting around epoch 15 based from the validation loss being higher than the train loss. And finally, some results of translating sentences:
I am giving you a gift.: Je vous donne un cadeau.
How did you find that?: Comment l'as-tu trouvée?
I'm going to run to your house.: Je vais courir à votre maison.
Some improvements that could be made:
- Using beam search to translate sentences
- Running the model on larger datasets
- Using torchtext instead of hacking my own dataset class to get more consistent batches
- Using smoothened loss
My code is located here.