We invite you to read the next article relating to the research and development work that the Data Science Literacka team deals with. The Transformer architecture, the main topic of this post, was a breakthrough – modern language processing models, such as BERT or GPT-2 , were created on its basis. We have tuned BERT to the task of extracting specific groups of words (so-called NER ) from the text, and we plan to use GPT-2 to generate publishing notes for books.
Transformer was first featured in the article Attention Is All You Need [1]. It is a seq2seq model based on the encoder-decoder architecture using the multi-head attention mechanism. It was proposed by Google and was originally used for translation between natural languages. In this task, he obtained better results than traditional recursive methods ( RNN, LSTM, GRU ). At the same time, thanks to the possibility of carrying out the calculations in parallel, it required much less time for training. Its appearance was a revolution in the generally understood NLP . Transformer elements made it possible to build modern language models such as BERT or GPT .
The figure below shows the model architecture:
Note that this is only a simplified diagram. The number of encoder layers set in series, as well as the decoder layers is equal to N. In the original operation, N = 6.
We will now deal with the description of the individual elements of the Transformer .
EMBEDDING
As in most NLP -related tasks, also in this case the considered model requires a numerical representation of tokens – the so-called embedding . It consists in placing words in a vector space with a predetermined dimension. In the work Attention Is All You Need, 512 dimensional embeddings were used. Currently, the standard dimension in language models is 768 or 1024. However, before the embedding of individual tokens goes to the encoder, a vector of the same dimension is added to each of them. Due to the fact that the model under consideration is not recursive – i.e. it does not accept tokens that make up the sentence sequentially, but simultaneously – we must enter some information about the position of tokens in the sentence. For this purpose, we are using Positional Encoding, which helps to determine the positions of the token and the relative distances between tokens in a sentence.
ENCODER
The input for the first encoder are embedded sentence tokens that take into account positional encoding. The input sentence is then represented by a matrix with dimensions (maximum_number_of_token_in_sentment, embedding_dimension). The maximum number of tokens in a sentence is a fixed hyperparameter. The considered sentences are then filled with empty tokens so that each of them has the same fixed number of tokens. The input for each next encoder in the model is the output from the previous one. The encoders keep the dimension. Therefore, the architecture of each of them may be the same.
The first stage of the encoder passage is the Multi-Head Attention layer. They are schematically presented in the following figures (Multi-Head Attention on the right; Scaled Dot-Product Attention on the left):
The model under consideration – as opposed to recursive architectures – processes each item in the input sentence in parallel, at the same time. The mechanism of attention then allows him to find the dependencies of individual tokens in relation to each other, and thus enables their better coding.
The second stage is the passage through a two-layer neural network with the ReLU activation function between the first and second layer. Moreover, both behind the block related to the attention mechanism and the neural layer, there is a residual connection and a normalizing layer. The output obtained in this way is transferred to the next encoder set in series.
DECODER
The architecture of the decoder is very similar to the architecture of the previously discussed encoder. However, the tokens are processed sequentially. Each of the decoders is powered by the output obtained from the last encoder. The flow through the entire Transformer is illustrated in the figure below:
SUMMARY
Transformer elements and models based on them lead the way in modern NLP. Due to the computational complexity of the Self-Attention mechanism, they are limited in terms of the possibility of accepting a large number of tokens at the entrance. Typically 768 or 1024 tokens are standard. For tasks related to the analysis of long texts, such as generating summaries, it is definitely not enough.
In the next article, we will present methods for modifying the model architecture, enabling you to work with long texts such as books.
Author: Sebastian Jankowski, Data Scientist and Python programmer at Literacka
BIBLIOGRAPHY
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Konstrukoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is all you need, Advances in Neural Information Processing Systems, (2017), 6000-6010.
[2] Jay Alammar. The Illustrated Transformer, http://jalammar.github.io/illustrated-transformer/ , (2018)