plenLiteracka Sp. z o.o.    |   Olsztyn, ul. Lanca 2/5
+48 883 00 88 22

Application of the BERT architecture to the NER extraction task

With this article, we start a series of entries on research methods and relating to research and development works, which are dealt with by the Data Science Literacka team. Usually, only the effects of these actions are visible, so it is important to also tell about how we reach them. We write down processes, explain concepts, and present the next steps of our work. Read on – we start with the NER analysis.

NER (Named Entity Recognition) is translated into Polish as the recognition of naming units. It is one of the tasks of extracting information from text. Locates and classifies naming units. These naming units can be people, locations, organizations, time, etc.

Example:

The IOB (Inside-Outside-Beginning) format is most often used to mark NER in the text. The first token of a given naming unit is prefixed with “B”. Each subsequent one with the prefix “I”. “O” is used when a given token has not been classified as NER.

In the example above, there are three different NER units:
EMOSTATE – denoting emotional states;
PERSON – denoting persons;
GPE – marking places possible for geolocation.

NER extraction approaches have changed over the years with the development of the general concept of NLP (Natural Language Processing) . So, as with NLP, these were rule-based methods, statistical methods related to machine learning, and neural approaches – based on deep learning. Currently, the methods related to neural networks are clearly leading the way.

Initially, RNN (Recurrent Neural Network) based on GRU (Gated Recurrent Unit) or LSTM (Long-Short Term Memory) cells were used for this purpose. However, everything changed with the appearance of the article Attention Is All You Need [1] by the Google team and the Transformer architecture described in it, based on the Multi-Head Attention mechanism. It was a revolution in generally understood NLP . This architecture allows for parallelization and therefore a significant acceleration of computations, while also significantly improving the results. The transformer was a seq2seq model based on the encoder-decode r architecture and was originally used for the translation task between natural languages. The elements of Transformer, however, allowed for the construction of modern language models. The GPT model was created on the basis of Transformer decoder blocks.

The BERT model mentioned in the title actually consists of series connected encoders. BERT was first mentioned in the article BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [2] by Facebook researchers. The BERT architecture allowed for the creation of pre-trained linguistic models that learn linguistic dependencies from large corpora of textual data. An example of such a language model is PoLitBert – a model trained among others. in. on a body composed of literature.

Having a ready, pre-trained language model with BERT architecture at our disposal, we can fine-tune it to a specific task. The NER extraction mentioned in the title can be such a task. In our case, the fine-tuning of the model consisted in adding additional layers to the pre-trained model. Namely, the Fully Connected layer ended with the ReLU (Rectified Linear Unit) activation function and the qualification layer related to the number of naming units considered. The architecture modified in this way, after selecting an optimizer, loss function and hyperparameter values, is trained on a prepared dataset containing text with data annotated towards NER extraction. The trained model is capable of extracting NER from a given text.

Below is an example of a pre-trained PoLitBert model tuned to recognize NER labels such as PERSON, GPE, DATE and EMOSTATE :

What do NERs give publishers? They allow to pre-define the nature of the text, to estimate what a given work may be about, what readers it will be interested in, and even what emotions its reading will arouse. Importantly, the publisher receives this information before he reads the lyrics himself. Thanks to NER, the publisher can quickly assess whether a given text fits its profile. This analysis helps to decide which works (in the flood of publishing offers) to deal with first. We write more about this in this article> CLICK .

The author of the text: Sebastian Jankowski, Data Scientist and Python programmer at Literacka.

Bibliography:

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Konstrukoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is all you need, Advances in Neural Information Processing Systems, (2017), 6000-6010.

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding, North American Association for Computational Linguistics (NAACL), (2019).

Related Posts

Leave a Reply