#nlp and #machinelearning for the needs of the book market

What are our current successes in the work on #nlp and #machinelearning for the book market? Just like that!

The result of the joint work of language and literature specialists, as well as data scientists Literacka and Ermlab, is the provision of the Polish RoBERT model, i.e. a machine learning technique for initial NLP training.

What is the essence of the BERT algorithm and its Polish RoBERT model developed by us? It is a groundbreaking algorithm for natural language processing, developed by Google, which at the stage of subsequent tasks can, among others, mark semantic roles, text classification, and disambiguate polysemic (ambiguous) words appearing in the text. Language analysis through machine learning with the use of artificial intelligence technology serves what a few years ago seemed to be a complete abstraction – the search for the “bestseller code”, i.e. a comprehensive, multi-level analysis of the text in terms of its quality, style, categorization of the writing form and many others features that make a given work characterized by much more than just a solid workshop, and therefore a given book is eagerly bought by readers.

The RoBERT algorithm “learned” our beautiful and complex language based on the processing of data from Polish Wikipedia and using previously accumulated large amounts of literary works. In total, in the process of ‘training’, RoBERT searched over 2 billion words (!), Consisting of a total of 15 billion characters.

Krzysztof Sopyla , the project leader from the technical point of view, describes the activities and conclusions of the work on the RoBERT model:

The main assumption of the work was to examine the impact of the quality of the text and the number of steps needed in the pre-training process.
– good text matters, texts from literature and a heuristically refined Oscar improved the results
– you don’t have to do a lot of pre-training steps to get a good model, especially when we continue to do fin tuning

In addition, we also provide a cleaned version of the oscar collection and entire notebooks with the code for training and preparation of the harvest.

Models available from

Results on KLEJ

