The implementation of the project under the title “Contextual understanding of written language technology for error correction and automatic evaluation of text intelligibility” is one of the most important areas of our activity, thanks to which we have drawn many conclusions necessary for our further research.
Based on the analysis of literary and non-literary texts in terms of design criteria, Literacka Sp. z o. o. collected source material. The key was the selection of texts from the following areas: Polish and general literature, press texts (image, industry, lifestyle, popular science), scientific articles, school textbooks, utility texts, legal, official and legal-official texts, Wikipedia.
The challenge of the analysis was that the texts should be thematically diverse. It was absolutely unacceptable for the database to contain several texts on the same topic or describing the same event. The texts also had to be of various difficulty levels according to the FOG index (the readability index, which aims to determine the degree of accessibility of the text). At the same time, with all these assumptions, the base had to be divided proportionally.
The prepared source material has been tokenized so that it can be used to prepare a “test cloze” meeting the requirements of the project, according to the principle that the gaps implemented in the texts should not significantly disturb the cause-and-effect order of the text and should be possible to supplement by the user. A list of methodological assumptions for the implementation of gaps in texts (list of non-convertible tokens and a list of rules blocking the replacement of text with gaps) was also prepared.