Abstract—
The paper overviews the existing digital Russian-language thesauri and the methods of their automatic construction and application. The authors have analyzed the main characteristics of thesauri published in open access for scientific research, evaluated trends of their development, and their effectiveness in solving natural language processing tasks. Statistical and linguistic methods of thesaurus construction that allow automation of their development and reduce the labor costs of expert linguists have been studied. In particular, algorithms for extracting keywords and semantic thesaurus relations of all types have been considered and the quality of the thesauri generated with the use of these tools was assessed. To illustrate features of various methods of constructing thesaurus relations, the authors developed a combined method that fully automatically generates a specialized thesaurus based on a text corpus of a selected domain and several existing linguistic resources. The proposed method was used to conduct experiments on two Russian-language text corpora that represent two different domains: articles on migration and tweets. The resulting thesauri were analyzed by means of an integrated assessment that had been developed by the authors in a previous study and allows one to determine various aspects of the analyzed thesaurus and appraise the quality of the methods of its generation. The analysis revealed the main advantages and disadvantages of various approaches to thesaurus construction and extraction of semantic relations of different types, and also made it possible to identify potential focus areas for future research.
Similar content being viewed by others
REFERENCES
Aitchison, J., Gilchrist, A., and Bawden, D., Thesaurus Construction and Use: A Practical Manual, Psychology Press, 2000.
Sidorova, E.A., Ontology-based approach to modeling the process of extracting information from text, Ontol.Proekt., 2018, vol. 8, no. 1, pp. 134–151.
Elenevskaya, M.N. and Ovchinnikova, I.G., The storage and description of the verbal associations, Vopr. Psikholingvist., 2016, no. 29, pp. 69–92.
Paramonov, I., et al., Thesaurus-based method of increasing text-via-keyphrase graph connectivity during keyphrase extraction for e-tourism applications, Commun. Comput. Inf. Sci., 2016, vol. 649, pp. 129–141.
Shchitov, I., Lagutina, K., Lagutina, N., and Paramonov, I., Sentiment classification of long newspaper articles based on automatically generated thesaurus with various semantic relationships, Proceedings of the 21st Conference of Open Innovations Association FRUCT, Helsinki, 2017, pp. 290–295.
Blenda, N. A., Overview of Russian-language thesauri to solve the problem of calculating the semantic similarity for scientific publications, Informatsionnye tekhnologii i sistemy, Trudy Chetvertoi Mezhdunarodnoi nauchnoi konferentsii (Information Technologies and Systems, Proceedings of the Fourth International Scientific Conference), 2015, pp. 70–74.
Porshnev, S.V., On the quality of open electronic thesauruses of the Russian language, Sbornik materialov Vserossiiskoi molodezhnoi shkoly-seminara “Aktual’nye problemy informatsionnykh tekhnologii, elektroniki i radiotekhniki—2015 (IT-ER—2015) (Proc. All-Russian Youth School-Seminar Current Problems of Information Technology, Electronics, and Radio Engineering—2015 (IT-ER—2015), 2015, vol. 2, pp. 45–48.
Loukachevitch, N. and Dobrov, B., RuThes linguistic ontology vs. Russian wordnets, Proceedings of the Seventh Global WordNet Conference, 2014, pp. 154–162.
Loukachevitch, N., Dobrov, B., and Chetviorkin, I., RuThes-Lite, a publicly available version of Thesaurus of Russian language RuThes, Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference Dialogue, 2014, no. 13, pp. 340–349.
Loukachevitch, N.V., Lashevich, G., Gerasimova, A.A., Ivanov, V.V., and Dobrov, B.V., Creating Russian WordNet by conversion, Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference Dialogue, 2016, no. 15, pp. 405–415.
Braslavski, P., Ustalov, D., Mukhin, M., and Kiselev, Y., YARN: Spinning-in-Progress, Proceedings of the Eight Global Wordnet Conference, 2016, pp. 58–65.
Sukhonogov, A.M. and Yablonsky, S.A., Automation of the construction of English-Russian WordNet, Komp’yuternaya lingvistika i intellektual’nye tekhnologii, Trudy Mezhdunarodnogo seminara Dialog (Computational Linguistics and Intellectual Technologies. Proceedings of the International Seminar Dialogue), 2005, pp. 25–31.
Azarowa, I., RussNet as a computer lexicon for Russian, Proceedings of the Intelligent Information SystemsIIS-2008, 2008, pp. 341–350.
Azarova, I.V., Zakharov, V.P., Kiselev, Yu., Ustalov, D.A., and Khokhlova, M.V., Integration of RussNet and YARN thesauruses, Komp’yuternaya lingvistika i vychislitel’nye ontologii, Trudy XIX Mezhdunarodnoi obedinennoi nauchnoi konferentsii Internet i sovremennoe obshchestvo (IMS-2016) (Computational Linguistics and Computational Ontologies, Proceedings of the 19th International United Scientific Conference The Internet and Modern Society (IMS-2016)), St. Petersburg, 2016, pp. 7–13.
Sladkova, O., Pirumova, L., and Pirumov A., Internet information resources for agricultural specialists, Mezhdunar. S-kh. Zh., 2016, no. 2, pp. 44–48.
Galieva, A.M. and Yakubova, D.D., Principles of representing vocabulary in the socio-political thesaurus of the Tatar language, Filol. Nauki, Vopr. Teor. Prakt., 2016, no. 12-2, pp. 80–84.
Galieva, A.M., Kirillovich, A.V., Lukashevich, N.V., Nevzorova, O.A., Suleimanov, D.Sh., and Yakubova, D.D., Russian-tatar socio-political thesaurus: publishing in the linguistic linked open data cloud, Int. J. Open Inf. Technol., 2017, vol. 5, no. 11, pp. 64–73.
Ageev, M.S., Dobrov, B.V., and Lukashevich, N.V., Automatic rubrication of texts: Methods and problems, Uch. Zap. Kazan. Gos. Univ., Ser. Fiz.-Mat. Nauki, 2008, vol. 150, no. 4, pp. 25–40.
Lukashevich, N.V., Dobrov, B.V., Pavlov, A.M., and Shternov, S.V., Ontological resources and information-analytical system in the subject area Security, Ontol.Proekt., 2018, vol. 8, no. 1, pp. 74–95.
Mishunin, O.B., Savinov, A.P., and Firstov, D.I., Problems of automatic free-text answer grading in intelligent tutoring systems, Sovrem. Probl. Nauki Obraz., 2015, no. 2-2, pp. 189–199.
Alekseev, A.A., Thematic representation of a news cluster as a basis for summarization, Program. Inzh., 2014, no. 3, pp. 41–48.
Ustalov, D.A., Concept discovery from synonymy graphs, Vychisl. Tekhnol., 2017, vol. 22, no. S1, pp. 99–112.
Kolchin, M., Chistyakov, A., Lapaev, M., and Khaydarova, R., FOODpedia: Russian food products as a linked data dataset, International Semantic Web Conference, 2015, pp. 87–09.
Hasan, K. and Vincent, N., Automatic keyphrase extraction: A survey of the state of the art, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014, pp. 1262–1273.
Dobrov, B.V. and Lukashevich, N.V., Linguistic ontology on natural sciences and technologies for information-retrieval applications, Uch. Zap. Kazan. Gos. Univ., Ser. Fiz.-Mat. Nauki, 2007, vol. 149, no. 2, pp. 49–72.
Lukashevich, N.V., Dobrov, B.V., and Chuiko, D.S., Automated analysis of multiword expressions for computational dictionaries, Komp’yuternaya lingvistika i intellektual’nye tekhnologii: Tr. Mezhdunarodnoi konferentsii Dialog (Computational Linguistics and Intellectual Technologies: Proc. Annual International Conference Dialogue), 2008, no. 7, pp. 339–344.
Turney, P.D. and Pantel, P., From frequency to meaning: Vector space models of semantics, J. Artif. Intell. Res., 2010, vol. 37, pp. 141–188.
Zakharov, V.P., Corpus-based approach to thesaurus and ontology construction, Strukt. Prikl. Lingvist., 2015, no. 11, pp. 123–141.
Kotova, E.E. and Pisarev, I.A., Construction of thematic ontologies using the method of automated thesauri development, Izv. S.-Peterb. Gos. Electrotekh. Univ. LETI, 2016, no. 3, pp. 37–47.
Ayusheeva, N.N. and Kusheeva, T.N., Method for calculating weight factors of vertices of a semantic network of a scientific text, Fundam. Issled., 2012, no. 6-3, pp. 626–630.
Ayusheeva, N.N., Gombozhapova, T.N., and Dorzhaev, T.V., A method for automatically determining the subject of a scientific text, Fundam. Issled., 2016, nos. 8-2, pp. 229–233.
Chetviorkin, I. and Loukachevitch, N., Extraction of Russian sentiment lexicon for product meta-domain, Proceedings of COLING 2012, 2012, pp. 593–610.
Loukachevitch, N. and Levchik, A., Creating a general Russian sentiment lexicon, Proceedings of Language Resources and Evaluation Conference, 2016, pp. 1171–1176.
Vanyushkin, A.S. and Grashchenko, L.A., Evaluation of keyword extraction algorithms: Tools and resources, Nov. Inf. Tekhnol. Avtom. Sist., 2017, vol. 20, pp. 95–102.
Lukashevich, N.V. and Logachev, Yu.M., Automatic term extraction based on feature combination, Vychisl. Metody Program., 2010, vol. 11, no. 4, pp. 108–116.
Lagutina, N.S., Lagutina, K.V., Mamedov, E.I., and Paramonov, I.V., Methodological aspects of separating semantic relationships for automatic generation of specialized thesauri and their evaluation, Model. Anal. Inf. Sist., 2016, vol. 23, no. 6, pp. 826–840.
Lukashevich, N.V., Quasi-synonyms in linguistic ontologies, Komp’yuternaya lingvistika i intellektual’nye tekhnologii: Po materialam ezhegodnoi Mezhdunarodnoi konferentsii “Dialog” (Computational Linguistics and Intellectual Technologies: Based on the Materials of the Annual International Conference Dialogue), 2010, no. 9, pp. 307–312.
Lukashevich, N.V., Modeling of the PART-WHOLE relations in a linguistic resource for information-retrieval applications, Inf. Tekhnol., 2007, no. 12, pp. 28–34.
Baranyuk, V.V., Bogoradnikova, A.V., and Smirnova, O.S., Defining the scope semantics by forming its thesaurus, Int. J. Open Inf. Technol., 2016, vol. 4, no. 9, pp. 74–79.
Nugumanova, A.B., Bessmertnyi, I.A., Petsina, P., and Baiburin, E.M., Semantic relations in text classification based on bag-of-words model, Program. Prod. Sist., 2016, no. 2, pp. 89–99.
Panchenko, A., Ustalov, D., Arefyev, N., Paperno, D., Konstantinova, N., Loukachevitch, N., and Biemann, C., Human and machine judgements for Russian semantic relatedness, Analysis of Images, Social Networks and Texts. 5th International Conference, AIST 2016, 2016, pp. 221–235.
Rapp, R., The automatic generation of thesauri of related words for English, French, German, and Russian, Int. J. Speech Technol., 2008, vol. 11, nos. 3–4, pp. 147–156.
Galina, I.V., Kozerenko, E.B., Morozova, Yu.I., Somin, N.V., and Sharnin, M.M., Associative portraits of subject areas as a tool for automated construction of big data systems for knowledge extraction: Theory, methods, visualization, and application, Inf. Primen., 2015, vol. 9, no. 2, pp. 92–110.
Kuznetsov, I.P., Kozerenko, E.B., and Charnine, M.M., Technological peculiarity of knowledge extraction for logical-analytical systems, Proceedings of ICAI, 2012, vol. 12, pp. 18–21.
Zolotarev, O.V. and Sharnin, M.M., Methods for extracting knowledge from natural language texts and the construction of models of business processes on the basis of identifying processes, objects, their relationships, and characteristics, Trudy Mezhdunarodnoi nauchnoi konferentsii CPT2014 (Proceedings of the International Scientific Conference CPT2014), 2015, pp. 92–98.
Zolotarev, O.V., Sharnin, M.M., and Klimenko, S.V., Semantic approach to the analysis of terrorist activity on the Internet based on thematic modeling methods, Vestn. Ross. Nov. Univ., Ser.: Slozhnye Sist.: Modeli Anal. Upr., 2016, no. 3, pp. 64–71.
Lagutina, N.S, Lagutina, K.V., Shchitov, I.A., and Paramonov, I.V., Analysis of influence of different relations types on the quality of thesaurus application to text classification problems, Model. Anal. Inf. Sist., 2017, vol. 24, no. 6, pp. 772–787.
Sabirova, K. and Lukanin, A., Automatic extraction of hypernyms and hyponyms from Russian texts, Supplementary Proceedings of the 3rd International Conference on Analysis of Images, Social Networks and Texts (AIST’2014), 2014, pp. 35–40.
Bolshakova, E.I., Ivanov, K.M., Sapin, A.S., and Sharikov, G.F., A system for extracting information from texts on the basis of lexical and syntactic templates, Pyatnadtsataya natsional’naya konferentsiya po iskusstvennomu intellektu s mezhdunarodnym uchastiem (Fifteenth National Conference on Artificial Intelligence with International Participation), 2016, pp. 14–22.
Rabchevskii, E.A., Automatic construction of ontologies based on lexical and syntactic patterns for information retrieval, Elektronnye biblioteki: Perspektivnye metody i tekhnologii, elektronnye kollektsii, Sb. nauch. tr. 11-i Vserossiiskoi nauchnoi konferentsii RCDL-2009 (Digital Libraries: Promising Methods and Technologies, Digital Collections, Proc. 11th All-Russian Scientific Conference RCDL-2009), Petrozavodsk, 2009, pp. 69–77.
Mihalcea, R. and Tarau, P., TextRank: Bringing order into texts, Proceedings of Empirical Methods in Natural Language Processing—EMNLP, Barcelona, 2004, pp. 404–411.
Wiemer-Hastings, P., Wiemer-Hastings, K., and Graesser, A., Latent semantic analysis, Proceedings of the 16th International Joint Conference on Artificial Intelligence, 2004, pp. 1–14.
Noh, S., Kim, S., and Jung, C., A lightweight program similarity detection model using XML and Levenshtein distance, FECS, 2006, pp. 3–9.
Lefever, E., Van de Kauter, M., and Hoste, V., Evaluation of automatic hypernym extraction from technical corpora in English and Dutch, 9th International Conference on Language Resources and Evaluation (LREC), 2014, pp. 490–497.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors declare that they have no conflicts of interest.
Additional information
Translated by A. Ovchinnikova
About this article
Cite this article
Lagutina, N.S., Lagutina, K.V., Adrianov, A.S. et al. Russian-Language Thesauri: Automatic Construction and Application for Natural Language Processing Tasks. Aut. Control Comp. Sci. 53, 705–718 (2019). https://doi.org/10.3103/S0146411619070149
Received:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0146411619070149