Skip to main content

RusNLP: Semantic Search Engine for Russian NLP Conference Papers

  • Conference paper
  • First Online:
Book cover Analysis of Images, Social Networks and Texts (AIST 2018)

Abstract

We present RusNLP, a web service implementing semantic search engine and recommendation system over proceedings of three major Russian NLP conferences (Dialogue, AIST and AINL). The collected corpus spans across 12 years and contains about 400 academic papers in English. The presented web service allows searching for publications semantically similar to arbitrary user queries or to any given paper. Search results can be filtered by authors and their affiliations, conferences or years. They are also interlinked with the NLPub.ru service, making it easier to quickly capture the general focus of each paper. The search engine source code and the publications metadata are freely available for all interested researchers.

In the course of preparing the web service, we evaluated several well-known techniques for representing and comparing documents: TF-IDF, LDA, and Paragraph Vector. On our comparatively small corpus, TF-IDF yielded the best results and thus was chosen as the primary algorithm working under the hood of RusNLP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.dialog-21.ru/en/.

  2. 2.

    https://aistconf.org/.

  3. 3.

    http://ainlconf.ru/.

  4. 4.

    The models were trained on our English sub-corpus, using the algorithm implementations in the Gensim library [13].

  5. 5.

    Initially, the agreement levels for the TF-IDF and LDA-10 models were below 0.5. We performed a reconciliation round with the assessors discussing their choices for these models, which resulted in changing some of the scores for particular documents. This increased the inter-rater agreement, but did not influence the final ranking.

  6. 6.

    https://nlpub.ru.

  7. 7.

    http://aclanthology.info/.

  8. 8.

    https://github.com/bakarov/rusnlp/tree/master/code/web.

  9. 9.

    http://nlp.rusvectores.org/about/.

References

  1. Bakarov, A., Kutuzov, A., Nikishina, I.: Russian computational linguistics: topical structure in 2007–2017 conference papers. In: Proceedings of Dialogue-2018, online papers. ABBYY (2018), http://www.dialog-21.ru/media/4249/bakarov_kutuzov.pdf

  2. Bhagavatula, C., Feldman, S., Power, R., Ammar, W.: Content-based citation recommendation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp. 238–251. Association for Computational Linguistics (2018), http://aclweb.org/anthology/N18-1022

  3. Bird, S., Dale, R., Dorr, B., Gibson, B., Joseph, M., Kan, M.Y., Lee, D., Powley, B., Radev, D., Tan, Y.F.: The ACL Anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In: LREC 2008 (2008), http://www.aclweb.org/anthology/L08-1005

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3(Jan), 993–1022 (2003)

    Google Scholar 

  5. Faessler, E., Hahn, U.: Semedico: a comprehensive semantic search engine for the life sciences. Proceedings of ACL 2017, System Demonstrations pp. 91–96 (2017)

    Google Scholar 

  6. Kern, R., Jack, K., Hristakeva, M., Granitzer, M.: Teambeam - meta-data extraction from scientific literature. In: Knoth, P., Zdrahal, Z., Juffinger, A. (eds.) Special Issue on Mining Scientific Publications, D-Lib Magazine, vol. 18, number 7/8. Corporation for National Research Initiatives (July 2012)

    Google Scholar 

  7. Khoroshevsky, V.: Semantic Web, 3 (Knowledge spaces in the Internet and Semantic Web, part 3); in Russian. (Articial Intelligence and Decision Making) pp. 3–38 (2012)

    Google Scholar 

  8. Krippendorff, K.: Content analysis: An introduction to its methodology. Sage (2012)

    Google Scholar 

  9. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning. pp. 1188–1196 (2014)

    Google Scholar 

  10. Medlar, A., Ilves, K., Wang, P., Buntine, W., Glowacka, D.: Pulp: A system for exploratory search of scientific literature. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. pp. 1133–1136. ACM (2016)

    Google Scholar 

  11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26, 3111–3119 (2013)

    Google Scholar 

  12. Nanni, F., Dietz, L., Faralli, S., Glavaš, G., Ponzetto, S.P.: Capturing interdisciplinarity in academic abstracts. D-lib magazine 22(9/10) (2016)

    Google Scholar 

  13. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp. 45–50. Valletta, Malta (May 2010)

    Google Scholar 

  14. Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28(1), 11–21 (1972)

    Article  Google Scholar 

  15. Straka, M., Straková, J.: Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. pp. 88–99 (2017)

    Google Scholar 

  16. Ustalov, D.: NLPub: a catalogue and a community for Russian linguistic resources. In: Selected Papers of XVI All-Russian Scientific Conference “Digital libraries: Advanced Methods and Technologies, Digital Collections”. vol. 1297, pp. 56–60. RWTH (2014)

    Google Scholar 

  17. Yoneda, T., Mori, K., Miwa, M., Sasaki, Y.: Bib2vec: Embedding-based search system for bibliographic information. In: Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics. pp. 112–115. Association for Computational Linguistics (2017), http://aclweb.org/anthology/E17-3028

Download references

Acknowledgments

We thank numerous VPNs and Tor Project. At the time of finalizing this paper, they were the only ways for Russian-based scholars to collaborate with the colleagues abroad, because of Internet censorship carried by the Russian governmental agency called Roskomnadzor. It accidentally managed to temporarily block a whole bunch of academic resources, including Softconf, Overleaf, etc.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Irina Nikishina .

Editor information

Editors and Affiliations

Appendix A

Appendix A

Fig. 1.
figure 1

Searching scholarly papers in the database by user-provided query words (in this case, ‘syntax neural classification’).

Fig. 2.
figure 2

RusNLP recommending papers similar to the query paper.

Fig. 3.
figure 3

RusNLP searching for all papers by a certain author in the Dialogue and AIST conferences.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nikishina, I., Bakarov, A., Kutuzov, A. (2018). RusNLP: Semantic Search Engine for Russian NLP Conference Papers. In: van der Aalst, W., et al. Analysis of Images, Social Networks and Texts. AIST 2018. Lecture Notes in Computer Science(), vol 11179. Springer, Cham. https://doi.org/10.1007/978-3-030-11027-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-11027-7_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-11026-0

  • Online ISBN: 978-3-030-11027-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics