Skip to main content

Comparison of Cross-Lingual Similar Documents Retrieval Methods

  • Conference paper
  • First Online:
Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2020)

Abstract

In this paper, we compare different methods for cross-lingual similar document retrieval for distant language pair, namely Russian and English languages. We compare various methods among them: classical Cross-Lingual Explicit Semantic Analysis (CL-ESA), machine translation methods and approaches based on cross-lingual embeddings. We introduce two datasets for evaluation of this task: Russian-English aligned Wikipedia articles and automatically translated Paraplag. Conducted experiments show that an approach with inverted index, with an extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings, achieves better performance in terms of recall and MAP than other methods on both datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    russian-syntagrus-ud-2.5–191206 and english-ewt-ud-2.5–191206 models.

  2. 2.

    Russian-English parallel corpus: https://translate.yandex.ru/corpus?lang=en/.

  3. 3.

    A library for Multilingual Unsupervised or Supervised word Embeddings https://github.com/facebookresearch/MUSE.

  4. 4.

    version 1.1.1.

  5. 5.

    http://nlp.isa.ru/ru-en-src-retr-dataset/.

  6. 6.

    http://opus.nlpl.eu/WMT-News.php.

  7. 7.

    http://opus.nlpl.eu/News-Commentary.php.

References

  1. Antonova, A., Misyurev, A.: Building a web-based parallel corpus and filtering out machine-translated text. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 136–144 (2011)

    Google Scholar 

  2. Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In: AAAI, pp. 5012–5019 (2018)

    Google Scholar 

  3. Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 7, 597–610 (2019)

    Article  Google Scholar 

  4. Bakhteev, O., Ogaltsov, A., Khazov, A., Safin, K., Kuznetsova, R.: Crosslang: the system of cross-lingual plagiarism detection. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

    Google Scholar 

  5. Barrón-Cedeño, A., Gupta, P., Rosso, P.: Methods for cross-language plagiarism detection. Knowl.-Based Syst. 50, 211–217 (2013)

    Article  Google Scholar 

  6. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Google Scholar 

  7. Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)

  8. Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: Usingword embedding for cross-language plagiarism detection. arXiv preprint arXiv:1702.03082 (2017)

  9. Franco-Salvador, M., Gupta, P., Rosso, P., Banchs, R.E.: Cross-language plagiarism detection over continuous-space-and knowledge graph-based representations of language. Knowl.-based Syst. 111, 87–99 (2016)

    Google Scholar 

  10. Gabrilovich, E., Markovitch, S., et al.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI. 7, 1606–1611 (2007)

    Google Scholar 

  11. Gillick, D., Presta, A., Tomar, G.S.: End-to-end retrieval in continuous space. arXiv preprint arXiv:1811.08008 (2018)

  12. Jiang, J.Y., Zhang, M., Li, C., Bendersky, M., Golbandi, N., Najork, M.: Semantic text matching for long-form documents. In: The World Wide Web Conference, pp. 795–806 (2019)

    Google Scholar 

  13. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data (2019)

    Google Scholar 

  14. Kutuzov, A., Kopotev, M., Sviridenko, T., Ivanova, L.: Clustering comparable corpora of russian and ukrainian academic texts: Word embeddings and semantic fingerprints. arXiv preprint arXiv:1604.05372 (2016)

  15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  16. Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Res. Eval. 45(1), 45–62 (2011)

    Google Scholar 

  17. Rekabsaz, N., Lupu, M., Hanbury, A., Zuccon, G.: Generalizing translation models in the probabilistic relevance framework. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 711–720 (2016)

    Google Scholar 

  18. Romanov, A., Kuznetsova, R., Bakhteev, O., Khritankov, A.: Machine-translated text detection in a collection of russian scientific papers. Dialogue, p. 2 (2016)

    Google Scholar 

  19. Schwenk, H., Chaudhary, V., Sun, S., Gong, H., Guzmán, F.: Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. arXiv preprint arXiv:1907.05791 (2019)

  20. Sochenkov, I.V., Zubarev, D.V., Tikhomirov, I.A.: Exploratory patent search. Inf. its Appl. 12(1), 89–94 (2018)

    Google Scholar 

  21. Sochenkov, I., Zubarev, D., Smirnov, I.: The paraplag: Russian dataset for paraphrased plagiarism detection. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue, vol. 1, pp. 284–297 (2017)

    Google Scholar 

  22. Straka, M., Hajic, J., Straková, J.: Udpipe: trainable pipeline for processing conll-u files performing tokenization, morphological analysis, pos tagging and parsing. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 4290–4297 (2016)

    Google Scholar 

  23. Tang, S., Mousavi, M., de Sa, V.R.: An empirical study on post-processing methods for word embeddings. arXiv preprint arXiv:1905.10971 (2019)

  24. Tiedemann, J.: Parallel data, tools and interfaces in opus. Lrec 2012, 2214–2218 (2012)

    Google Scholar 

  25. Vulić, I., et al.: Multi-simlex: A large-scale evaluation of multilingual and cross-lingual lexical semantic similarity. arXiv preprint arXiv:2003.04866 (2020)

  26. Vulic, I., Moens, M.F.: Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), vol. 2, pp. 719–725. ACL; East Stroudsburg, PA (2015)

    Google Scholar 

  27. Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372 (2015)

    Google Scholar 

  28. Zubarev, D.V., Sochenkov, I.V.: Cross-lingual similar document retrieval methods. In: Proceedings of the ISP RAS, 31(5) (2019)

    Google Scholar 

  29. Zubarev, D., Sochenkov, I.: Cross-language text alignment for plagiarism detection based on contextual and context-free models. In: Proceedings of the Annual International Conference Dialogue, vol. 1, pp. 799–810 (2019)

    Google Scholar 

  30. Zweigenbaum, P., Sharoff, S., Rapp, R.: Overview of the third bucc shared task: spotting parallel sentences in comparable corpora. In: Proceedings of 11th Workshop on Building and Using Comparable Corpora, pp. 39–42 (2018)

    Google Scholar 

Download references

Acknowledgement

The reported study was funded by RFBR according to the research projects No 18–37-20017 & No 18–29-03187. This research is also partially supported by the Ministry of Science and Higher Education of the Russian Federation according to the agreement between the Lomonosov Moscow State University and the Foundation of project support of the National Technology Initiative No 13/1251/2018 dated 11.12.2018 within the Research Program “Center of Big Data Storage and Analysis” of the National Technology Initiative Competence Center (project “Text mining tools for big data”).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Denis Zubarev .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zubarev, D., Sochenkov, I. (2021). Comparison of Cross-Lingual Similar Documents Retrieval Methods. In: Sychev, A., Makhortov, S., Thalheim, B. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2020. Communications in Computer and Information Science, vol 1427. Springer, Cham. https://doi.org/10.1007/978-3-030-81200-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81200-3_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-81199-0

  • Online ISBN: 978-3-030-81200-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics