Comparison of Cross-Lingual Similar Documents Retrieval Methods

Zubarev, Denis; Sochenkov, Ilya

doi:10.1007/978-3-030-81200-3_16

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1427))

Included in the following conference series:

International Conference on Data Analytics and Management in Data Intensive Domains

246 Accesses
1 Citations

Abstract

In this paper, we compare different methods for cross-lingual similar document retrieval for distant language pair, namely Russian and English languages. We compare various methods among them: classical Cross-Lingual Explicit Semantic Analysis (CL-ESA), machine translation methods and approaches based on cross-lingual embeddings. We introduce two datasets for evaluation of this task: Russian-English aligned Wikipedia articles and automatically translated Paraplag. Conducted experiments show that an approach with inverted index, with an extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings, achieves better performance in terms of recall and MAP than other methods on both datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
russian-syntagrus-ud-2.5–191206 and english-ewt-ud-2.5–191206 models.
2.
Russian-English parallel corpus: https://translate.yandex.ru/corpus?lang=en/.
3.
A library for Multilingual Unsupervised or Supervised word Embeddings https://github.com/facebookresearch/MUSE.
4.
version 1.1.1.
5.
http://nlp.isa.ru/ru-en-src-retr-dataset/.
6.
http://opus.nlpl.eu/WMT-News.php.
7.
http://opus.nlpl.eu/News-Commentary.php.

References

Antonova, A., Misyurev, A.: Building a web-based parallel corpus and filtering out machine-translated text. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 136–144 (2011)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In: AAAI, pp. 5012–5019 (2018)
Google Scholar
Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 7, 597–610 (2019)
Article Google Scholar
Bakhteev, O., Ogaltsov, A., Khazov, A., Safin, K., Kuznetsova, R.: Crosslang: the system of cross-lingual plagiarism detection. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Google Scholar
Barrón-Cedeño, A., Gupta, P., Rosso, P.: Methods for cross-language plagiarism detection. Knowl.-Based Syst. 50, 211–217 (2013)
Article Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Google Scholar
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)
Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: Usingword embedding for cross-language plagiarism detection. arXiv preprint arXiv:1702.03082 (2017)
Franco-Salvador, M., Gupta, P., Rosso, P., Banchs, R.E.: Cross-language plagiarism detection over continuous-space-and knowledge graph-based representations of language. Knowl.-based Syst. 111, 87–99 (2016)
Google Scholar
Gabrilovich, E., Markovitch, S., et al.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI. 7, 1606–1611 (2007)
Google Scholar
Gillick, D., Presta, A., Tomar, G.S.: End-to-end retrieval in continuous space. arXiv preprint arXiv:1811.08008 (2018)
Jiang, J.Y., Zhang, M., Li, C., Bendersky, M., Golbandi, N., Najork, M.: Semantic text matching for long-form documents. In: The World Wide Web Conference, pp. 795–806 (2019)
Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data (2019)
Google Scholar
Kutuzov, A., Kopotev, M., Sviridenko, T., Ivanova, L.: Clustering comparable corpora of russian and ukrainian academic texts: Word embeddings and semantic fingerprints. arXiv preprint arXiv:1604.05372 (2016)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Res. Eval. 45(1), 45–62 (2011)
Google Scholar
Rekabsaz, N., Lupu, M., Hanbury, A., Zuccon, G.: Generalizing translation models in the probabilistic relevance framework. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 711–720 (2016)
Google Scholar
Romanov, A., Kuznetsova, R., Bakhteev, O., Khritankov, A.: Machine-translated text detection in a collection of russian scientific papers. Dialogue, p. 2 (2016)
Google Scholar
Schwenk, H., Chaudhary, V., Sun, S., Gong, H., Guzmán, F.: Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. arXiv preprint arXiv:1907.05791 (2019)
Sochenkov, I.V., Zubarev, D.V., Tikhomirov, I.A.: Exploratory patent search. Inf. its Appl. 12(1), 89–94 (2018)
Google Scholar
Sochenkov, I., Zubarev, D., Smirnov, I.: The paraplag: Russian dataset for paraphrased plagiarism detection. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue, vol. 1, pp. 284–297 (2017)
Google Scholar
Straka, M., Hajic, J., Straková, J.: Udpipe: trainable pipeline for processing conll-u files performing tokenization, morphological analysis, pos tagging and parsing. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 4290–4297 (2016)
Google Scholar
Tang, S., Mousavi, M., de Sa, V.R.: An empirical study on post-processing methods for word embeddings. arXiv preprint arXiv:1905.10971 (2019)
Tiedemann, J.: Parallel data, tools and interfaces in opus. Lrec 2012, 2214–2218 (2012)
Google Scholar
Vulić, I., et al.: Multi-simlex: A large-scale evaluation of multilingual and cross-lingual lexical semantic similarity. arXiv preprint arXiv:2003.04866 (2020)
Vulic, I., Moens, M.F.: Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), vol. 2, pp. 719–725. ACL; East Stroudsburg, PA (2015)
Google Scholar
Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372 (2015)
Google Scholar
Zubarev, D.V., Sochenkov, I.V.: Cross-lingual similar document retrieval methods. In: Proceedings of the ISP RAS, 31(5) (2019)
Google Scholar
Zubarev, D., Sochenkov, I.: Cross-language text alignment for plagiarism detection based on contextual and context-free models. In: Proceedings of the Annual International Conference Dialogue, vol. 1, pp. 799–810 (2019)
Google Scholar
Zweigenbaum, P., Sharoff, S., Rapp, R.: Overview of the third bucc shared task: spotting parallel sentences in comparable corpora. In: Proceedings of 11th Workshop on Building and Using Comparable Corpora, pp. 39–42 (2018)
Google Scholar

Download references

Acknowledgement

The reported study was funded by RFBR according to the research projects No 18–37-20017 & No 18–29-03187. This research is also partially supported by the Ministry of Science and Higher Education of the Russian Federation according to the agreement between the Lomonosov Moscow State University and the Foundation of project support of the National Technology Initiative No 13/1251/2018 dated 11.12.2018 within the Research Program “Center of Big Data Storage and Analysis” of the National Technology Initiative Competence Center (project “Text mining tools for big data”).

Author information

Authors and Affiliations

Federal Research Center ‘Computer Science and Control’ of Russian Academy of Sciences, 44-2 Vavilov Street, Moscow, 119333, Russia
Denis Zubarev & Ilya Sochenkov

Authors

Denis Zubarev
View author publications
You can also search for this author in PubMed Google Scholar
Ilya Sochenkov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denis Zubarev .

Editor information

Editors and Affiliations

Voronezh State University, Voronezh, Russia
Alexander Sychev
Voronezh State University, Voronezh, Russia
Sergey Makhortov
Christian-Albrecht University of Kiel, Kiel, Schleswig-Holstein, Germany
Bernhard Thalheim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zubarev, D., Sochenkov, I. (2021). Comparison of Cross-Lingual Similar Documents Retrieval Methods. In: Sychev, A., Makhortov, S., Thalheim, B. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2020. Communications in Computer and Information Science, vol 1427. Springer, Cham. https://doi.org/10.1007/978-3-030-81200-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-81200-3_16
Published: 16 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81199-0
Online ISBN: 978-3-030-81200-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics