Abstract
In this paper, we compare different methods for cross-lingual similar document retrieval for distant language pair, namely Russian and English languages. We compare various methods among them: classical Cross-Lingual Explicit Semantic Analysis (CL-ESA), machine translation methods and approaches based on cross-lingual embeddings. We introduce two datasets for evaluation of this task: Russian-English aligned Wikipedia articles and automatically translated Paraplag. Conducted experiments show that an approach with inverted index, with an extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings, achieves better performance in terms of recall and MAP than other methods on both datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
russian-syntagrus-ud-2.5–191206 and english-ewt-ud-2.5–191206 models.
- 2.
Russian-English parallel corpus: https://translate.yandex.ru/corpus?lang=en/.
- 3.
A library for Multilingual Unsupervised or Supervised word Embeddings https://github.com/facebookresearch/MUSE.
- 4.
version 1.1.1.
- 5.
- 6.
- 7.
References
Antonova, A., Misyurev, A.: Building a web-based parallel corpus and filtering out machine-translated text. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 136–144 (2011)
Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In: AAAI, pp. 5012–5019 (2018)
Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 7, 597–610 (2019)
Bakhteev, O., Ogaltsov, A., Khazov, A., Safin, K., Kuznetsova, R.: Crosslang: the system of cross-lingual plagiarism detection. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Barrón-Cedeño, A., Gupta, P., Rosso, P.: Methods for cross-language plagiarism detection. Knowl.-Based Syst. 50, 211–217 (2013)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)
Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: Usingword embedding for cross-language plagiarism detection. arXiv preprint arXiv:1702.03082 (2017)
Franco-Salvador, M., Gupta, P., Rosso, P., Banchs, R.E.: Cross-language plagiarism detection over continuous-space-and knowledge graph-based representations of language. Knowl.-based Syst. 111, 87–99 (2016)
Gabrilovich, E., Markovitch, S., et al.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI. 7, 1606–1611 (2007)
Gillick, D., Presta, A., Tomar, G.S.: End-to-end retrieval in continuous space. arXiv preprint arXiv:1811.08008 (2018)
Jiang, J.Y., Zhang, M., Li, C., Bendersky, M., Golbandi, N., Najork, M.: Semantic text matching for long-form documents. In: The World Wide Web Conference, pp. 795–806 (2019)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data (2019)
Kutuzov, A., Kopotev, M., Sviridenko, T., Ivanova, L.: Clustering comparable corpora of russian and ukrainian academic texts: Word embeddings and semantic fingerprints. arXiv preprint arXiv:1604.05372 (2016)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Res. Eval. 45(1), 45–62 (2011)
Rekabsaz, N., Lupu, M., Hanbury, A., Zuccon, G.: Generalizing translation models in the probabilistic relevance framework. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 711–720 (2016)
Romanov, A., Kuznetsova, R., Bakhteev, O., Khritankov, A.: Machine-translated text detection in a collection of russian scientific papers. Dialogue, p. 2 (2016)
Schwenk, H., Chaudhary, V., Sun, S., Gong, H., Guzmán, F.: Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. arXiv preprint arXiv:1907.05791 (2019)
Sochenkov, I.V., Zubarev, D.V., Tikhomirov, I.A.: Exploratory patent search. Inf. its Appl. 12(1), 89–94 (2018)
Sochenkov, I., Zubarev, D., Smirnov, I.: The paraplag: Russian dataset for paraphrased plagiarism detection. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue, vol. 1, pp. 284–297 (2017)
Straka, M., Hajic, J., Straková, J.: Udpipe: trainable pipeline for processing conll-u files performing tokenization, morphological analysis, pos tagging and parsing. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 4290–4297 (2016)
Tang, S., Mousavi, M., de Sa, V.R.: An empirical study on post-processing methods for word embeddings. arXiv preprint arXiv:1905.10971 (2019)
Tiedemann, J.: Parallel data, tools and interfaces in opus. Lrec 2012, 2214–2218 (2012)
Vulić, I., et al.: Multi-simlex: A large-scale evaluation of multilingual and cross-lingual lexical semantic similarity. arXiv preprint arXiv:2003.04866 (2020)
Vulic, I., Moens, M.F.: Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), vol. 2, pp. 719–725. ACL; East Stroudsburg, PA (2015)
Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372 (2015)
Zubarev, D.V., Sochenkov, I.V.: Cross-lingual similar document retrieval methods. In: Proceedings of the ISP RAS, 31(5) (2019)
Zubarev, D., Sochenkov, I.: Cross-language text alignment for plagiarism detection based on contextual and context-free models. In: Proceedings of the Annual International Conference Dialogue, vol. 1, pp. 799–810 (2019)
Zweigenbaum, P., Sharoff, S., Rapp, R.: Overview of the third bucc shared task: spotting parallel sentences in comparable corpora. In: Proceedings of 11th Workshop on Building and Using Comparable Corpora, pp. 39–42 (2018)
Acknowledgement
The reported study was funded by RFBR according to the research projects No 18–37-20017 & No 18–29-03187. This research is also partially supported by the Ministry of Science and Higher Education of the Russian Federation according to the agreement between the Lomonosov Moscow State University and the Foundation of project support of the National Technology Initiative No 13/1251/2018 dated 11.12.2018 within the Research Program “Center of Big Data Storage and Analysis” of the National Technology Initiative Competence Center (project “Text mining tools for big data”).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zubarev, D., Sochenkov, I. (2021). Comparison of Cross-Lingual Similar Documents Retrieval Methods. In: Sychev, A., Makhortov, S., Thalheim, B. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2020. Communications in Computer and Information Science, vol 1427. Springer, Cham. https://doi.org/10.1007/978-3-030-81200-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-81200-3_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81199-0
Online ISBN: 978-3-030-81200-3
eBook Packages: Computer ScienceComputer Science (R0)