Abstract
In this article we compare the quality of various cross-lingual embeddings on the cross-lingual text classification problem and explore the possibility of transferring knowledge between languages. We consider Multilingual Unsupervised and Supervised Embeddings (MUSE), multilingual BERT embeddings, XLM-RoBERTa (XLM-R) model embeddings, and Language-Agnostic Sentence Representations (LASER). Various classification algorithms use them as inputs for solving the task of the patent categorization. It is a zero-shot cross-lingual classification task since the training and the validation sets include the English texts, and the test set consists of documents in Russian.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond (2018)
Chen, Y.L., Chang, Y.C.: A three-phase method for patent classification. Inf. Process. Manag. 48, 1017–1030 (2012). https://doi.org/10.1016/j.ipm.2011.11.001
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale (2019)
Lample, G., et al.: Word translation without parallel data. In: International Conference on Learning Representations (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018)
Fall, C., Benzineb, K., Guyot, J., Törcsvári, A., Fiévet, P.: Computer-assisted categorization of patent documents in the international patent classification (2003)
Fall, C., Törcsvári, A., Benzineb, K., Karetka, G.: Automated categorization in the international patent classification. SIGIR Forum 37, 10–25 (2003). https://doi.org/10.1145/945546.945547
Fall, C., Törcsvári, A., Fiévet, P., Karetka, G.: Automated categorization of German-language patent documents. Expert Syst. Appl. 26, 269–277 (2004). https://doi.org/10.1016/S0957-4174(03)00141-6
Gomez, J.C., Moens, M.-F.: A survey of automated hierarchical classification of patents. In: Paltoglou, G., Loizides, F., Hansen, P. (eds.) Professional Search in the Modern World. LNCS, vol. 8830, pp. 215–249. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12511-4_11
Goodfellow, I.: Nips 2016 tutorial: generative adversarial networks (2016)
Hirota, W., Suhara, Y., Golshan, B., Tan, W.C.: Emu: enhancing multilingual sentence embeddings with semantic specialization (2019)
Wang, Z., Mayhew, S., Roth, D.: Cross-lingual ability of multilingual BERT: an empirical study (2019)
Kapoor, R.: Intellectual property and appropriability regime of innovation in financial services, p. 33 (2014)
Kim, J.H., Choi, K.S.: Patent document categorization based on semantic structural information. Inf. Process. Manag. 43, 1200–1215 (2007). https://doi.org/10.1016/j.ipm.2007.02.002
Lample, G., et al.: Unsupervised machine translation using monolingual corpora only. In: International Conference on Learning Representations (2018)
Lim, S., Kwon, Y.J.: IPC multi-label classification applying the characteristics of patent documents. In: Park, J.J.J.H., Pan, Y., Yi, G., Loia, V. (eds.) CSA/CUTE/UCAWSN -2016. LNEE, vol. 421, pp. 166–172. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3023-9_27
Mikolov, T., Le, Q., Sutskever, I.: Exploiting similarities among languages for machine translation (2013)
Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual BERT? pp. 4996–5001 (2019). https://doi.org/10.18653/v1/P19-1493
Ruder, S.: A survey of cross-lingual embedding models (2017)
Schönemann, P.: A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1–10 (1966). https://doi.org/10.1007/BF02289451
Vaswani, A., et al.: Attention is all you need (2017)
Acknowledgments
The reported study was funded by RFBR according to the research projects No 18-37-20017 & No 18-29-03187. This research is also partially supported by the Ministry of Science and Higher Education of the Russian Federation according to the agreement between the Lomonosov Moscow State University and the Foundation of project support of the National Technology Initiative No 13/1251/2018 dated 11.12.2018 within the Research Program “Center of Big Data Storage and Analysis” of the National Technology Initiative Competence Center (project “Text mining tools for big data”).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ryzhova, A., Sochenkov, I. (2021). Extrinsic Evaluation of Cross-Lingual Embeddings on the Patent Classification Task. In: Sychev, A., Makhortov, S., Thalheim, B. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2020. Communications in Computer and Information Science, vol 1427. Springer, Cham. https://doi.org/10.1007/978-3-030-81200-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-81200-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81199-0
Online ISBN: 978-3-030-81200-3
eBook Packages: Computer ScienceComputer Science (R0)