Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem in Another Language

Smirnov, Andrew; Mendelev, Valentin

doi:10.1007/978-3-319-52920-2_23

Andrew Smirnov^18,20 &
Valentin Mendelev^19,20

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 661))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

1216 Accesses

Abstract

A text classification problem in Kazakh language is examined. The amount of training data for the task in Kazakh is very limited, but plenty of labeled data in Russian are available. Language vector space transform is built and used to transfer knowledge from Russian into Kazakh language. The obtained classification quality is comparable to that of an approach that employed sophisticated automatic translation system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Coulmance, J., Marty, J.-M., Wenzek, G., Benhalloum, A.: Trans-gram, fast cross-lingual word-embeddings. In: Proceedings of the Empirical Methods in Natural Language Processing (2015)
Google Scholar
Erk, K., Pad, S.: A structured vector space model for word meaning in context. In: Proceedings of EMNLP (2008)
Google Scholar
Gouws, S., Bengio, Y., Corrado, G.: Bilbowa: fast bilingual distributed representations without word alignments. In: Proceedings of the 25th International Conference on Machine Learning, vol. 15, pp. 748–756 (2015)
Google Scholar
Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of ACL, pp. 873–882. ACL (2012)
Google Scholar
Klementiev, A., Titov, A., Bhattarai, B.: Inducing crosslingual distributed representations of words. In: International Conference on Computational Linguistics (COLING), Bombay, India (2012)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T., Li, F.: Rcv1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)
Google Scholar
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation (2013). http://arXiv.org/abs/1309.4168
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Empirical Methods in Natural Language Processing (2014)
Google Scholar
Socher, R., Pennington, J., Huang, E.H., Ng, A.Y., Manning, C.D.: Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of EMNLP, pp. 151–161. ACL (2011)
Google Scholar
Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)
MathSciNet MATH Google Scholar
https://code.google.com/archive/p/word2vec/

Download references

Acknowledgments

This work was financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

Author information

Authors and Affiliations

STC-Innovations, Saint Petersburg, Russia
Andrew Smirnov
Speech Technology Center, Saint Petersburg, Russia
Valentin Mendelev
ITMO-University, Saint Petersburg, Russia
Andrew Smirnov & Valentin Mendelev

Authors

Andrew Smirnov
View author publications
You can also search for this author in PubMed Google Scholar
Valentin Mendelev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentin Mendelev .

Editor information

Editors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Krasovsky Institute of Mathematics and Mechanics, Yekaterinburg, Russia
Mikhail Yu. Khachay
Ural Federal University, Yekaterinbug, Russia
Valeri G. Labunets
Research Computing Center, Lomonosov Moscow State University, Moscow, Russia
Natalia Loukachevitch
National Research University Higher School of Economics, St. Petersburg, Russia
Sergey I. Nikolenko
Technische Universität Darmstadt, Darmstadt, Germany
Alexander Panchenko
Laboratory of Algorithms and Technologies for Networks Analysis, National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Dorodnicyn Computing Centre of Russian Academy of Sciences, Moscow, Russia
Konstantin Vorontsov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Smirnov, A., Mendelev, V. (2017). Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem in Another Language. In: Ignatov, D., et al. Analysis of Images, Social Networks and Texts. AIST 2016. Communications in Computer and Information Science, vol 661. Springer, Cham. https://doi.org/10.1007/978-3-319-52920-2_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-52920-2_23
Published: 17 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52919-6
Online ISBN: 978-3-319-52920-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics