Skip to main content

Semantic Space Transformations for Cross-Lingual Document Classification

  • Conference paper
  • First Online:
  • 7074 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11139))

Abstract

Cross-lingual document representation can be done by training monolingual semantic spaces and then to use bilingual dictionaries with some transform method to project word vectors into a unified space. The main goal of this paper consists in evaluation of three promising transform methods on cross-lingual document classification task. We also propose, evaluate and compare two cross-lingual document classification approaches. We use popular convolutional neural network (CNN) and compare its performance with a standard maximum entropy classifier. The proposed methods are evaluated on four languages, namely English, German, Spanish and Italian from the Reuters corpus. We demonstrate that the results of all transformation methods are close to each other, however the orthogonal transformation gives generally slightly better results when CNN with trained embeddings is used. The experimental results also show that convolutional network achieves better results than maximum entropy classifier. We further show that the proposed methods are competitive with the state of the art.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996)

    Google Scholar 

  2. Brychcin, T.: Linear transformations for cross-lingual semantic textual similarity. CoRR abs/1807.04172 (2018). http://arxiv.org/abs/1807.04172

  3. Brychcin, T., Taylor, S.E., Svoboda, L.: Cross-lingual word analogies using linear transformations between semantic spaces. CoRR abs/1807.04175 (2018). http://arxiv.org/abs/1807.04175

  4. Sarath Chandar, A.P., et al.: An autoencoder approach to learning bilingual word representations. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 1853–1861. Curran Associates, Inc. (2014)

    Google Scholar 

  5. Coulmance, J., Marty, J.M., Wenzek, G., Benhalloum, A.: Trans-gram, fast cross-lingual word-embeddings. arXiv preprint arXiv:1601.02502 (2016)

  6. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)

  7. Klementiev, A., Titov, I., Bhattarai, B.: Inducing crosslingual distributed representations of words. In: Proceedings of COLING 2012, pp. 1459–1474 (2012)

    Google Scholar 

  8. Kočiský, T., Hermann, K.M., Blunsom, P.: Learning bilingual word representations by marginalizing alignments. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 224–229 (2014)

    Google Scholar 

  9. Lenc, L., Král, P.: Deep neural networks for czech multi-label document classification. In: Gelbukh, A. (ed.) CICLing 2016. LNCS, vol. 9624, pp. 460–471. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75487-1_36

    Chapter  Google Scholar 

  10. Levy, O., Goldberg, Y.: Dependency-based word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 302–308 (2014)

    Google Scholar 

  11. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5(Apr), 361–397 (2004)

    Google Scholar 

  12. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  13. Zou, W.Y., Socher, R., Cer, D., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1393–1398 (2013)

    Google Scholar 

Download references

Acknowledgements

This work has been partly supported by the project LO1506 of the Czech Ministry of Education, Youth and Sports and by Grant No. SGS-2016-018 Data and Software Engineering for Advanced Applications.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavel Král .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Martínek, J., Lenc, L., Král, P. (2018). Semantic Space Transformations for Cross-Lingual Document Classification. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science(), vol 11139. Springer, Cham. https://doi.org/10.1007/978-3-030-01418-6_60

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01418-6_60

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01417-9

  • Online ISBN: 978-3-030-01418-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics