Skip to main content

Knowledge-Enhanced Bilingual Textual Representations for Cross-Lingual Semantic Textual Similarity

  • Conference paper
  • First Online:
  • 1477 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1058))

Abstract

Joint learning of words and entities is advantageous to various NLP tasks, while most of the works focus on single language setting. Cross-lingual representations learning receives high attention recently, but is still restricted by the availability of parallel data. In this paper, a method is proposed to jointly embed texts and entities on comparable data. In addition to evaluate on public semantic textual similarity datasets, a task (cross-lingual text extraction) was proposed to assess the similarities between texts and contribute to this dataset. It shows that the proposed method outperforms cross-lingual representations methods using parallel data on cross-lingual tasks, and achieves competitive results on mono-lingual tasks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The Wikipedia dump was downloaded from the website: https://dumps.wikimedia.org.

  2. 2.

    https://en.wikipedia.org/wiki/Latin_music.

  3. 3.

    https://github.com/hsuehkuan-lu/KEBTR.

  4. 4.

    https://github.com/fxsjy/jieba.

References

  1. Agirre, E., et al.: Semeval-2014 task 10: multilingual semantic textual similarity. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 81–91. Association for Computational Linguistics (2014)

    Google Scholar 

  2. Agirre, E., et al.: Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 497–511 (2016)

    Google Scholar 

  3. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A.: Semeval-2012 task 6: a pilot on semantic textual similarity. In: *SEM 2012: The First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp. 385–393. Association for Computational Linguistics, Montréal, 7–8 June 2012

    Google Scholar 

  4. Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C., Smith, N.A.: Massively multilingual word embeddings. CoRR abs/1602.01925 (2016)

    Google Scholar 

  5. Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 451–462 (2017)

    Google Scholar 

  6. Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 789–798 (2018)

    Google Scholar 

  7. Aziz, W., Specia, L.: Fully automatic compilation of Portuguese-English and Portuguese-Spanish parallel corpora. In: STIL (2011)

    Google Scholar 

  8. Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 757–766 (2007)

    Google Scholar 

  9. Cao, Y., Huang, L., Ji, H., Chen, X., Li, J.: Bridge text and knowledge by learning multi-prototype entity mention embedding. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1623–1633 (2017)

    Google Scholar 

  10. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14. Association for Computational Linguistics (2017)

    Google Scholar 

  11. Chandar, A.P.S., et al.: An autoencoder approach to learning bilingual word representations. CoRR abs/1402.1454 (2014)

    Google Scholar 

  12. Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 462–471. Association for Computational Linguistics (2014)

    Google Scholar 

  13. Franco-Salvador, M., Rosso, P., Navigli, R.: A knowledge-based representation for cross-language document retrieval and categorization. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, April 2014

    Google Scholar 

  14. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 6–12 (2007)

    Google Scholar 

  15. Gouws, S., Bengio, Y., Corrado, G.: BilBOWA: fast bilingual distributed representations without word alignments. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 748–756, 07–09 July 2015

    Google Scholar 

  16. He, H., Gimpel, K., Lin, J.: Multi-perspective sentence similarity modeling with convolutional neural networks. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1576–1586. Association for Computational Linguistics (2015)

    Google Scholar 

  17. He, H., Lin, J.: Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 937–948. Association for Computational Linguistics (2016)

    Google Scholar 

  18. Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributional semantics. In: Proceedings of ACL, June 2014

    Google Scholar 

  19. Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E., Milios, E.: Information retrieval by semantic similarity. Int. J. Semant. Web Inf. Syst. 2, 55–73 (2006)

    Article  Google Scholar 

  20. Huang, A.: Similarity measures for text document clustering. In: Proceedings of the 6th New Zealand Computer Science Research Student Conference, pp. 49–56 (2008)

    Google Scholar 

  21. Kenter, T., de Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420 (2015)

    Google Scholar 

  22. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)

    Google Scholar 

  23. Kiros, R., et al.: Skip-thought vectors. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 3294–3302. Curran Associates, Inc. (2015)

    Google Scholar 

  24. Klementiev, A., Titov, I., Bhattarai, B.: Inducing crosslingual distributed representations of words. In: COLING (2012)

    Google Scholar 

  25. Lavie, A., Denkowski, M.J.: The meteor metric for automatic evaluation of machine translation. Mach. Transl. 23(2–3), 105–115 (2009)

    Article  Google Scholar 

  26. Luong, M.T., Pham, H., Manning, C.D.: Bilingual word representations with monolingual quality in mind. In: NAACL Workshop on Vector Space Modeling for NLP, Denver, United States (2015)

    Google Scholar 

  27. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. CoRR (2013)

    Google Scholar 

  28. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119 (2013)

    Google Scholar 

  29. Mogadala, A., Rettinger, A.: Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 692–702. Association for Computational Linguistics (2016)

    Google Scholar 

  30. Mohammad, S.M., Hirst, G.: Distributional measures as proxies for semantic relatedness (2012)

    Google Scholar 

  31. Mohler, M., Bunescu, R., Mihalcea, R.: Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, HLT 2011, vol. 1, pp. 752–762 (2011)

    Google Scholar 

  32. Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., Cheng, X.: Text matching as image recognition. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA, 12–17 February 2016, pp. 2793–2799 (2016)

    Google Scholar 

  33. Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Knowl.-Based Syst. 45, 45–62 (2011)

    Article  Google Scholar 

  34. Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29, 349–380 (2003)

    Article  Google Scholar 

  35. Ruder, S.: A survey of cross-lingual embedding models. CoRR abs/1706.04902 (2017). http://arxiv.org/abs/1706.04902

  36. Schwenk, H., Douze, M.: Learning joint multilingual sentence representations with neural machine translation. In: ACL workshop on Representation Learning for NLP (2017)

    Google Scholar 

  37. Severyn, A., Moschitti, A.: Learning to rank short text pairs with convolutional deep neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 373–382 (2015)

    Google Scholar 

  38. Søgaard, A., Agić, Ž., Martínez Alonso, H., Plank, B., Bohnet, B., Johannsen, A.: Inverted indexing for cross-lingual NLP. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1713–1722 (2015)

    Google Scholar 

  39. Vulić, I., Moens, M.F.: Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 719–725. Association for Computational Linguistics (2015)

    Google Scholar 

  40. Vulic, I., Moens, M.: Bilingual distributed word representations from document-aligned comparable data. J. Artif. Intell. Res. 55, 953–994 (2016)

    Article  MathSciNet  Google Scholar 

  41. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph and text jointly embedding. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1591–1601 (2014)

    Google Scholar 

  42. Yamada, I., Shindo, H., Takeda, H., Takefuji, Y.: Joint learning of the embedding of words and entities for named entity disambiguation. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 250–259 (2016)

    Google Scholar 

  43. Yamada, I., Shindo, H., Takeda, H., Takefuji, Y.: Learning distributed representations of texts and entities from knowledge base. Trans. Assoc. Comput. Linguis. 5, 397–411 (2017)

    Article  Google Scholar 

  44. Yang, L., Ai, Q., Guo, J., Croft, W.B.: aNMM: ranking short answer texts with attention-based neural matching model. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, pp. 287–296 (2016)

    Google Scholar 

  45. Yin, W., Schütze, H., Xiang, B., Zhou, B.: ABCNN: attention-based convolutional neural network for modeling sentence pairs. Trans. Assoc. Comput. Linguis. 4, 259–272 (2016)

    Article  Google Scholar 

Download references

Acknowledgement

The work is supported by NSFC key project (U1736204, 61533018, 61661146007), Ministry of Education and China Mobile Joint Fund (MCM20170301), and THUNUS NExT Co-Lab.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hsuehkuan Lu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lu, H., Cao, Y., Lei, H., Li, J. (2019). Knowledge-Enhanced Bilingual Textual Representations for Cross-Lingual Semantic Textual Similarity. In: Cheng, X., Jing, W., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2019. Communications in Computer and Information Science, vol 1058. Springer, Singapore. https://doi.org/10.1007/978-981-15-0118-0_33

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-0118-0_33

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-0117-3

  • Online ISBN: 978-981-15-0118-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics