Cross-lingual document similarity estimation and dictionary generation with comparable corpora

Štajner, Tadej; Mladenić, Dunja

doi:10.1007/s10115-018-1179-9

Cross-lingual document similarity estimation and dictionary generation with comparable corpora

Short Paper
Published: 28 March 2018

Volume 58, pages 729–743, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

446 Accesses
Explore all metrics

Abstract

This paper proposes an approach for performing bilingual dictionary generation even when trained on widely available comparable bilingual corpora. We also show its capability to provide cross-lingual similarity estimates that correlate well with human judgments. We implement an approach using a nonlinear bilingual translation model that we train using comparable corpora. We propose a method using word embeddings and kernel approximation to train scalable nonlinear transformations. We demonstrate that this novel method works better on a majority of evaluated language pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Jointly learning bilingual word embeddings and alignments

Article 01 November 2021

New Areas of Application of Comparable Corpora

Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations

Notes

http://dbpedia.org/Wiktionary.

References

Barrón-Cedeno A, Paramita ML, Clough P, Rosso P (2014) A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In: ECIR, pp 424–429
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010). Springer, Paris, pp 177–187. http://leon.bottou.org/papers/bottou-2010
Cassidy T, Ji H, Deng H, Zheng J, Han J (2012) Analysis and refinement of cross-lingual entity linking. In: Information access evaluation. Multilinguality, multimodality, and visual analytics. Springer, New York, pp 1–12
Duh K, Fujino A, Nagata M (2011) Is machine translation ripe for cross-lingual sentiment classification? In: ACL (Short Papers), pp 429–433
Dumais ST, Letsche TA, Littman ML, Landauer TK (1997) Automatic cross-language retrieval using latent semantic indexing. In: AAAI spring symposium on cross-language text and speech retrieval, vol 15, p 21
Fortuna B, Shawe-Taylor J (2005) The use of machine translation tools for cross-lingual text mining. Learning with multiple views, workshop at the ICML
Fung P (1998) A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine Translation and the Information Soup, pp 1–17. Springer
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42
Article MATH Google Scholar
Hellmann S, Brekle J, Auer S (2013) Leveraging the crowdsourcing of lexical resources for bootstrapping a linguistic data cloud. In: Semantic Technology. Springer, pp 191–206
Lauly S, Boulanger A, Larochelle H (2014) Learning multilingual word representations using a bag-of-words autoencoder. arXiv preprint arXiv:1401.1803
Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems, pp 2177–2185
Littman ML, Dumais ST, Landauer TK (1998) Automatic cross-language information retrieval using latent semantic indexing. In: Cross-language information retrieval. Springer, pp 51–62
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. CoRR. arXiv:1309.4168
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546
Ni J, Dinu G, Florian R (2017) Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. arXiv preprint arXiv:1707.02483
Paramita ML, Clough P, Aker A, Gaizauskas RJ (2012) Correlation between similarity measures for inter-language linked wikipedia articles. In: LREC, pp 790–797
Ruder S (2017) A survey of cross-lingual embedding models. arXiv preprint arXiv:1706.04902
Rupnik J, Fortuna B (2008) Regression canonical correlation analysis. Learning from (2008)
Rupnik J, Muhic A, Leban G, Skraba P, Fortuna B, Grobelnik M (2016) News across languages-cross-lingual document similarity and event tracking. J Artif Intell Res 55:283–316
Article MathSciNet Google Scholar
Rupnik J, Muhic A, Škraba P (2011) Low-rank approximations for large, multi-lingual data. http://ailab.ijs.si/primoz_skraba/papers/nips_full.pdf
Skadiņa I, Aker A, Mastropavlos N, Su F, Tufis D, Verlic M, Vasiļjevs A, Babych B, Clough P, Gaizauskas R, et al (2012) Collecting and using comparable corpora for statistical machine translation. In: Proceedings of the 8th international conference on language resources and evaluation (LREC), Istanbul, Turkey
Sorg P, Cimiano P (2012) Exploiting wikipedia for cross-lingual and multilingual information retrieval. Data Knowl Eng 74:26–45
Article Google Scholar
Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. IEEE Trans Pattern Anal Mach Intell 34(3):480–492
Article Google Scholar
Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Proceedings of the 14th annual conference on neural information processing systems, EPFL-CONF-161322, pp 682–688
Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH (2012) Nyström method vs random fourier features: a theoretical and empirical comparison. In: NIPS, pp 485–493
Zhang L, Rettinger A, Färber M, Tadić M (2013) A comparative evaluation of cross-lingual text annotation techniques. In: Information access evaluation. Multilinguality, multimodality, and visualization. Springer, pp 124–135

Download references

Acknowledgements

This work was supported by the Slovenian Research Agency and the IST Programme of the EC under XLike (ICT-STREP-288342), LT-Web (ICT-287815-CSA) and RENDER (ICT-257790-STREP).

Author information

Authors and Affiliations

Jožef Stefan Institute, Jožef Stefan International Postgraduate School, Jamova ulica 39, 1000, Ljubljana, Slovenia
Tadej Štajner & Dunja Mladenić

Authors

Tadej Štajner
View author publications
You can also search for this author inPubMed Google Scholar
Dunja Mladenić
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Tadej Štajner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Štajner, T., Mladenić, D. Cross-lingual document similarity estimation and dictionary generation with comparable corpora. Knowl Inf Syst 58, 729–743 (2019). https://doi.org/10.1007/s10115-018-1179-9

Download citation

Received: 08 October 2014
Revised: 15 November 2017
Accepted: 14 March 2018
Published: 28 March 2018
Issue Date: 05 March 2019
DOI: https://doi.org/10.1007/s10115-018-1179-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-lingual document similarity estimation and dictionary generation with comparable corpora

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Jointly learning bilingual word embeddings and alignments

New Areas of Application of Comparable Corpora

Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now