short-paper

A Study of Efficacy of Cross-lingual Word Embeddings for Indian Languages

Authors:
Jyotsana Khatri

Indian Institute of technology Bombay

Indian Institute of technology Bombay
View Profile

,
Rudra Murthy

Indian Institute of technology Bombay

Indian Institute of technology Bombay
View Profile

,
Pushpak Bhattacharyya

Indian Institute of technology Bombay

Indian Institute of technology Bombay
View Profile

CoDS COMAD 2020: Proceedings of the 7th ACM IKDD CoDS and 25th COMADJanuary 2020Pages 347–348https://doi.org/10.1145/3371158.3371219

Published:15 January 2020Publication History

CoDS COMAD 2020: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD

Pages 347–348

ABSTRACT

Cross-lingual word embeddings have become ubiquitous for various NLP tasks. Existing literature primarily evaluate the quality of cross-lingual word embeddings on the task of Bilingual Lexicon Induction. They report very high accuracies for European languages. In this paper, we report the accuracy of Bilingual Lexicon Induction (BLI) task for cross-lingual word embeddings generated using two mapping based unsupervised approaches: VecMap and MUSE for Indian languages on a dataset created using linked Indian Wordnet. We also show the comparison of these approaches with a simple baseline where the embeddings for all languages are trained using fast-text on the combined corpora of 11 Indian languages. Our experiments show that existing cross-lingual word embedding approaches give low accuracy on bilingual lexicon induction for cognate words. Given the high cognate overlap of several Indian languages, this is a serious limitation of existing approaches.

References

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 789--798.Google ScholarCross Ref
Pushpak Bhattacharyya. 2017. IndoWordNet. In The WordNet in Indian Languages. Springer, 1--18.Google Scholar
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.Google ScholarCross Ref
Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word Translation Without Parallel Data. In In Proceedings of ICLR 2018.Google Scholar
Anoop Kunchukuttan, Ratish Puduppully, and Pushpak Bhattacharyya. 2015. Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 81--85.Google ScholarCross Ref
Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2019. A Survey of Cross-lingual Word Embedding Models. Journal of Artificial Intelligence Research 65 (2019), 569--631.Google ScholarDigital Library

Recommendations

Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial Intelligence

In recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...
Read More
Cross-lingual word sense disambiguation for languages with scarce resources
Canadian AI'11: Proceedings of the 24th Canadian conference on Advances in artificial intelligence

Word Sense Disambiguation has long been a central problem in computational linguistics. Word Sense Disambiguation is the ability to identify the meaning of words in context in a computational manner. Statistical and supervised approaches require a large ...
Read More
Cross-Lingual Information Retrieval System for Indian Languages
Advances in Multilingual and Multimodal Information Retrieval

This paper describes our attempt to build a Cross-Lingual Information Retrieval (CLIR) system as a part of the Indian language sub-task of the main Adhoc monolingual and bilingual track in CLEF competition. In this track, the task required retrieval of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CoDS COMAD 2020: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD
January 2020
399 pages
ISBN:9781450377386
DOI:10.1145/3371158
General Chairs:
Vasudeva Varma,
Subbarao Kambhampati,
Program Chairs:
Arnab Bhattacharya,
Sriraam Natarajan,
Publications Chair:
Rishiraj Saha Roy
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 January 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
CoDS COMAD 2020 Paper Acceptance Rate78of275submissions,28%Overall Acceptance Rate197of680submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 102
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Study of Efficacy of Cross-lingual Word Embeddings for Indian Languages

CoDS COMAD 2020: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD

ABSTRACT

References

Cited By

Recommendations

Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification

Cross-lingual word sense disambiguation for languages with scarce resources

Cross-Lingual Information Retrieval System for Indian Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Study of Efficacy of Cross-lingual Word Embeddings for Indian Languages

CoDS COMAD 2020: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD

ABSTRACT

References

Cited By

Recommendations

Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification

Cross-lingual word sense disambiguation for languages with scarce resources

Cross-Lingual Information Retrieval System for Indian Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media