short-paper

word2vec or JoBimText?: A Comparison for Lexical Expansion of Hindi Words

Authors:

Nitin Ramrakhiyani,

Girish PalshikarAuthors Info & Claims

FIRE '15: Proceedings of the 7th Annual Meeting of the Forum for Information Retrieval Evaluation

Pages 39 - 42

https://doi.org/10.1145/2838706.2838713

Published: 04 December 2015 Publication History

Abstract

Exploration of distributional semantics for NLP tasks in Indian languages has been scarce. This work carries out a comparative analysis of two recent and high performing distributional semantics techniques namely word2vec and JoBimText. The task of lexical expansion of words in Hindi is considered for the analysis. A manual similarity assessment of the lexical expansions of words is employed for evaluation of the techniques. It can be observed that word2vec framework performs better than the JoBimText for various corpus sizes. Analysis of the results also presents insights on performance of the systems on various word types.

References

[1]

Decompositional Semantics for Document Embedding. http://www.cse.iitk.ac.in/users/spranjal/thesis/.

[2]

Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137--1155, 2003.

Digital Library

[3]

S. Bhingardive, R. Puduppully, D. Singh, and P. Bhattacharyya. Merging Verb Senses of Hindi WordNet using Word Embeddings. In Proceedings the 11th International Conference on Natural Language Processing (ICON), 2014.

[4]

S. Bhingardive, D. Singh, R. V, H. H. Redkar, and P. Bhattacharyya. Unsupervised most frequent sense detection using word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1238--1243, 2015.

[5]

C. Biemann, S. Handschuh, A. Freitas, F. Meziane, and E. Métais. Natural Language Processing and Information Systems: 20th International Conference on Applications of Natural Language to Information Systems, NLDB 2015, Passau, Germany, June 17-19, 2015, Proceedings, volume 9103. Springer, 2015.

[6]

C. Biemann and M. Riedl. Text: Now in 2D! a framework for lexical expansion with contextual similarity. Journal of Language Modelling, 1(1):55--95, 2013.

[7]

A. K. Eragani, V. Kuchibhotla, D. M. Sharma, S. Reddy, and A. Kilgarriff. Hindi Word Sketches. In Proceedings the 11th International Conference on Natural Language Processing (ICON), 2014.

[8]

J. Firth. A synopsis of linguistic theory, 1930--1955 selected papers of jr firth (1952--1959), fr palmer, 168 205, 1968.

[9]

Govind, A. Ekbal, and C. Biemann. Multiobjective Optimization and Unsupervised Lexical Acquisition for Named Entity Recognition and Classification. In Proceedings the 11th International Conference on Natural Language Processing (ICON), 2014.

[10]

Z. S. Harris. Distributional structure. Word, 1954.

[11]

A. Kilgarriff, P. Rychly, P. Smrz, and D. Tugwell. The Sketch Engine. Information Technology, 105, 2004.

[12]

K. Krishnamurthi, V. R. Panuganti, and V. V. Bulusu. Influence of domain information on latent semantic analysis of hindi text. IJCSIET, 2.

[13]

K. Krishnamurthi, V. R. Panuganti, and V. V. Bulusu. Capturing the semantic structure of documents using summaries in supplemented latent semantic analysis. WSEAS Transactions on Computers, 14, 2015.

[14]

P. Majumder, M. Mitra, D. Pal, A. Bandyopadhyay, S. Maiti, S. Pal, D. Modak, and S. Sanyal. The fire 2008 evaluation exercise. ACM Transactions on Asian Language Information Processing (TALIP), 9(3):10, 2010.

Digital Library

[15]

R. Mihalcea, C. Corley, and C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI, volume 6, pages 775--780, 2006.

Digital Library

[16]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[17]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013.

Digital Library

[18]

P. Singh and A. Mukerjee. Word Vector Averaging: Parserless Approach to Sentiment Analysis. In regICON-2015: Regional Symposium on Natural Language Processing, 2015.

[19]

A. SivaKumar, P. Premchand, and A. Govardhan. Indian languages ir using latent semantic indexing. International Journal of Computer Science & Information Technology (IJCSIT), 3.

[20]

A. SivaKumar, P. Premchand, and A. Govardhan. Application of latent semantic indexing for hindi-english clir irrespective of context similarity. In Trends in Network and Communications, pages 711--720. Springer, 2011.

[21]

A. Tammewar, K. Singla, B. Agrawal, R. Bhat, and D. M. Sharma. Can distributed word embeddings be an alternative to costly linguistic features: A study on parsing hindi. In Proceedings of the 6th Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2015), pages 21--30, 2015.

[22]

G. Tomar, M. Singh, S. Rai, A. Kumar, R. Sanyal, and S. Sanyal. Probabilistic latent semantic analysis for unsupervised word sense disambiguation. International Journal of Computer Science Issues, 10, 2013.

Cited By

Jain MJindal RJain A(2023)Code‐mixed Hindi‐English text correction using fuzzy graph and word embeddingExpert Systems10.1111/exsy.13328Online publication date: 14-May-2023
https://doi.org/10.1111/exsy.13328
Garg KLobiyal D(2021)KL-NF technique for sentiment classificationMultimedia Tools and Applications10.1007/s11042-021-10559-y80:13(19885-19907)Online publication date: 1-May-2021
https://dl.acm.org/doi/10.1007/s11042-021-10559-y
Garg KLobiyal D(2018)Multi-class Classification of Sentiments in Hindi Sentences Based on IntensitiesTowards Extensible and Adaptable Methods in Computing10.1007/978-981-13-2348-5_19(251-266)Online publication date: 5-Nov-2018
https://doi.org/10.1007/978-981-13-2348-5_19

Recommendations

Word Embedding in Nepali Language using Word2Vec
NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval

Word embedding is a technique for understanding the relationship among words by mapping words to numbers. Several kinds of research have been carried out in this field in different languages such as English, Hindi, Bengali etc. but very few works are ...
A study of lexical function detection with word2vec and supervised machine learning
Special Section: Applied Machine Learning and Management of Volatility, Uncertainty, Complexity & Ambiguity (V.U.C.A)

In this work, we report the results of our experiments on the task of distinguishing the semantics of verb-noun collocations in a Spanish corpus. This semantics was represented by four lexical functions of the Meaning-Text Theory. Each lexical function ...
Word2vec’s Distributed Word Representation for Hindi Word Sense Disambiguation
Distributed Computing and Internet Technology
Abstract
Word Sense Disambiguation (WSD) is the task of extracting an appropriate sense of an ambiguous word in a sentence. WSD is an essential task for language processing, as it is a pre-requisite for determining the closest interpretations of various ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

FIRE '15: Proceedings of the 7th Annual Meeting of the Forum for Information Retrieval Evaluation

December 2015

57 pages

ISBN:9781450340045

DOI:10.1145/2838706

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 December 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Short-paper
Research
Refereed limited

Conference

FIRE '15

FIRE '15: Forum for Information Retrieval Evaluation

December 4 - 6, 2015

Gandhinagar, India

Acceptance Rates

FIRE '15 Paper Acceptance Rate 12 of 42 submissions, 29%;

Overall Acceptance Rate 19 of 64 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
170
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jain MJindal RJain A(2023)Code‐mixed Hindi‐English text correction using fuzzy graph and word embeddingExpert Systems10.1111/exsy.13328Online publication date: 14-May-2023
https://doi.org/10.1111/exsy.13328
Garg KLobiyal D(2021)KL-NF technique for sentiment classificationMultimedia Tools and Applications10.1007/s11042-021-10559-y80:13(19885-19907)Online publication date: 1-May-2021
https://dl.acm.org/doi/10.1007/s11042-021-10559-y
Garg KLobiyal D(2018)Multi-class Classification of Sentiments in Hindi Sentences Based on IntensitiesTowards Extensible and Adaptable Methods in Computing10.1007/978-981-13-2348-5_19(251-266)Online publication date: 5-Nov-2018
https://doi.org/10.1007/978-981-13-2348-5_19

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents