Generation of Hindi Word Embeddings and Their Utilization in Ranking Documents Using Negative Sampling Architecture, t-SNE Visualization and TF-IDF Based Weighted Average of Vectors

Prabhudesai, Arya

doi:10.1007/978-3-030-16681-6_28

Arya Prabhudesai¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 939))

Included in the following conference series:

International Conference on Innovations in Bio-Inspired Computing and Applications

543 Accesses

Abstract

Hindi is the official language of India and has over 500 million speakers worldwide. Being a dominant language with a widespread impact, implies the need for development of technologies that cater to its native speakers. In this paper, a text mining based information retrieval model has been developed to generate Hindi word embeddings and their application ranking documents in order of relevance to an input query. Word embeddings are multi-dimensional vectors that can be created by utilizing the linguistic context of words in a large corpus. To generate the embeddings, a corpus was created from the Hindi Wikipedia dump, on which the skip-gram approach was applied using a neural network based negative sampling-architecture. The weighted average of each word embedding along with its tf-idf score generated the embeddings for each individual document. The cosine-similarity was then calculated between each document vector and the query vector. Using these similarity scores, the documents were ranked in descending order of relevance to the query. Highly relevant rankings were obtained in response to a query input. The results of the model were visualized using the t-SNE visualization method. The accuracy of this method proves that in the process of conversion of words to numeric vectors, the semantic context of the words was preserved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving Search Relevance with Word Embedding Based Clusters

Text Visualization Using t-Distributed Stochastic Neighborhood Embedding (t-SNE)

Intrinsic Evaluation of Lithuanian Word Embeddings Using WordNet

References

https://scroll.in/article/884754/surging-hindi-shrinking-south-indian-languages-nine-charts-that-explain-the-2011-language-census
https://timesondia.indiatimes.com/people/around-90-of-new-net-users-non-english-google-indias-rajan-anandan/articleshow/58375379.cms
https://www.forbes.com/sites/baxiabhishek/2018/03/29/more-indians-access-the-internet-in-their-native-language-than-in-english/#1cec6e474a03LNCS
https://en.wikipedia.org/wiki/Stop_words
Mikolov, T., et al.: Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013). arXiv:1310.4546
http://adventuresinmachinelearning.com/word2vec-keras-tutorial/
Ramos, J.: Using TF-IDF to Determine Word Relevance in Document Queries
Google Scholar
Nalisnick, E., Mitra, B., Craswell, N., Caruana, R.: Improving Document Ranking with Dual Word Embeddings (2016)
Google Scholar
http://universaldependencies.org/treebanks/hi_hdtb/index.html
http://www.tfidf.com/
van der Maaten, L.J.P., Hinton, G.E.: Visualizing data using t-SNE (PDF). J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Murphy, K.: Machine Learning: A Probabilistic Perspective. MIT (2012). ISBN 978-0262018029
Google Scholar
https://github.com/attardi/wikiextractor/wiki
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering. https://github.com/taki0112/Vector_Similarity/blob/master/TS-SS_paper.pdf
Hurst, S.: The Characteristic Function of the Student-t Distribution, Financial Mathematics Research Report No. FMRR006-95, Statistics Research Report No. SRR044-95 Archived February 18, 2010, at the Wayback Machine
Google Scholar

Download references

Acknowledgements

I would like to thank Mr. Vaibhav Khatavkar, ME CSE-IT, Assistant Professor, Department of Computer Engineering and Information Technology at College of Engineering, Pune for the providing the invaluable guidance and advice during the process of conducting this research.

Author information

Authors and Affiliations

College of Engineering, Pune, Pune, India
Arya Prabhudesai

Authors

Arya Prabhudesai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arya Prabhudesai .

Editor information

Editors and Affiliations

Machine Intelligence Research Labs, Auburn, WA, USA
Ajith Abraham
Machine Intelligence Research Labs, Auburn, WA, USA
Niketa Gandhi
Department of Applied Science and Engineering, Indian Institute of Technology, Roorkee, India
Millie Pant

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prabhudesai, A. (2019). Generation of Hindi Word Embeddings and Their Utilization in Ranking Documents Using Negative Sampling Architecture, t-SNE Visualization and TF-IDF Based Weighted Average of Vectors. In: Abraham, A., Gandhi, N., Pant, M. (eds) Innovations in Bio-Inspired Computing and Applications. IBICA 2018. Advances in Intelligent Systems and Computing, vol 939. Springer, Cham. https://doi.org/10.1007/978-3-030-16681-6_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-16681-6_28
Published: 21 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16680-9
Online ISBN: 978-3-030-16681-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Generation of Hindi Word Embeddings and Their Utilization in Ranking Documents Using Negative Sampling Architecture, t-SNE Visualization and TF-IDF Based Weighted Average of Vectors

Abstract

Access this chapter

Similar content being viewed by others

Improving Search Relevance with Word Embedding Based Clusters

Text Visualization Using t-Distributed Stochastic Neighborhood Embedding (t-SNE)

Intrinsic Evaluation of Lithuanian Word Embeddings Using WordNet

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Generation of Hindi Word Embeddings and Their Utilization in Ranking Documents Using Negative Sampling Architecture, t-SNE Visualization and TF-IDF Based Weighted Average of Vectors

Abstract

Access this chapter

Similar content being viewed by others

Improving Search Relevance with Word Embedding Based Clusters

Text Visualization Using t-Distributed Stochastic Neighborhood Embedding (t-SNE)

Intrinsic Evaluation of Lithuanian Word Embeddings Using WordNet

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation