Abstract
Hindi is the official language of India and has over 500 million speakers worldwide. Being a dominant language with a widespread impact, implies the need for development of technologies that cater to its native speakers. In this paper, a text mining based information retrieval model has been developed to generate Hindi word embeddings and their application ranking documents in order of relevance to an input query. Word embeddings are multi-dimensional vectors that can be created by utilizing the linguistic context of words in a large corpus. To generate the embeddings, a corpus was created from the Hindi Wikipedia dump, on which the skip-gram approach was applied using a neural network based negative sampling-architecture. The weighted average of each word embedding along with its tf-idf score generated the embeddings for each individual document. The cosine-similarity was then calculated between each document vector and the query vector. Using these similarity scores, the documents were ranked in descending order of relevance to the query. Highly relevant rankings were obtained in response to a query input. The results of the model were visualized using the t-SNE visualization method. The accuracy of this method proves that in the process of conversion of words to numeric vectors, the semantic context of the words was preserved.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Mikolov, T., et al.: Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013). arXiv:1310.4546
http://adventuresinmachinelearning.com/word2vec-keras-tutorial/
Ramos, J.: Using TF-IDF to Determine Word Relevance in Document Queries
Nalisnick, E., Mitra, B., Craswell, N., Caruana, R.: Improving Document Ranking with Dual Word Embeddings (2016)
http://universaldependencies.org/treebanks/hi_hdtb/index.html
van der Maaten, L.J.P., Hinton, G.E.: Visualizing data using t-SNE (PDF). J. Mach. Learn. Res. 9, 2579–2605 (2008)
Murphy, K.: Machine Learning: A Probabilistic Perspective. MIT (2012). ISBN 978-0262018029
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering. https://github.com/taki0112/Vector_Similarity/blob/master/TS-SS_paper.pdf
Hurst, S.: The Characteristic Function of the Student-t Distribution, Financial Mathematics Research Report No. FMRR006-95, Statistics Research Report No. SRR044-95 Archived February 18, 2010, at the Wayback Machine
Acknowledgements
I would like to thank Mr. Vaibhav Khatavkar, ME CSE-IT, Assistant Professor, Department of Computer Engineering and Information Technology at College of Engineering, Pune for the providing the invaluable guidance and advice during the process of conducting this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Prabhudesai, A. (2019). Generation of Hindi Word Embeddings and Their Utilization in Ranking Documents Using Negative Sampling Architecture, t-SNE Visualization and TF-IDF Based Weighted Average of Vectors. In: Abraham, A., Gandhi, N., Pant, M. (eds) Innovations in Bio-Inspired Computing and Applications. IBICA 2018. Advances in Intelligent Systems and Computing, vol 939. Springer, Cham. https://doi.org/10.1007/978-3-030-16681-6_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-16681-6_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16680-9
Online ISBN: 978-3-030-16681-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)