Skip to main content

Generation of Hindi Word Embeddings and Their Utilization in Ranking Documents Using Negative Sampling Architecture, t-SNE Visualization and TF-IDF Based Weighted Average of Vectors

  • Conference paper
  • First Online:
Innovations in Bio-Inspired Computing and Applications (IBICA 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 939))

  • 543 Accesses

Abstract

Hindi is the official language of India and has over 500 million speakers worldwide. Being a dominant language with a widespread impact, implies the need for development of technologies that cater to its native speakers. In this paper, a text mining based information retrieval model has been developed to generate Hindi word embeddings and their application ranking documents in order of relevance to an input query. Word embeddings are multi-dimensional vectors that can be created by utilizing the linguistic context of words in a large corpus. To generate the embeddings, a corpus was created from the Hindi Wikipedia dump, on which the skip-gram approach was applied using a neural network based negative sampling-architecture. The weighted average of each word embedding along with its tf-idf score generated the embeddings for each individual document. The cosine-similarity was then calculated between each document vector and the query vector. Using these similarity scores, the documents were ranked in descending order of relevance to the query. Highly relevant rankings were obtained in response to a query input. The results of the model were visualized using the t-SNE visualization method. The accuracy of this method proves that in the process of conversion of words to numeric vectors, the semantic context of the words was preserved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. https://scroll.in/article/884754/surging-hindi-shrinking-south-indian-languages-nine-charts-that-explain-the-2011-language-census

  2. https://timesondia.indiatimes.com/people/around-90-of-new-net-users-non-english-google-indias-rajan-anandan/articleshow/58375379.cms

  3. https://www.forbes.com/sites/baxiabhishek/2018/03/29/more-indians-access-the-internet-in-their-native-language-than-in-english/#1cec6e474a03LNCS

  4. https://en.wikipedia.org/wiki/Stop_words

  5. Mikolov, T., et al.: Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781

  6. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013). arXiv:1310.4546

  7. http://adventuresinmachinelearning.com/word2vec-keras-tutorial/

  8. Ramos, J.: Using TF-IDF to Determine Word Relevance in Document Queries

    Google Scholar 

  9. Nalisnick, E., Mitra, B., Craswell, N., Caruana, R.: Improving Document Ranking with Dual Word Embeddings (2016)

    Google Scholar 

  10. http://universaldependencies.org/treebanks/hi_hdtb/index.html

  11. http://www.tfidf.com/

  12. van der Maaten, L.J.P., Hinton, G.E.: Visualizing data using t-SNE (PDF). J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  13. Murphy, K.: Machine Learning: A Probabilistic Perspective. MIT (2012). ISBN 978-0262018029

    Google Scholar 

  14. https://github.com/attardi/wikiextractor/wiki

  15. http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

  16. A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering. https://github.com/taki0112/Vector_Similarity/blob/master/TS-SS_paper.pdf

  17. Hurst, S.: The Characteristic Function of the Student-t Distribution, Financial Mathematics Research Report No. FMRR006-95, Statistics Research Report No. SRR044-95 Archived February 18, 2010, at the Wayback Machine

    Google Scholar 

Download references

Acknowledgements

I would like to thank Mr. Vaibhav Khatavkar, ME CSE-IT, Assistant Professor, Department of Computer Engineering and Information Technology at College of Engineering, Pune for the providing the invaluable guidance and advice during the process of conducting this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arya Prabhudesai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Prabhudesai, A. (2019). Generation of Hindi Word Embeddings and Their Utilization in Ranking Documents Using Negative Sampling Architecture, t-SNE Visualization and TF-IDF Based Weighted Average of Vectors. In: Abraham, A., Gandhi, N., Pant, M. (eds) Innovations in Bio-Inspired Computing and Applications. IBICA 2018. Advances in Intelligent Systems and Computing, vol 939. Springer, Cham. https://doi.org/10.1007/978-3-030-16681-6_28

Download citation

Publish with us

Policies and ethics