Language Identification and Disambiguation in Indian Mixed-Script

Gupta, Bhumika; Bhatt, Gaurav; Mittal, Ankush

doi:10.1007/978-3-319-28034-9_14

Bhumika Gupta¹⁶,
Gaurav Bhatt¹⁷ &
Ankush Mittal¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9581))

Included in the following conference series:

International Conference on Distributed Computing and Internet Technology

853 Accesses

Abstract

The algorithm that has been proposed in this paper tries to segregate words from various languages (namely Hindi, English, Bengali and Gujarati) and provide relevant replacements for the misspelled or unknown words in a given query. Thus, generating a relevant query in which the original language of each word is known. First, the words are matched directly with the dictionaries of each language transliterated into English. And then, for those that do not match, a set of probable words from all the dictionaries taking words that are closest to the given spelling is shortlisted using the Levenshtein algorithm. After this, to achieve a higher level of generalization, we use a list of probabilities of doublets and triplets of words occurring together that are computed from a training database. The probabilities computed further determine the relevance of those words in the given text allowing us to pick the most relevant match.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Stemming Techniques on English Language and Devanagari Script: A Review

Identification of Closed Compound Words in Devanagari Scripted and Non-Devanagari Scripted Corpora

Automatic language identification: a case study of Pahari languages

Article 12 May 2023

References

Vyas, Y., Gella, S., Sharma, J., Bali, K., Choudhury, M.: POS tagging of English-Hindi code-mixed social media content. In: Proceedings of the EMNLP 2014, pp. 974–979 (2014)
Google Scholar
Chittaranjan, G., Vyas, Y.: Word-level language identification using CRF: code switching shared task report of MSR india system. In: Proceedings of the EMNLP (2014)
Google Scholar
Gupta, P., Bali, K., Banchs, R.E., Choudhury, M., Rosso, P.: Query expansion for mixed-script information retrieval. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (2014)
Google Scholar
Bhat, I.A., Mujadia, V., Tammewar, A., Bhat, R.A., Shrivastava, M.: IIIT-H system submission for FIRE 2014 shared task on transliterated search. In: Proceedings of the Forum for Information Retrieval Evaluation (2014)
Google Scholar
King, B., Abney, S.: Labelling the languages of words in mixed-language documents using weakly supervised methods. In: Proceedings of NAACL-HLT (2013)
Google Scholar
Gupta, P., Rosso, P., Banchs, R.E.: Encoding transliteration variation through dimensionality reduction: FIRE shared task on transliterated search. In: Proceedings of the 5th Forum for Information Retrieval Evaluation (2013)
Google Scholar
Raghavi, K.C., Chinnakotla, M.K., Shrivastava, M.: Answer ka type kya he? Learning to classify questions in code-mixed language. In: Proceedings of the 24th International Conference on World Wide Web Companion, pp. 853–858. International World Wide Web Conferences Steering Committee (2015)
Google Scholar
Roy, R.S., Choudhury, M., Majumder, P., Agarwal, K.: Overview and datasets of FIRE 2013 track on transliterated search. In: Proceedings of the 5th Forum for Information Retrieval Evaluation (2013)
Google Scholar
Marton, Y., Callison-Burch, C., Resnik, P.: Improved statistical machine translation using monolingually-derived paraphrases. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 381–390. Association for Computational Linguistics (2009)
Google Scholar
Callison-Burch, C., Koehn, P., Osborne, M.: Improved statistical machine translation using paraphrases. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 17–24. Association for Computational Linguistics (2006)
Google Scholar
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)
Google Scholar
Gupta, K., Choudhury, M., Bali, K.: Mining Hindi-English transliteration pairs from online Hindi lyrics. In: LREC, pp. 2459–2465 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Engineering Roorkee, Roorkee, Uttarakhand, India
Bhumika Gupta
Indian Institute of Technology, Roorkee, Uttarakhand, India
Gaurav Bhatt
Graphic Era University, Dehradun, Uttarakhand, India
Ankush Mittal

Authors

Bhumika Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Gaurav Bhatt
View author publications
You can also search for this author in PubMed Google Scholar
Ankush Mittal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bhumika Gupta .

Editor information

Editors and Affiliations

Microsoft Research, Redmond, Washington, USA
Nikolaj Bjørner
Indian Institute of Technology Delhi, New Delhi, India
Sanjiva Prasad
IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA
Laxmi Parida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gupta, B., Bhatt, G., Mittal, A. (2016). Language Identification and Disambiguation in Indian Mixed-Script. In: Bjørner, N., Prasad, S., Parida, L. (eds) Distributed Computing and Internet Technology. ICDCIT 2016. Lecture Notes in Computer Science(), vol 9581. Springer, Cham. https://doi.org/10.1007/978-3-319-28034-9_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-28034-9_14
Published: 25 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28033-2
Online ISBN: 978-3-319-28034-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics