skip to main content
research-article

Query Expansion for Transliterated Text Retrieval

Published: 20 July 2021 Publication History

Abstract

With Web 2.0, there has been exponential growth in the number of Web users and the volume of Web content. Most of these users are not only consumers of the information but also generators of it. People express themselves here in colloquial languages, but using Roman script (transliteration). These texts are mostly informal and casual, and therefore seldom follow grammar rules. Also, there does not exist any prescribed set of spelling rules in transliterated text. This freedom leads to large-scale spelling variations, which is a major challenge in mixed script information processing. This article studies different existing phonetic algorithms to handle the issue of spelling variation, points out the limitations of them, and proposes a novel phonetic encoding approach with two different flavors in the light of Hindi transliteration. Experiments performed over Hindi song lyrics retrieval in mixed script domain with three different retrieval models show that proposed approaches outperform the existing techniques in a majority of the cases (sometimes statistically significantly) for a number of metrics like nDCG@1, nDCG@5, nDCG@10, MAP, MRR, and Recall.

References

[1]
2015. Forum for Information Retrieval Evaluation (FIRE). Retrieved February 14, 2015 from http://www.isical.ac.in/clia/.
[2]
James Allan, Bruce Croft, Alistair Moffat, and Mark Sanderson. 2012. Frontiers, challenges, and opportunities for information retrieval: Report from swirl 2012 the second strategic workshop on information retrieval in Lorne. In ACM SIGIR Forum, Vol. 46. ACM, 2–32.
[3]
Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 357–389.
[4]
I. A. Bhat, V. Mujadia, A. Tammewar, R. A. Bhat, and M. Shrivastava. 2014. IIIT-H system submission for FIRE2014 shared task on transliterated search. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).
[5]
M. Choudhury, G. Chittaranjan, P. Gupta, and A. Das. 2014. Overview and datasets of FIRE 2014 track on transliterated search. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).
[6]
Nigel Collier, Hideki Hirakawa, and Akira Kumano. 1998. Machine translation vs. dictionary term translation: A comparison for English-Japanese news article alignment. In Proceedings of the 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 263–267.
[7]
James C. French, Allison L. Powell, and Eric Schulman. 1997. Applications of approximate word matching in information retrieval. In Proceedings of the 6th International Conference on Information and Knowledge Management. ACM, 9–15.
[8]
T. N. Gadd. 1988. ‘Fisching fore werd’: Phonetic retrieval of written text in information systems. Program 22, 3 (1988), 222–237.
[9]
T. N. Gadd. 1990. PHONIX: The algorithm. Program 24, 4 (1990), 363–366.
[10]
Björn Gambäck and Amitava Das. 2016. Comparing the level of code-switching in corpora. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). https://www.lrec-conf.org/proceedings/lrec2016/summaries/669.html
[11]
D. Ganguly, S. Pal, and G. J. F. Jones. 2014. DCUFIRE-2014: Fuzzy queries with rule-based normalization for mixed script information retrieval. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).
[12]
Kanika Gupta, Monojit Choudhury, and Kalika Bali. 2012. Mining Hindi-English transliteration pairs from online Hindi lyrics. In LREC. 2459–2465.
[13]
Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo Rosso. 2014. Query expansion for mixed-script information retrieval. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 677–686.
[14]
P. Gupta, P. Rosso, and R. E. Banchs. 2013. Encoding transliteration variation through dimensionality reduction: FIRE shared task on transliterated search. In Pre-proceedings 5th Workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).
[15]
Chung-Chian Hsu and Chien-Hsing Chen. 2010. Mining synonymous transliterations from the World Wide Web. ACM Transactions on Asian Language Information Processing (TALIP) 9, 1 (2010), 1–28.
[16]
K. Sparck Jones, Steve Walker, and Stephen E. Robertson. 2000. A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Information Processing & Management 36, 6 (2000), 809–840.
[17]
H. Joshi, A. Bhatt, and H. Patel. 2013. Transliterated search using syllabification approach. In Pre-proceedings 5th workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).
[18]
Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. 2011. Machine transliteration survey. ACM Computing Surveys (CSUR) 43, 3 (2011), 17:1–46.
[19]
Ben King and Steven P. Abney. 2013. Labeling the languages of words in mixed-language documents using weakly supervised methods. In HLT-NAACL. Association for Computational Linguistics, 1110–1119.
[20]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press, Cambridge.
[21]
Mandar Mitra, Amit Singhal, and Chris Buckley. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 206–214.
[22]
A. Mukherjee, K. Datta, and A. Ravi. 2014. Mixed-script query labelling using supervised learning and Adhoc retrieval using sub word indexing. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).
[23]
M. Odell and R. Russell. 1918. The soundex coding system. US Patent 1261167 (1918).
[24]
P. Pakray and P. Bhaskar. 2013. Transliterated search system for Indian languages. In Pre-proceedings 5th Workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).
[25]
A. Prakash and S. K. Saha. 2014. A relevance feedback based approach for mixed script transliterated text search: Shared task report by BIT Mesra. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).
[26]
Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei. 2011. Rumor has it: Identifying misinformation in microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1589–1599.
[27]
Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing.
[28]
Rishiraj Saha Roy, Monojit Choudhury, Prasenjit Majumder, and Komal Agarwal. 2013. Overview of the FIRE 2013 track on transliterated search. In Proceedings of the 5th 2013 Forum on Information Retrieval Evaluation. ACM, 4.
[29]
Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 5 (1988), 513–523.
[30]
Royal Sequiera, Monojit Choudhury, Parth Gupta, Paolo Rosso, Shubham Kumar, Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay, Gokul Chittaranjan, Amitava Das, and Kunal Chakma. 2015. Overview of FIRE-2015 shared task on mixed script information retrieval. In Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation. 19–25. https://ceur-ws.org/Vol-1587/T2-1.pdf
[31]
Xuerui Wang, Andrei Broder, Evgeniy Gabrilovich, Vanja Josifovski, and Bo Pang. 2009. Cross-language query classification using web search for exogenous knowledge. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, 74–83.
[32]
Justin Zobel and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 166–172.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 4
July 2021
419 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3465463
Issue’s Table of Contents
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2021
Accepted: 01 January 2021
Revised: 01 November 2020
Received: 01 April 2020
Published in TALLIP Volume 20, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Mixed script information retrieval
  2. transliteration
  3. query expansion
  4. phonetics

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 98
    Total Downloads
  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media