research-article

Query Expansion for Transliterated Text Retrieval

Authors:

Dinesh Kumar Prabhakar,

Chiranjeev KumarAuthors Info & Claims

Transactions on Asian and Low-Resource Language Information Processing, Volume 20, Issue 4

Article No.: 64, Pages 1 - 34

https://doi.org/10.1145/3447649

Published: 20 July 2021 Publication History

Abstract

With Web 2.0, there has been exponential growth in the number of Web users and the volume of Web content. Most of these users are not only consumers of the information but also generators of it. People express themselves here in colloquial languages, but using Roman script (transliteration). These texts are mostly informal and casual, and therefore seldom follow grammar rules. Also, there does not exist any prescribed set of spelling rules in transliterated text. This freedom leads to large-scale spelling variations, which is a major challenge in mixed script information processing. This article studies different existing phonetic algorithms to handle the issue of spelling variation, points out the limitations of them, and proposes a novel phonetic encoding approach with two different flavors in the light of Hindi transliteration. Experiments performed over Hindi song lyrics retrieval in mixed script domain with three different retrieval models show that proposed approaches outperform the existing techniques in a majority of the cases (sometimes statistically significantly) for a number of metrics like nDCG@1, nDCG@5, nDCG@10, MAP, MRR, and Recall.

References

[1]

2015. Forum for Information Retrieval Evaluation (FIRE). Retrieved February 14, 2015 from http://www.isical.ac.in/clia/.

[2]

James Allan, Bruce Croft, Alistair Moffat, and Mark Sanderson. 2012. Frontiers, challenges, and opportunities for information retrieval: Report from swirl 2012 the second strategic workshop on information retrieval in Lorne. In ACM SIGIR Forum, Vol. 46. ACM, 2–32.

Digital Library

[3]

Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 357–389.

Digital Library

[4]

I. A. Bhat, V. Mujadia, A. Tammewar, R. A. Bhat, and M. Shrivastava. 2014. IIIT-H system submission for FIRE2014 shared task on transliterated search. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).

Digital Library

[5]

M. Choudhury, G. Chittaranjan, P. Gupta, and A. Das. 2014. Overview and datasets of FIRE 2014 track on transliterated search. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).

[6]

Nigel Collier, Hideki Hirakawa, and Akira Kumano. 1998. Machine translation vs. dictionary term translation: A comparison for English-Japanese news article alignment. In Proceedings of the 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 263–267.

Digital Library

[7]

James C. French, Allison L. Powell, and Eric Schulman. 1997. Applications of approximate word matching in information retrieval. In Proceedings of the 6th International Conference on Information and Knowledge Management. ACM, 9–15.

Digital Library

[8]

T. N. Gadd. 1988. ‘Fisching fore werd’: Phonetic retrieval of written text in information systems. Program 22, 3 (1988), 222–237.

Digital Library

[9]

T. N. Gadd. 1990. PHONIX: The algorithm. Program 24, 4 (1990), 363–366.

Digital Library

[10]

Björn Gambäck and Amitava Das. 2016. Comparing the level of code-switching in corpora. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). https://www.lrec-conf.org/proceedings/lrec2016/summaries/669.html

[11]

D. Ganguly, S. Pal, and G. J. F. Jones. 2014. DCUFIRE-2014: Fuzzy queries with rule-based normalization for mixed script information retrieval. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).

Digital Library

[12]

Kanika Gupta, Monojit Choudhury, and Kalika Bali. 2012. Mining Hindi-English transliteration pairs from online Hindi lyrics. In LREC. 2459–2465.

[13]

Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo Rosso. 2014. Query expansion for mixed-script information retrieval. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 677–686.

Digital Library

[14]

P. Gupta, P. Rosso, and R. E. Banchs. 2013. Encoding transliteration variation through dimensionality reduction: FIRE shared task on transliterated search. In Pre-proceedings 5th Workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).

[15]

Chung-Chian Hsu and Chien-Hsing Chen. 2010. Mining synonymous transliterations from the World Wide Web. ACM Transactions on Asian Language Information Processing (TALIP) 9, 1 (2010), 1–28.

Digital Library

[16]

K. Sparck Jones, Steve Walker, and Stephen E. Robertson. 2000. A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Information Processing & Management 36, 6 (2000), 809–840.

Digital Library

[17]

H. Joshi, A. Bhatt, and H. Patel. 2013. Transliterated search using syllabification approach. In Pre-proceedings 5th workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).

[18]

Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. 2011. Machine transliteration survey. ACM Computing Surveys (CSUR) 43, 3 (2011), 17:1–46.

Digital Library

[19]

Ben King and Steven P. Abney. 2013. Labeling the languages of words in mixed-language documents using weakly supervised methods. In HLT-NAACL. Association for Computational Linguistics, 1110–1119.

[20]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press, Cambridge.

Digital Library

[21]

Mandar Mitra, Amit Singhal, and Chris Buckley. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 206–214.

Digital Library

[22]

A. Mukherjee, K. Datta, and A. Ravi. 2014. Mixed-script query labelling using supervised learning and Adhoc retrieval using sub word indexing. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).

Digital Library

[23]

M. Odell and R. Russell. 1918. The soundex coding system. US Patent 1261167 (1918).

[24]

P. Pakray and P. Bhaskar. 2013. Transliterated search system for Indian languages. In Pre-proceedings 5th Workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).

[25]

A. Prakash and S. K. Saha. 2014. A relevance feedback based approach for mixed script transliterated text search: Shared task report by BIT Mesra. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).

[26]

Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei. 2011. Rumor has it: Identifying misinformation in microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1589–1599.

Digital Library

[27]

Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing.

[28]

Rishiraj Saha Roy, Monojit Choudhury, Prasenjit Majumder, and Komal Agarwal. 2013. Overview of the FIRE 2013 track on transliterated search. In Proceedings of the 5th 2013 Forum on Information Retrieval Evaluation. ACM, 4.

Digital Library

[29]

Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 5 (1988), 513–523.

Digital Library

[30]

Royal Sequiera, Monojit Choudhury, Parth Gupta, Paolo Rosso, Shubham Kumar, Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay, Gokul Chittaranjan, Amitava Das, and Kunal Chakma. 2015. Overview of FIRE-2015 shared task on mixed script information retrieval. In Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation. 19–25. https://ceur-ws.org/Vol-1587/T2-1.pdf

[31]

Xuerui Wang, Andrei Broder, Evgeniy Gabrilovich, Vanja Josifovski, and Bo Pang. 2009. Cross-language query classification using web search for exogenous knowledge. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, 74–83.

Digital Library

[32]

Justin Zobel and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 166–172.

Digital Library

Index Terms

Query Expansion for Transliterated Text Retrieval
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing
      1. Query representation

Recommendations

Query expansion for mixed-script information retrieval
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

For many languages that use non-Roman based indigenous scripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or ...
A Hybrid Approach for Transliterated Word-Level Language Identification: CRF with Post-Processing Heuristics
FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval Evaluation

In this paper, we describe a hybrid approach for word-level language (WLL) identification of Bangla words written in Roman script and mixed with English words as part of our participation in the shared task on transliterated search at Forum for ...
Automatic query expansion: A structural linguistic perspective

A user's query is considered to be an imprecise description of their information need. Automatic query expansion is the process of reformulating the original query with the goal of improving retrieval effectiveness. Many successful query expansion ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 20, Issue 4

July 2021

419 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3465463

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2021

Accepted: 01 January 2021

Revised: 01 November 2020

Received: 01 April 2020

Published in TALLIP Volume 20, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
98
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents