research-article

Query expansion for mixed-script information retrieval

Authors:

Rafael E. Banchs,

Monojit Choudhury,

Paolo RossoAuthors Info & Claims

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Pages 677 - 686

https://doi.org/10.1145/2600428.2609622

Published: 03 July 2014 Publication History

Abstract

For many languages that use non-Roman based indigenous scripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or multi-lingual space with more than one script which we refer to as the Mixed-Script space. IR in the mixed-script space is challenging because queries written in either the native or the Roman script need to be matched to the documents written in both the scripts. Moreover, transliterated content features extensive spelling variations. In this paper, we formally introduce the concept of Mixed-Script IR, and through analysis of the query logs of Bing search engine, estimate the prevalence and thereby establish the importance of this problem. We also give a principled solution to handle the mixed-script term matching and spelling variation where the terms across the scripts are modelled jointly in a deep-learning architecture and can be compared in a low-dimensional abstract space. We present an extensive empirical analysis of the proposed method along with the evaluation results in an ad-hoc retrieval setting of mixed-script IR where the proposed method achieves significantly better results (12% increase in MRR and 29% increase in MAP) compared to other state-of-the-art baselines.

References

[1]

U. Z. Ahmed, K. Bali, M. Choudhury, and S. VB. Challenges in designing input method editors for indian lan-guages: The role of word-origin and context. In Proceedings of the WTIM, pages 1--9, November 2011.

[2]

G. Amati. Frequentist and bayesian approach to information retrieval. In Proceedings of ECIR, pages 13--24, 2006.

Digital Library

[3]

H.-H. Chen, S.-J. Hueng, Y.-W. Ding, and S.-C. Tsai. Proper name translation in cross-language information retrieval. In Proceedings of ACL, pages 232--236, 1998.

Digital Library

[4]

M. Choudhury, K. Bali, K. Gupta, and N. Datha. Multilingual search for transliterated content. Patent number US 20120278302, 2012.

[5]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.

[6]

N. Dua, K. Gupta, M. Choudhury, and K. Bali. Query completion without query logs for song search. In Proceedings of WWW (Companion Volume), pages 31--32, 2011.

Digital Library

[7]

E. N. Efthimiadis. How do greeks search the web?: A query log analysis study. In Proceedings of iNEWS, pages 81--84, 2008.

Digital Library

[8]

E. N. Efthimiadis, N. Malevris, A. Kousaridas, A. Lepeniotou, and N. Loutas. Non-english web search: an evaluation of indexing and searching the greek web. Information Retrieval, 12(3), 2009.

Digital Library

[9]

J. C. French, A. L. Powell, and E. Schulman. Applications of approximate word matching in information retrieval. In Proceedings of CIKM, pages 9--15, 1997.

Digital Library

[10]

P. V. Gehler, A. D. Holub, and M. Welling. The rate adapting poisson model for information retrieval and object recognition. In Proceedings of ICML, pages 337--344, 2006.

Digital Library

[11]

K. Gupta, M. Choudhury, and K. Bali. Mining Hindi-English transliteration pairs from online Hindi lyrics. In Proceedings of LREC, pages 2459--2465, 2012.

[12]

P. Gupta, R. E. Banchs, and P. Rosso. Squeezing bottlenecks: exploring the limits of autoencoder semantic representation capabilities. CoRR, abs/1402.3070, 2014.

[13]

P. Gupta, P. Rosso, and R. E. Banchs. Encoding transliteration variation through dimensionality reduction: FIRE Shared Task on Transliterated Search. In Fifth Forum for Information Retrieval Evaluation, 2013.

[14]

P. A. V. Hall and G. R. Dowling. Approximate string matching. ACM Comp. Surv., 12(4):381--402, 1980.

Digital Library

[15]

G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 -- 507, 2006.

[16]

G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771--1800, 2002.

Digital Library

[17]

S. C. Janarthanam, S. Subramaniam, and U. Nallasamy. Named entity transliteration for cross-language information retrieval using compressed word format mapping algorithm. In Proceedings of iNEWS, pages 33--38, 2008.

Digital Library

[18]

B. King and S. Abney. Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of NAACL-HLT, pages 1110--1119, 2013.

[19]

K. Knight and J. Graehl. Machine transliteration. Comput. Linguist., 24(4):599--612, Dec. 1998.

Digital Library

[20]

S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In Proceedings of IJCAI, pages 1360--1365, 2011.

Digital Library

[21]

A. Kumaran, M. M. Khapra, and H. Li. Report of news 2010 transliteration mining shared task. In Proceedings of NEWS, pages 21--28, 2010.

Digital Library

[22]

D. W. Oard, G.-A. Levow, and C. I. Cabezas. Clef experiments at maryland: Statistical stemming and backoff translation. In Proceedings of CLEF, pages 176--187, 2000.

Digital Library

[23]

D. Pal, P. Majumder, M. Mitra, S. Mitra, and A. Sen. Issues in searching for indian language web content. In Proceedings of iNEWS, pages 93--96, 2008.

Digital Library

[24]

Y. Qu, G. Grefenstette, and D. A. Evans. Automatic transliteration for Japanese-to-English text retrieval. In Proceedings of SIGIR, pages 353--360, 2003.

Digital Library

[25]

A. A. Raj and H. Maganti. Transliteration based search engine for multilingual information access. In Proceedings of CLIAWS3, pages 12--20, 2009.

Digital Library

[26]

R. Saha Roy, M. Choudhury, P. Majumder, and K. Agarwal. Overview and Datasets of FIRE 2013 Track on Transliterated Search. In Fifth Forum for Information Retrieval Evaluation, 2013.

[27]

R. Salakhutdinov and G. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969--978, July 2009.

Digital Library

[28]

R. Salakhutdinov and G. E. Hinton. Replicated softmax: an undirected topic model. In Proceedings of NIPS, pages 1607--1614, 2009.

[29]

V. B. Sowmya and V. Varma. Transliteration based text input methods for telugu. In Proceedings of ICCPOL, pages 122--132, 2009.

Digital Library

[30]

R. Udupa and M. M. Khapra. Improving the multilingual user experience of wikipedia using cross-language name search. In Proceedings of HLT-NAACL, pages 492--500, 2010.

Digital Library

[31]

R. Udupa and M. M. Khapra. Transliteration equivalence using canonical correlation analysis. In Proceedings of ECIR, pages 75--86, 2010.

Digital Library

[32]

X. Wang, A. Broder, E. Gabrilovich, V. Josifovski, and B. Pang. Cross-lingual query classification: A preliminary study. In Proceedings of iNEWS, pages 101--104, 2008.

Digital Library

[33]

D. Zhou, M. Truran, T. Brailsford, V. Wade, and H. Ashman. Translation techniques in cross-language information retrieval. ACM Comput. Surv., 45(1):1:1--1:44, Dec. 2012.

Digital Library

[34]

J. Zobel and P. Dart. Phonetic string matching: lessons from information retrieval. In Proceedings of SIGIR, pages 166--172, 1996.

Digital Library

Cited By

Chaudhary APradhan RShekhar S(2024)A Novel Framework for Multilingual Script Detection and Pattern Analysis in Mixed Script QueriesInternational Journal of Experimental Research and Review10.52756/ijerr.2024.v43spl.01643(214-228)Online publication date: 30-Sep-2024
https://doi.org/10.52756/ijerr.2024.v43spl.016
M MS AVijayan R(2023)Synonym Insensitive Searching: A Novel Synonym Weighted-Vector Space Model for Document Retrieval2023 2nd International Conference on Computational Systems and Communication (ICCSC)10.1109/ICCSC56913.2023.10142977(1-7)Online publication date: 3-Mar-2023
https://doi.org/10.1109/ICCSC56913.2023.10142977
Chanda SPal S(2023)The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social MediaSN Computer Science10.1007/s42979-023-01942-74:5Online publication date: 27-Jun-2023
https://doi.org/10.1007/s42979-023-01942-7
Show More Cited By

Index Terms

Query expansion for mixed-script information retrieval
1. Information systems
  1. Information retrieval

Recommendations

Modelling of terms across scripts through autoencoders
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

cripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or cross-lingual space with more than one scripts which is ...
Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing
FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval Evaluation

Much of the user generated content on the internet is written in their transliterated form instead of in their indigenous script. Due to this search engines receive a large number of transliterated search queries.

This paper presents our approach to ...
Transliteration of Arabizi into Arabic Script for Tunisian Dialect

The evolution of information and communication technology has markedly influenced communication between correspondents. This evolution has facilitated the transmission of information and has engendered new forms of written communication (email, chat, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

July 2014

1330 pages

ISBN:9781450322577

DOI:10.1145/2600428

General Chairs:
Shlomo Geva
Queensland University of Technology
,
Andrew Trotman
University of Dunedin
,
Program Chairs:
Peter Bruza
Queensland University of Technology
,
Charles L.A. Clarke
University of Waterloo
,
Kal Järvelin
University of Tampere

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '14

Sponsor:

SIGIR

SIGIR '14: The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 6 - 11, 2014

Queensland, Gold Coast, Australia

Acceptance Rates

SIGIR '14 Paper Acceptance Rate 82 of 387 submissions, 21%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

48
Total Citations
View Citations
914
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)11

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chaudhary APradhan RShekhar S(2024)A Novel Framework for Multilingual Script Detection and Pattern Analysis in Mixed Script QueriesInternational Journal of Experimental Research and Review10.52756/ijerr.2024.v43spl.01643(214-228)Online publication date: 30-Sep-2024
https://doi.org/10.52756/ijerr.2024.v43spl.016
M MS AVijayan R(2023)Synonym Insensitive Searching: A Novel Synonym Weighted-Vector Space Model for Document Retrieval2023 2nd International Conference on Computational Systems and Communication (ICCSC)10.1109/ICCSC56913.2023.10142977(1-7)Online publication date: 3-Mar-2023
https://doi.org/10.1109/ICCSC56913.2023.10142977
Chanda SPal S(2023)The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social MediaSN Computer Science10.1007/s42979-023-01942-74:5Online publication date: 27-Jun-2023
https://doi.org/10.1007/s42979-023-01942-7
Joshi PJoshi D(2023)Code Mixed Information Retrieval for Gujarati Script News ArticlesAdvances in Computing and Data Sciences10.1007/978-3-031-37940-6_22(265-276)Online publication date: 23-Jul-2023
https://doi.org/10.1007/978-3-031-37940-6_22
Abdollah Pour MFarinneya PToroghi AKorikov APesaranghader ASajed TBharadwaj MMavrin BSanner S(2023)Self-supervised Contrastive BERT Fine-tuning for Fusion-Based Reviewed-Item RetrievalAdvances in Information Retrieval10.1007/978-3-031-28244-7_1(3-17)Online publication date: 2-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-28244-7_1
Satapara SModha BModha SMehta P(2022)FIRE 2022 ILSUM Track: Indian Language SummarizationProceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3574318.3574328(8-11)Online publication date: 9-Dec-2022
https://dl.acm.org/doi/10.1145/3574318.3574328
Zhang YJiang MZhao Q(2022)Query and Attention Augmentation for Knowledge-Based Explainable Reasoning2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.01513(15555-15564)Online publication date: Jun-2022
https://doi.org/10.1109/CVPR52688.2022.01513
Alqahtani ASaravanan PMaheswari MAlshmrany S(2022)An automatic query expansion based on hybrid CMO-COOT algorithm for optimized information retrievalThe Journal of Supercomputing10.1007/s11227-021-04171-y78:6(8625-8643)Online publication date: 12-Jan-2022
https://doi.org/10.1007/s11227-021-04171-y
Kumar SKumar SPati J(2022)A Review on Transliterated Text Retrieval for Indian LanguagesProceedings of International Conference on Computational Intelligence10.1007/978-981-19-2126-1_10(137-146)Online publication date: 4-Oct-2022
https://doi.org/10.1007/978-981-19-2126-1_10
Prabhakar DPal SKumar C(2021)Query Expansion for Transliterated Text RetrievalACM Transactions on Asian and Low-Resource Language Information Processing10.1145/344764920:4(1-34)Online publication date: 20-Jul-2021
https://dl.acm.org/doi/10.1145/3447649
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten