skip to main content
10.1145/2600428.2609622acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Query expansion for mixed-script information retrieval

Published: 03 July 2014 Publication History

Abstract

For many languages that use non-Roman based indigenous scripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or multi-lingual space with more than one script which we refer to as the Mixed-Script space. IR in the mixed-script space is challenging because queries written in either the native or the Roman script need to be matched to the documents written in both the scripts. Moreover, transliterated content features extensive spelling variations. In this paper, we formally introduce the concept of Mixed-Script IR, and through analysis of the query logs of Bing search engine, estimate the prevalence and thereby establish the importance of this problem. We also give a principled solution to handle the mixed-script term matching and spelling variation where the terms across the scripts are modelled jointly in a deep-learning architecture and can be compared in a low-dimensional abstract space. We present an extensive empirical analysis of the proposed method along with the evaluation results in an ad-hoc retrieval setting of mixed-script IR where the proposed method achieves significantly better results (12% increase in MRR and 29% increase in MAP) compared to other state-of-the-art baselines.

References

[1]
U. Z. Ahmed, K. Bali, M. Choudhury, and S. VB. Challenges in designing input method editors for indian lan-guages: The role of word-origin and context. In Proceedings of the WTIM, pages 1--9, November 2011.
[2]
G. Amati. Frequentist and bayesian approach to information retrieval. In Proceedings of ECIR, pages 13--24, 2006.
[3]
H.-H. Chen, S.-J. Hueng, Y.-W. Ding, and S.-C. Tsai. Proper name translation in cross-language information retrieval. In Proceedings of ACL, pages 232--236, 1998.
[4]
M. Choudhury, K. Bali, K. Gupta, and N. Datha. Multilingual search for transliterated content. Patent number US 20120278302, 2012.
[5]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
[6]
N. Dua, K. Gupta, M. Choudhury, and K. Bali. Query completion without query logs for song search. In Proceedings of WWW (Companion Volume), pages 31--32, 2011.
[7]
E. N. Efthimiadis. How do greeks search the web?: A query log analysis study. In Proceedings of iNEWS, pages 81--84, 2008.
[8]
E. N. Efthimiadis, N. Malevris, A. Kousaridas, A. Lepeniotou, and N. Loutas. Non-english web search: an evaluation of indexing and searching the greek web. Information Retrieval, 12(3), 2009.
[9]
J. C. French, A. L. Powell, and E. Schulman. Applications of approximate word matching in information retrieval. In Proceedings of CIKM, pages 9--15, 1997.
[10]
P. V. Gehler, A. D. Holub, and M. Welling. The rate adapting poisson model for information retrieval and object recognition. In Proceedings of ICML, pages 337--344, 2006.
[11]
K. Gupta, M. Choudhury, and K. Bali. Mining Hindi-English transliteration pairs from online Hindi lyrics. In Proceedings of LREC, pages 2459--2465, 2012.
[12]
P. Gupta, R. E. Banchs, and P. Rosso. Squeezing bottlenecks: exploring the limits of autoencoder semantic representation capabilities. CoRR, abs/1402.3070, 2014.
[13]
P. Gupta, P. Rosso, and R. E. Banchs. Encoding transliteration variation through dimensionality reduction: FIRE Shared Task on Transliterated Search. In Fifth Forum for Information Retrieval Evaluation, 2013.
[14]
P. A. V. Hall and G. R. Dowling. Approximate string matching. ACM Comp. Surv., 12(4):381--402, 1980.
[15]
G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 -- 507, 2006.
[16]
G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771--1800, 2002.
[17]
S. C. Janarthanam, S. Subramaniam, and U. Nallasamy. Named entity transliteration for cross-language information retrieval using compressed word format mapping algorithm. In Proceedings of iNEWS, pages 33--38, 2008.
[18]
B. King and S. Abney. Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of NAACL-HLT, pages 1110--1119, 2013.
[19]
K. Knight and J. Graehl. Machine transliteration. Comput. Linguist., 24(4):599--612, Dec. 1998.
[20]
S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In Proceedings of IJCAI, pages 1360--1365, 2011.
[21]
A. Kumaran, M. M. Khapra, and H. Li. Report of news 2010 transliteration mining shared task. In Proceedings of NEWS, pages 21--28, 2010.
[22]
D. W. Oard, G.-A. Levow, and C. I. Cabezas. Clef experiments at maryland: Statistical stemming and backoff translation. In Proceedings of CLEF, pages 176--187, 2000.
[23]
D. Pal, P. Majumder, M. Mitra, S. Mitra, and A. Sen. Issues in searching for indian language web content. In Proceedings of iNEWS, pages 93--96, 2008.
[24]
Y. Qu, G. Grefenstette, and D. A. Evans. Automatic transliteration for Japanese-to-English text retrieval. In Proceedings of SIGIR, pages 353--360, 2003.
[25]
A. A. Raj and H. Maganti. Transliteration based search engine for multilingual information access. In Proceedings of CLIAWS3, pages 12--20, 2009.
[26]
R. Saha Roy, M. Choudhury, P. Majumder, and K. Agarwal. Overview and Datasets of FIRE 2013 Track on Transliterated Search. In Fifth Forum for Information Retrieval Evaluation, 2013.
[27]
R. Salakhutdinov and G. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969--978, July 2009.
[28]
R. Salakhutdinov and G. E. Hinton. Replicated softmax: an undirected topic model. In Proceedings of NIPS, pages 1607--1614, 2009.
[29]
V. B. Sowmya and V. Varma. Transliteration based text input methods for telugu. In Proceedings of ICCPOL, pages 122--132, 2009.
[30]
R. Udupa and M. M. Khapra. Improving the multilingual user experience of wikipedia using cross-language name search. In Proceedings of HLT-NAACL, pages 492--500, 2010.
[31]
R. Udupa and M. M. Khapra. Transliteration equivalence using canonical correlation analysis. In Proceedings of ECIR, pages 75--86, 2010.
[32]
X. Wang, A. Broder, E. Gabrilovich, V. Josifovski, and B. Pang. Cross-lingual query classification: A preliminary study. In Proceedings of iNEWS, pages 101--104, 2008.
[33]
D. Zhou, M. Truran, T. Brailsford, V. Wade, and H. Ashman. Translation techniques in cross-language information retrieval. ACM Comput. Surv., 45(1):1:1--1:44, Dec. 2012.
[34]
J. Zobel and P. Dart. Phonetic string matching: lessons from information retrieval. In Proceedings of SIGIR, pages 166--172, 1996.

Cited By

View all
  • (2024)A Novel Framework for Multilingual Script Detection and Pattern Analysis in Mixed Script QueriesInternational Journal of Experimental Research and Review10.52756/ijerr.2024.v43spl.01643(214-228)Online publication date: 30-Sep-2024
  • (2023)Synonym Insensitive Searching: A Novel Synonym Weighted-Vector Space Model for Document Retrieval2023 2nd International Conference on Computational Systems and Communication (ICCSC)10.1109/ICCSC56913.2023.10142977(1-7)Online publication date: 3-Mar-2023
  • (2023)The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social MediaSN Computer Science10.1007/s42979-023-01942-74:5Online publication date: 27-Jun-2023
  • Show More Cited By

Index Terms

  1. Query expansion for mixed-script information retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval
    July 2014
    1330 pages
    ISBN:9781450322577
    DOI:10.1145/2600428
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 July 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deep-learning
    2. mixed-script information retrieval
    3. transliteration

    Qualifiers

    • Research-article

    Conference

    SIGIR '14
    Sponsor:

    Acceptance Rates

    SIGIR '14 Paper Acceptance Rate 82 of 387 submissions, 21%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Novel Framework for Multilingual Script Detection and Pattern Analysis in Mixed Script QueriesInternational Journal of Experimental Research and Review10.52756/ijerr.2024.v43spl.01643(214-228)Online publication date: 30-Sep-2024
    • (2023)Synonym Insensitive Searching: A Novel Synonym Weighted-Vector Space Model for Document Retrieval2023 2nd International Conference on Computational Systems and Communication (ICCSC)10.1109/ICCSC56913.2023.10142977(1-7)Online publication date: 3-Mar-2023
    • (2023)The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social MediaSN Computer Science10.1007/s42979-023-01942-74:5Online publication date: 27-Jun-2023
    • (2023)Code Mixed Information Retrieval for Gujarati Script News ArticlesAdvances in Computing and Data Sciences10.1007/978-3-031-37940-6_22(265-276)Online publication date: 23-Jul-2023
    • (2023)Self-supervised Contrastive BERT Fine-tuning for Fusion-Based Reviewed-Item RetrievalAdvances in Information Retrieval10.1007/978-3-031-28244-7_1(3-17)Online publication date: 2-Apr-2023
    • (2022)FIRE 2022 ILSUM Track: Indian Language SummarizationProceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3574318.3574328(8-11)Online publication date: 9-Dec-2022
    • (2022)Query and Attention Augmentation for Knowledge-Based Explainable Reasoning2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.01513(15555-15564)Online publication date: Jun-2022
    • (2022)An automatic query expansion based on hybrid CMO-COOT algorithm for optimized information retrievalThe Journal of Supercomputing10.1007/s11227-021-04171-y78:6(8625-8643)Online publication date: 12-Jan-2022
    • (2022)A Review on Transliterated Text Retrieval for Indian LanguagesProceedings of International Conference on Computational Intelligence10.1007/978-981-19-2126-1_10(137-146)Online publication date: 4-Oct-2022
    • (2021)Query Expansion for Transliterated Text RetrievalACM Transactions on Asian and Low-Resource Language Information Processing10.1145/344764920:4(1-34)Online publication date: 20-Jul-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media