skip to main content
10.1145/2600428.2610380acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
abstract

Modelling of terms across scripts through autoencoders

Published: 03 July 2014 Publication History

Abstract

cripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or cross-lingual space with more than one scripts which is referred as mixed-script space and information retrieval in this space is referred as mixed-script information retrieval (MSIR) [1]. In mixed-script space, the documents and queries may either be in the native script and/or the Roman transliterated script for a language (mono-lingual scenario). There can be further extension of MSIR such as multi-lingual MSIR in which terms can be in multiple scripts in multiple languages. Since there are no standard ways of spelling a word in a non-native script, transliteration content almost always features extensive spelling variations. This phenomenon presents a non-trivial term matching problem for search engines to match the native-script or Roman-transliterated query with the documents in multiple scripts taking into account the spelling variations. This problem, although prevalent inWeb search for users of many languages around the world, has received very little attention till date. Very recently we have formally defined the problem of MSIR and presented the quantitative study on it through Bing query log analysis.

References

[1]
P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and P. Rosso. Query expansion for mixed-script information retrieval. In Proceedings of SIGIR, Gold Coast, Australia, 2014.
[2]
K. Knight and J. Graehl. Machine transliteration. Comput. Linguist., 24(4):599--612, Dec. 1998.
[3]
S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In Proceedings of IJCAI, pages 1360--1365, Barcelona, Spain, July 2011.
[4]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of ICML, pages 689--696, Bellevue, USA, June 2011.

Index Terms

  1. Modelling of terms across scripts through autoencoders

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval
    July 2014
    1330 pages
    ISBN:9781450322577
    DOI:10.1145/2600428
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 July 2014

    Check for updates

    Author Tags

    1. deep-learning
    2. mixed-script information retrieval
    3. term modelling

    Qualifiers

    • Abstract

    Conference

    SIGIR '14
    Sponsor:

    Acceptance Rates

    SIGIR '14 Paper Acceptance Rate 82 of 387 submissions, 21%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 184
      Total Downloads
    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media