abstract

Modelling of terms across scripts through autoencoders

Author:

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Page 1279

https://doi.org/10.1145/2600428.2610380

Published: 03 July 2014 Publication History

Get Access

Abstract

cripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or cross-lingual space with more than one scripts which is referred as mixed-script space and information retrieval in this space is referred as mixed-script information retrieval (MSIR) [1]. In mixed-script space, the documents and queries may either be in the native script and/or the Roman transliterated script for a language (mono-lingual scenario). There can be further extension of MSIR such as multi-lingual MSIR in which terms can be in multiple scripts in multiple languages. Since there are no standard ways of spelling a word in a non-native script, transliteration content almost always features extensive spelling variations. This phenomenon presents a non-trivial term matching problem for search engines to match the native-script or Roman-transliterated query with the documents in multiple scripts taking into account the spelling variations. This problem, although prevalent inWeb search for users of many languages around the world, has received very little attention till date. Very recently we have formally defined the problem of MSIR and presented the quantitative study on it through Bing query log analysis.

References

[1]

P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and P. Rosso. Query expansion for mixed-script information retrieval. In Proceedings of SIGIR, Gold Coast, Australia, 2014.

Digital Library

Google Scholar

[2]

K. Knight and J. Graehl. Machine transliteration. Comput. Linguist., 24(4):599--612, Dec. 1998.

Digital Library

Google Scholar

[3]

S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In Proceedings of IJCAI, pages 1360--1365, Barcelona, Spain, July 2011.

Digital Library

Google Scholar

[4]

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of ICML, pages 689--696, Bellevue, USA, June 2011.

Google Scholar

Index Terms

Modelling of terms across scripts through autoencoders
1. Information systems
  1. Information retrieval

Recommendations

Query expansion for mixed-script information retrieval
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

For many languages that use non-Roman based indigenous scripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or ...
Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques

Offline handwriting recognition in Indian regional scripts is an interesting area of research as almost 460 million people in India use regional scripts. The nine major Indian regional scripts are Bangla (for Bengali and Assamese languages), Gujarati, ...
Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique
SBES '13: Proceedings of the 2013 27th Brazilian Symposium on Software Engineering

Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu ...

Comments

Information & Contributors

Information

Published In

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

July 2014

1330 pages

ISBN:9781450322577

DOI:10.1145/2600428

General Chairs:
Shlomo Geva
Queensland University of Technology
,
Andrew Trotman
University of Dunedin
,
Program Chairs:
Peter Bruza
Queensland University of Technology
,
Charles L.A. Clarke
University of Waterloo
,
Kal Järvelin
University of Tampere

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2014

Check for updates

Author Tags

Qualifiers

Abstract

Conference

SIGIR '14

Sponsor:

SIGIR

SIGIR '14: The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 6 - 11, 2014

Queensland, Gold Coast, Australia

Acceptance Rates

SIGIR '14 Paper Acceptance Rate 82 of 387 submissions, 21%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
184
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Index Terms

Recommendations

Query expansion for mixed-script information retrieval

Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques

Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations