ABSTRACT
cripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or cross-lingual space with more than one scripts which is referred as mixed-script space and information retrieval in this space is referred as mixed-script information retrieval (MSIR) [1]. In mixed-script space, the documents and queries may either be in the native script and/or the Roman transliterated script for a language (mono-lingual scenario). There can be further extension of MSIR such as multi-lingual MSIR in which terms can be in multiple scripts in multiple languages. Since there are no standard ways of spelling a word in a non-native script, transliteration content almost always features extensive spelling variations. This phenomenon presents a non-trivial term matching problem for search engines to match the native-script or Roman-transliterated query with the documents in multiple scripts taking into account the spelling variations. This problem, although prevalent inWeb search for users of many languages around the world, has received very little attention till date. Very recently we have formally defined the problem of MSIR and presented the quantitative study on it through Bing query log analysis.
- P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and P. Rosso. Query expansion for mixed-script information retrieval. In Proceedings of SIGIR, Gold Coast, Australia, 2014. Google ScholarDigital Library
- K. Knight and J. Graehl. Machine transliteration. Comput. Linguist., 24(4):599--612, Dec. 1998. Google ScholarDigital Library
- S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In Proceedings of IJCAI, pages 1360--1365, Barcelona, Spain, July 2011. Google ScholarDigital Library
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of ICML, pages 689--696, Bellevue, USA, June 2011.Google Scholar
Index Terms
- Modelling of terms across scripts through autoencoders
Recommendations
Query expansion for mixed-script information retrieval
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalFor many languages that use non-Roman based indigenous scripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or ...
Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques
Offline handwriting recognition in Indian regional scripts is an interesting area of research as almost 460 million people in India use regional scripts. The nine major Indian regional scripts are Bangla (for Bengali and Assamese languages), Gujarati, ...
Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique
SBES '13: Proceedings of the 2013 27th Brazilian Symposium on Software EngineeringTesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu ...
Comments