Approximate String Matching Techniques for Effective CLIR Among Indian Languages

Makin, Ranbeer; Pandey, Nikita; Pingali, Prasad; Varma, Vasudeva

doi:10.1007/978-3-540-73400-0_54

Ranbeer Makin¹,
Nikita Pandey¹,
Prasad Pingali¹ &
…
Vasudeva Varma¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4578))

Included in the following conference series:

International Workshop on Fuzzy Logic and Applications

2050 Accesses
10 Citations

Abstract

Commonly used vocabulary in Indian language documents found on the web contain a number of words that have Sanskrit, Persian or English origin. However, such words may be written in different scripts with slight variations in spelling and morphology. In this paper we explore approximate string matching techniques to exploit this situation of relatively large number of cognates among Indian languages, which are higher when compared to an Indian language and a non-Indian language. We present an approach to identify cognates and make use of them for improving dictionary based CLIR when the query and documents both belong to two different Indian languages. We conduct experiments using a Hindi document collection and a set of Telugu queries and report the improvement due to cognate recognition and translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Pingali, P., Varma, V.: Hindi and Telugu to English Cross Language Information Retrieval at CLEF 2006. In: Working Notes of Cross Language Evaluation Forum 2006 (2006)
Google Scholar
Hull, D., Grefenstette, G.: Querying across languages: A dictionary-based approach to multilingual information retrieval. In: Proceedings of the 19th Annual international ACM SIGIR 1996, Zurich, Switzerland, pp. 49–57 (1996)
Google Scholar
Radwan, K., Fluhr, C.: Textual database lexicon used as a filter to resolve semantic ambiguity application on multilingual information retrieval. In: The 4th Symp. on Document Analysis and Information Retrieval, pp. 121–136 (1995)
Google Scholar
Adriani, M., Croft, W.: The effectiveness of a dictionary-based technique for indonesion-english cross-language text retrieval. CLIR Technical Report IR-170 (1997)
Google Scholar
Melamed, I.D.: Bitext maps and alignment via pattern recognition. Computational Linguistics 25(1), 107–130 (1999)
Google Scholar
Tiedmann, J.: Combining clues for word alignment. In: Proceedings of the 10th Conference of the European Chapter of the ACL (EACL 2003) (2003)
Google Scholar
Koehn, P., Knight, K.: Knowledge sources for word-level translation models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 27–35 (2001)
Google Scholar
Mann, G.S., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: Proceedings of NAACL 2001, pp. 151–158 (2001)
Google Scholar
Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., Jarvelin, K.: Fuzzy translation of cross-lingual spelling variants. In: Proceedings of SIGIR 2003, pp. 345–352 (2003)
Google Scholar
Jaro, M.: Probabilistic linkage of large public health data files. Statistics in Medicine 14, 491–498 (1995)
Article Google Scholar
Winkler, W.: The state record linkage and current research problems. Technical report, statistics of Income Division, Internal Revenue Service Publication (1999)
Google Scholar
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

International Institute of Information Technology, Hyderabad, India
Ranbeer Makin, Nikita Pandey, Prasad Pingali & Vasudeva Varma

Authors

Ranbeer Makin
View author publications
You can also search for this author in PubMed Google Scholar
Nikita Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Prasad Pingali
View author publications
You can also search for this author in PubMed Google Scholar
Vasudeva Varma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Francesco Masulli Sushmita Mitra Gabriella Pasi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Makin, R., Pandey, N., Pingali, P., Varma, V. (2007). Approximate String Matching Techniques for Effective CLIR Among Indian Languages. In: Masulli, F., Mitra, S., Pasi, G. (eds) Applications of Fuzzy Sets Theory. WILF 2007. Lecture Notes in Computer Science(), vol 4578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73400-0_54

Download citation

DOI: https://doi.org/10.1007/978-3-540-73400-0_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73399-7
Online ISBN: 978-3-540-73400-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics