Abstract
Untranslatable query keys pose a problem in dictionary-based cross-language information retrieval (CLIR). One solution consists of using approximate string matching methods for finding the spelling variants of the source key among the target database index. In such a setting, it is important to select a matching method suited especially for CLIR. This paper focuses on comparing the effectiveness of several matching methods in a cross-lingual setting. Search words from five domains were expressed in six languages (French, Spanish, Italian, German, Swedish, and Finnish). The target data consisted of the index of an English full-text database. In this setting, we first established the best method among six baseline matching methods for each language pair. Secondly, we tested novel matching methods based on binary digrams formed of both adjacent and non-adjacent characters of words. The latter methods consistently outperformed all baseline methods.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Angell, R.C., Freund, G.E., Willett, P.: Automatic Spelling Correction Using a Trigram Similarity Measure. Information Processing & Managament 4, 255–261 (1983)
Damashek, M.: Gauging Similarity with n-Grams: Language-Independent Sorting, Categorization, and Retrieval of Text. Science 267, 843–848 (1995)
Hull, D., Grefenstette, G.: Querying Across Languages: A Dictionary- Based Approach to Multilingual Information Retrieval. In: Proc. ACM SIGIR, Zürich, Switzerland, pp. 49–57 (1996)
Peters, C.: Cross Language Evaluation Forum (2002), http://clef.iei.pi.cnr.it
Pfeifer, U., Poersch, T., Fuhr, N.: Searching Proper Names in Databases. HIM, 259–275 (1995)
Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4(3/4), 209–230 (2001)
Pirkola, A., Keskustalo, H., Leppänen, E., Känsälä, A.-P., Järvelin, K.: Targeted s-Gram Matching: a Novel n-Gram Matching Technique for Cross- and Monolingual Word Form Variants. Information Research 7 (2) (2002), Available at http://InformationR.net/ir/7-2/paper126.html
Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., Järvelin, K.: Fuzzy Translation of Cross-Lingual Spelling Variants. Accepted for ACM SIGIR 2003 (2003)
Robertson, A.M., Willet, P.: Applications of N-Grams in Textual Information Systems. Journal of Documentation 1, 48–69 (1998)
Salosaari, P., Järvelin, K.: MUSIR - A Retrieval Model for Music. Research Notes 1, Department of Information Studies, University of Tampere (1998)
Ullman, J.R.: A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words. Computer Journal 2, 141–147 (1977)
Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Proc. ACM SIGIR, Zürich, Switzerland, pp. 166–173 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Keskustalo, H., Pirkola, A., Visala, K., Leppänen, E., Järvelin, K. (2003). Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-39984-1_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20177-9
Online ISBN: 978-3-540-39984-1
eBook Packages: Springer Book Archive