Abstract
Technical term translations are important for cross-lingual information retrieval. In many languages, new technical terms have a common origin rendered with different spelling of the underlying sounds, also known as cross-lingual spelling variants (CLSV).
To find the best CLSV in a text database index, we contribute a formulation of the problem in a probabilistic framework, and implement this with an instance of the general edit distance using weighted finite-state transducers. Some training data is required when estimating the costs for the general edit distance. We demonstrate that after some basic training our new multilingual model is robust and requires little or no adaptation for covering additional languages, as the model takes advantage of language independent transliteration patterns.
We train the model with medical terms in seven languages and test it with terms from varied domains in six languages. Two test languages are not in the training data. Against a large text database index, we achieve 64–78 % precision at the point of 100% recall. This is a relative improvement of 22% on the simple edit distance.
Article PDF
Similar content being viewed by others
References
Al-Onaizan Y and Knight K (2002) Machine Transliterations of Names in Arabic Text. In: Proceedings of ACL Workshop on Computational Approaches to Semitic Languages
Bilac S and Tanaka H (2004) A hybrid back-transliteration system for Japanese. In: Proceedings of the 20th International Conference on Computational Linguistics, Coling 2004. Geneva, Switzerland, pp. 597–603
Cucerzan S and Brill E (2004) Spelling correction as an iterative process that exploits the collective knowledge of web users. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004). Barcelona, Spain
Kanji GK (1999) 100 Statistical Tests. Sage Publications, new edition
Keskustalo H, Pirkola A, Visala K, Leppänen E and Järvelin K (2003) Non-Adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants. In: SPIRE 2003 — 10th International Symposium on String Processing and Information Retrieval. Manaus, Brazil
Knight K and Graehl J (1998) Machine Transliteration. Computational Linguistics 24(4):599–612
Mohri M (1997) Finite-State Transducers in Language and Speech Processing. Computational Linguistics 23(2):269–311
Mohri M (2003) Edit-Distance of Weighted Automata. In: J.-M. Champarnaud and D. Maurel (eds.): Seventh International Conference, CIAA 2002, Vol. 2608 of Lecture Notes in Computer Science. Tours, France, pp. 1–23, Springer, Berlin-NY
Mohri M, Pereira FCN and Riley MD (2003) AT&T FSM Library — Finite-State Machine Library. [http://www.research.att.com/sw/tools/fsm/]
Navarro G (2001) A guided tour to approximate string matching. ACM Computing Surveys 33(1):31–88
Nienstedt W (2003) Tohtori.fi — Lääkärikirja.[http://www.tohtori.fi/laakarikirja]
Oard D and Diekema A (1998) Cross Language Information Retrieval. In: Annual Review of Information Science and Technology, Vol. 33. pp. 223–256
Ohtake K, Sekiguchi Y and Yamamoto K (2004) Detecting Transliterated Orthographic Variants via Two Similarity Metrics. In: Proceedings of the 20th International Conference on Computational Linguistics, Coling 2004. Geneva, Switzerland, pp. 709–715
Peters C (2000) Cross Language Evaluation Forum.[http://clef.iei.pi.cnr.it/]
Pirkola A, Hedlund T, Keskustalo H and Järvelin K (2001) Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4(3/4):209–230
Pirkola A and Järvelin K (2001) Employing the resolution power of search keys. Journal of the American Society of Information Science 52(7):575–583
Pirkola A, Toivonen J, Keskustalo H, Visala K and Järvelin K (2003) Fuzzy translation of cross-lingual spelling variants. In: SIGIR 2003. pp. 345–352, ACM Press
Qu Y, Grefenstette G and Evans DA (2003) Automatic transliteration for Japanese-to-English text retrieval. In: SIGIR 2003. pp. 353–360, ACM Press
Stichele RV (1995) Multilingual Glossary of Technical and Popular Medical Terms in Nine European Languages. [http://allserv.rug.ac.be/simrvdstich/eugloss/welcome.html]
van Noord G (2002) FSA6.2xx: Finite State Automata Utilities. [http://odur.let.rug.nl/simvannoord/Fsa/fsa.html]
Voutilainen A, Heikkilä J and Järvinen T (1995) ENGTWOL: English Morphological Analyzer.[http://www.lingsoft.fi/cgi-bin/engtwol]
Zhang M, Li H and Su J (2004) Direct Orthographical Mapping for Machine Transliteration. In: Proceedings of the 20th International Conference on Computational Linguistics, Coling 2004. Geneva, Switzerland, pp. 716–722
Zhang Y and Vines P (2004) Using the web for automated translation extraction in cross-language information retrieval. In: SIGIR 2004. Sheffield, United Kingdom, pp. 162–169, ACM
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lindén, K. Multilingual modeling of cross-lingual spelling variants. Inf Retrieval 9, 295–310 (2006). https://doi.org/10.1007/s10791-006-1541-5
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10791-006-1541-5