Translating cross-lingual spelling variants using transformation rules

https://doi.org/10.1016/j.ipm.2004.02.001Get rights and content

Abstract

Technical terms and proper names constitute a major problem in dictionary-based cross-language information retrieval (CLIR). However, technical terms and proper names in different languages often share the same Latin or Greek origin, being thus spelling variants of each other. In this paper we present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first step, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second step, the intermediate forms obtained in the first step are translated into a target language using fuzzy matching. The effectiveness of the technique was evaluated empirically using five source languages and English as a target language. The two-step technique performed better, in some cases considerably better, than fuzzy matching alone. Even using the first step as such showed promising results.

Introduction

Technical terms and proper names are often central keys in requests for information. In dictionary-based cross-language information retrieval (CLIR) they constitute a major problem, since they are not found in general translation dictionaries, except for the most commonly used terms and names. In dictionary-based CLIR untranslatable query keys are typically used in target language queries in their original source language forms. Unless they are identical to the corresponding database index terms, they do not match the index terms, causing significant loss of retrieval effectiveness. However, technical terms (proper names) in different languages often share the same Latin or Greek origin, being thus spelling variants of each other, as German konstruktion and English construction. This allows the use of fuzzy matching (approximate string matching) techniques to find the target language correspondents of source language keys.

Approximate matching techniques involve Soundex and Phonix, which compare words on the basis of their phonetic similarity (Gadd, 1990), edit distance (Zobel & Dart, 1996), and n-gram based matching (Robertson & Willett, 1998). In n-gram matching text strings are decomposed into n-grams, i.e., substrings of length n, which usually consist of the adjacent characters of the text strings. The degree of similarity between the strings is computed on the basis of the number of similar n-grams and the total number of unique n-grams in the strings.

Transliteration refers to phonetic translation across languages with different orthographies (Knight & Graehl, 1998), such as Arabic to English (Stalls & Knight, 1998) or Japanese to English (Qu, Grefenstette, & Evans, 2003). In this paper we will present a novel two-step fuzzy translation technique for cross-lingual technical terms and proper names. It is similar to transliteration, but no phonetic elements are included. The technique bears some resemblance to query translation and transliteration reported in Fujii and Ishikawa (2001). Fujii and Ishikawa use character-based rules to establish mapping between English characters and romanized Japanese katakana characters. They also utilize probabilistic character-based language models, which can be seen as a variation of the fuzzy matching technique. Fujii's and Ishikawa's technique, on the other hand, is focused on languages with different orthographies and thus has a different focus from ours.

In the first step of our technique, source language words are transformed into intermediate forms by means of transformation rules. The intermediate forms are often correct translations or more similar word forms to their target language equivalents than the original source language words. We call this step transformation rule based translation (TRT). A transformation rule refers to an automatically extracted regular correspondence between the characters in two languages, for instance Spanish character string ia corresponds to English character y, e.g., in the term pair somatologia–somatology.

In the second step of fuzzy translation, the intermediate forms achieved in the first step are matched with their target language equivalents through fuzzy matching. The benefits of the combined technique are in cases where TRT does not yield correct translations but renders source words more similar to their target language equivalents. This allows n-gram matching to rank the correct equivalents high.

The transformation rules were generated automatically by extracting equivalent term pairs from translation dictionaries. The terms were then aligned pairwise and regular correspondences were identified using the edit distance measure. The rules were generated for five language pairs, with English always being a target language and Finnish, French, German, Spanish, and Swedish source languages.

The effectiveness of the two-step fuzzy translation technique was evaluated by means of test words in five different domains. The intermediate forms obtained using TRT were matched through n-gram matching against an English target word list of 189,000 words, including the correct equivalents of the source words. As an evaluation measure we used precision at the rank where all the equivalents of the source words have been retrieved. We will demonstrate that the combined fuzzy translation technique performs better, sometimes considerably better, than n-grams alone.

In Pirkola, Toivonen, Keskustalo, Visala, and Järvelin (2003) we presented first results on the fuzzy translation technique. In this paper we present more detailed results and extend the first study by exploring how effective TRT is as such when used alone without fuzzy matching. This is an important question when TRT applications are considered. We evaluated the effectiveness of TRT by considering what proportion of word forms obtained through TRT are correct translations (translation precision) and what proportion of source words are translated correctly (translation recall).

The rest of this paper is organized as follows. Section 2 presents the methodology and data, and Section 3 the findings. Section 4 contains the discussion and conclusions.

Section snippets

Automatic generation of rules

This section describes the automatic rule generation process. Fig. 1 illustrates the process by means of examples. The process consisted of the following main steps:

  • Extracting similar terms from a dictionary.

  • Selection of transformations.

  • Generation of transformation rules.

Two-step fuzzy translation

For Swedish, transformation rules were produced by using only 657 term pairs. The combined TRT and fuzzy matching technique was not useful, but it performed as well or slightly worse than fuzzy matching alone. The Swedish results suggest that the rules should be formed on the basis of thousands rather than hundreds of term pairs.

The results of fuzzy translation tests are presented in Table 2, Table 3, Table 4, Table 5 (HCF strategy) and Table 6, Table 7, Table 8, Table 9 (LCF strategy). There

Discussion and conclusions

Technical terms and proper names often are untranslatable due to limited coverage of translation dictionaries. This has a depressing effect on CLIR performance, as such expressions often are central keys in queries. In this study we presented a novel fuzzy translation technique based on automatically generated transformation rules and fuzzy matching. Two translation strategies were tested. In the high confidence factor strategy the aim was to minimize the number of incorrect transformations by

Acknowledgements

Multilingual Medical Technical Dictionary (http://www.interfold.com/translator/) was provided by André Fairchild, of Denver, Colorado, USA. We would like to thank André Fairchild for permission to use the dictionary.

ENGTWOL morphological analyzer was used for the morphological analysis of the English data. ENGTWOL (Morphological Transducer Lexicon Description of English): Copyright (c) 1989–1992 Atro Voutilainen and Juha Heikkilä. TWOL-R (Run-Time Two-Level Program): Copyright (c) Kimmo

References (16)

  • U. Pfeifer et al.

    Retrieval effectiveness of proper name search methods

    Information Processing & Management

    (1996)
  • Charras, C., & Lecroq, T. (1998). Sequence comparison. Available:...
  • M.A. Covington

    An algorithm to align words for historical comparison

    Computational Linguistics

    (1996)
  • A. Fujii et al.

    Japanese/English cross-language information retrieval: Exploration of query translation and transliteration

    Computers and the Humanities

    (2001)
  • T. Gadd

    Phonix: the algorithm

    Program

    (1990)
  • H. Keskustalo et al.

    Non-adjacent digrams improve matching of cross-lingual spelling variants

  • K. Knight et al.

    Machine transliteration

    Computational Linguistics

    (1998)
  • Peters, C. (2002). Cross-language evaluation forum (CLEF). Available:...
There are more references available in the full text version of this article.

Cited by (14)

  • Transliteration normalization for Information Extraction and Machine Translation

    2014, Journal of King Saud University - Computer and Information Sciences
    Citation Excerpt :

    These variants are typically translated words with similar stems in another language. Toivonen and colleagues (2005) proposed a two-step fuzzy translation technique to solve similar problems. Al-Onaizan and Knight (2002), Huang et al. (2003), and Ji and Grishman (2007) investigated the general name entity translation problem, especially within the context of Machine Translation.

  • s-grams: Defining generalized n-grams for information retrieval

    2007, Information Processing and Management
  • Fuzzy Language in Literature and Translation

    2023, Fuzzy Language in Literature and Translation
  • A review of existing transliteration approaches and methods

    2023, International Journal of Multilingualism
  • Machine transliteration and transliterated text retrieval: a survey

    2018, Sadhana - Academy Proceedings in Engineering Sciences
  • Arabic cross-language information retrieval: A review

    2016, ACM Transactions on Asian and Low-Resource Language Information Processing
View all citing articles on Scopus

A shorter version of this paper was presented at the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Toronto, Canada, July 28–August 1, 2003).

View full text