Transitive dictionary translation challenges direct dictionary translation in CLIR

https://doi.org/10.1016/j.ipm.2003.10.005Get rights and content

Abstract

The paper reports on experiments carried out in transitive translation, a branch of cross-language information retrieval (CLIR). By transitive translation we mean translation of search queries into the language of the document collection through an intermediate (or pivot) language. In our experiments, queries constructed from CLEF 2000 and 2001 Swedish, Finnish and German topics were translated into English through Finnish and Swedish by an automated translation process using morphological analyzers, stopword lists, electronic dictionaries, n-gramming of untranslatable words, and structured and unstructured queries. The results of the transitive runs were compared to the results of the bilingual runs, i.e. runs translating the same queries directly into English. The transitive runs using structured target queries performed well. The differences ranged from −6.6% to +2.9% units (or −25.5% to +7.8%) between the approaches. Thus transitive translation challenges direct translation and considerably simplifies global CLIR efforts.

Introduction

The amount of accessible electronic information has exploded in recent years thanks to Internet and other international networks. There is a great diversity in the languages texts are written in. The more languages there are, the more there are language barriers to be crossed. Thus it is understandable that cross-language information retrieval (CLIR) has become an important area in both research and practice. For overviews of CLIR, see Oard and Diekema (1998); Pirkola, Hedlund, Keskustalo, and Järvelin (2001).

Information retrieval is traditionally based on matching the words of a query with the words of a document. In CLIR, this kind of direct matching is impossible because the query and the document collection are in different languages. Translation is needed: either the query has to be translated into the language of the documents or the documents have to be translated into the language of the query. Translating the whole document collection is more demanding, as it requires more resources, which is why query translation is more common in CLIR. The query in one language (called source language) is translated into the language of the documents (called target language). The basic methods in query translation are machine translation, corpus-based translation and dictionary-based translation (Hull & Grefenstette, 1996). There are also methods bypassing direct translation of query words or documents. See, e.g., a recent work based on language models (Lavrenko, Choquette, & Croft, 2002).

Machine translation is not an ideal method of translating queries unless the queries are formulated in grammatically correct sentences. On the other hand, parallel or comparable corpora are seldom available in the topic areas of all queries. Dictionaries that can be used in CLIR are easier to find. These are usually bilingual machine readable dictionaries (MRD) designed for a human reader, and converted for CLIR purposes by removing superfluous material. Bilingual or multilingual thesauri have also been developed for CLIR purposes (see, e.g., Gilarranz, Gonzalo, & Verdejo, 1997). Translation in CLIR is a simpler process than what is normally meant by translation: query words are most often translated separately, one by one, without taking into consideration their relations to each other. When using an MRD, each word of a query is simply replaced with all of its translation equivalents in the target language. All the translation equivalents are taken into the final CLIR query (Ballesteros, 2000; Pirkola, 1998). Dictionary-based translation in CLIR has been used by a number of researchers, among them Ballesteros (2000), Ballesteros and Croft, 1996, Ballesteros and Croft, 1997, Ballesteros and Croft, 1998, Gollins (2000), Gollins and Sanderson (2001), Hedlund, Keskustalo, Pirkola, Sepponen, and Järvelin (2001), Hull and Grefenstette (1996), Pirkola (1998), Pirkola et al., 2000, Pirkola et al., 2001, Pirkola, Puolamäki, and Järvelin (2003).

However, it is not always easy to find suitable MRDs between languages. There are not always good dictionaries even between common European languages. Direct translation from language A into language B may therefore not be possible. However, there might be a dictionary between language A and language C, and one between language C and language B, which means that translation would be possible first from A into C and then from C into B. This kind of translation through an intermediate (also a pivot) language is called transitive translation.

One of the basic problem associated with MRD translation is translation ambiguity (Ballesteros, 2000; Pirkola et al., 2000). Natural language words often have more than one sense. When a word is translated, most often all the senses are automatically taken into the translated query even though not all of them are relevant. In dictionary-based CLIR, methods for choosing between senses to translate have been explored but have not yet proven effective (Sperer & Oard, 2000). If we translate, for example, the Swedish 2001 CLEF title “Reservat för valar” (“Reserve for whales”) into English, the inflected word form `valar' is first normalized into base form `val' (noun singular nominative) which has three senses: (1) election (2) whale (3) selection, choice. If all these senses are included in the dictionary, we might have the following words in the target language query: election poll whale choice choosing selecting selection, of which only whale is correct.

Ambiguity may occur at every stage of the translation process because the query words in both source, pivot and target language may be ambiguous. The number of irrelevant words in a query is likely to increase every time a translation is performed. It is easy to imagine that ambiguity would be a problem of transitive translation in particular because of the additional translation phases needed.

In most cases transitive translation has performed worse than bilingual translation. This is probably because of the ambiguity introduced by double translation. Some techniques have been experimented with to improve the performance of transitive translation. Gollins and Sanderson (2001) tried to solve the problem of ambiguity in transitive translation through triangulation, i.e. by using several translation routes. They used several pivot languages and merged the translation results from the different routes. This indeed had a favourable effect. On the whole, the effectiveness was low, mainly because of the poor translation resources used. Ballesteros (2000), for her part, reduced the ambiguity of transitive translation by query structuring and various expansion techniques. In this paper we study how well transitive translation performs compared to the baseline direct bilingual translation when morphological analyzers, electronic dictionaries, stopword lists and n-gramming of untranslatable words are used in the translation process. The choice of languages to be used as source and pivot languages and the use/non-use of structured target queries are the variables tested in this study. The effect of triangulation is also tested.

This paper does not introduce new techniques to CLIR. However, the present combination of techniques has not been used before in transitive CLIR (only in direct CLIR). In particular, compound splitting and component translation, as well as n-gram translation of problem words have not been used as transitive CLIR techniques. They are, however, interesting components in the process since

  • compound splitting in the transitive process gives rise to much ambiguity, which needs to be managed;

  • problem words (proper names) are translated directly from the source language into the target language by n-gram matching in the target language index, i.e. avoiding transitive translation; this is both necessary (there is no pivot word list) and effective.


We are able to

  • confirm earlier results for new language pairs;

  • show that transitive CLIR is a competitive technique, with the present mix of tools, at a much higher performance level than reported previously;

  • show that at high performance levels triangulation is only useful in the case of unstructured queries. With structured queries it is not helpful.


This paper is organized as follows: methods and data used are presented in Section 2, and findings in Section 3. In Section 4 the findings are further discussed, and some suggestions for future research are given. Section 5 concludes the paper.

Section snippets

The test database

As a test collection we used the English collection of CLEF,1 which contains newspaper articles from the Los Angeles Times and consists of 113,005 indexed documents. CLEF provided 33 test topics in the year 2000 campaign and 47 test topics in the 2001 campaign. These topics––the Finnish, Swedish and German versions––together with CLEF relevance assessments against the Los Angeles Times collection were used in the tests. The two topic

Findings

Altogether eight transitive runs were carried out, two using Swedish, two Finnish and four German as the source language. The effectiveness results of the runs are presented in Table 1, Table 2. The results were evaluated as average precision over 10 recall points (10–100%), using the deval evaluation program of InQuery. The results of the baseline direct translations and the monolingual English runs are also given, as well as the differences between the bilingual and the transitive runs. The

Discussion

Transitive translation, i.e. translation through an intermediate language, may be the only means of translation between two languages when there is a lack of suitable translation resources between the languages. Second, it may reduce the number of translation routes needed when translations have to be performed between a large number of languages. If there are, for example, 50 languages and a translation system is needed between each pair of these, there will be no less than 2450 translation

Conclusions

In this study, transitive translations were carried out using three source languages, Swedish, Finnish and German, two pivot languages, Finnish and Swedish, and English as a target language. The results of the transitive translations were compared to the results of the direct translations between the three source languages and the target language. The transitive translations performed better than expected given the results of previous transitive translation studies. The difference from the

Acknowledgements

The InQuery search engine was provided by the Center for Intelligent Information Retrieval at the University of Massachusetts.

ENGTWOL (Morphological Transducer Lexicon Description of English): Copyright (c) 1989–1992 Arto Voutilainen and Juha Heikkilä.

FINTWOL (Morphological Description of Finnish): Copyright (c) Kimmo Koskenniemi and Lingsoft plc. 1983–1993.

GERTWOL (Morphological Transducer Lexicon Description of German): Copyright (c) 1997 Kimmo Koskenniemi and Lingsoft plc.

SWETWOL

References (24)

  • A. Pirkola et al.

    Applying query structuring in cross-language retrieval

    Information Processing and Management

    (2003)
  • L.A. Ballesteros

    Cross language retrieval via transitive translation

  • L. Ballesteros et al.

    Dictionary methods for cross-lingual information retrieval

  • L. Ballesteros et al.

    Phrasal translation and query expansion techniques for cross-language information retrieval

  • L. Ballesteros et al.

    Resolving ambiguity for cross-language retrieval

  • J. Broglio et al.

    INQUERY system overview

  • E. Cosijn et al.

    Information access in indigenous languages: a case study in Zulu

  • Gilarranz, J., Gonzalo, J., & Verdejo, F. (1997). An approach to conceptual text retrieval using the EuroWordNet...
  • Gollins, T. J. (2000). Dictionary-based transitive cross-language information retrieval using lexical triangulation....
  • T. Gollins et al.

    Improving cross language retrieval with triangulated translation

  • T. Hedlund et al.

    Bilingual tests with Swedish, Finnish and German queries: dealing with morphology, compound words and query structure

  • D. Hull et al.

    Querying across languages: a dictionary-based approach to multilingual information retrieval

  • Cited by (11)

    View all citing articles on Scopus
    View full text