Transitive dictionary translation challenges direct dictionary translation in CLIR
Introduction
The amount of accessible electronic information has exploded in recent years thanks to Internet and other international networks. There is a great diversity in the languages texts are written in. The more languages there are, the more there are language barriers to be crossed. Thus it is understandable that cross-language information retrieval (CLIR) has become an important area in both research and practice. For overviews of CLIR, see Oard and Diekema (1998); Pirkola, Hedlund, Keskustalo, and Järvelin (2001).
Information retrieval is traditionally based on matching the words of a query with the words of a document. In CLIR, this kind of direct matching is impossible because the query and the document collection are in different languages. Translation is needed: either the query has to be translated into the language of the documents or the documents have to be translated into the language of the query. Translating the whole document collection is more demanding, as it requires more resources, which is why query translation is more common in CLIR. The query in one language (called source language) is translated into the language of the documents (called target language). The basic methods in query translation are machine translation, corpus-based translation and dictionary-based translation (Hull & Grefenstette, 1996). There are also methods bypassing direct translation of query words or documents. See, e.g., a recent work based on language models (Lavrenko, Choquette, & Croft, 2002).
Machine translation is not an ideal method of translating queries unless the queries are formulated in grammatically correct sentences. On the other hand, parallel or comparable corpora are seldom available in the topic areas of all queries. Dictionaries that can be used in CLIR are easier to find. These are usually bilingual machine readable dictionaries (MRD) designed for a human reader, and converted for CLIR purposes by removing superfluous material. Bilingual or multilingual thesauri have also been developed for CLIR purposes (see, e.g., Gilarranz, Gonzalo, & Verdejo, 1997). Translation in CLIR is a simpler process than what is normally meant by translation: query words are most often translated separately, one by one, without taking into consideration their relations to each other. When using an MRD, each word of a query is simply replaced with all of its translation equivalents in the target language. All the translation equivalents are taken into the final CLIR query (Ballesteros, 2000; Pirkola, 1998). Dictionary-based translation in CLIR has been used by a number of researchers, among them Ballesteros (2000), Ballesteros and Croft, 1996, Ballesteros and Croft, 1997, Ballesteros and Croft, 1998, Gollins (2000), Gollins and Sanderson (2001), Hedlund, Keskustalo, Pirkola, Sepponen, and Järvelin (2001), Hull and Grefenstette (1996), Pirkola (1998), Pirkola et al., 2000, Pirkola et al., 2001, Pirkola, Puolamäki, and Järvelin (2003).
However, it is not always easy to find suitable MRDs between languages. There are not always good dictionaries even between common European languages. Direct translation from language A into language B may therefore not be possible. However, there might be a dictionary between language A and language C, and one between language C and language B, which means that translation would be possible first from A into C and then from C into B. This kind of translation through an intermediate (also a pivot) language is called transitive translation.
One of the basic problem associated with MRD translation is translation ambiguity (Ballesteros, 2000; Pirkola et al., 2000). Natural language words often have more than one sense. When a word is translated, most often all the senses are automatically taken into the translated query even though not all of them are relevant. In dictionary-based CLIR, methods for choosing between senses to translate have been explored but have not yet proven effective (Sperer & Oard, 2000). If we translate, for example, the Swedish 2001 CLEF title “Reservat för valar” (“Reserve for whales”) into English, the inflected word form `valar' is first normalized into base form `val' (noun singular nominative) which has three senses: (1) election (2) whale (3) selection, choice. If all these senses are included in the dictionary, we might have the following words in the target language query: election poll whale choice choosing selecting selection, of which only whale is correct.
Ambiguity may occur at every stage of the translation process because the query words in both source, pivot and target language may be ambiguous. The number of irrelevant words in a query is likely to increase every time a translation is performed. It is easy to imagine that ambiguity would be a problem of transitive translation in particular because of the additional translation phases needed.
In most cases transitive translation has performed worse than bilingual translation. This is probably because of the ambiguity introduced by double translation. Some techniques have been experimented with to improve the performance of transitive translation. Gollins and Sanderson (2001) tried to solve the problem of ambiguity in transitive translation through triangulation, i.e. by using several translation routes. They used several pivot languages and merged the translation results from the different routes. This indeed had a favourable effect. On the whole, the effectiveness was low, mainly because of the poor translation resources used. Ballesteros (2000), for her part, reduced the ambiguity of transitive translation by query structuring and various expansion techniques. In this paper we study how well transitive translation performs compared to the baseline direct bilingual translation when morphological analyzers, electronic dictionaries, stopword lists and n-gramming of untranslatable words are used in the translation process. The choice of languages to be used as source and pivot languages and the use/non-use of structured target queries are the variables tested in this study. The effect of triangulation is also tested.
This paper does not introduce new techniques to CLIR. However, the present combination of techniques has not been used before in transitive CLIR (only in direct CLIR). In particular, compound splitting and component translation, as well as n-gram translation of problem words have not been used as transitive CLIR techniques. They are, however, interesting components in the process since
- •
compound splitting in the transitive process gives rise to much ambiguity, which needs to be managed;
- •
problem words (proper names) are translated directly from the source language into the target language by n-gram matching in the target language index, i.e. avoiding transitive translation; this is both necessary (there is no pivot word list) and effective.
We are able to
- •
confirm earlier results for new language pairs;
- •
show that transitive CLIR is a competitive technique, with the present mix of tools, at a much higher performance level than reported previously;
- •
show that at high performance levels triangulation is only useful in the case of unstructured queries. With structured queries it is not helpful.
This paper is organized as follows: methods and data used are presented in Section 2, and findings in Section 3. In Section 4 the findings are further discussed, and some suggestions for future research are given. Section 5 concludes the paper.
Section snippets
The test database
As a test collection we used the English collection of CLEF,1 which contains newspaper articles from the Los Angeles Times and consists of 113,005 indexed documents. CLEF provided 33 test topics in the year 2000 campaign and 47 test topics in the 2001 campaign. These topics––the Finnish, Swedish and German versions––together with CLEF relevance assessments against the Los Angeles Times collection were used in the tests. The two topic
Findings
Altogether eight transitive runs were carried out, two using Swedish, two Finnish and four German as the source language. The effectiveness results of the runs are presented in Table 1, Table 2. The results were evaluated as average precision over 10 recall points (10–100%), using the deval evaluation program of InQuery. The results of the baseline direct translations and the monolingual English runs are also given, as well as the differences between the bilingual and the transitive runs. The
Discussion
Transitive translation, i.e. translation through an intermediate language, may be the only means of translation between two languages when there is a lack of suitable translation resources between the languages. Second, it may reduce the number of translation routes needed when translations have to be performed between a large number of languages. If there are, for example, 50 languages and a translation system is needed between each pair of these, there will be no less than 2450 translation
Conclusions
In this study, transitive translations were carried out using three source languages, Swedish, Finnish and German, two pivot languages, Finnish and Swedish, and English as a target language. The results of the transitive translations were compared to the results of the direct translations between the three source languages and the target language. The transitive translations performed better than expected given the results of previous transitive translation studies. The difference from the
Acknowledgements
The InQuery search engine was provided by the Center for Intelligent Information Retrieval at the University of Massachusetts.
ENGTWOL (Morphological Transducer Lexicon Description of English): Copyright (c) 1989–1992 Arto Voutilainen and Juha Heikkilä.
FINTWOL (Morphological Description of Finnish): Copyright (c) Kimmo Koskenniemi and Lingsoft plc. 1983–1993.
GERTWOL (Morphological Transducer Lexicon Description of German): Copyright (c) 1997 Kimmo Koskenniemi and Lingsoft plc.
SWETWOL
References (24)
- et al.
Applying query structuring in cross-language retrieval
Information Processing and Management
(2003) Cross language retrieval via transitive translation
- et al.
Dictionary methods for cross-lingual information retrieval
- et al.
Phrasal translation and query expansion techniques for cross-language information retrieval
- et al.
Resolving ambiguity for cross-language retrieval
- et al.
INQUERY system overview
- et al.
Information access in indigenous languages: a case study in Zulu
- Gilarranz, J., Gonzalo, J., & Verdejo, F. (1997). An approach to conceptual text retrieval using the EuroWordNet...
- Gollins, T. J. (2000). Dictionary-based transitive cross-language information retrieval using lexical triangulation....
- et al.
Improving cross language retrieval with triangulated translation
Bilingual tests with Swedish, Finnish and German queries: dealing with morphology, compound words and query structure
Querying across languages: a dictionary-based approach to multilingual information retrieval
Cited by (11)
Towards computing technologies on machine parsing of English and Chinese Garden path sentences
2019, Advances in Intelligent Systems and ComputingBuilding CLIA for resource-scarce African languages: A case study on Oromo-English CLIR
2018, Information Retrieval and Management: Concepts, Methodologies, Tools, and ApplicationsAcquisition des traductions de requêtes à partir de wikipédia pour la recherche d'information translingue
2014, Vision 2020: Sustainable Growth, Economic Development, and Global Competitiveness - Proceedings of the 23rd International Business Information Management Association Conference, IBIMA 2014Translation techniques in cross-language information retrieval
2012, ACM Computing SurveysMethods for cross-language information retrieval
2010, Bilinguals: Cognition, Education and Language ProcessingCross-language information retrieval
2010, Synthesis Lectures on Human Language Technologies