1 Introduction

Cross-language information retrieval (CLIR) is an important technique whose aim is to enable universal access to information in all languages by people speaking different languages. Due to its importance, CLIR has been extensively studied. Most existing work on CLIR has assumed the availability of at least some reasonable linguistic resources such as bilingual dictionaries or parallel corpora, but creating such resources is expensive and requires manual effort.

In this paper, we study the feasibility of doing CLIR using comparable corpora and without relying on other rich linguistic resources. Comparable corpora are text documents in two different languages that cover similar topics. For example, news articles published in two different languages in the same time period naturally form comparable corpora as major events tend to be covered in news articles in multiple languages. Of course these news articles should be topically related, not talking about completely different subjects. Comparable corpora are expected to become increasingly available due to the rapid growth of online text data such as news reports, blog articles, and forum posts.

Our aim in this study is to evaluate the feasibility of using only comparable corpora to do CLIR and study effectiveness of different ways of leveraging comparable corpora for CLIR. In Tao and Zhai (2005), it has been shown that using Pearson correlations between the frequency distributions of words in different languages in a comparable corpus can discover related words. We would like to see whether such word relations are reliable enough for doing CLIR. One question here is how to incorporate word correlations into a CLIR model. We study this issue in the language modeling framework. We first use cross-lingual word relations to translate a query from one language to another, and then estimate a query language model in the target language and use a standard retrieval method to score documents in that language.

A key step in our approach is to transform word associations extracted from comparable corpora into probabilities and construct a query language model in the target language accordingly. There are two challenges here. First, unlike word translation relations obtained from a bilingual dictionary, the word associations learned from comparable corpora are inherently unreliable. Second, the number of word association pairs that can be reliably learned from comparable corpora is relatively small, thus it is important to address the issue of coverage. We propose and study methods for solving these two challenges. To solve the first challenge, we propose to transform the word association scores using non-linear transformation function to allow us to fully exploit all the potential word associations while at the same time controlling the influence of unreliable word associations. To solve the second challenge, we apply a probabilistic propagation method proposed in Cao et al. (2007) to exploit word co-occurrences in monolingual data in the two languages as well as word associations extracted from time correlations to solve the problem of data sparseness and better estimate the target query language models.

We use the data set used in TREC-2002 (Oard and Gey 2002) Arabic–English retrieval task in our experiments. For the comparable corpus, we used Arabic–English comparable corpus from news articles published by Agence France Presse and Xinhua news agencies between 1994 and 2004. The articles are aligned based on the date of publication. Evaluation results on this data set show that compared to the monolingual baseline and using the basic CLIR method, we can achieve up to 64.3 % of monolingual mean average precision (MAP), 67.7 % of monolingual precision at 5 documents (Prec@5), 69.4 % of monolingual precision at 10 documents (Prec@10) and 84.7 % of monolingual recall as shown in Sect. 4.3.1. Appropriate transformation of raw correlation scores improves the performance to 70.8 % of monolingual MAP, 70.6 % of monolingual Prec@5, 75.3 % of monolingual Prec@10 and 91.5 % of monolingual recall (see Sect. 4.3.3) and the probabilistic propagation model helps us to achieve up to 75.9 % of monolingual MAP, 76.5 % of monolingual Prec@5, 77.2 % of monolingual Prec@10 and 94.4 % of monolingual recall (see Sect. 4.3.4). The experimental results further show that the proposed methods are more effective in leveraging comparable corpora than an existing thesaurus-based method.

The rest of the paper is organized as follows: we first present some previous work in Sect. 2. We then introduce our proposed CLIR methods in Sect. 3, discuss the experiment results in Sect. 4 and finally conclude in Sect. 5.

2 Previous work

Cross-language information retrieval (CLIR) deals with finding information in one language in response to a query in another language. Since the query and the documents are expressed in different languages, direct matching of the query and the documents is impossible. Over the years, different methods have been proposed for crossing the language barrier.

One approach for achieving this goal is to assume that the words in one language are misspelled forms of another language (Buckley et al. 2000; Gey 2004; He et al. 2003), but such an assumption would only work for related languages, such as Italian and French, or Chinese and Japanese (Savoy 2005).

A more general approach is to use translation resources for this purpose, and a specific issue in CLIR is where to obtain the translation knowledge (Oard and Diekema 1998). The most common translation resources are bilingual dictionaries, machine translation systems, parallel corpora and comparable corpora. Parallel corpora consist of pairs of documents in two languages which are translations of each other, while comparable corpora are composed of text pairs which are topically similar, without being parallel.

Various research has been conducted on using bilingual dictionaries (Aljlayl and Frieder 2001; Ballesteros and Croft 1997; Hedlund et al. 2004; Hull and Grefenstette 1996; Levow et al. 2005; Xu and Weischedel 2000), machine translation systems (Aljlayl and Frieder 2001; Chen and Gey 2004; Dolamic and Savoy 2010; Kwok 1999; Oard and Hackett 1997) and parallel corpora (Hiemstra et al. 2001; Nie and Simard 2002; Nie et al. 1999) for CLIR. But machine translation systems, bilingual dictionaries and parallel corpora are expensive resources which are not available for many minority language pairs. However comparable corpora are much easier to obtain. Zanettin (1998) introduced several available bilingual comparable corpora. Comparable corpora are generally obtained from news articles (Munteanu and Marcu 2005; Steinberger et al. 2005; Tao and Zhai 2005), novels (Dimitrova et al. 1998), available research corpora such as CLEF or TREC collections (Braschler and Schäuble 1998; Sheridan and Ballerini 1996; Talvensaari et al. 2007) or by crawling the web (Talvensaari et al. 2008; Utsuro et al. 2002). Although comparable corpora are more available than parallel corpora, extracting knowledge from comparable corpora is significantly more challenging.

Using comparable corpora as a language resource for CLIR has been studied extensively in the existing literature (Abdul-Rauf and Schwenk 2009; Braschler et al. 2002; Braschler and Schäuble 2001; Franz et al. 1999; Fung and Yee 1998; Masuichi et al. 2000; Munteanu and Marcu 2005; Picchi and Peters 1996; Rapp 1995; Sadat et al. 2003a, b; Sheridan et al. 1998; Sheridan and Ballerini 1996; Talvensaari et al. 2007; Tao and Zhai 2005; Vu et al. 2009; Yu and Tsujii 2009). But most of these works, except for the work done in Sheridan and Ballerini (1996), make use of other kinds of linguistic resources as well. Picchi and Peters (1996), Franz et al. (1999), Fung and Yee (1998), Sadat et al. (2003a, b), Talvensaari et al. (2007), Vu et al. (2009) and Yu and Tsujii (2009) use some kind of bilingual dictionary or bilingual lexical database on top of comparable corpora. Abdul–Rauf and Schwenk (2009) require an automatic machine translator, Masuichi et al. (2000), use a small parallel corpus as the training data and Munteanu et al. (2005) require both a bilingual dictionary and a small amount of parallel data. In Sheridan and Ballerini (1996), the authors proposed a thesaurus-based method to do CLIR solely based on comparable corpora. However, the results of this work mostly demonstrated the benefit of such a method in improving recall rather than precision. With a similar goal, we propose a different strategy for exploiting comparable corpora based on correlation analysis of terms, which is more effective than the thesaurus-based method as will be shown in Sect. 4. Another limitation of the work (Sheridan and Ballerini 1996) is that the comparable corpora used are the same as the document collection for retrieval, which limits the generality of the conclusions drawn; in contrast, we show the feasibility of exploiting a separate comparable corpora to improve retrieval accuracy on another document collection.

Tao et al. (2005) are among a few which do not use any linguistic resources but comparable corpora to discover knowledge of potential word translations. In their work, they exploit frequency correlations of words in different languages in the comparable corpora and discover mappings between words in different languages. In this paper, we study how to effectively estimate the target query language models using these word mappings to do CLIR. We further exploit word co-occurrences in the two languages to identify related and similar terms by applying a probabilistic propagation method proposed in Cao et al. (2007). In Cao et al. (2007), the authors proposed a Markov Chain model to integrate monolingual and cross-lingual term relations for query expansion and evaluated the framework with reliable translations from a bilingual dictionary. We extend this work by further experimenting with such a framework in the setting of unreliable translations learned from comparable corpora.

3 Cross-language information retrieval (CLIR) with comparable corpora

In this section, we present the details of our proposed approach for doing CLIR when the only available resource is comparable corpora. Our idea is to use word correlations, correlations of frequency distributions of two terms over time, mined from comparable corpora to construct word translation probabilities between word pairs. Having a query in one language, we then construct the query language model in the second language using these estimated probabilities which allows us to match the target documents to the user query. Figure 1 shows a sketch of our approach. In the first step, we mine word correlations from bilingual time-aligned documents in the comparable corpus. We then use these learned correlations to obtain word translation probabilities. In this second step, we study how to effectively transform a word association extracted from a time correlation into a probability and propose to transform the raw association values with a non-linear function to address the problem of unreliable word associations. In the third step, we use the translation probabilities to estimate the query language model in the target language. We study different methods for this purpose: simple estimation of the target query language model using top correlated words of query terms as well as a propagation method which exploits word co-occurrences in the monolingual data for better estimation of the term scores in the target language. The final step ranks the target documents using a standard retrieval method. In the rest of this section, we will present different steps of our approach in more detail.

Fig. 1
figure 1

Proposed CLIR steps

3.1 Extracting word correlations

Having time-aligned comparable corpora, we use the method proposed by Tao et al. (2005) to discover correlations between words in different languages. In this method, frequency correlations of words in different languages in the comparable corpora are used to discover mappings between words. The main idea is based on the observation that the words that are translations of each other or are about the same topic tend to co-occur in the comparable corpora at the same time period. Such correlations are exploited to discover the associations between words in different languages.

In this method, each word is represented by a vector of frequencies and each pair of words in different languages is scored based on the similarity of their frequency vectors. The Pearson’s correlation coefficient is used to score every word in one language against every word in the other language.

Formally, let \(C = \{(d_1, d'_1), \ldots, (d_n, d'_n)\}\) be the comparable corpus where d i and d i are sets of documents with the same time stamp in languages L 1 and L 2 respectively. Also let a be a word in L 1 and b be a word in L 2. The normalized frequency vectors for a and b will be \(\overrightarrow{a} = (a_1, \ldots, a_n)\) and \(\overrightarrow{b} = (b_1, \ldots, b_n)\) respectively where

$$ a_i = \frac{c(a, d_i)}{\sum_{j = 1}^n c(a, d_j)}, b_i = \frac{c(b, d'_i)}{\sum_{j = 1}^n c(b, d'_j)} $$

and c(ad i ) is the count of word a in document d i . The similarity of these two words is computed using the Pearson’s correlation coefficient:

$$ r(a, b) = \frac{\sum_{i = 1}^n a_ib_i - \frac{1}{n}\sum_{i = 1}^n a_i \sum_{i = 1}^n b_i}{\sqrt{\left(\sum_{i = 1}^n {a_i}^2 - \frac{1}{n} (\sum_{i = 1}^n a_i)^2\right)\left(\sum_{i = 1}^n {b_i}^2 - \frac{1}{n} (\sum_{i = 1}^n b_i)^2\right)}} $$

Figure 2 shows a sample set of top English–Arabic word pairs extracted from the English–Arabic comparable corpus we used in our experiments, most of which have a very high quality. Note that we used Porter stemmer for stemming English words and light10 Arabic stemmer for stemming Arabic data, both as implemented in the Lemur toolkit,Footnote 1 and that’s why some prefixes and/or suffixes are stripped off.

Fig. 2
figure 2

Sample English–Arabic extracted word pairs

3.2 Estimating word translation probabilities

Having the correlations between words in different languages, in the next step we estimate translation probabilities. From the obtained word correlations, we select those above a specific positive threshold with the intuition that these correlations are more reliable. A natural baseline method is to use the normalized correlation scores as translation probabilities. Formally, let w be a word in L 1 and \(u_1, \ldots, u_m\) be the top m correlated words in L 2 with correlation scores \(r_1, {\ldots}, r_m\) respectively, i.e., r i  = r(wu i ), 1 < i < m, are the top m scores among \(r(w, u), u \in V. \) We construct the probabilities by normalizing these raw correlation scores:

$$ p(u_i|w) = \frac{r_i}{\sum_{j = 1}^m r_j } $$

where p(u i |w) is the probability of u i being the translation of word w in L 2.

One deficiency of this naive method is that we trust low correlations too much. Intuitively, high correlations are trustable, but not low correlations. Thus the probabilities should drop sharply as the correlations become smaller. To take this intuition into account, we decided to transform the scores with a transformation function that penalizes low correlation scores. For the transformation function, we chose to use exponential transformation with the general form:

$$ f(x) = a e^{bx} + c. $$

We set two restrictions on the transformation function: transform the highest possible correlation score (1) to 1 and transform the lowest possible correlation score (0) to 0, i.e, f(1) = 1 and f(0) = 0. The exponential transformation function that satisfies these two constraints is as follows:

$$ f(r) = \frac{1}{e^b - 1} e^{br} - \frac{1}{e^b - 1} $$

where b is a parameter that controls how much we want to penalize low correlations. Figure 3 shows the effect of exponential transformation for different values of b. As can be seen from this figure, higher values of b penalize low correlation values more.

Fig. 3
figure 3

Exponential transformation with different values of b

We then construct the probabilities from these converted scores:

$$ p(u_i|w) = \frac{f(r_i)}{\sum_{j = 1}^N f(r_j) } = \frac{\frac{1}{e^b - 1} e^{br_i} - \frac{1}{e^b - 1}}{\sum_{j = 1}^N \left(\frac{1}{e^b - 1} e^{br_j} - \frac{1}{e^b - 1}\right)}. $$

These probabilities drop sharply as the correlations become smaller and thus unreliable.

3.3 Constructing query language models and ranking the documents

Given the query Q in language L 1, our goal is to find related documents in language L 2. Intuitively, if we can somehow map the query Q in L 1 to a corresponding query in L 2, then we can easily find related documents in L 2 by comparing them to this translated query using a typical retrieval method. Thus what we have to do is to estimate the query language model of the translated query in L 2. Here we propose two methods for constructing the query language model of the translated query. As a basic method, we propose to use the translation probabilities estimated in step 2 directly to translate query words in L 1 to the corresponding words in L 2 and to construct the query language model in L 2 from these translated query words. To further exploit word co-occurrences in the monolingual data, we also propose a second method which uses a propagation framework to construct the query language model. We will present each method in more detail in the following.

3.3.1 Basic query translation method

Having estimated the translation probabilities between words in the two languages L 1 and L 2, we construct a query language model in L 2 corresponding to the given query Q (in L 1) using our “Top-k translation” method. In this method, for each query word in L 1, we use the top k correlated words in L 2 as its translation and construct the translation of the whole query. We assume all query words to be equally important in this method and thus have equal weights in constructing the query language model. The influence of each translation word depends on the estimated translation probability of the word.

Formally, let \(Q = q_1, \ldots, q_n\) be an input query in L 1. We estimate the query language model in L 2 using:

$$ p(w|\hat{\Uptheta}_Q) = \sum^n_{i = 1} \frac{1}{n} \frac{p(w|q_i)}{\sum_{j = 1}^k p(w_j|q_i)} $$
(1)

Here \(\hat{\Uptheta}_Q\) is the estimated query language model in L 2 and \(p(w|\hat{\Uptheta}_Q)\) is the estimated probability of a word w in this language model. p(w l |q j ) > 0 if w l is in the top k correlated words of q j , and p(w l | q j ) = 0 otherwise. This way we have constructed a basic translation of the query in L 2.

As a simple example, consider the query “Kurdistan Independence” in the task of English–Arabic CLIR with an English–Arabic comparable corpus. Figure 4a shows the mined Arabic translations of each English query word, along with the estimated translation probabilities. Using our “Top-k translation” method with k = 2, the top 2 correlated words of each query word are used to construct the target query. The weight of each Arabic word in Arabic query language model is computed based on the estimated translation probabilities and the fact that the two query words are of the same importance. The Arabic query language model is estimated using Eq. 1 and is shown in Fig. 4b.

Fig. 4
figure 4

Basic query translation a mined translations, b Arabic query language model

3.3.2 Propagation method

In the proposed basic method, we only consider those words in L 2 that are highly correlated with the query words in L 1 as candidate words in the translated query and use the translation probabilities to construct the query language model. But we observe that we can also consider word co-occurrences in the monolingual data to better estimate the query language models. Using co-occurrence information can introduce related words to the queries in both languages, resulting in a better estimation of query language models. Intuitively, a word has a high chance of being in the translated query if it is highly correlated with a word in the source language which has a high probability of being in the query language model, or/and it co-occurs a lot with a word in the same language that has a high probability of being in the translated query.

In Shakery and Zhai (2006), a general probabilistic framework for information retrieval was proposed. A similar framework was applied to CLIR with promising results in Cao et al. (2007) based on reliable word translation relations obtained from a bilingual dictionary. We further experiment with such a framework in the setting of unreliable translations learned from comparable corpora.

To implement the idea, we first construct a network of all the words in language L 1 and all the words in language L 2 where the edges between words in the same language are based on the mutual information between words, and the edges between words in different languages are correlation edges. The structure of the network is shown in Fig. 5. We define a probability distribution over the words in this network such that the probability of a word would intuitively indicate the probability of the word in our query language model. In order for each word in the network to be able to get influence from the neighboring nodes in the two languages, and the term scores in the two languages to be comparable, here we do not distinguish words in the two languages and the probabilities of all the words in both languages would sum to one.

Fig. 5
figure 5

Word network structure

We define the basic term probabilities in this network as:

$$ p_0(w) = \left\{\begin{array}{ll} \frac{1}{2} \times \frac{1}{|Q|} & \hbox{if }w \in L_1\hbox{ and }w \in Q\\ 0 & \hbox{if }w \in L_1\hbox{ and }w \notin Q\\ \frac{1}{2} \times \sum_{q \in Q} \frac{1}{|Q|} \times p(w|q) & \hbox{if }w \in L_2 \end{array}\right. $$

The rationale for this definition is as follows. A term has a non-zero score as its basic term probability if it is a query term in the source language or it is a direct translation of a query term in the target language, i.e., it is detected as a translation of a query word in the word association mining step. All query terms are considered equally important here and the weights of the target words depend on the estimated translation probabilities. These probabilities are essentially the same as our basic estimation method, which does not consider propagation within monolingual languages. Initially, we give equal weight to the words in the two languages, so that we would not bias toward a language, i.e., \(\sum_{w \in L_1}p_0(w) = \sum_{w' \in L_2} p_0(w') = \frac{1}{2}.\)

In order to incorporate monolingual co-occurrences, we assume that a word would have a higher probability score in this network if it is surrounded by highly scored words, and define the probability score of each word as:

$$ \begin{aligned} p(w^1) &= \alpha_0\quad p_0(w^1)\\ &\quad + \alpha_{\rm {MI}} \sum_{x \in L_1} p(x) p_{\rm {M}I}(x \rightarrow w^1)\\ &\quad + \alpha_{\rm {trans}} \sum_{y \in L_2} p(y) p_{\rm {trans}}(y \rightarrow w^1) \quad \hbox{if }w^1 \in L_1 \end{aligned} $$
(2)
$$ \begin{aligned} p(w^2) &= \alpha_0\quad p_0(w^2)\\ &\quad + \alpha_{\rm {trans}} \sum_{x \in L_1} p(x) p_{\rm {trans}}(x \rightarrow w^2)\\ &\quad + \alpha_{\rm {MI}} \sum_{y \in L_2} p(y) p_{\rm {MI}}(y \rightarrow w^2) \quad \hbox{if } w^2 \in L_2 \end{aligned} $$
(3)
$$ \alpha_0 + \alpha_{\rm {MI}} + \alpha_{\rm {trans}} = 1 $$

i.e., a linear combination of its basic term probability and the effect of neighbors in the word network. Here p 0(w) is the basic term probability of word \(w, p_{MI}(w_i \rightarrow w_j)\) is the co-occurrence probability, the normalized weight of co-occurrence edges between words in the same language, and \(p_{\rm {trans}}(w_i \rightarrow w_j)\) is the translation probability. α0, αMI and αtrans are parameters in [0, 1] which control the influence of each component on the total score of each word and are to be set empirically. To ensure the contribution of basic term probabilities, we generally should ensure that α0 is not too small. αMI controls the influence from the words in the same language, while αtrans controls the influence of words in the other language. In order to combine all the evidences, we expect the best performance to be achieved when we achieve an appropriate balance between these different sources of evidence, which is confirmed in our experiments.

The probability scores defined in Eqs. (2) and (3) are computed iteratively, updating the probability score of each word using the updated probability scores of the neighbors until they converge to a limit.

We then estimate the query language model in L 2 by normalizing these probability scores:

$$ p(w|\hat{\Uptheta}_Q) = \frac{p(w)}{\sum_{v \in L_2} p(v)}\quad \hbox{for each }w \in L_2 $$

The proposed updating formula is a special case of the general probabilistic relevance propagation framework proposed in Shakery and Zhai (2006). At each step, the score of each word is propagated to its outgoing neighbors in the word network in a weighted manner, and the score of each word is updated to a combination of the sum of its incoming (propagated) scores and its own score.

The score definitions 2 and 3 correspond to the stationary probability distribution of a random walk on the word network. Think of a random surfer surfing the set of words looking for terms related to query Q. At each step, the surfer being in a word, jumps to a co-occurring word with probability αMI, jumps to a correlated word with probability αtrans and jumps to a random word based on its basic term probability with probability α0. The surfer keeps doing this iteratively, jumping to words looking for words related to the query. The final score of each word is equal to the stationary probability of the surfer visiting the word.

In order to compute the scores, we construct a matrix M = α0 M 0 + α MI M MI + αtrans M trans. Here M 0(m, n) = β p 0(w n ) + (1 − β)/|N| where β is a number very close to 1 (0.99 in our experiments) which is used to give unreachable words a very tiny probability. \(M_{MI}(m, n) = p_{\rm {MI}}(w_m \rightarrow w_n)\) and \(M_{\rm {trans}}(m, n) = p_{\rm {trans}}(w_m \rightarrow w_n).\) We then compute the probability scores using matrix multiplication: \(\overrightarrow{P} = M^T \overrightarrow{P}\) where \(\overrightarrow{P}\) is the vector of the probability values. The probability scores are computed iteratively until they converge to a unique probability distribution, in a very similar way as any of the existing link-based scoring algorithms such as PageRank (Page et al. 1999). Efficient methods similar to the ones proposed for computing PageRank can be used to further speed up the scoring (Desikan 2009; Haveliwala 1999). The final scores will be the values of the stationary probability distribution of the Markov chain defined by M. The way we have defined M 0 will ensure reachability of each word from all other words in one step, thus by the Ergodicity theorem for Markov chains (Grimmett and Stirzaker 2001), we know that the Markov chain defined by such a transition matrix M must have a unique stationary probability distribution.

3.4 Ranking documents

Having generated the query language model in L 2, we then rank the documents in L 2 based on the KL-divergence between the estimated query language model and the estimated document language models (Zhai and Lafferty 2001). We assume that each document is generated from a unigram document language model \(\Uptheta_D.\) We estimate the document language model (\(\hat{\Uptheta}_D\)) of each document using the maximum likelihood estimator and smooth the estimated document language models using Dirichlet prior smoothing (Zhai and Lafferty 2004).

Assuming \(\hat{\Uptheta}_Q\) and \(\hat{\Uptheta}_D\) to be the estimated query and document language models respectively, the document D is ranked based on the KL-divergence between \(\hat{\Uptheta}_Q\) and \(\hat{\Uptheta}_D\):

$$ -D(\hat{\Uptheta}_Q||\hat{\Uptheta}_D) = -\sum_{w \in V} p(w|\hat{\Uptheta}_Q) \hbox{log} \frac{p(w|\hat{\Uptheta}_Q)}{p(w|\hat{\Uptheta}_D)} $$

where V is the set of words in our vocabulary.

4 Experiments

In our experiments, we focus on CLIR with comparable bilingual corpora as the only available resource. We will show that with such limited linguistic resources, we can achieve up to 75.9 % of monolingual MAP, up to 76.5 % of monolingual precision at 5 documents, up to 77.2 % of monolingual precision at 10 documents and up to 94.4 % of monolingual recall on the data set that we experimented with.

4.1 Data set and queries

As the comparable corpus, we used Arabic–English comparable corpus from news articles published by Agence France Presse 1994–1997, 2001–2004 and Xinhua news agencies 2001–2004. They are parts of the Arabic Gigaword Corpus (second edition) Footnote 2 and English Gigaword corpus (second edition) Footnote 3 released by linguistic data consortium. The articles coming from the same agency and the same publication date are aligned to comprise the comparable corpus with an average of 173 Arabic articles and 461 English articles per day. This corpus was the only one Arabic–English comparable corpus available at the time of doing this research and it is also used by other researchers in the area.

As the CLIR task we focus on the CLIR task of TREC-2002 (Oard and Gey 2002): retrieval of Arabic documents from topics in English. The document collection for this task contains 383,872 newswire stories that appeared on the Agence France Presse Arabic Newswire 1994–2000. The queries are 50 topic descriptions in English and the Arabic translations of these topics. The Arabic translations are used for the monolingual retrieval.

4.2 Monolingual (Arabic–Arabic) retrieval

We use monolingual Arabic–Arabic retrieval as a baseline to which we compare the cross-language results. In our monolingual Arabic runs, we only use the title field of each Arabic query topic as the query words. We used the light10 Arabic stemmer in the Lemur toolkit to stem the Arabic words. This light stemmer strips off initials, definite articles and suffixes. We did our experiments with two versions of the monolingual run: one that does not do query expansion, and one which does query expansion with pseudo-feedback. For the pseudo-feedback runs, we used the top 10 retrieved documents to perform feedback using the mixture model approach implemented in the Lemur toolkit (Zhai and Lafferty 2001). As the parameters, we used 0.5 for both background noise and feedback coefficient and used 100 terms for expanding the query model. We have tuned the parameters to achieve sufficiently good performance. Table 1 shows the MAP, Precision at 5 documents (Prec@5) and Precision at 10 documents (Prec@10) of our monolingual runs.

Table 1 Monolingual Arabic–Arabic retrieval performance

Among the nine teams participating in the TREC-2002 cross-language information retrieval track, three did monolingual Arabic retrieval with title fields only: Hummingbird Technologies (Tomlinson 2002), IBM Research (Franz and McCarley 2002) and University of Neuchatel (Savoy and Rasolofo 2002). Table 2 shows the title-only monolingual results of these three runs. All these teams use blind query expansion for these monolingual runs. Our monolingual results with query expansion is better than the results obtained by Hummingbird Technologies and IBM Research, but slightly worse than the results of University of Neuchatel. As the tables show, our results are comparable to these monolingual results and form a reasonable baseline to which we can compare our cross-language results.

Table 2 Title-only monolingual performance of TREC-2002 teams

4.3 English–Arabic CLIR

Our approach for CLIR is composed of four major steps (see Sect. 3). In the first step, we construct the word mappings between all the English query words and their possible Arabic translations. As the possible Arabic translations, for each English query word, we consider all the Arabic words occurring in any Arabic document time-aligned with the English document the English query term has occurred in. We further prune those Arabic words which occur very frequently and those with very low frequency.

In the second step, we estimate the word translation probabilities. We conduct two different sets of experiments in this step. In the first set of experiments, from the obtained word correlations, we select those above a specific threshold with the intuition that these correlations are more reliable. We then use the naive probability estimation method to estimate translation probabilities of English and Arabic words from the raw correlation scores:

$$ p(u_i|w) = \frac{r_i}{\sum_{j = 1}^N r_j} $$

In the second set of experiments, we use exponential transformation of correlations to penalize low correlations, which we assume are unreliable, and estimate the translation probabilities from the transformed correlation scores. Recall that the transformation function we use is:

$$ f(r) = \frac{1}{e^b - 1} e^{br} - \frac{1}{e^b - 1} $$

These probabilities drop sharply as the correlation scores become smaller.

After estimating the translation probabilities, we construct the Arabic query language models in the third step. Again we experiment with two different methods in this step. In the first set of experiments, we use our proposed Top-k translation method to construct the Arabic query language model corresponding to each English query using the estimated probabilities. In the second set of experiments, we use our probabilistic propagation method with the hope to better estimate the query language models. Recall that using this method, we consider both word co-occurrences and word correlations when constructing the Arabic query language models.

Having estimated the Arabic query language models, we rank the documents based on the KL-divergence between the estimated query language models and the document language models.

We conducted several series of experiments to evaluate our proposed approach for CLIR. The methods used in steps 1 and 4 are common to all the experiments and we have two different methods for steps 2 and 3 each, giving us four combinations. We will discuss the details of the experiments and results in the following.

4.3.1 Naive probability estimation + basic query translation

In the first set of experiments, we used “naive probability estimation” for estimating the translation probabilities and “Top-k translation” for constructing the Arabic query language models.

Table 3(a) and (b) shows the results of top-2–top-10 translations, where we use the top 2–top 10 correlated words as the translations of each query word, when we do not expand queries and when we do query expansion with pseudo feedback respectively. In this set of experiments, we have set the correlation threshold to 0.5, only counting words with correlation scores above this threshold as translations of the query words. Setting the threshold may limit the number of translations, leading to less than k translations for some query words.

Table 3 Basic query translation and naive probability estimation (threshold = 0.5)

As the table shows, using this naive probability estimation and basic query translation, we can achieve up to 67.7 % of monolingual MAP, 77.3 % of monolingual precision at 5 documents, 70.8 % of monolingual precision at 10 documents and 89.6 % of monolingual recall when not doing query expansion and about 64.3 % of monolingual MAP, 67.7 % of monolingual precision at 5, 69.4 % of monolingual precision at 10 documents and 84.7 % of monolingual recall when we expand queries. These results are very promising as we are using very little language resources for this task, suggesting the feasibility of using comparable corpora alone for CLIR.

As we stated earlier, we prune those Arabic words with correlation scores below the threshold to eliminate unreliable translations. We further tried to change the value of this threshold to see how it affects the performance. Figure 6a and b shows the MAP for different number of translation words as we vary the threshold from 0.3 to 0.8, when we do not expand the queries and when we do query expansion with pseudo feedback respectively.

Fig. 6
figure 6

Using different thresholds for pruning Arabic translations. a No query expansion. b Query expansion with pseudo feedback

When we set the threshold to 0.3, we allow almost all correlated words, even with small correlation scores, to be counted as translation words. Thus increasing the number of translation words for each English query word will allow inaccurate translation words to enter the query translation and thus hurt the performance. As we increase the value of the threshold, we prune those unreliable translation words, thus increasing the number of translation words will not hurt the performance as much.

With no query expansion, we get the best performance when we set the threshold to 0.5. After that, increasing the value of threshold will decrease the MAP. This is because we are setting the threshold too tight and we have very few translation words for our English query words. This is different in the case when we do query expansion. As can be seen in the figure, a high threshold such as 0.7 results in a high MAP. This can be because with such a high threshold, only few accurate translation words remain which results in accurate relevant documents on top. We use these top documents for blind query expansion, which leads to better performance. Again increasing the value of the threshold to 0.8 hurts the performance, because of pruning most of correlated words which results in no translation words for many of our English query words.

4.3.2 Naive probability estimation + query translation using propagation method

In our next set of experiments, we again use “Naive Probability Estimation” method for estimating the translation probabilities, but we use our probabilistic propagation method for estimating the Arabic query language models with the hope to get better estimations. Table 4(a) and (b) shows the results of this set of experiments when we do not expand queries and when we do query expansion. In this table, we report MAP, precision at 5 documents, precision at 10 documents and recall when we use the top 2–top 10 translation words as correlation neighbors. We did Wilcoxon signed rank test at 0.05 level of significance to see if the improvements over the basic translation method are statistically significant. Statistically significant improvements are distinguished by stars (*).

Table 4 Query translation using propagation and naive probability estimation

As the table shows, both when we do not expand queries and when we do query expansion, we can improve the performance in most cases using our propagation method. We get the best performance when we use the top 2 translation words as correlation neighbors.

From Table 4(a), it is clear that we can improve the performance a lot when we use the top 2 translation words as correlation neighbors, do propagation for constructing the query language models, and do not expand the queries. The reason is that the propagation framework gives us some kind of feedback effect itself, allowing co-occurring words to appear in the query translation and affect the estimated probabilities. Compared to the monolingual baseline when we do not expand queries, we get up to 72.8 % of monolingual MAP, up to 83.8 % of monolingual precision at 5 documents, up to 77.9 % of monolingual precision at 10 documents and up to 93.8 % of monolingual recall. Compared to basic query translation, this method shows 8.3 % improvement over MAP, 13 % improvement over precision at 5 documents, 10.2 % improvement over precision at 10 documents and 4.7 % improvement over recall. We also see improvement using propagation when we expand queries, getting up to 68.1 % of monolingual MAP, up to 72 % of monolingual precision at 5, up to 74.1 % of monolingual precision at 10 and up to 87.8 % of monolingual recall. Compared to basic query translation, this method shows 6.5 % improvement over MAP, 12 % improvement over precision at 5 documents, 6.7 % improvement over precision at 10 documents and 3.9 % improvement over recall. The improvement we get with query expansion is less than the case when we do not expand the queries, because query expansion itself introduces some related words to the query. These results show that the probabilistic propagation method is effective for improving CLIR accuracy by leveraging co-occurrences in monolingual data.

4.3.3 Exponential transformation of correlations + basic query translation

In this set of experiments, we used our basic query translation method to construct the Arabic queries again, but using the new transformed probability scores where the low, unreliable correlations are penalized. As we expected, exponential transformation of the correlations helped us improve the results significantly. Table 5(a) and (b) shows two sample sets of results with b = 6 (b is parameter of exponential transformation) and thr = 0.3 (thr is the correlation threshold). We did Wilcoxon signed rank test at 0.05 level of significance to see if the improvements over the basic translation method are statistically significant. Statistically significant improvements are distinguished by stars (*).

Table 5 Basic query translation and Exponential transformation of the correlations (b = 6, threshold = 0.3)

We further tried different values of b, the parameter of the exponential transformation which controls the amount we want to penalize low correlations, to see how the performance changes. Figure 7a and b shows the performance for different values of b. The horizontal axis shows the number of translation words we use for each query word and the vertical axis shows the MAP. In these set of experiments, we set the correlation threshold to 0.3.

Fig. 7
figure 7

Exponential transformation with different values of b (threshold = 0.3). a No query expansion. b Query expansion with pseudo feedback

As the charts show, exponential transformation improves the performance over the naive probability estimation baseline with all the different values of b, showing that it is a reasonable transformation when computing the probabilities. We get the best performance at b = 8, getting up to 76 % of monolingual MAP with no query expansion and up to 72.2 % of monolingual MAP with query expansion.

One interesting observation is that as we increase the number of translation words, our baseline performance with no exponential transformation hurts a lot. But with our exponential transformation, specially with larger values of b, the performance improves and stabilizes at some point. This is exactly what we expected. With our exponential transformation, increasing the number of translation words will not allow inaccurate translation words to hurt the performance, since they only have very small probabilities.

In this set of experiments, we had fixed the correlation threshold to 0.3. We further tried different thresholds for our best set of results (b = 8) to see how it affects the performance. Figure 8a and b shows the performance when we change the threshold from 0.3 to 0.8.

Fig. 8
figure 8

Exponential transformation with different thresholds (b = 8). a No query expansion. b Query expansion with pseudo feedback

The performance change this time is different from what we saw when we used the correlation values directly. When we use exponential transformation to compute the probabilities, we are penalizing those words with small correlation values, thus increasing the number of translation words does not hurt the performance. Besides, we get the best performance at thr = 0.3, that is when we use most correlated words. Setting higher thresholds would result in pruning some correct translation words which would allow better performance if we had kept them, even with small probability values.

Figure 9a and b shows the performance for different values of b when we set the threshold to 0.5.

Fig. 9
figure 9

Exponential transformation with different values of b (threshold = 0.5). a Query expansion. b No query expansion

As can be seen from the charts, the best performance in this case is worse than the case when we set the threshold to 0.3. That is because we are pruning some effective translation words when we set a higher threshold. Otherwise the trend is similar.

4.3.4 Exponential transformation of correlations + query translation using propagation method

We ran the final set of experiments using both proposed methods, i.e., the propagation method and the transformed probability scores. Table 6(a) and (b) shows the results of this set of experiments when we do not expand queries and when we do query expansion respectively. In this table, we report the performance results when we use the top 2 to top 10 translation words as correlation neighbors. Statistically significant improvements over the basic translation method, confirmed by Wilcoxon signed rank test at 0.05 level of significance, are distinguished by stars (*).

Table 6 Query translation using propagation and exponential transformation of correlations

As the table shows, in both cases, when we do not expand queries and when we do query expansion with pseudo feedback, we get substantial improvements using our propagation method over the basic method. In the case when we do not expand queries, we achieve up to 80.9 % of monolingual MAP, up to 86.9 % of monolingual precision at 5 documents, up to 83.4 % of monolingual precision at 10 documents and up to 98.8 % of monolingual recall. When we expand queries, we achieve up to 75.9 % of monolingual MAP, up to 76.5 % of monolingual precision at 5 documents, up to 77.2 % of monolingual precision at 10 documents and up to 94.4 % of monolingual recall.

These results show that combining the two methods performs better than either one alone. Exponential transformation of correlations penalizes low, unreliable correlations and the probabilistic propagation method exploits word co-occurrences to better estimate the term scores.

4.3.5 Sensitivity analysis

We have so far reported the best performance we achieve through tuning the parameters α MI and α trans in the propagation model which control the influence of each group of neighbors. The question now is how sensitive this method is to the setting of these parameters.

To answer this question, we compute the range of parameter values for which using propagation outperforms the baseline method. Figures 10 and 11 show the performance results with different parameter values for the propagation method. The dark gray cells indicate the case where propagation outperforms the baseline, light gray cells indicate equal performance and white cells indicate the case where the baseline outperforms the propagation method. It can easily be seen that the optimal range is quite wide. Specifically, the propagation method improves the MAP for all the values of the parameters. Precision at 5 documents and precision at 10 documents are more sensitive to the setting of these parameters, but still for a wide range of parameter values, the propagation method outperforms the baseline method, indicating that using the propagation method is useful in improving the performance in general.

Fig. 10
figure 10

Ranges of αMI and αtrans for improving baseline—no query expansion. Dark gray cells indicate the case where propagation outperforms the baseline, light gray cells indicate equal performance and white cells indicate the case where the baseline outperforms the propagation method

Fig. 11
figure 11

Ranges of αMI and αtrans for improving baseline—query expansion with pseudo feedback. Dark gray cells indicate the case where propagation outperforms the baseline, light gray cells indicate equal performance and white cells indicate the case where the baseline outperforms the propagation method

4.3.6 Closer examination of the results

We did a further closer examination of the results to understand query by query performance differences, looking into the results of best achieved performance reported in Table 6. This examination reveals that for 42 % of the queries, our best performing method outperforms monolingual baseline and for 10 % of queries, this performance is twice as good as monolingual baseline.

For 14 % of queries, the performance of the proposed method is less than 5 % of monolingual MAP. Studying these queries reveals that 71.4 % of these queries are hard queries, for which the monolingual MAP is less than 50 % of the average MAP over all queries. Different factors have caused the poor performance of these queries.

Examination of the queries shows that in some cases, the English queries are not direct translations of Arabic queries, which results in the poor performance of cross-lingual English–Arabic IR. For example, for the query AR26, the English query is “Kurdistan Independence”, while the exact translation of the Arabic query is “Kurdistan National Council of Resistance” and the description of the query is “How does the National Council of Resistance relate to the potential independence of Kurdistan?”. Although our method is able to find a good translation for the query “Kurdistan Independence”, the query does not exactly match the information need and will not result in good performance. Another example is query AR38, whose English query is “traditional arab arts” and the exact translation of the Arabic query is “traditional arab music and dance”. Again our method is able to find good Arabic translations for the word “art”, but the result is still not good compared to the more specific Arabic query.

For some other queries, the poor performance is due to the limitations of using comparable corpora for translation. In order to be able to extract translations of a term from a comparable corpus, the term should appear frequently enough in the corpus, so that we have enough evidence of the correlations. We can not expect to translate queries containing terms which occur very rarely, or do not occur in the comparable corpus. Two examples of such poor performing queries in our data set are query “AR35: Mina Sulman/Umm Qasr Sea Link” and “AR55: Umm Kalthoum’s Influence”.

4.3.7 Summary

In this section, we have reported the results of our several sets of experiments to evaluate our proposed approach. The basic CLIR method can achieve about 68 % of monolingual retrieval performance with no query expansion and about 64 % of monolingual performance when we expand queries using pseudo feedback. The experiment results show that appropriate transformation of the raw correlation scores can help get better performance. Specifically, our proposed exponential transformation of time correlations help us improve performance to about 74 % of monolingual performance with no query expansion and up to 71 % of monolingual performance when we expand queries. The results further show that the probabilistic propagation method is effective for better estimation of the target language models, improving the performance to 81 % of monolingual retrieval with no query expansion and 76 % of monolingual performance when we expand queries. The propagation method is shown to be effective for a wide range of parameter values.

4.4 Comparison with thesaurus-based query expansion technique for CLIR

Sheridan and Ballerini (1996) proposed to use thesaurus-based query expansion techniques applied over a comparable corpus to do CLIR. This work is among the few works which use comparable corpora as the only resource for CLIR.

Given a comparable corpus, Sheridan et al. first merge the aligned documents of the comparable corpus to construct a collection of multilingual documents. Each document in this collection will contain text in the two languages of the comparable corpus. They then apply thesaurus-based query expansion across the collection of multilingual documents to come up with terms in different languages related to a query term. The result will be an expanded multilingual query. Filtering the expanded query with the second language will result in a query in the target language, which is used to retrieve documents of the second language. The details of this work can be found at Sheridan and Ballerini (1996).

In their experiments Sheridan and Ballerini (1996), translate each query word in the source language to k words in the target language. We have implemented this method and Table 7(a) and (b) shows the results of the method for different values of k. As the table shows, this thesaurus-based method does not perform as well as our correlation-based method, achieving up to 23.3 % of monolingual MAP in case of no query expansion and 27.1 % of monolingual MAP with pseudo feedback; in contrast, our methods can achieve up to 80.9 % and 75.9 % of monolingual MAP for no query expansion and with query expansion respectively [see Table 6(a) and (b)].

Table 7 Thesaurus-based query expansion for multilingual information retrieval

5 Conclusions and future directions

Existing work on CLIR has mostly relied on rich, high quality linguistic resources such as machine translation systems, bilingual dictionaries, or parallel corpora. But such high quality resources may not be at hand or may be very expensive for some language pairs, making it a challenge to perform CLIR for such languages. We observed that for these language pairs, we often naturally have available comparable corpora, and studied how to use just comparable corpora to do CLIR. Our basic idea is to use word associations extracted from comparable corpora based on time correlations to translate queries from the source language to the target language. In order to improve the estimation of the target query language model, we propose an exponential transformation function to increase the robustness of term weighting. We further apply a probabilistic propagation framework to exploit word co-occurrences in the monolingual data as well to address the data sparseness problem and better estimate the query language model.

The experiment results show that it is feasible to use just comparable corpora to do CLIR and using the proposed transformation function can effectively address the issue of unreliable associations and achieve up to 70.8 % of monolingual MAP, 70.6 % of monolingual precision at 5 documents, 75.3 % of monolingual precision at 10 documents and 91.5 % of monolingual recall. We further observed that the propagation method is an effective method for enriching bilingual associations with monolingual co-occurrences which can improve the performance substantially, achieving up to 75.9 % of monolingual MAP, 76.5 % of monolingual precision at 5 documents, 77.2 % of monolingual precision at 10 documents and 94.4 % of monolingual recall. The experimental results also show that the proposed methods, which are based on correlations of words in different languages, are more effective in leveraging comparable corpora than the existing thesaurus-based method. The achieved results are quite promising since we are using very limited naturally available linguistic resources, thus the method can potentially be applied in many minority language pairs to support CLIR without any special manual effort.

There are several interesting directions for further research in this area:

  1. 1.

    In our current experiment setup, the queries and the comparable corpora are from the same domain. We currently use news articles published in Arabic and English languages as our comparable corpus and our CLIR task is to retrieve Arabic newswire stories in response to English queries. Intuitively, if the query and the comparable corpora are not from the same domain, the CLIR task should be harder. An interesting future research direction is to look into cases where queries are slightly out of the domain of the comparable corpora to see how our method performs. Specifically, we should look into the coverage of query words in the comparable corpora and its impact on the performance. We can also think of other sources for comparable corpora, for example scientific literatures to extract translation knowledge.

  2. 2.

    In our current research, we have studied the feasibility of using comparable corpora for CLIR. In the future, we can investigate the effectiveness of using comparable corpora for CLIR by exploring different configurations of the corpora, e.g., the impact of the size of comparable corpora on the performance of CLIR.

  3. 3.

    We are currently using comparable corpora as the only available language resource, but potentially our method can benefit CLIR methods which use other linguistic resources such as bilingual dictionaries. Exploiting comparable corpora can be expected to help when the vocabulary coverage of common dictionaries is poor for a language. Using comparable corpora on top of bilingual dictionaries can help to find translations of newly introduced words to a language.

  4. 4.

    A different solution to the CLIR problem involving a resource-lean language pair (e.g., Arabic–Lithuanian) for which we do not have rich linguistic resources is to go through a third popular language. For example for translating a query from Arabic to Lithuanian, we can first translate the Arabic query to English and then translate the English translation to Lithuanian (assuming we have linguistic resources for Arabic–English and English–Lithuanian language pairs). An interesting research direction is how to use our propagation framework for such a double translation approach.