Multilingual information retrieval in the language modeling framework

Rahimi, Razieh; Shakery, Azadeh; King, Irwin

doi:10.1007/s10791-015-9255-1

Multilingual information retrieval in the language modeling framework

Published: 06 May 2015

Volume 18, pages 246–281, (2015)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

Multilingual information retrieval in the language modeling framework

Download PDF

Razieh Rahimi¹,
Azadeh Shakery^1,2 &
Irwin King³

1595 Accesses
13 Citations
Explore all metrics

Abstract

Multilingual information retrieval (MLIR) provides results that are more comprehensive than those of mono- and cross-lingual retrieval. Methods for MLIR are categorized as: (1) Fusion-based methods that merge results from multiple retrieval runs, and (2) Direct methods that build a unique index for the entire collection. Merging results of individual runs reduces the overall effectiveness, while more effective direct methods suffer from either time complexity and memory overhead, or over-weighting of index terms. In this paper, we propose a direct MLIR approach by using the language modeling framework that includes a novel multilingual language model estimation for documents, and a new way to globally estimate word statistics. These contributions enable ranking documents in multiple languages in one retrieval phase without having the problems of the previous direct methods. Moreover, our approach has the advantage of accommodating multilingual feedback information which helps to prevent query drift, and consequently to improve the performance. Finally, we effectively address the common case of incomplete coverage of translation resources in our proposed estimation methods. Experimental results show that the proposed approach outperforms the previous MLIR approaches.

An axiomatic approach to corpus-based cross-language information retrieval

Article 09 April 2020

Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEF

A Learning to rank framework based on cross-lingual loss function for cross-lingual information retrieval

Article 01 July 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the rapid proliferation of Web contents in different languages, multilingual information retrieval (MLIR) has been inevitably linked to Web search. MLIR enables retrieving documents in multiple languages in response to a user’s query. MLIR is more challenging than mono- and cross-lingual information retrieval in that in the former, document collection contains documents in multiple languages, while in the latter, all documents of the collection are in one language. A major challenge in MLIR is the appropriate use of translation knowledge to score documents which might or might not be in the query language.

There are two main architectures for MLIR which are referred to as distributed and centralized architectures in (Lin and Hsi 2003), and query translation and document translation approaches in (Peters et al. 2012). Herein, to avoid confusion with federated MLIR (Si et al. 2008) and cross-language IR terminologies (Nie 2010), the two architectures are called fusion-based and direct approaches, respectively. In direct approaches, the entire multilingual collection has a unique index (Lin and Hsi 2003) and thus, the multilingual result list can be obtained in one retrieval phase. On the other hand, in fusion-based approaches, the multilingual retrieval problem is transformed to a number of retrieval tasks, each of which corresponds to a language in the collection. These retrieval tasks are then followed by a merging phase to create the final ranked list.

Using direct approaches is motivated by substantial performance degradation in the merging phase of fusion-based approaches (Peters et al. 2012). However, existing direct approaches are not as practical either due to time and memory overhead, or over-weighting of index terms. For example, in one direct approach, each document is translated into all languages (typically by using machine translation systems), which not only is time-consuming, but also needs updating with improvements in translation resources (Peters et al. 2012). To ameliorate these problems, the interlingua approach can be adopted (Kraaij and de Jong 2004; Sorg and Cimiano 2012). This approach although reduces the required translation resources by a factor of the number of languages, may decrease the accuracy of retrieval, in that a query and a document in two languages different from the pivot language are indirectly matched through translation to the pivot language. This is because employing direct translation resources when they are available, usually outperforms the use of transitive translations (Nie 2010). In another direct approach (Nie and Jin 2002), instead of translating documents, the query is expanded by translations of query terms in all languages. The drawback of this method is that terms of a language having fewer documents may be over-weighted (Kishida 2005). Therefore, taking advantage of direct approaches requires overcoming the mentioned problems.

In this paper, we propose a direct approach for MLIR with only one retrieval phase which does not suffer from memory or time overhead of translating all documents, and also preserves relative term statistics. Our approach is based on the language modeling framework (Lafferty and Zhai 2001). In this framework, documents are represented by a language model. The language model of a document is a representation of the queries that would be submitted by users interested in that document. The idea behind our approach is that a document in a multilingual environment is relevant to queries in different languages. Thus, the language models of these documents should reflect this fact. To achieve this, we represent each document by a multilingual unigram language model (MULM).

This paper improves the performance of MLIR through the following contributions:

1.
We propose a novel way of estimating document language models in a multilingual environment. Using these document models in the KL-divergence framework, documents in multiple languages are ranked in only one retrieval phase, without translating all documents into all languages.
2.
We provide more accurate global estimates of two retrieval heuristics, namely term and document frequencies, by simultaneously considering all subcollections in different languages. This way, we avoid over-weighting of index terms.
3.
In items 1 and 2 above, we adjust each estimation approach in such a way that it also performs well when translation resources do not provide full coverage of collection words.
4.
We show that feedback information from documents in multiple languages can be naturally incorporated to improve the ranking of documents in each subcollection. This feature prevents query drift when documents in one language do not cover the topic of a query.

Results of our experiments reveal that our approach outperforms the previous MLIR approaches; namely fusion-based methods (Powell et al. 2000; Savoy 2003; Martinez-Santiago et al. 2006) and a direct method (Nie and Jin 2003). Furthermore, we achieved MLIR effectiveness between 68 and 93 % of the theoretical optimal effectiveness achievable by any fusion-based method. This is also higher than the percentage reported in the previous work (Martinez-Santiago et al. 2006), which is the state-of-the-art among unsupervised fusion-based methods.

The rest of the paper is organized as follows. In Sect. 2, we review previous work on MLIR. Section 3 discusses basics of the KL-divergence retrieval model. The main approach is presented in Sect. 4 and its features are discussed in Sect. 5. Section 6 describes the experimental setup and evaluation of our approach. Finally, we conclude in Sect. 7.

2 Related work

We divide the MLIR approaches into two groups, fusion-based and direct approaches; fusion-based approaches merge retrieval results from separate indexes, while direct approaches avoid the merging phase by retrieving documents from a single index.

Direct approaches In the direct method proposed by Braschler et al. (2002), all documents of a multilingual collection are translated into the query language. Thus, multilingual retrieval task is reduced to monolingual retrieval on the translated document collection. Subsequent approaches overcome the problem of translating all documents into all languages. Nie and Jin (2002, 2003) build a unique index for a multilingual collection in which each word has a language tag. To rank the documents with respect to a query, the query is expanded by adding the translations of each query term in all languages with their associated probabilities. Documents and multilingual queries are then matched using a TF-IDF weighting scheme. The inverse document frequency (IDF) of a term is estimated on the unified multilingual index and is higher than that calculated on the term’s respective subcollection. However, the amount of increase in IDF is not equal for terms of different languages due to the difference in the size of respective subcollections. This can cause improper higher weights for terms of a language with fewer number of documents. In another study, Sorg and Cimiano (2012) make a unique index from inter-lingual representations of documents using Wikipedia.

Fusion-based approaches are further divided into three sub-groups based on their merging strategy.

First group of fusion-based methods use only the ranks and/or scores of documents to merge the result lists from mono- and cross-lingual runs. Traditional approaches such as round-robin merging (Savoy 2002; Chen and Gey 2004), raw-score merging (Savoy 2003), and normalized-score merging (Savoy 2004b; Savoy and Berger 2005; Savoy 2004a; Jones et al. 2005) belong to this group. These approaches are based on some assumptions on the distribution of relevant documents across subcollections, or comparability of retrieval scores from different subcollections.

Second group of fusion-based methods includes methods that try to partially relax the aforementioned limiting assumptions by exploiting more information from underlying subcollections or retrieved lists. For this purpose, Braschler and Schäuble (2000) align documents in different languages in the collection and use this information to make the scores of retrieved documents comparable. Braschler (2004) investigates performance improvement on simple merging strategies through contributing intermediate lists to the final result based on their respective subcollection sizes. Lin and Hsi (2004) consider subcollection characteristics and translation qualities in the merging phase, and propose a weighted combination of intermediate results. Multiple parameters are used for determining the combination weights which makes tuning difficult. Martinez-Santiago et al. (2006) perform an additional retrieval step to merge the intermediate results. The second retrieval step is to rank the top-k documents, retrieved in the first step, with respect to an expanded query through translations of its terms in all languages.

Third group of fusion-based approaches apply machine learning techniques to the merging problem. Le Calvé and Savoy (2000) explore the use of logistic regression to learn the probability of relevance for each document based on its score and the logarithm of its rank. Si and Callan (2006) also propose a query- and language-specific result merging algorithm by using the logistic model. Gao et al. (2009) define features of the learning method based on a document’s similarities with a query, with other retrieved documents in its language, and with retrieved documents in other languages. Tsai et al. (2008) define several features, such as technical terms and person and organization names, for learning the document ranks. Extracting some of these defined features highly depends on the availability of language-specific tools. Therefore, these features might not be available for languages with limited resources.

Language modeling framework In addition to monolingual information retrieval, the language modeling framework has been used for cross-lingual (bilingual) information retrieval (Xu et al. 2001; Lavrenko et al. 2002; Kraaij et al. 2003). These approaches try to adopt the idea of translation models in monolingual information retrieval, proposed in (Berger and Lafferty 1999), to CLIR. Thus, these approaches are applicable when documents of a collection are written in one language and the goal is to rank these documents w.r.t. a query in another language. Adopting these approaches for MLIR needs a merging phase to combine the results which are separately generated by the language modeling framework for each language (monolingual and bilingual runs). Although the language modeling framework has a good performance in ranking documents of each subcollection, the merging phase can degrade the performance of MLIR. Therefore, ranking documents of a multilingual collection in one retrieval phase is preferred to avoid performance degradation through merging. To the best of our knowledge, this work is the first direct approach for MLIR using the language modeling framework. We use different merging algorithms on the individual results produced by the language modeling framework, as baselines for evaluating our approach.

Our approach is similar to the approaches of (Kraaij et al. 2003; Xu et al. 2001) in terms of building new language models for documents. But, in (Kraaij et al. 2003; Xu et al. 2001), the new language model of a document has the same number of parameters as the number of words in the source (query) language, while in our approach, the number of parameters is the same as the number of words in all languages in order to enable direct retrieval.

In addition to removing the merging phase, our approach naturally allows adopting feedback information from one subcollection to improve the retrieval performance on other subcollections. Getting assistance of subcollections in other languages is not directly applicable in fusion-based methods for MLIR, independent of which retrieval model is used for generating the individual lists. Therefore, using the language modeling approaches of (Kraaij et al. 2003; Xu et al. 2001) for MLIR cannot provide this advantage.

3 Language modeling approach

In this section, we briefly review the basics of the language modeling framework which are required to describe our approach.

Monolingual information retrieval The Kullback–Leibler (KL) divergence retrieval model is considered as the state-of-the-art for retrieval using the language modeling approach. Using the KL-divergence model for retrieval, the score of a document D with respect to a query Q is calculated as (Lafferty and Zhai 2001):

$$\begin{aligned} {\mathrm {Score}}(Q,D) & = - D_{KL}(\theta _{Q} \Vert \theta _{D}) \nonumber \\ & \mathop = \limits ^{{\mathrm {rank}}} \sum _{w \in v}{p(w| \theta _{Q}) \log p(w| \theta _{D})}, \end{aligned}$$

(1)

where $\theta _{Q}$ and $\theta _{D}$ are the estimated query and document language models, respectively, and v is the vocabulary set. Assuming a multinomial model, the basic approach to estimate document language models is the Maximum likelihood (ML) estimator. According to this estimator, word probabilities are estimated as follows:

$$\begin{aligned} p_{ml}(w|\theta _{D}) = \frac{c(w,D)}{|D|}, \end{aligned}$$

(2)

where $c(w,D)$ is the count of word $w$ in document $D$ and $|D|$ is the length of $D$. Maximum likelihood estimator assigns zero probabilities to unseen words in a document, causing problems in scoring the document using Eq. (1) (Zhai and Lafferty 2001b). Smoothing methods address this problem by discounting the probabilities of observed words in the document and assigning non-zero probabilities to unseen words. One commonly used smoothing technique is Dirichlet Prior (Zhai and Lafferty 2001b) in which the language model for a document $D$ is estimated as:

$$\begin{aligned} p(w|\theta _D) = \frac{|D|}{|D| + \mu } p_{ml} (w|D) + \frac{\mu }{|D| + \mu } p(w|C), \end{aligned}$$

(3)

where $\mu$ is the smoothing parameter and $p(.|C)$ is the collection language model.

Feedback The KL-divergence framework provides a principled way to leverage feedback information in order to improve the estimation of the query language model. In model-based feedback (Zhai and Lafferty 2001a), the query language model is updated using the feedback model estimated based on the feedback documents:

$$\begin{aligned} p\left( w|\theta '_Q\right) = \lambda p(w|\theta _Q) + (1 - \lambda ) p(w|\theta _F), \end{aligned}$$

(4)

where $\theta '_Q$ is the new language model for the query, $\theta _F$ is the estimated feedback model, and $0\le \lambda \le 1$ is the interpolation parameter.

Cross-Lingual IR The KL-divergence retrieval model can also be used for cross-language information retrieval (CLIR). In CLIR, the language of the query is different from that of the documents. Therefore, to score documents with respect to a given query, we need to integrate the translation model into either the query or the document language model (Nie 2010). The translation model, in the basic form, includes a translation probability for each pair of source- and target-language words. In query translation approach, a new language model is built for the query and documents are ranked using:

$$\begin{aligned} {\mathrm {Score}}(Q,D)= & {} \sum _{w_t \in V_t} p(w_t | \tilde{\theta }_Q) \log p(w_t | \theta _D), \end{aligned}$$

(5)

$$\begin{aligned} p(w_t | \tilde{\theta }_Q)= & {} \sum _{w_s \in V_s} p(w_t | w_s, \theta _Q) p(w_s | \theta _Q) \nonumber \\\approx & {} \sum _{w_s \in V_s} p(w_t | w_s) p(w_s | \theta _Q), \end{aligned}$$

(6)

where $w_s$ ($w_t$) are source (target) words belonging to $V_s$ ($V_t$) which are the source (target) language vocabulary, respectively. $p(w_t|w_s)$ indicates the probability of translating the source word $w_s$ to the target word $w_t$. In document translation approach, the translation model is integrated into the document language models and the score of a document $D$ is given by:

$$\begin{aligned} {\mathrm {Score}}(Q,D)= & {} \sum _{w_s \in V_s} p(w_s | \theta _Q) \log p(w_s | \tilde{\theta }_D), \end{aligned}$$

(7)

$$\begin{aligned} p(w_s | \tilde{\theta }_D)= & {} \sum _{w_t \in V_t} p(w_s | w_t, \theta _D) p(w_t | \theta _D) \nonumber \\\approx & {} \sum _{w_t \in V_t} p(w_s | w_t) p(w_t | \theta _D), \end{aligned}$$

(8)

where $p(w_s|w_t)$ indicates the probability of translating the target word $w_t$ to the source word $w_s$. We call this approach LM-based document translation to avoid confusion with traditional document translation approach that literally translates the whole document and then indexes the translated document.

4 Extension of the language modeling framework to MLIR

In this section, we describe our approach for MLIR.

Problem definition Suppose that we have a multilingual collection $C$ where its documents are written in $N$ different languages $\{l_i\}_{i=1}^N$. For each language pair, we are given a translation model of the form $p(w|u)$ which indicates the probability of translating word $u$ in one language to word $w$ in another language. The goal is to optimize the effectiveness of multilingual information retrieval given the available translation models. In particular, we aim to estimate the score of a document $D$ ($D \in C$) with respect to a query $Q$ in language $l_j$ ($1\le j\le N$) in order to provide a ranking of documents in multiple languages $\{l_i\}_{i=1}^N$.

Before presenting the solution, let us first define some notations. By dividing the collection $C$ based on document languages, we get $N$ subcollections $\{C_i\}_{i=1}^N$ where $C=\cup _{i=1}^{N}{C_i}$ and all documents in each subcollection $C_i$ are in the same language $l_i$. We define $v_i$ as the vocabulary set of language $l_i$ and $V=\cup _{i=1}^{N}{v_i}$ as the vocabulary set of the entire collection. Words of each language are labeled with a language tag similar to (Nie and Jin 2002). Thus, common words between languages are considered as separate words and ${|V|=\sum _{i=1}^N|v_i|}$. The reason behind this decision is described in Sect. 5.1. The probabilities in the translation model for each language pair ${(l_i,l_j)}$, ${1\le i,j\le N \,{\mathrm {and}}\,i\ne j}$, are denoted by $p_{ij}(w|u)$ which indicates the probability of translating word $u$ in language $l_j$ to word $w$ in language $l_i$. The given translation probabilities are normalized such that $\sum\nolimits_{w\in v_i}p_{ij}(w|u)=1$ for each pair $(l_i,l_j)$, $1\le i,j\le N$, of languages. In addition, self translation probabilities are set to one, i.e., $p_{ii}(w|w)=1$. In the following, we describe how to perform multilingual information retrieval using the language modeling framework.

4.1 Multilingual unigram language model

Multilingual language model for document representation As mentioned before, a document in a multilingual environment can be retrieved with respect to queries in different languages. Hence, we should extend the basic estimation of document language models to further support these queries. In all basic language modeling approaches, the parameters of the document language model are estimated by considering the document as the observed data, which has the problem that the document might be a small sample of its language model. Extending language models of the documents to support queries in different languages makes the estimation of document language models more challenging, because the document has no term in many query languages. To tackle this issue, we define the probabilistic count of term $w\in v_i$ (belonging to language $l_i$) in document $D$ written in language $l_j$ as:

$$\begin{aligned} c_{p}(w, D) = \sum _{u \in D}{p_{ij}(w|u)c(u,D)}, \end{aligned}$$

(9)

where $c(u,D)$ is the real count of term $u$ in document $D$. Defining probabilistic counts of terms is intuitively analogous to expanding each document with terms of other languages than the document language. The expanded documents are then considered as bags of multilingual words, i.e., we assume independence between terms of expanded documents. This simplifying assumption is also made in estimating document language models in monolingual information retrieval (Zhai 2008).

We build a new multilingual language model for document $D$, denoted by ${\hat{\theta }}_{D}$, considering the probabilistic counts of words instead of only the real counts. To estimate the parameters of ${\hat{\theta }}_{D}$, we follow the well-established unigram multinomial language model, where the probability of generating a sequence of words is obtained by multiplying the probabilities of generating each of its words, assuming that words are generated independently. Therefore, the parameters of this multilingual model are $\{p(w_i|{\hat{\theta }}_{D})\}_{i=1}^{|V|}$, i.e. all terms of all languages. The maximum likelihood estimator gives us:

$$\begin{aligned} p_{ml}\left(w|{\hat{\theta }}_{D}\right) = \frac{c_{p}(w, D)}{\sum _{u \in V}{c_{p}(u, D)}} = \frac{c_{p}(w, D)}{N|D|}, \end{aligned}$$

(10)

where $N$ is the number of different languages in the collection, $|D|$ is the length of document $D$ accounting real counts of its terms, and $N|D|$ represents the new size of document $D$ considering the probabilistic counts. For illustration, consider the example in Fig. 1a. In the basic ML-estimated language model for document $D_1$ in the figure, term $a$ has probability 1, while term $a$ as well as its translation, term $\alpha$, has probability 0.5 within the multilingual language model built by our approach (Eq. (10)) using the probabilistic counts shown in Fig. 1b.

The estimates provided in Eq. (10) suffer from the same problem of underestimating probabilities for words that have zero probabilistic counts in a document, computed according to Eq. (9). To address this issue, we can generalize existing smoothing techniques to be applicable on our multilingual language model. We consider here a smoothing method that uses a reference language model. First, we proceed with the estimation of the reference language model in our retrieval model.

New reference language model To estimate the reference language model for smoothing techniques, probabilistic counts of words in all documents of the entire collection are used. That is,

$$\begin{aligned} p'(w|C) = \frac{\sum _{D \in C}{c_{p}(w,D)}}{\sum _{D \in C}{\sum _{u \in V}{c_{p}(u,D)}}} = \frac{\sum _{D \in C}{c_{p}(w,D)}}{N\sum _{D \in C}{|D|}}, \end{aligned}$$

(11)

where $p'(.|C)$ denotes the new reference language model. This estimate of the reference language model considers the probabilistic counts of each word in all subcollections, rather than only the subcollection that actually includes that word. Therefore, this collection language model can be considered as an expanded estimate of the reference language model, compared to the ML-estimate,

$$\begin{aligned} p(w|C)=\frac{\sum _{D \in C}{c(w,D)}}{\sum _{D \in C}{|D|}}, \end{aligned}$$

(12)

which is equal to ${\frac{\sum _{D \in {\varvec{C}}_{\varvec{i}}}{c(w,D)}}{\sum _{D \in C}{|D|}}}$ if word $w$ belongs to language $l_i$. The effect of counting word occurrences in all subcollections is investigated in more detail in Sect. 5.3.

Smoothing The new reference language model (Eq. (11)) should be used in smoothing techniques that need a fallback model. For example, the smoothed multilingual language model for a document $D$ using Dirichlet Prior smoothing technique is estimated as:

$$\begin{aligned} p\left(w|\hat{\theta }_{D}\right) = \frac{N|D|}{N|D| + \mu } p_{\mathrm {ml}} \left(w|\hat{\theta }_{D}\right) + \frac{\mu }{N|D| + \mu } p'(w|C). \end{aligned}$$

(13)

Ranking documents By substituting the smoothed multilingual language models for documents in Eq. (1), the score of each document in the collection can be calculated with respect to any given query, independent of the original language of the document. Ranking based on these scores gives us a multilingual result list without the need to merge different ranked lists.

4.2 Dictionary coverage

The estimates of the document and the reference language models, described in the previous subsection, are valid when the translation resource provides full coverage of the words in the vocabulary set $V$, which may not be satisfied in practice. Despite having high quality translation resources, there may be several words in the collection with no entry in the translation resources, due to out of vocabulary, misspelled, or informal words in the collection. The first solution can be extending each translation model by translation relations implied by the transitivity through a pivot language. But, this solution only reduces the severity of the problem and does not yield a translation resource with full coverage. In case of incomplete coverage of dictionary, $\sum\nolimits _{u\in V}{c_p(u,D)}$ in Eq. (10) is not equal to $N|D|$. Therefore, we have:

$$\begin{aligned} p_{ml}\left(w|\hat{\theta }_{D}\right) = \frac{c_{p}(w, D)}{\sum _{u \in V}{c_{p}(u, D)}}. \end{aligned}$$

(14)

Length ratio However, estimation of document language models using Eq. (14) implies disregarding words with no entry in the translation resource, which does not preserve the length ratio of documents. In particular, the length ratio of documents may differ when we count real occurrences of words in the documents compared to when we sum the probabilistic counts of words. This contradicts a retrieval axiom which is explored in detail in Sect. 5.2. To resolve this problem, we consider dummy words as the translations of words with no entry in the translation resource. These dummy words do not match any query term, but help to preserve the length ratio of documents. Therefore, the language model of a document $D$ using the maximum likelihood estimator is estimated by the following equation considering dummy words:

$$\begin{aligned} p_{ml}\left(w|\hat{\theta }_{D}\right) = \frac{c_p(w,D)}{N|D|}. \end{aligned}$$

(15)

Particularly, it suffices to consider $N|D|$ as the length of the document $D$. The new reference language model is also estimated as:

$$\begin{aligned} p''(w|C) = \frac{\sum _{D \in C}{c_{p}(w,D)}}{N\sum _{D \in C}{|D|}}. \end{aligned}$$

(16)

Term discrimination value (IDF heuristic) Another side effect of words without translations emerges in discriminating query words based on the reference language model. A frequent term in the collection would have a high probability in the reference language model $p''(w|C)$. Smoothing document language models with the reference language model makes the weights of matched terms between a query and a document in Eq. (1) to have a factor of ${}^{1}\!/_{p''(w|C)}$ which consequently causes frequent terms to get penalized^{Footnote 1} (Zhai 2008).

A word with no entry in the translation resource, may get an artificially high discrimination value because of no increase in the frequency of the word with documents of other subcollections, although its translations may be available in other subcollections. Therefore, we cannot solely rely on the reference language model estimated based on the expanded documents to determine the frequent terms. To address this issue, we propose two solutions. The solutions are based on employing the reference language model estimated without considering translations.

The first solution is to combine the expanded and the maximum likelihood estimates of the reference language model. Toward this, we adopt linear interpolation:

$$\begin{aligned} \hat{p}(w|C) = \beta p''(w|C) + (1 - \beta ) p(w|C), \end{aligned}$$

(17)

where $p''(.|C)$ is the global estimate of word statistics (the expanded estimate of the reference language model), while $p(.|C)$ is the ML-estimate of the reference language model and depends on occurrences of words only in their respective subcollections (Eq. (12)), and $0 \le \beta \le 1$ is a weighting parameter that can be determined based on the dictionary coverage. $\hat{p}(.|C)$ can subsequently be employed as the reference language model for smoothing.

The second solution to avoid rewarding words with no entry in the translation resource is to use 2-stage smoothing to estimate the document language models as:

$$\begin{aligned} p\left(w|\hat{\theta }_D\right) = (1 - \lambda ) \frac{c_p(w,D) + \mu p''(w|C)}{N |D| + \mu }+ \lambda p(w | C). \end{aligned}$$

(18)

As mentioned in (Zhai and Lafferty 2002), the purpose of the first stage of smoothing is to explain unseen words in a document. Therefore, for the first stage, we use the reference language model estimated globally using probabilistic counts. The second stage of smoothing is then supposed to reduce the effect of noise words in ranking the documents, for which we use the reference language model estimated based only on the real counts of the words in the collection. In all that follows, we use only this solution and leave the first solution for future work.

4.3 Incorporation of feedback information

In this section, we study the feedback concept in a multilingual environment and its incorporation in our MLIR approach. The purpose of using feedback in the retrieval task is to update a query with feedback information to achieve better performance. In the relevance feedback technique, feedback information is obtained from sample relevant documents, which are substituted in pseudo relevance feedback by documents that seem to match the query in an initial retrieval run.

Multilingual feedback model In multilingual retrieval, feedback information should be extracted from documents in different languages, since documents relevant to the query as well as the top ranked documents in an initial run, are generally in different languages. Feedback information, extracted from either set of documents, would subsequently be in multiple languages. Incorporating this conceptual perception of feedback into the model-based feedback technique, the topic model ($\theta _F$ in Eq. (4)) extracted from feedback documents should be multilingual. In particular, parameters of the topic language model include terms of different languages available in a multilingual collection. Building such a feedback model is not possible with existing approaches mentioned in Sect. 2. In our approach, we can naturally build a multilingual topic model, since language models of documents are multilingual.

Multilingual query After estimating the feedback topic model, the next step is to update the query language model using Eq. (4). Interpolating the query language model with the multilingual feedback model results in a query language model different from the initial query model. The new query model may have terms in different languages, i.e., incorporating feedback information results in a multilingual query. The next step is to score documents of the multilingual collection with respect to this new query model, which imposes additional complexity due to query terms in multiple languages.

To score documents of a multilingual collection with respect to a multilingual query using the basic retrieval models, the query should be translated into one language. Otherwise, only query terms in the language of a document have impact on the score of that document and thus documents of different subcollections are scored with respect to different parts of the query which are not equivalent. In contrast, our proposed approach allows directly retrieving relevant documents to a multilingual query, without any additional query translation, since the new document models have parameters equivalent to the terms of all languages.

Therefore, the great advantage of our approach is that all components of a retrieval framework including query expansion, relevance feedback, and pseudo relevance feedback are directly applicable to multilingual information retrieval.

Knowledge transfer One problem of query expansion through pseudo relevance feedback is query drift that occurs when the collection has few relevant documents with respect to a query. In retrieval on a multilingual collection, one subcollection may have fewer relevant documents to a query compared to the others. Merging the ranked list of documents in this subcollection, generated by applying the pseudo relevance feedback technique, may harm the overall multilingual performance.

Leveraging multilingual feedback information has the remarkable benefit of transferring knowledge between subcollections which prevents query drift. Considering $C_i$ as the subcollection with few relevant documents w.r.t a query, most of the top-ranked documents as well as most of the feedback terms would be in languages other than $l_i$. In our approach, these feedback terms can also increase the retrieval performance on subcollection $C_i$ although they are in other languages. Since the new language models of documents assign probabilities to the terms of all languages, feedback terms can directly match with documents in subcollection $C_i$ without translation. Therefore, feedback terms in other languages can help to increase the recall measure in subcollection $C_i$. We deal with this issue in a similar way to (Chinnakotla et al. 2010), but with a lower overhead. In (Chinnakotla et al. 2010), feedback information is obtained from top-ranked documents from an assisting collection in a language different from the query language. The original query is then updated by translating the obtained feedback terms.

5 Discussions of the proposed multilingual retrieval model

In this section, we analytically study some aspects of the proposed retrieval model using axiomatic analysis (Fang et al. 2011) and also discuss the computational complexity of our approach. Axiomatic analysis is based on formal constraints that any reasonable retrieval model should satisfy.

5.1 Common terms between languages

We first discuss the reason of labeling terms with language tags in the presence of common terms between languages. By common terms, we mean words of two languages that after performing preprocessing steps such as normalization and stemming, have the same spelling. Therefore, common terms include a part of cognates whose spellings are identical, and may include some proper nouns. The reason of labeling terms is described through the following constraint.

MLIR Constraint 1 Consider a collection in two languages $l_1$ and $l_2$ which share a common term $x$. Let ${q=\{xz\}}$ be a two-term query in language $l_1$. We are interested in the relative ranking of two documents $D_1$ and $D_2$ in language $l_1$ w.r.t query $q$, where $D_1$ contains the common term but $D_2$ does not. Suppose the following assumptions hold for the documents and given dictionaries:

${|D_1| = |D_2|}$, ${c(x, D_1) = c(z, D_2)}$, ${z \notin D_1}$, and ${x \notin D_2}$.
Terms $x$ and $z$ have the same discrimination value considering the entire collection.
Term $x$ in $l_1$ translates to term $x$ in $l_2$ with probability greater than 0, but translations of $z$ into $l_2$ do not belong to $v_1$. In particular, we have: ${p_{21}(x|x) > 0}$ and if ${p_{21}(\zeta | z) > 0 }$, then ${\zeta \notin v_1}$.

Given these assumptions, $D_1$ and $D_2$ should get the same score.

To analyze the mentioned constraint on our MLIR approach, we first calculate the probabilistic counts of words in each document given the probabilistic dictionary. If we do not distinguish term $x$ in two languages, then for calculating the probabilistic count of term $x$ in document $D_1$, we count $x$ in both languages, i.e., ${c_p(x, D_1) > c(x, D_1)}$. On the other hand, for counting $z$ in $D_2$, we only have $z$ in $l_1$, i.e. ${c_p(z, D_1) = c(z, D_1)}$. Therefore, ${c_p(x, D_1) > c_p(z, D_2)}$. This causes document $D_1$ to artificially have more occurrences of query terms and hence get a higher rank than document $D_2$, which is not desirable. But, this problem does not arise when terms are labeled with language tags.

5.2 Incomplete dictionary coverage

Another point to discuss is the proposed estimation approaches in the case that available translation resources do not provide full coverage of words (Eqs. (15), (16)). Estimating document language models using Eq. (14) does not preserve the length ratio of documents. As a consequence, terms of a document containing term(s) with no entry in the translation dictionary are artificially enhanced compared to those of a document that all its terms are available in the dictionary. This leads to an improper ranking of documents which we show using the second MLIR constraint.

MLIR Constraint 2 Let $D_1$ and $D_2$ be two documents in the same language in a multilingual collection and the two documents differ only in one term. Therefore, ${|D_1| = |D_2|}$. Let terms $x$ and $y$ belonging to $D_1$ and $D_2$, respectively, represent the only difference between these two documents. Also, assume that the dictionary contains translations for $x$, but not for $y$. Let $q$ be a query that contains neither $x$ nor $y$. Under these assumptions, $D_1$ and $D_2$ should get the same score w.r.t. $q$.

Analyzing the second constraint on our approach The matched terms between a query and a document contribute to the document’s score in our approach. Let $z$ denote a matched term between the document $D_1$ ($D_2$) and query $q$. We are thus interested in $p(z|D_1)$ and $p(z|D_2)$ to figure out the relative ranking of the two documents in the result list. If we estimate term probabilities using Eq. (14), then the value of the denominator for $D_2$ is one less than that for $D_1$, which means that the length ratio of these documents is changed considering the initial equal lengths of these documents. Since the probabilistic counts of $z$ in both documents are equal, we have $p(z|D_2)>p(z|D_1)$ and consequently ${\mathrm {Score}}(q,D_2)>{\mathrm {Score}}(q,D_1)$. This scoring is contrary to the reasonable scores of documents, ${\mathrm {Score}}(q,D_2)={\mathrm {Score}}(q,D_1)$. But, considering ${N|D|}$ as the denominator in Eq. (15), we achieve the expected ranking.

5.3 Term discrimination value

The term discrimination constraint (TDC), introduced in (Fang et al. 2004), regulates the impact of discrimination values of query terms on a document’s score. TDC states that between two equal-length documents with the same total occurrences of query terms, the document containing more occurrences of the more specific query term should get a higher score.

TDC axiom in a multilingual environment The main objective is to depict that discrimination values of terms in a multilingual collection should be determined considering term occurrences in all documents of the collection, independent of their languages. To clarify, consider the example collection in Fig. 1a. Let ${q=\{ab\}}$ be a two term query in language $l_1$. The goal is to investigate the reasonable relative ranking of documents $D_1$ and $D_3$ in language $l_1$. Note that their relative ranking depends only on the discrimination values of query terms. To consider the entire collection in determining term discrimination values, all documents are translated into one language $(l_1)$, using the dictionary of Fig. 1a. Given the translated collection, depicted in Fig. 1c, term $a$ occurs in three documents, while term $b$ has been used in four documents. Therefore, term $a$ is more specific than term $b$ and according to TDC, $D_1$ should get a higher score compared to $D_3$ in the final result.

Our MLIR approach Our approach satisfies the mentioned reasonable ranking of documents, because the reference language model is estimated using probabilistic counts of words (Eq. (11)). As shown in Fig. 1b, terms $a$ and $b$ have also non-zero probabilistic counts in documents in language $l_2$. Therefore, in our reference language model, the probability of term $a$ is less than that of term $b$. Smoothing using this reference language model leads to the desired ranking of documents.

Fusion-based methods Almost all fusion-based methods fail to satisfy TDC, because of the limitation that the relative ranking of documents in the individual result lists should be preserved in the final ranked list. These methods combine the results of two retrieval runs to produce results for collection $C$: monolingual retrieval on subcollection $C_1$, and cross-lingual retrieval on subcollection $C_2$. An ideal monolingual retrieval model should prefer $D_3$ over $D_1$ in response to $q$, because in subcollection $C_1$, query term $b$ is more specific than term $a$. Under the constraint of preserving relative ranks of documents in the individual results, the rank of $D_3$ will be lower than that of $D_1$ in the final result of these fusion-based methods in response to query $q$ on collection $C$, which is not desirable considering the statistics on the entire collection.

The mentioned sample case is common in practice. Documents in one language might cover the query topic, while documents in another language do not. Hence, decisions on term and document features to derive a ranked list of documents should be based on the entire collection, which is not possible in fusion-based methods, but is a strong point of our approach.

5.4 Computational complexity

The next point to consider is the run time analysis of the proposed MLIR approach. We first discuss the efficient implementation of monolingual retrieval based on the KL-divergence framework, investigated in (Zhai and Lafferty 2001b). If we use a smoothing technique in which the probability of an unseen word $w$ in a document $D$ is equal to $\alpha _D p(w|C)$ ($\alpha _D$ is a document-dependent constant), then Eq. (1) can be calculated very efficiently. The reason is that the summation in Eq. (1) is computed only for matched terms between the query and the document. In this case, the computational complexity of scoring documents with respect to query $Q$ is estimated as $O(K|Q|)$, where $K$ is the average number of documents containing a query term.

Using multilingual language models, the summation in the KL-divergence scoring function of Eq. (1) can be computed only for words that have non-zero probabilities in the query language model, and non-zero probabilistic counts in a document $D$ as follows:

$$\begin{aligned} {\mathrm {Score}}(Q,D) = \sum _{\begin{array}{c} w:p(w|\theta _Q) > 0, \\ c_p(w,D) > 0 \end{array}}p(w|\theta _Q ) \log \frac{p_{\mathrm {s}}(w|\theta _D)}{p_{\mathrm {u}}(w|\theta _D)} + \sum _{w: p(w|\theta _Q) > 0}{p(w|\theta _Q) \log p_{\mathrm {u}}(w|\theta _D)}, \end{aligned}$$

(19)

where $p_s(w|D)$ is the smoothed probability of word $w$ seen in document $D$, and $p_u(w|D)$ is the probability assigned to unseen word $w$ in the document. The only efficiency issue for computing this equation is the estimation of $c_p(w,D)$ in $p_s(w|D)$.

We employ probabilistic word-to-word translation models to estimate probabilistic counts of words. There are two strategies to filter probabilistic translation models, obtained from parallel corpora, for use in CLIR; selecting $n$ best translations for each word, or selecting translations whose probabilities are higher than a threshold (Nie et al. 2012). After filtering and renormalizing the translation models, an inverted index is built on translation models such that for each word $u$ a list of words in all languages that translate into $u$ along with their translation probabilities is kept. Model parameters are estimated using this inverted index on translation models and the inverted index of the collection. This estimation can be done either at index time or at retrieval time. Following, we discuss the complexity of the two options in more details.

1.
Estimating the parameters of document language models at index time: Probabilistic counts are precomputed for all documents at index time (Xu et al. 2001). In this case, each document will be added to the document list of all translations of its words. The size of new index, containing probabilistic counts of words, depends on the number of selected translations per word. On average, the index size will be about $N$ times larger than that on the original document collection where $N$ is the number of languages in the collection. Additional offline processing for building this index is just the calculation of probabilistic counts for documents. Using the new probabilistic index, multilingual retrieval can be performed as efficiently as monolingual retrieval. This way, the scoring complexity of our MLIR approach for query $Q$ is $O(K'|Q|)$, where $K'$ is the average number of documents that have a query term with a non-zero probabilistic count.^{Footnote 2}
2.
Estimating the parameters of document language models at retrieval time: In this case, there is no increase in the size of collection index, or offline processing time. The index is multilingual, however it is built on the original documents, not the expanded ones. Thus, this index is identical to concatenation of individual indexes of subcollections since we use language tags and there cannot be a common term between documents of different languages. Probabilities of words given a document are then estimated at retrieval time. However, this estimation does not significantly increase the runtime complexity, because retrieval score is computed only for documents that have a non-zero probabilistic count of a query term. For each query term in monolingual retrieval, the score of documents containing that term, obtained using the inverted index of the collection, will be increased. In this implementation of multilingual retrieval, in addition to documents containing each query term, the score of documents that contain a term that translates to a query term is also increased. These documents are obtained using the built inverted index on translation models. Therefore, the scoring complexity of our MLIR approach for query $Q$ using this strategy is $O(K(|Q|+|T|))$, where $K$ is the average number of documents that have a query term or a term that translates to a query term^{Footnote 3}, and $T$ is the set of terms that translate to query terms.

In our experiments, we follow the second strategy.

6 Experiments

Datasets We use two CLEF datasets for evaluation: (1) CLEF 2001–2002 multilingual test collection, and (2) CLEF 2003-Multilingual 4 test collection. Table 1 lists statistics of these collections. We index the TEXT and TITLE fields of documents in both collections for retrieval and use the three query sets, CLEF2001, CLEF2002, and CLEF2003. Each query set includes equivalent topics in multiple languages.

Table 1 Dataset statistics

Multilingual information retrieval in the language modeling framework

Abstract

Similar content being viewed by others

An axiomatic approach to corpus-based cross-language information retrieval

Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEF

A Learning to rank framework based on cross-lingual loss function for cross-lingual information retrieval

1 Introduction

2 Related work

3 Language modeling approach

4 Extension of the language modeling framework to MLIR

4.1 Multilingual unigram language model

4.2 Dictionary coverage

4.3 Incorporation of feedback information

5 Discussions of the proposed multilingual retrieval model

5.1 Common terms between languages

5.2 Incomplete dictionary coverage

5.3 Term discrimination value

5.4 Computational complexity

6 Experiments

6.1 Effectiveness of multilingual unigram language models

6.2 Comparison with previous approaches

6.3 Impact of feedback on retrieval performance

7 Conclusion and future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords