An outranking approach for information retrieval

Farah, Mohamed; Vanderpooten, Daniel

doi:10.1007/s10791-008-9046-z

An outranking approach for information retrieval

Published: 16 February 2008

Volume 11, pages 315–334, (2008)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

An outranking approach for information retrieval

Download PDF

Mohamed Farah^1,2 &
Daniel Vanderpooten¹

266 Accesses
11 Citations
Explore all metrics

Abstract

Over the last three decades, research in Information Retrieval (IR) shows performance improvement when many sources of evidence are combined to produce a ranking of documents. Most current approaches assess document relevance by computing a single score which aggregates values of some attributes or criteria. They use analytic aggregation operators which either lead to a loss of valuable information, e.g., the min or lexicographic operators, or allow very bad scores on some criteria to be compensated with good ones, e.g., the weighted sum operator. Moreover, all these approaches do not handle imprecision of criterion scores. In this paper, we propose a multiple criteria framework using a new aggregation mechanism based on decision rules identifying positive and negative reasons for judging whether a document should get a better ranking than another. The resulting procedure also handles imprecision in criteria design. Experimental results are reported showing that the suggested method performs better than standard aggregation operators.

Rank Aggregation: Models and Algorithms

Constructing an Outranking Relation with Weighted OWA for Multi-criteria Decision Analysis

Enhancement of Fuzzy Rank Aggregation Technique

1 Introduction

Information Retrieval (IR) is concerned with situations where a user, having information needs, performs queries on a collection of documents to find a limited subset of the most relevant ones. Performances of IR systems is measured by its ability to search and retrieve relevant documents as efficiently and effectively as possible. In this paper, efficiency, which refers to the ability of a system to provide results within reasonable response times, is not our main concern. We primarily focus on retrieval effectiveness, which refers to the ability of a system to deliver the most relevant results first. Relevance is indeed the main challenge for most search engines as shown by several comparative studies reporting limitations in their performances (Hawking et al. 2001).

In the literature, a wide range of models have been proposed to rank documents according to their relevance to queries. They result in different rankings depending on the way they define relevance. In fact, relevance is reflected by the sources of evidence that are considered, as well as the way they are combined.

Most of the current approaches assess document relevance by computing a single score which aggregates values of elementary attributes related to the query terms, the document or the relationship between these two entities. For instance, in the Vector Space Model (Salton et al. 1975), the Okapi BM25 probabilistic model (Robertson et al. 1994) as well as language models (Cao et al. 2005), term frequency (tf), document frequency (df) and document length (dl) are the main attributes which come into play. These attributes are combined in the term weighting formulation which corresponds to a first aggregation phase. The resulting scores are in turn considered to compute document relevance status value (rsv) to queries, as a second aggregation phase.

With the advent of hypertext collections, such as the Web, attributes characterizing the hyperlink structure are considered and led to link-based measures such as Kleinberg’s HITS scores (Kleinberg 1999), PageRank scores (Brin and Page 1998) and HostRank scores (Amento et al. 2000).

All these text- and link-based attributes can be combined to get better performance. A variety of aggregation operators have been used such as the min and max operators in (Fox and Shaw 1994) or the weighted linear operator in (Craswell et al. 2005). Other aggregation operators include similarity-based measures (Van Rijsbergen 1979; Salton and McGill 1983; Frakes and Baeza-Yates 1992), P-norms (Salton et al. 1983), or fuzzy-logic conjunctive and disjunctive operators (Dubois and Prade 1984).

In some cases, aggregation is performed in an ad-hoc manner. For instance, in (Kraaij et al. 2002) link-based attributes such as in-degree and URL, are used as priors in language models. Another way consists in aggregating evidence in two stages. In the first stage, text-based attributes are combined to get scores of documents. In the second stage, the resulting top ranked documents are re-ordered according to link information by using techniques such as spreading activation or probabilistic argumentation (Savoy and Rasolofo 2000). Thus, these approaches do not explicitly use link-attributes.

Each aggregation operator conveys a specific aggregation logic which reflects the degree of compensation we are ready to accept. In the IR literature, two main classes of operators are in use. The first class corresponds to a totally compensatory logic. It consists of building a single score using a more or less complex operator such as the weighted sum. For such operators, a very bad score on one criterion can be compensated by one or several good scores on other criteria. These operators often require inter-criteria information such as weights, which are sometimes difficult to define and interpret. Indeed, these weights aim at capturing at the same time the relative importance of criteria but also a normalization factor when criteria are expressed on different scales.

The second class corresponds to a non-compensatory logic. In this case, aggregation is mainly based on one criterion value such as the worst score or the score of the most important criterion. The remaining criteria are only used to discriminate documents with similar scores. This gives rise to min-based or lexicographic-based operators, variations of which are the discrimin and leximin operators (Boughanem et al. 2005). A clear weakness of this class of operators is that a large part of the scores is ignored or plays a minor role.

Moreover, in both classes, we do not consider imprecision underlying criteria design resulting from the fact that there are many acceptable formulations of the same criterion: for instance, Anh and Moffat (2002) proposed four alternative formulations of the tf criterion. Therefore, it is important to give a limited interpretation to values, i.e., we should consider that slight differences in values are often not meaningful. This way, the resulting rankings are more robust.

In this paper, we propose a multiple criteria framework which combines any set of criteria while taking into consideration the imprecision underlying the criteria design process. We first put emphasis on the importance of the design of good criterion families capturing complementary aspects of relevance and give clues to the design of such families. Then, we describe ranking procedures based on natural decision rules.

Multiple criteria techniques were previously used in IR, especially in information filtering (Pasi et al. 2007) as well as in data fusion (Bordogna et al. 2003; Bordogna and Pasi 2004). Nevertheless, the proposed methods basically use fuzzy sets theory. In this paper, we use a different kind of aggregation mechanisms.

The paper is organized as follows. We first introduce the multiple criteria framework where we describe the overall approach and its component phases (Sect. 2). Then, we highlight some specificities of the IR problem which are addressed in the proposed approach (Sect. 3). Section 4 deals with the modeling phase which consists in designing a set of relevance criteria. We present in Sect. 5, a filtering procedure whose purpose is to obtain a reduced set of potentially relevant documents. Section 6 shows how to aggregate such criteria and build the final ranking. The complexity of the whole approach is investigated at the end of this section. We report experimental results in Sect. 7 and provide conclusions in a final section.

2 A multiple criteria framework for IR

Many studies argued that the reason why no consensus has been reached on the relevance concept is that there are many kinds of relevance, not just one, as stated by Borlund (2003). Moreover, different sources of evidence are contributing to capture the relevance concept. Therefore, being able to make effective use of these sources of evidence can significantly improve retrieval effectiveness.

We propose a formal approach for IR where relevance is explicitly defined as multidimensional (by a set of criteria) and ranking is derived from pairwise comparisons of document performance vectors (document profiles) using decision rules identifying positive and negative reasons for judging whether or not a document should get a better ranking than another. The overall approach can be split into four phases (see Fig. 1) which will be detailed in the following sections:

The modeling phase consists in identifying various attributes affecting relevance. These attributes are used to develop a set of appropriate decision criteria which model different aspects of relevance. Each criterion will give rise to a partial preference relation (binary relation) modeling the way two documents are compared, according to that criterion.
The filtering phase aims at identifying the set of potentially relevant documents with respect either to the query structure or to the criterion family. In the first case, a boolean filter selects documents that match query terms and query formula. In the second case, a profile-based filter selects documents that satisfy an acceptance profile defined by minimal required values on some or all criteria.
The aggregation phase aggregates partial preference relations derived from pairwise comparisons of documents with respect to each criterion, into one or several global preference relations. A global preference relation indicates how two documents are compared with respect to all the considered criteria.
The exploitation phase processes global preference relations resulting from the previous phase in order to derive the final ranking.

The last two phases correspond to the ranking phase.

It is worth noting that the proposed method is collection- and representation-independent to some extent. It can thus be used for any type of collection and combined with the best representation available. In fact, the context is mainly considered in the modeling phase in order to devise relevant criterion families.

3 Specificities of the IR problem

The IR problem can be considered as a multiple criteria decision problem when we explicitly consider the multidimensional nature of relevance. Nevertheless, it has some particularities that have an impact on the modeling phase as well as on the aggregation and exploitation phases.

3.1 Specificities for the modeling phase

Specificity 1: Two kinds of criteria need to be considered to assess documents relevance: query-dependent and query-independent criteria.

Query-dependent criteria measure semantic proximity between documents and queries and are derived from attributes about the form of occurrences of query terms in the document and the collection. Examples of such attributes are term frequency (tf) and document frequency (df).

The evaluation of query-dependent criteria depends on the structure of the query. In fact, we should distinguish one-term queries from multi-terms queries. Some criteria are only relevant in the second case. Moreover, for multi-terms queries, two evaluation levels are required: (i) evaluation for each term of the query, and (ii) aggregation of these evaluations. Therefore, the design of such criteria deserves thorough analysis. This is addressed in Sect. 4.1.

Query-independent criteria mainly refer to characteristics of the document and the collection. They can be evaluated independently of the query. Examples of such criteria are document length (dl) and PageRank. We need such criteria to better help discriminating between documents. In fact, the query frequently consists of two or three terms in average, and this cannot be sufficient to rank thousands or millions of documents.

Specificity 2: Criteria can play different roles depending on which phase they are used in. In the filtering phase, they are primarily used to build acceptance profiles which help separating potentially relevant documents. In the ranking phase, they are used for pairwise comparisons.

3.2 Specificities for the ranking phase

Specificity 3: Criteria to be used to establish relevance are not specified by the user. They are rather based on attributes evidenced to best capture relevance by the IR community. Consequently, it is difficult to get precise preference information regarding their relative importance. In this case, we assume that each criterion is neither prevailing nor negligible. Therefore, we should use appropriate ranking procedures.

Specificity 4: The query is too poor to justify a precise ranking of documents. One can expect that many of the ‘most relevant’ documents should be present in the head of the ranking, but their exact ranking is meaningless. This can also be justified in terms of users behavior when interacting with the results pages of search engines. In fact, research in eye-tracking analysis of users behavior has shown that once users have started scrolling, rank becomes less of an influence for attention (Granka et al. 2004). Therefore, even if a ranking is a handy way of presenting results, its significance should not be overemphasized.

4 Modeling phase

In our context, a criterion models relevance between documents, regarding a specific point of view. It is represented by a real-valued function g defined on the set of documents and aims at comparing any pair of documents d and d′, on a specific point of view, as follows:

$$ g(d) \geq g(d^{\prime}) \Rightarrow d \hbox{ `is at least as relevant as' } d^{\prime} $$

For instance, considering the term frequency criterion (tf), it is always common to consider that when one query term occurs more frequently in the body of document d than in document d′, then d is judged more relevant than d′, ceteris paribus: $tf(d) \geq tf(d^{\prime}) \Rightarrow\,d$ ‘is at least as relevant as’ d′ according to criterion tf.

Choosing the right criterion family depends on the task at hand as well as the type of information that documents encompass. In fact, retrieving images or video sequences differs greatly from retrieving textual documents since each kind of information encompass specific features. This choice should be undertaken with great care since it has an important impact on the final ranking.

Although many candidate criterion families could be derived from the same considered relevant attributes, we should nevertheless try to fulfill the following desirable requirements:

each criterion should be concerned with a specific point of view,
all attributes deemed to be important in comparing two documents should be captured by the set of criteria,
we should avoid redundancy, i.e., we should not consider the same attribute more than once and therefore, it is better to have independent criteria in order not to favor attributes upon others, and
while building the criterion family, we should have in mind the way it will be used in the ranking process.

It is worth noting that many formulations of the same criterion are possible. Therefore, we should not overemphasize the criterion scores of documents. We briefly discuss two important issues of the modeling phase.

4.1 Evaluation of query-dependent criteria

To build some query-dependent criteria, such as the tf-like criterion, we need to make a clear distinction between one-term and multi-terms queries. For one-term queries, building criteria has no specific difficulties, but to deal with multi-terms queries, i.e., conjunctive and/or disjunctive queries, we can proceed in two steps:

build a sub-criterion corresponding to each term of the query. Each literal of the query formula can therefore be evaluated accordingly,
select an aggregation operator corresponding to each query-type (conjunctive query, disjunctive query or a combination of both). This sub-aggregation step aggregates homogeneous partial measures derived from the previous step.

Since elements being aggregated in the sub-aggregation step are homogeneous, we can use analytic aggregation operators like conjunctive, disjunctive or compensatory operators (Dubois and Prade 1984), depending on the aggregation logic we wish to use and on the interpretation given to the juxtaposition of terms.

For instance, let us suppose that we want to assess the relevance of documents to some query $q=t_{1} t_{2} \ldots t_{n_{q}}$ according to the tf criterion, where t _k is a query term. In the first step, we compute the score of each document d for each query term t _k, i.e., tf(d, t _k). In the second step, we combine these different scores into one single score using some aggregation operator such as the average operator, i.e., $tf(d) = \frac{tf(d,\, t_k)}{n_q}.$

4.2 Modeling imprecision

It is often inadequate to consider that slight differences in evaluation should give rise to clear-cut distinctions. This is particularly true when different formulations of criteria are acceptable. Imprecision underlying criteria design can be modeled using the following discrimination thresholds (Roy 1989):

An indifference threshold allows for two documents with close criterion values to be judged as equivalent. The indifference threshold basically draws the boundary between an indifference and a preference situation.
A preference threshold is introduced when we want or need to be more precise when describing a preference situation. Therefore, it establishes the boundary between a situation of a strict preference and an hesitation between an indifference and a preference situations, namely a weak preference.

A criterion g _j, having indifference and preference thresholds, q _j and p _j, respectively (p _j ≥ q _j ≥ 0), is called a pseudo-criterion. Comparing two documents d and d′ according to a pseudo-criterion g _j leads to the following partial preference relations:

$$ \left\{ \begin{aligned} d I_j d^{\prime} &\Leftrightarrow |g_j(d) - g_j(d^{\prime})| \leq q_j \\ d Q_j d^{\prime} &\Leftrightarrow q_j < g_j(d) - g_j(d^{\prime}) \leq p_j \\ d P_j d^{\prime} &\Leftrightarrow g_j(d) - g_j(d^{\prime}) > p_j \end{aligned}\right. $$

where I _j, Q _j and P _j represent respectively indifference, weak preference and strict preference relations restricted to criterion g _j. These three relations could be grouped into an outranking relation $S_j = (I_j \cup Q_j \cup P_j)$ such that $dS_j d^{\prime} \Leftrightarrow g_j(d)-g_j(d^{\prime}) \geq -q_j$ which corresponds to the assertion d `is as least as relevant as’ d′ with respect to the aspects covered by criterion g _j.

To model situations where a very low score of a document d′ with respect to d, according to some criterion g _j, cannot be compensated by a good score on one or several other criteria, we use a veto threshold v _j (v _j ≥ p _j) and define the following veto relation $V_j: d V_j d^{\prime} \Leftrightarrow g_j(d)-g_j(d^{\prime}) > v_j.$ In this case, d′ cannot be considered as `at least as relevant as’ d, whatever the scores on other criteria.

Figure 2 summarizes the different preference situations that can be derived from the comparison of two documents d and d′.

We illustrate these different preference relations using the following example. Let us consider Table 1 which gives the scores of five documents evaluated according to a pseudo-criterion g. Table 2 gives the different thresholds of this criterion. In this illustration, we denote g _ij = g(d _i) − g(d _j) which corresponds to the difference of the scores of documents d _i and d _j according to criterion g. Table 3 reports the differences of document scores and Table 4 gives the relational interpretation of such differences. For instance, since $q \leq g_{13}=0.3 \leq p$ the weak preference relation holds between d ₁ and d ₃. Moreover, since $g_{15} > v > p$ both the strict preference relation as well as the veto relation hold between d ₁ and d ₅. This involves, in particular, that criterion g imposes its veto to the assertion ‘d ₅ is at least as good as d ₁’, whatever the scores on other criteria.

Table 1 Documents scores according to g

An outranking approach for information retrieval

Abstract

Similar content being viewed by others

Rank Aggregation: Models and Algorithms

Constructing an Outranking Relation with Weighted OWA for Multi-criteria Decision Analysis

Enhancement of Fuzzy Rank Aggregation Technique

1 Introduction

2 A multiple criteria framework for IR

3 Specificities of the IR problem

3.1 Specificities for the modeling phase

3.2 Specificities for the ranking phase

4 Modeling phase

4.1 Evaluation of query-dependent criteria

4.2 Modeling imprecision

5 Filtering procedure

6 Ranking procedure

6.1 Aggregation phase

6.2 Exploitation phase

6.3 Illustrative example

6.4 Complexity of our approach

7 Experiments and results

7.1 Test setting

7.2 Results

8 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation