Parameters and optimization

This paper addresses some basic features of many of the measures of retrieval effectiveness in current or past use. The objective is to understand aspects of their behaviour, with particular reference to the optimization of parameters of a retrieval model or ranking function. This is a theoretical paper.

Many current models of retrieval, and ranking functions derived from them, contain free tunable parameters. A constantly recurring theme of work in this field is the use of experimental data of the sort traditionally used for retrieval system evaluation as training data on which to tune such parameters. Almost all contributions to (for example) the annual TREC test involve some degree of learning/training some aspects of the method on the basis of previous test data. Retrieval methods and ranking functions may have any number of tunable features or parameters, from one to several hundred (typically, web search engines come in at the upper end of this scale).

Before attempting any form of optimization, we have to decide which measure or measures we would like to optimize: the measure of retrieval effectiveness. In the present paper, we consider only the problem of optimizing a single measure. For a training set of documents, queries and relevance judgements, and any particular set of parameter values for our ranking function, we can evaluate our chosen effectiveness measure. Thus, we regard the measure as a function on the space of all possible combinations of parameter values; we seek an optimum in this space. In principle, if the measure has some nice topological properties in this space (specifically continuity and convexity), there exist optimization methods that tune the parameters and find the maximum relevance setting in reasonable time. However, as we shall see, most such measures (including all the usual IR effectiveness metrics) have no such nice properties.

With one or a handful of parameters, it is often possible to explore the parameter space exhaustively (at least to some degree of granularity). That is, we may define a grid of points over the parameter space, and specifically evaluate the measure at every point on the grid. With (say) 10 continuous parameters, this starts to become infeasible; one might instead explore parameters singly or in smaller sets, covering all in a heuristically specified sequence of separate optimizations, possibly iterating the sequence. With (say) 100, it becomes almost imperative to consider other methods. The field of machine learning offers some such methods, often based on some form of gradient descent (or ascent) of the objective function, the function that we want to optimize. Essentially such methods follow a trajectory of points through parameter space—at each point a decision is taken as to where the next point should be.

But here is the rub: not one of the measure of retrieval effectiveness commonly used is suitable for gradient ascent. Measures based on ranked output are discontinuous, non-differentiable functions of the parameters, and cannot be used (at least in any simple straightforward way) for this purpose.

The primary purpose of this paper is to explore and further understand the characteristics of traditional rank-based effectiveness measures when these are regarded as functions in the parameter space. The basic tool is an analysis of the behaviour of the measures as we move along trajectories in parameter space. It turns out that all rank-based measures share some interesting properties in this regard. In part at least, this behaviour is probably understood at an intuitive level by people working in the field. However, we believe that the formal analysis does yield some surprising results, which will help to improve our understanding. The results apply to parameter spaces of any dimensionality.

The paper does not attempt to provide solutions to the problems raised. Some practical consequences are discussed, but the main aim is understanding.

In Section 2, we develop a characterization of traditional effectiveness measures, which allows us in Section 3 to generalize about a wide range of rank-based measures. In this section we develop the central idea of the paper, which concerns the behaviour of measures of effectiveness as we move around the parameter space; the theory is explored more fully in the Appendix. In Section 4, we consider some alternative approaches, score-based measures and non-gradient-descent methods. We present some conclusions in the final section.

Measures of effectiveness

The measures proposed for or used in retrieval tests may be characterized in any number of different ways. The present characterization is primarily intended to emphasize the similarities rather than the differences, and to identify a significant subset of the totality of measures about which we can generalize.

Set-based measures

The original measures of Recall and Precision were defined in terms of a set of retrieved documents. The output of the system was assumed to be a yes–no judgement on each document, thus a single undifferentiated set of documents is retrieved from the collection. The F and FBeta measures commonly used in text categorization are similar, as are the various measures (often forms of utility) used in the various Filtering tasks at past TRECs (see e.g., Robertson and Soboroff, 2002). As we have seen in the latter case (Robertson, 2002), a retrieval function typically involves both a ranking function and a threshold, both of which have to be optimized. This is a complication not considered further in the present paper.

For convenience, we reiterate the definitions of Recall (proportion of all the relevant documents that are retrieved) and Precision (proportion of all the retrieved documents that are relevant). F (FBeta) is a (weighted) harmonic mean of Recall and Precision, and Utility is typically defined as a function of the number of relevant documents retrieved discounted by the number of non-relevant retrieved.

Rank-based measures

Most measures in common use, including some adaptations of the above set-based measures, assume system output in the form of a ranking of documents. This ranking normally derives from a scoring function: in relation to a given query, each document is assigned a real-value score, and the documents are ranked according to this score. Ranking is assumed to start at the highest score (the document which seems most likely to satisfy the information need).

For convenience in later discussion, some of these measures are briefly defined below; it is assumed for the purpose of definition that (a) the ranking is complete, with no ties; (b) all documents are judged for relevance, on either a binary or a discrete ordered scale; and (c) we are dealing with a single query/topic only. All these matters are discussed further below.

Average Precision (AP)

Determined by locating the relevant documents in the ranking, calculating precision at those points, and averaging over all relevant documents for the query (this is ‘non-interpolated’ AP).

Precision at n (P@n)

Defined as precision calculated at rank n, for any integer n.

RPrecision (RPrec)

Defined as precision calculated at rank R, where R is the total number of relevant documents for this query.

Recall (Rec)

In form used in TREC and elsewhere, defined as recall at rank n, for some integer n which is usually set to 1000.

Discounted Cumulative Gain (DCG)

To each relevance grade, a utility or gain figure is assigned. To each rank position, a discount is assigned (usually following a simple formula). DCG is calculated by accumulating the gain, discounted by the discount, as we traverse the ranking. Note that this definition specifies a whole family of measures, depending on the choice of both gain parameters and discount function (Järvelin and Kekäläinen, 2000).

Search Length

The number of non-relevant documents seen by the time the user reaches the nth relevant document, for some integer n (Cooper, 1968).

Correct Pairs

The number of pairs of documents which are ranked in the correct order, taken from a list of n candidate pairs of document (relevant/non-relevant or more/less relevant). Note: this measure itself is not used in evaluation, but is the basis for \({\mathbf{bpref}}\), discussed below.

Success at n (S@ n)

Defined as whether or not a relevant document has been retrieved (irrespective of how many) by rank n, for some integer n.

Reciprocal Rank

Defined as the reciprocal of the rank of the first relevant document.

All of the above measures depend to some degree on the actual ranking of the documents. All of them also place very strong emphasis on the early part of the ranking—at least in normal retrieval circumstances, where both the total number of relevant documents and the chosen value of n in any of the above is orders of magnitude smaller than the collection size (but see below for discussion of Correct Pairs and bpref).

Multiple queries

All the above measures are typically averaged over a set of queries. In the cases of measures which are naturally between zero and one, this is straightforward (below we refer to Mean Average Precision, MAP, and Mean Reciprocal Rank, MRR). For those with arbitrary scales, it is usual to normalize them per-query, so as to give equal weight to each query. Thus:

Normalized Discounted Cumulative Gain (NDCG)

DCG is normalized by dividing by its maximum possible value for the query. This maximum is calculated by notionally ranking the documents in the optimum order for the query (all documents of higher relevance before all documents of lower relevance). NDCG can then be averaged over queries.

Search Length

Cooper does not propose a method for averaging search length over queries.

Correct Pairs

As with DCG, this must be divided by its maximum, which is the number of candidate pairs chosen for the measure (see discussion of bpref below).

All the above methods of averaging start from the principle that each query is to be given equal weight (as opposed, for example, to averaging non-normalized DCG, which would give more weight to queries with many relevant documents). Without departing from this principle, we may seek to reweight different ranges of the measure concerned. A current approach to emphasizing poor performance is to take a geometric mean of the measure rather than the usual arithmetic mean—or equivalently to apply a log transformation to the measure before averaging. All the arguments in this paper apply to geometric means such as GMAP as well as to arithmetic means such as MAP.

Unjudged documents

Usually not all documents in the collection have been judged for relevance. We may have the situation that a pool including all the top-ranked documents from all the searches being evaluated has been judged. Or it may be that such a pool has been judged, but we are evaluating a new search, which might retrieve some unjudged documents. In this case it might be a reasonable assumption that most of the top-ranked have been judged, and furthermore that those that have not been judged are most likely to be non-relevant. For most of the above measures, the assumption that all unjudged documents are non-relevant provides a reasonable approximation in such circumstances, because of the built-in bias to the early part of the ranking. The exception is Correct Pairs, which has no such bias and would be unworkable if we counted all known-relevant/unjudged pairs as well as all known-relevant/known-non-relevant pairs. The simple choice for Correct Pairs is to consider only judged documents (but see the discussion of bpref that follows).

When the relevance judgements are thought to be seriously incomplete, these arguments look more suspect. The bpref measure was introduced by Buckley and Voorhees (2000) to deal with this situation. It involves the Correct Pairs measure (normalized by its maximum per-query and then averaged over queries). However, the main issue concerns the choice of pairs. Out of various methods investigated by Buckley and Voorhees, they choose the following: take all R known relevant documents for the query, and the top R ranked known non-relevant documents, making R 2 candidate pairs (this value R 2 is now the maximum number of correct pairs). This heuristic gives bpref the charateristic that the other measures have, of being strongly biased towards the early part of the ranking. It also gives some degree of stability and correlation with MAP, according to their experiments.

Note that such heuristics may depend on the judging regime. In the case of TREC corpora (used by Buckley and Voorhees), judgements have usually been made on deep pools of documents taken from the output of a variety of different search systems for each query. This means that there is almost always a large number and variety of judged non-relevant documents—certainly many times more than judged relevant. Even when they sample to simulate missing judgements, this remains the case. However, if judgements are based on shallow pools from one or a small number of systems, different methods may be called for.

Truncation and ties

The above discussion assumes that there are no ties and that the ranking for each query is complete (the entire collection is ranked). Most practical ranking methods involve far-from-complete rankings, which we might regard as having a single bucket at the end containing a very large number of ties. In addition, there may be ties earlier in the ranking.

Measures like P@n depend only on the top n ranks; n is usually chosen so that the tail-end bin is not an issue. Measures like MAP are usually truncated at some n; the precision at the rank of any relevant document that is not in the top n is assumed to be zero; this is a reasonable approximation. Similar considerations apply to NDCG. Correct Pairs in the form of bpref is similarly not affected by the tail-end bin.

Ties earlier in the ranking have to be dealt with; the full specification of any measure has to include a suitable method. Note however that many modern ranking algorithms use such a variety of features that ties (apart from the large end bucket) are exceedingly unlikely. Note also that (in a somewhat earlier era) Cooper (1968) devoted most of his Search Length paper to dealing with ties. The name he gave to the measure is Expected Search Length; the ‘expected’ refers solely to the tie situation.

We may deal with (i.e., break) ties in a way that depends on the measure (e.g., pessimistic), or in an arbitrary way, either random or determinate. In effect the Cooper method was random. In the determinate category, we may have some arbitrary unique document id, and break ties by means of a secondary ranking by id. For the purposes of the present paper, we will assume some arbitrary deterministic tie-breaking, determined by the documents only: that is, given any pair of documents which may obtain equal scores for one or more queries, we apply a predetermined ordering of the two documents, which is independent of the query.

We will encounter the problem of ties in a slightly different way below.

Redundancy

The above measures make no reference to possible duplication or redundancy. The effect of redundancy is that a document that might on its own be judged relevant becomes less useful because a document seen earlier by the user has made it more or less redundant. This effect was partially simulated in the filtering task at TREC by means of a non-linear utility measure, and possibly also by Cooper’s Search Length. But the issue is attacked much more directly by various authors, including in recent work on INEX (Kazai et al., 2004). Note that this is not at all the same as the discount in DCG: the rank-based discount is applied irrespective of how many relevant documents have been seen earlier.

Score-based measures

A few measures have been based on the actual scores given to documents, rather than the resulting ranks. These are discussed further below.

Trajectories in parameter space

We now return to the consideration of parameters and parameter optimization. Assume that we have a ranking function with n free continuous parameters, and that we are exploring the resulting n-dimensional parameter space. Further assume that the mode of exploration is a smooth trajectory through the space. Note that in practical optimization trajectories are not smooth—they normally involve a series of discrete steps. However, looking in principle at smooth trajectories will allow us to see characteristics which may not be visible in stepwise trajectories.

Now consider the behaviour (as we follow the trajectory) of any one of the above rank-based measures—let us say measure M.

Flips

A rank-based measure can only change when two documents switch ranks. Without loss of generality we can look only at pairs of documents that are currently neighbours in the ranking for any particular query. We assume for simplicity of this discussion that we are looking at just one query, although the argument does not depend on this assumption. A switch between two neighbouring documents in a ranking is described as a flip. As we follow the trajectory, our chosen measure M will remain the same until we encounter a flip, at which point it may change. A flip is good or bad or neutral, in the obvious sense that moving a relevant above a non-relevant (or a more above a less relevant) is good.

More formally, we have relevance judgements \({\hbox{Rel}(q,d)}\), which may be binary or multi-valued discrete, and a scoring function \({s_{\mathbf{P}}(q,d)}\), where \({\mathbf{P}}\) is the vector of parameters defining the specific scoring/ranking function, and q and d are query and document, respectively. A flip occurs between documents d 1 and d 2 if the trajectory passes a point at which the sign of \({s_{\mathbf{P}}(q,d_1) - s_{\mathbf{P}}(q,d_2)}\) reverses. A flip is good if e.g., \({\hbox{Rel}(q,d_1) > \hbox{Rel}(q,d_2)}\) and the flip causes \({s_{\mathbf{P}}(q,d_1) - s_{\mathbf{P}}(q,d_2)}\) to go from negative to positive.

As before, we need to consider also the point at which the two scores are equal. We have already assumed a tie-breaking method, so that even given equal scores, the ranking is fully defined, and the flip actually occurs either upon reaching the flipping point, or upon leaving it. This matter is discussed further in the Appendix.

M may or may not respond to a flip—that is, it may ignore, treat as neutral, some good or bad flips. But if it does respond, it will reward a good flip and penalize a bad one;Footnote 1 and it will always be neutral about a neutral one. This statement applies precisely to every single one of the above rank-based measures, both for individual queries and for averages across queries. It is at least arguable that a measure which behaved differently would not be a sensible measure of effectiveness. Measures vary in regard to (a) which flips they respond to, and (b) how much they respond, but they never vary in the direction of response.

To make a formal statement of the above:

Assertion 1

Flips may be identified unambiguously (modulo the relevance judgements in a particular training set) as good or bad or neutral. Any reasonable rank-based measure of effectiveness, including each of those defined above, satisfies the following: as we move along a trajectory in parameter space, the measure:

  1. (a)

    does not change until a flip is encountered for one of the queries in the training set;

  2. (b)

    does not change if the flip is neutral;

  3. (c)

    may or may not change if the flip is good;

  4. (d)

    may or may not change if the flip is bad;

    but

  5. (e)

    will only reward a good flip or penalize a bad one.

A short summary will illustrate the choice of flips question: MAP and NDCG respond to all good or bad flips (or if truncated, to all such flips down to the truncation point). P@n responds only to flips between ranks n and n + 1; S@n similarly, but then only if there are no higher-ranked relevant documents. MRR responds only to flips involving the top-ranked relevant document. bpref responds only to flips involving the chosen pairs.

The diagrams at the end of the Appendix represent a two-parameter space; the straight lines represent flip boundaries. That is, a trajectory in this space encounters a flip when it crosses a boundary. The full details of this example are discussed in the Appendix, but the illustration may help the reader to visualize the situation.

Optima

Now suppose we have two measures M1 and M2 which respond to the same flips (such as MAP and NDCG based on binary relevance, or two different NDCG functions, with the same truncation point, or MAP and GMAP). Further, we assume that we seek to optimize effectiveness, in a certain model parameter space, according to one or other of these measures, and on a specified training set of documents, queries and relevance judgements. We have the following result:

Theorem 1

On the assumption that a trajectory encounters only one flip at a time, M 1 and M 2 will share all local optima.

Suppose we start at a local optimum of M1: that is, at a point in the parameter space such that, if we follow any trajectory away from this point, M1 remains constant or gets worse. This means that, whatever trajectory we take, the first flip we encounter and to which M1 responds must be a bad flip. Since M2 responds in the same direction to the same flips as M1, we must also have started at a local optimum of M2. \({\square}\)

The mathematical detail of this result is discussed further in the Appendix. The assumption sounds a bit strong, but actually further discussion in the Appendix will specify it more precisely, and relax it somewhat. The relaxed assumption appears to be at least reasonable. The resulting version of the theorem is quite robust.

Example 1

We here show an example of the way in which flips control the optima. The example is a very simple one, consisting of the following:

  • A single query.

  • A collection of 100 non-relevant documents and two relevant \({\mathsf{a,b}}\). We assume that \({\mathsf{a}}\) is highly relevant and \({\mathsf{b}}\) less so.

  • A ranking function in the form of a linear combination of two features, controlled by a single parameter (the weight of feature 2). The trajectory in this case is simply a scan along this parameter.

  • Effectiveness measures AP and NDCG. Since there is a single query involved, no averaging is required. For NDCG, the gain for \({\mathsf{b}}\) is fixed at 1, while that for \({\mathsf{a}}\) is varied, giving three variant NDCGs, denoted NDCG(\({g_{\mathsf a}}\)) where \({g_{\mathsf a}=1,3,10}\).

The distribution of feature values is indicated in Fig. 1; Fig. 2 shows the four effectiveness measures as functions of the single parameter. Note the following:

  • The measures agree almost entirely on which flips to respond to. The only disagreement is about a flip involving the two relevant documents \({\mathsf{a,b}}\) (which occurs once on this trajectory). For AP and NDCG(1) this flip is neutral; NDCG(3) and NDCG(10) both prefer \({\mathsf{a}}\) to rank above \({\mathsf{b}}\). The flip is represented by a small section of the graph (weight 2.5–3) where the former two measures are flat, but the latter two are not.

  • With this single exception, the four graphs agree everywhere about local optima. Even in this simple example, there are 7 or 8 local optima.

  • Each optimum is actually a small flat region; properly, the graph should be step-like, consisting of a series of horizontal line segments, joined by vertical drops at the exact points where flips occur. However, for simplicity a single point has been chosen in each flat region.

  • AP and NDCG(1) agree on placing the global optimum at a weight of approximately 0.8, where \({\mathsf{a}}\) has rank 12 and \({\mathsf{b}}\) is at 4; the other two agree on a global optimum at about 2, where the ranks are 8 and 7, respectively.

  • Other variations (for example different effectiveness measures, or NDCG with different discounts as well as gains) would promote to global status other local optima. Indeed, it would be possible to discover or invent a sensible measure to promote any one of the 8 local optima to global status.

Fig. 1
figure 1

Example 1: one query, 100 non-relevant and two relevant documents. Two features. Figure shows the distribution of documents by feature

Fig. 2
figure 2

Example 1: effectiveness as a function of the weight of feature 2

Discussion of the theorem

Note that it is an empirically observed fact that the global optima of different measures differ. For example, in the TREC Robust Track (Voorhees, 2006), it has been found that methods designed to enhance MAP are not necessarily good for GMAP. The above result might therefore seem slightly surprising. At minimum, it means that a global optimum of one of the measures must also be a local optimum (at least) of the other (MAP and GMAP at least agree precisely on which flips to respond to).

The reality is likely to be that there are many different local optima, in a given training collection with multiple queries. A measure that responds to many flips will have many optima; a measure that responds to fewer flips will have fewer optima, but larger flat areas (large areas of the parameter space that it cannot distinguish).

However, the shape of the measure surface in parameter space will depend a great deal on the sample size (size of the training set). The larger the training set, the more local optima there will be, but probably the less significant each optimum will be. As we enlarge the training set, we may imagine the surface on which we are trying to find an optimum showing finer and finer grain variations. At the same time, larger training sets appear to smooth things out, if we ignore the very small-scale variations. We may reach a point where a grid-based exploration will simply not see these small variations.

This paper includes no analysis of such large-sample behaviour. There are many interesting questions here. Given a large enough sample, the apparent smoothness might allow an approximate gradient analysis based on observing empirical gradients over finite grid steps. However, the underlying small-scale variation prohibits the use of gradient functions derived by differentiation on the rank-based measures.

Some alternative approaches

Score-based measures

In the machine learning community, there are well-understood criteria for objective functions which optimizers would like to find. Ideally, the objective function should be a continuous, differentiable, convex function of the parameters. Differentiability allows a gradient-descent type of procedure; convexity ensures that there is exactly one (global) optimum. It is very clear that none of the above rank-based measures comes near to satisfying these requirements.

Could we nevertheless define an effectiveness measure for IR which has these properties? One possible route would be to define the measure in terms of scores rather than ranks (since we would expect the scoring function to be continuous in the parameters at least). There has been a scattering of proposals for such measures, beginning with Swets (1963) in the 1960s.

There is a problem in principle with such measures. If we assume that the only user-visible output of the system is the ranking, then this output is independent of any monotonic transformation of the scores. An effectiveness measure defined on scores will in general be affected by any such transformation, and an optimum parameter set defined by optimizing such a measure will also generally be affected. It might be argued that the system can and should reveal scores as well as ranks, or even (Cooper et al., 1994) that scores should be calibrated into realistic probabilities and then shown to users. Nevertheless, the likely outcome is that the user primarily responds to the ranking, and only secondarily to the scores. Thus, an optimum that changes with monotonic transformations of the score seems perverse.

Nevertheless, measures based on scores might be candidates for gradient descent optimization, on the grounds that we may be able to discover (if not an optimum) at least a reasonably good parameter set for the rank-based measure that we really want to optimize. An early contribution along these lines is presented by Bartell (1994), and a similar approach is taken more recently in RankNet (Burges et al., 2005). Here a measure based on scores, actually score differences between pairs of documents. This has the advantage of giving very limited monotonic-transformation independence (independence of zero-shifts), though also the disadvantage associated with bpref, that there has to be very careful choice of pairs. Bartell’s (1994) measure is intended to estimate a rank correlation; Burges et al.’s (2005) measure is tested against NDCG.

There are some optimization methods and related techniques now being developed at the interface of machine learning and information retrieval. In the RankNet work, for example, the validation set heuristic is used not just to prevent overfitting the model to the training data, but also to help overcome the fact that the measure used in training is not the one we really care about. Some further developments are explored in Burges (2005). Other methods have been proposed in the general area of ranking optimization (for example in Freund et al., 2003; Herbrich et al., 2000).

Such methods may give reasonable practical solutions. However, they leave one wondering if another method would have found something closer to a real global optimum. They render it difficult to reason about or have confidence in the result. There is a well-established principle in optimization that one should optimize the objective function that one is really interested in. We seem to be being forced away from this principle.

Optimization without trajectories

So far we have discussed some characteristics of IR performance measures as we move parameters in some trajectory across the parameter space in an optimization procedure, and how this may affect optimization of parameters with gradient descent methods. There exist however some optimization methods that do not create a trajectory in parameter space and do not use gradient information. We briefly discuss the most important of these methods here and argue that they do not provide an effective alternative for parameter optimization in todays high-dimensional IR ranking functions.

Non-gradient methods treat the function as a black box and require only that it can be evaluated at any given point of its domain. In one dimension, Brent’s method (Press et al., 2002) is guaranteed to find a local minimum relatively fast; this is done by bracketing the minimum and using a parabolic approximation of the remaining function to move quickly towards the minimum point. In higher dimensions, direction-set methods (Press et al., 2002) (e.g., Powell’s method) can be used; these methods call Brent’s method iteratively in multiple dimensions to find the minimum. However, all of these methods require very large numbers of function evaluations, which in our case is extremely costly since each evaluation requires ranking the collection with respect to every query, sorting and evaluating the performance measure. Furthermore the storage requirements of these methods grow quadratic with the number of dimensions, which can be a problem when many hundreds of dimensions are used. In practice, such methods are only useful in IR for ranking functions of 20 parameters or less. Furthermore, all these methods rely on bracketing the minimum and will converge on a local minima. To have a chance of finding the global minima, the minimization procedure needs to be run multiple times from random starting points.

The Genetic Algorithms approach (Mitchell, 1996) is an alternative to bracketing methods; here a population of possible solutions (leading to low values of the function) is perturbed in different ways to find, with high probability, the regions of the function domain with lowest values. Because there is a non-zero probability of visiting any domain region, there is a probabilistic guarantee that we will find the global (as well as the local) minima at some point, although this may require an unreasonable amount of time. More interestingly, the algorithm does not get stuck on local minima and iteratively improves the solutions. However, it may take a very long time to find reasonably good solutions and some heuristic needs to be designed to determine convergence. For these reasons, genetic algorithms are only used when other minimization methods (such as gradient-based or direction-set methods) cannot be.

Discussion and conclusions

Optimization over a substantial parameter space has become a significant issue for many current approaches to search. The measures that we use in evaluating search systems are central to optimization, and the nature and characteristics of the measures may make for difficulties in the choice or design of optimization methods.

In this paper, we have shown how a careful investigation of the way a measure might vary across a parameter space yields insights into the problem. In particular, it is clear that for a given training set, not only do the different IR effectiveness measures have different global optima, but also they probably have very many local optima. If we consider any one of these measures as a function in parameter space, it looks like very lumpy function, making many abrupt steps up or down as we move around parameter space. Many different measures share the same lumps, even if they differ substantially on the global optimum. We have expressed this result in a way which is independent of the dimensionality of the parameter space, the performance measure and the ranking method used, and in particular applies to ranking algorithms with many parameters.

We regard this insight as valuable in its own right, as part of our general understanding of the rank-based measures and of the optimization problem. But the insight also leads to various suggestions concerning implications and future work:

  • There is scope for further investigation of smooth cost functions which nevertheless approximate or predict the rank-based functions.

  • There is scope for the better understanding of the statistical behaviour of the rank-based functions under discrete steps involving many flips.

  • When choosing an optimization method, one should seek a method that avoids getting stuck in local optima.

The first suggestion is already the subject of some work, mostly using smooth cost functions based on scores. However, as noted in the previous section, it is hard to see how such a function can avoid the problem concerning monotonic transformations of the scores. The second looks more difficult to get a handle on, but goes to the heart of the problem, in the sense that we are not interested in the training set per se, but in generalizing from it. As indicated in Section 4.2, there are optimization methods which target the third item. It is not obvious that existing methods are particularly suited to this task, but at least there is scope for developing them.