Experiments with Kemeny ranking: What works when?

https://doi.org/10.1016/j.mathsocsci.2011.08.008Get rights and content

Abstract

This paper performs a comparison of several methods for Kemeny rank aggregation (104 algorithms and combinations thereof in total) originating in social choice theory, machine learning, and theoretical computer science, with the goal of establishing the best trade-offs between search time and performance. We find that, for this theoretically NP-hard task, in practice the problems span three regimes: strong consensus, weak consensus, and no consensus. We make specific recommendations for each, and propose a computationally fast test to distinguish between the regimes.

In spite of the great variety of algorithms, there are few classes that are consistently Pareto optimal. In the most interesting regime, the integer program exact formulation, local search algorithms and the approximate version of a theoretically exact branch and bound algorithm arise as strong contenders.

Highlights

► We compare 104 algorithms for Kemeny rank aggregation across several datasets. ► We find that the problem instances span 3 regimes: strong, weak, and no consensus. ► We make specific algorithmic recommendations for each regime. ► We also propose a computationally fast test to distinguish between the 3 regimes. ► The integer program, local search, and branch and bound algorithms perform well.

Introduction

Preference aggregation has been extensively studied by economists under social choice theory. Arrow discussed certain desirable properties that a ranking rule must have and showed that no rule can simultaneously satisfy them all (Arrow, 1963). Thus, a variety of models of preference aggregation have been proposed, each of which satisfy subsets of properties deemed desirable. In theoretical computer science, too, many applications of preference aggregation exist, including merging the results of various search engines (Cohen et al., 1998, Dwork et al., 2001), collaborative filtering (Pennock et al., 2000), and multiagent planning (Ephrati and Rosenschein, 1993).

The Kemeny ranking rule (Kemeny and Snell, 1962) is widely used in both of these areas. From Arrow’s Axioms’ point of view, the Kemeny ranking is the unique rule satisfying the independence of irrelevant alternatives and the reinforcement axiom (Young and Levenglick, 1978). More recently, it has been found to have a natural interpretation as the maximum likelihood ranking under a very simple noise model proposed by Condorcet (Young, 1995). The same noise model was proposed independently in statistics by Mallows (1957) and refined by Fligner and Verducci (1986), under the name Mallows’ model. Further extensions that go beyond the scope of this paper but that are relevant to modeling preferences have been proposed by Mao and Lebanon (2008) and Meilă and Bao (2010).

Finding the Kemeny ranking is unfortunately a computational challenge, since the problem is NP-hard even for only four votes (Bartholdi et al., 1989, Cohen et al., 1998). Since the problem is important across a variety of fields, many researchers across these fields have converged on finding good, practical algorithms for its solution. There are formulations of the problem that lead to exact algorithms, of course without polynomial running time guarantees, and we present two of these in Section 3.1. There are also a large number of heuristic and approximate algorithms, and we enumerate several classes of these algorithms in Section 3.2.

Surveying the various approaches is only the first step, since we are interested in finding which algorithms (or combinations of algorithms) are likely to perform well in practice. For this, we perform a systematic comparison, running all algorithms on a batch of real-world and synthetic Kemeny ranking tasks.

Since we are attacking an intractable task, what we can examine is the trade-off between computational effort and quality of the obtained solution. We will examine which algorithms, if any, are systematically achieving optimal trade-offs. We hope thereby to be able to recommend reliable methods to practitioners.

Such an analysis was recently undertaken by Schalekampf and van Zuylen (2009), forthwith abbreviated as SvZ, with whose work several parallels exist. While from the point of view of the experimental setup our work has little to add, we differ fundamentally in the conclusions we draw. That is because we factor in the difficulty of the actual problem, something that SvZ do not consider. Since both the performance and running time of an algorithm can change with the difficulty of the problem, we expect that the “best” algorithm to use will differ depending on the operating regime.

Hence, we propose several natural measures of difficulty for the Kemeny ranking problem, and re-examine the algorithms’ performance results from this perspective.

The rest of the paper is structured as follows. The next section formally defines the Kemeny ranking problem and introduces the relevant notation. Sections 3.1 Exact algorithms, 3.2 Approximate algorithms review the algorithms, while Section 4 presents the datasets we used for experimental evaluation and how they were obtained. The experimental results follow in Section 5; this section already gives some insights into what algorithms are promising, and into the role of the problem difficulty in shaping the performance landscape. We further examine these findings in Section 6. Section 7 discusses the different regimes of difficulty of the Kemeny ranking problem and proposes a simple heuristic to distinguish between them in practice. The paper concludes with Section 8.

Section snippets

The Kemeny ranking problem

Given a profile of N rankings1π1,π2,,πN over n alternatives, the Kemeny ranking problem is the problem of finding the ranking π0=argminπ01Ni=1Nd(πi,π0). In the above, d(π,π) represents the Kendall distance between two permutations π,π defined as the minimum number of adjacent transpositions needed to turn π into π. The ranking π0 is the ranking that minimizes the total number of disagreements with the

Algorithms

We aimed for an ensemble of algorithms that would allow for various trade-offs between performance and accuracy. We first discuss the algorithms we experimented with that return an exact solution to the Kemeny ranking problem. Then, we discuss algorithms that return an approximate solution.

Datasets

In this section, we discuss the datasets that we used for our experiments. In collecting these datasets, we aimed for a variety of lengths (n), number of rankings (N), and degrees of consensus. To this end, we used real-world and synthetic datasets. We first describe the real-world and then move on to the synthetic datasets.

Methodology

We ran each algorithm described above on each data set (i.e. 104 algorithms × 49 data sets) and recorded the running time, resulting ranking, and other variables.

As a measure of algorithm accuracy we use the number of pairwise disagreements between the algorithm’s output ranking and all the data, i.e. cost(π0)=i=1Nd(πi,π0). The above is nothing else than the rescaled r.h.s. of Eq. (1), so it reflects how well the algorithm has optimized the desired objective. We call this objective the cost of

The Pareto curves and general observations

Across all experiments described here we observe the following. First, no algorithm reaches both the lowest cost and lowest runtime simultaneously over all experiments. This is why in our experimental results, we traced the Pareto boundary, i.e. the set of algorithms that cannot be bettered in both cost and running time by another algorithm.

The shape of the Pareto boundary tell us about the possible trade-offs between run time and accuracy. It is easy to observe that in practically all

Regimes of difficulty: strong, weak, and no consensus

In the tasks above, we presented the algorithms with difficulties ranging from hard to easy. The reader has probably noted that the success of the algorithms (and implicitly the difficulty of the problem) did not depend on n alone, but also on the data distribution. In particular, the concentration of the data around the Kemeny ranking had a strong influence on the algorithms’ ability to optimize the cost, and for some algorithms (like B&B) it also influenced the running time. We now discuss

Conclusions

We have performed an extensive comparison of algorithms for the Kemeny rank aggregation problem, originating in social choice theory, machine learning, and theoretical computer science. The problem being NP-hard, the focus has been on establishing the best trade-offs between search time and performance.

Acknowledgments

We thank the anonymous reviewers for the constructive feedback that significantly improved this paper. MM acknowledges partial support from NSF award IIS-0535100.

References (31)

  • N. Betzler et al.

    Fixed-parameter algorithms for Kemeny rankings

    Theoretical Computer Science

    (2009)
  • N. Ailon et al.

    Aggregating inconsistent information: ranking and clustering

  • K.J. Arrow

    Social Choice and Individual Values

    (1963)
  • J. Bartholdi et al.

    Voting schemes for which it can be difficult to tell who won

    Social Choice and Welfare

    (1989)
  • N. Betzler et al.

    Partial kernelization for rank aggregation: theory and experiments

  • J. Borda

    Memoire sur les Elections au Scrutin

    (1781)
  • S. Chanas et al.

    A new heuristic algorithm solving the linear ordering problem

    Computational Optimization and Applications

    (1996)
  • W.W. Cohen et al.

    Learning to order things

    Journal of Artificial Intelligence Research

    (1998)
  • Coleman, T., Wirth, A., 2008. Ranking tournaments: local search and a new algorithm. In: Proceedings of the Workshop on...
  • Conitzer, V., Davenport, A., Kalagnanam, J., 2006. Improved bounds for computing Kemeny rankings. In: Proceedings of...
  • Copeland, A., 1951. A ‘reasonable’ social welfare function. Technical Report. Seminar on Applications of Mathematics to...
  • Davenport, A., Kalagnanam, J., 2004. A computational study of the Kemeny rule for preference aggregation. In:...
  • P. Diaconis et al.

    Spearman’s footrule as a measure of disarray

    Journal of the Royal Statistical Society. Series B

    (1977)
  • Dwork, C., Kumar, R., Naor, M., Sivakumar, D., 2001. Rank aggregation methods for the web. In: Proceedings of the 10th...
  • B. Efron et al.

    An Introduction to the Bootstrap

    (1993)
  • Cited by (143)

    View all citing articles on Scopus
    View full text