New measurements for search engine evaluation proposed and tested

https://doi.org/10.1016/S0306-4573(03)00043-8Get rights and content

Abstract

A set of measurements is proposed for evaluating Web search engine performance. Some measurements are adapted from the concepts of recall and precision, which are commonly used in evaluating traditional information retrieval systems. Others are newly developed to evaluate search engine stability, an issue unique to Web information retrieval systems. An experiment was conducted to test these new measurements by applying them to a performance comparison of three commercial search engines: Google, AltaVista, and Teoma. Twenty-four subjects ranked four sets of Web pages and their rankings were used as benchmarks against which to compare search engine performance. Results show that the proposed measurements are able to distinguish search engine performance very well.

Introduction

The astonishing growth of the Web propelled the rapid development of Web search engines. However, the evaluation of these search engines has not been keeping up with the pace of their development. The significance of the evaluation is twofold: to help Web users in their choice of search engines and to inform the development of search algorithms and search engines.

Decades of research, from the classic Cranfield experiments to the ongoing TREC, have established a set of standard measurements for the evaluation of information retrieval systems. Among the most commonly used criteria are recall and precision. Precision is the proportion of retrieved documents that are relevant, while recall is the proportion of relevant documents that are retrieved (Voorhees & Harman, 2001, p. 5). However, it is very difficult, if not impossible, to directly apply these measurements to the evaluation of Web information retrieval systems due to the unique nature of the Web. Variations of these measurements have been proposed and used in earlier studies as discussed in the “related studies” below. Most of these modified measures still rely on binary relevance decisions (relevant vs. non-relevant) or multi-level discrete relevance judgements (e.g. relevant, partially relevant, non-relevant). This study proposes a set of measurements that are based on a continuous ranking (from the most relevant to the least relevant) of the set of experiment documents (or Web pages). A previous study shows that human subjects are able to make relevance judgements on a continuous scale (Greisdorf & Spink, 2001). The main justification for using a continuous ranking rather than discrete relevance judgements is that retrieval results from Web search engines are typically ranked. Measurements based on ranking will therefore provide a better “match” with the system being evaluated.

Studies on the evaluation of Web search engines have proposed various measures ranging from coverage to response time (e.g. Chu & Rosenthal, 1996; Gwizdka & Chignell, 1999). However, none has recommended that performance stability be an evaluation criterion. While traditional information retrieval systems such as DIALOG provide very stable search results for a given query executed multiple times, Web search results can be very unstable due to the unique environment in which the Web search engines operate. For example, Web search engines can truncate results to improve response times during peak periods of activity. Multiple databases or multiple indexes, which are not always identical, may be used by the same search engine to respond to user queries (Mettrop & Nieuwenhuysen, 2001, pp. 641–642). It is therefore very important to include performance stability in any Web search engine evaluation. If a search engine is not stable, then the results obtained from the search engine for evaluation purposes may be just a fluke and may not represent the general performance of the search engine. Another reason for testing search engine stability is to provide information to researchers who use Web search engines to collect data, e.g. Webometrics research. These researchers need to know the stability of a search engine to gauge the reliability of the data collected. This study thus proposed a set of measurements to evaluate performance stability.

An experiment was conducted to test the measurements proposed by applying them to the comparison of three Web search engines. Four sets of Web pages corresponding to four queries were retrieved from the search engines and the ranking of these pages by each engine recorded. Twenty-four human subjects ranked these four sets of Web pages and their rankings were used as the benchmark against which the ranking results of different search engines were compared. A search engine that generated a ranking closer to the human ranking is considered better. To assess the stability of the search results, queries in the study were performed on each search engine once a week over a 10-week period.

Section snippets

Related studies

Many publications compare or evaluate Web search engines (e.g. Notess, 2000). Perhaps the best known of these are Search Engine Watch (http://www.searchenginewatch.com) and Search Engine Showdown (http://www.searchengineshowdown.com). However, many of these publications did not employ formal evaluation procedures with rigorous methodology. Some papers that describe advances in search algorithms gave anecdotal evidence instead of formal evaluations (e.g. Brin & Page, 1998). Only studies that

Proposed measurements

Two measurements are proposed as counterparts of traditional recall and precision. In contrast to the calculations of recall and precision, which are based on binary relevance judgements, the proposed measurements are calculated based on a continuous relevance ranking (from most relevant to least relevant) by human subjects. In addition, a set of three measurements is proposed to evaluate the stability of search engine performance.

Experiment to test the proposed measurements

An experiment was carried out to test the proposed measurements by applying them in the comparison of three search engines. The design of the experiment is detailed below.

Quality of result ranking

The quality of result ranking was measured by the correlation between search engine ranking and human ranking. The Spearman correlation coefficient was calculated for each query and for each search engine. The results are summarized in Table 1. A higher correlation coefficient indicates better quality of ranking by the search engine. Coefficients that are statistically significant at 0.05 level are indicated by a “*” sign beside them. Each search engine's performance over the four queries was

Discussion and conclusions

Two measurements are proposed as counterparts of traditional recall and precision: the quality of result ranking and the ability to retrieve top ranked pages. The main difference between these measurements and those used in earlier studies is that these new measures are based on a continuous ranking of test documents (ranked from the most relevant to the least relevant) rather than the discrete relevance judgements (e.g. relevant, partial relevant, irrelevant) used in previous studies. It is

Acknowledgements

I am very grateful to all the students who participated in the study by giving permission for me to use their ranking data. The study would have been impossible without their support. I also thank the two anonymous referees for their very helpful comments and suggestions.

References (35)

  • M.P Courtois et al.

    Results ranking in Web search engines

    Online

    (1999)
  • W Ding et al.

    A comparative study of Web search service performance

  • Gwizdka, J., & Chignell, M. (1999). Towards information retrieval measures for evaluation of Web search engines....
  • S Harter

    Variation in relevance assessments and the measurement of retrieval effectiveness

    Journal of the American Society for Information Science

    (1996)
  • D Hawking et al.

    Measuring search engine quality

    Information Retrieval

    (2001)
  • K.L Maglaughlin et al.

    User perspectives on relevance criteria: a comparison among relevant, partially relevant, and not-relevant judgments

    Journal of the American Society for Information Science and Technology

    (2002)
  • W Mettrop et al.

    Internet search engines––fluctuations in document accessibility

    Journal of Documentation

    (2001)
  • View full text