New measurements for search engine evaluation proposed and tested
Introduction
The astonishing growth of the Web propelled the rapid development of Web search engines. However, the evaluation of these search engines has not been keeping up with the pace of their development. The significance of the evaluation is twofold: to help Web users in their choice of search engines and to inform the development of search algorithms and search engines.
Decades of research, from the classic Cranfield experiments to the ongoing TREC, have established a set of standard measurements for the evaluation of information retrieval systems. Among the most commonly used criteria are recall and precision. Precision is the proportion of retrieved documents that are relevant, while recall is the proportion of relevant documents that are retrieved (Voorhees & Harman, 2001, p. 5). However, it is very difficult, if not impossible, to directly apply these measurements to the evaluation of Web information retrieval systems due to the unique nature of the Web. Variations of these measurements have been proposed and used in earlier studies as discussed in the “related studies” below. Most of these modified measures still rely on binary relevance decisions (relevant vs. non-relevant) or multi-level discrete relevance judgements (e.g. relevant, partially relevant, non-relevant). This study proposes a set of measurements that are based on a continuous ranking (from the most relevant to the least relevant) of the set of experiment documents (or Web pages). A previous study shows that human subjects are able to make relevance judgements on a continuous scale (Greisdorf & Spink, 2001). The main justification for using a continuous ranking rather than discrete relevance judgements is that retrieval results from Web search engines are typically ranked. Measurements based on ranking will therefore provide a better “match” with the system being evaluated.
Studies on the evaluation of Web search engines have proposed various measures ranging from coverage to response time (e.g. Chu & Rosenthal, 1996; Gwizdka & Chignell, 1999). However, none has recommended that performance stability be an evaluation criterion. While traditional information retrieval systems such as DIALOG provide very stable search results for a given query executed multiple times, Web search results can be very unstable due to the unique environment in which the Web search engines operate. For example, Web search engines can truncate results to improve response times during peak periods of activity. Multiple databases or multiple indexes, which are not always identical, may be used by the same search engine to respond to user queries (Mettrop & Nieuwenhuysen, 2001, pp. 641–642). It is therefore very important to include performance stability in any Web search engine evaluation. If a search engine is not stable, then the results obtained from the search engine for evaluation purposes may be just a fluke and may not represent the general performance of the search engine. Another reason for testing search engine stability is to provide information to researchers who use Web search engines to collect data, e.g. Webometrics research. These researchers need to know the stability of a search engine to gauge the reliability of the data collected. This study thus proposed a set of measurements to evaluate performance stability.
An experiment was conducted to test the measurements proposed by applying them to the comparison of three Web search engines. Four sets of Web pages corresponding to four queries were retrieved from the search engines and the ranking of these pages by each engine recorded. Twenty-four human subjects ranked these four sets of Web pages and their rankings were used as the benchmark against which the ranking results of different search engines were compared. A search engine that generated a ranking closer to the human ranking is considered better. To assess the stability of the search results, queries in the study were performed on each search engine once a week over a 10-week period.
Section snippets
Related studies
Many publications compare or evaluate Web search engines (e.g. Notess, 2000). Perhaps the best known of these are Search Engine Watch (http://www.searchenginewatch.com) and Search Engine Showdown (http://www.searchengineshowdown.com). However, many of these publications did not employ formal evaluation procedures with rigorous methodology. Some papers that describe advances in search algorithms gave anecdotal evidence instead of formal evaluations (e.g. Brin & Page, 1998). Only studies that
Proposed measurements
Two measurements are proposed as counterparts of traditional recall and precision. In contrast to the calculations of recall and precision, which are based on binary relevance judgements, the proposed measurements are calculated based on a continuous relevance ranking (from most relevant to least relevant) by human subjects. In addition, a set of three measurements is proposed to evaluate the stability of search engine performance.
Experiment to test the proposed measurements
An experiment was carried out to test the proposed measurements by applying them in the comparison of three search engines. The design of the experiment is detailed below.
Quality of result ranking
The quality of result ranking was measured by the correlation between search engine ranking and human ranking. The Spearman correlation coefficient was calculated for each query and for each search engine. The results are summarized in Table 1. A higher correlation coefficient indicates better quality of ranking by the search engine. Coefficients that are statistically significant at 0.05 level are indicated by a “*” sign beside them. Each search engine's performance over the four queries was
Discussion and conclusions
Two measurements are proposed as counterparts of traditional recall and precision: the quality of result ranking and the ability to retrieve top ranked pages. The main difference between these measurements and those used in earlier studies is that these new measures are based on a continuous ranking of test documents (ranked from the most relevant to the least relevant) rather than the discrete relevance judgements (e.g. relevant, partial relevant, irrelevant) used in previous studies. It is
Acknowledgements
I am very grateful to all the students who participated in the study by giving permission for me to use their ranking data. The study would have been impossible without their support. I also thank the two anonymous referees for their very helpful comments and suggestions.
References (35)
Some thoughts on the reported results of TREC
Information Processing & Management
(2002)- et al.
The anatomy of a large scale hypertextual Web search engine
Computer Networks and ISDN Systems
(1998) - et al.
Finding information on the World Wide Web: The retrieval effectiveness of search engines
Information Processing & Management
(1999) - et al.
Median measure: an approach to IR Systems evaluation
Information Processing & Management
(2001) - et al.
Results and challenges in Web search evaluation
Computer Networks
(1999) - et al.
Letters to the editor
Information Processing & Management
(2003) Search engine results over time––a case study on search engine stability
Cybermetrics
(1998/99)Evaluating the stability of the search tools HotBot and Snap: A case study
Online Information Review
(2000)Methods for assessing search engine performance over time
Journal of the American Society for Information Science and Technology
(2002)- et al.
Search engines for the World Wide Web: a comparative study and evaluation methodology
Results ranking in Web search engines
Online
A comparative study of Web search service performance
Variation in relevance assessments and the measurement of retrieval effectiveness
Journal of the American Society for Information Science
Measuring search engine quality
Information Retrieval
User perspectives on relevance criteria: a comparison among relevant, partially relevant, and not-relevant judgments
Journal of the American Society for Information Science and Technology
Internet search engines––fluctuations in document accessibility
Journal of Documentation
Cited by (136)
Effect of 'Wh' questions in ask search engine's performance
2024, AIP Conference ProceedingsProposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator
2024, Journal of Librarianship and Information ScienceExploring the Use of the Delphi Method in Visual Search Service Effectiveness Measurement Research in e-Commerce Platform
2023, Proceedings - 2023 International Conference on Software and System Engineering, ICoSSE 2023Development of a Visual Search Service Effectiveness Scale for Assessing Image Search Effectiveness: A Behavioral and Technological Perspective
2023, International Journal of Human-Computer InteractionAcademic search engines: Constraints, bugs, and recommendations
2022, A-TEST 2022 - Proceedings of the 13th International Workshop on Automating Test Case Design, Selection and Evaluation, co-located with ESEC/FSE 2022