article

Repeatable evaluation of search services in dynamic environments

Authors:

Eric C. Jensen,

Steven M. Beitzel,

Abdur Chowdhury,

Ophir FriederAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 26, Issue 1

Pages 1 - es

https://doi.org/10.1145/1292591.1292592

Published: 01 November 2007 Publication History

Abstract

In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, as large collection sizes demand judgments deep into retrieved pools. In practice it is common to perform shallow evaluations over small numbers of live engines (often pairwise, engine A vs. engine B) without system pooling. Although these evaluations are not intended to construct reusable test collections, their utility depends on conclusions generalizing to the query population as a whole. We leverage the bootstrap estimate of the reproducibility probability of hypothesis tests in determining the query sample sizes required to ensure this, finding they are much larger than those required for static collections. We propose a semiautomatic evaluation framework to reduce this effort. We validate this framework against a manual evaluation of the top ten results of ten Web search engines across 896 queries in navigational and informational tasks. Augmenting manual judgments with pseudo-relevance judgments mined from Web taxonomies reduces both the chances of missing a correct pairwise conclusion, and those of finding an errant conclusion, by approximately 50%.

References

[1]

Aslam, J., Pavlu, V., and Yilmaz, E. 2006. Statistical method for system evaluation using incomplete judgments. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[2]

Aslam, J. A., Pavlu, V., and Savell, R. 2003. A unified model for metasearch, pooling, and system evaluation. In Proceedings of the ACM Conference on Information and Knowledge Management, 484--491.

[3]

Bacchetti, P. 2002. Peer review of statistics in medical research: The other problem. Brit. Med. J. 324, 1271--1273.

[4]

Beitzel, S. M., Jensen, E. C., Chowdhury, A., and Grossman, D. 2003a. Using titles and category names from editor-driven taxonomies for automatic evaluation. In Proceedings of the ACM Conference on Information and Knowledge Management.

[5]

Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2003b. Using manually-built Web directories for automatic evaluation of known-item retrieval. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[6]

Beitzel, S. M., Jensen, E. C., Chowdhury, A., and Grossman, D. 2004a. Evaluation of filtering current news search results. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[7]

Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2004b. Hourly analysis of a very large topically categorized web query log. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[8]

Beitzel, S. M., Jensen, E. C., Chowdhury, A., Frieder, O., and Grossman, D. 2006. Temporal analysis of a very large topically categorized Web query log. J. Amer. Soc. Inform. Sci. Tech. (to appear).

[9]

Blustein, J. and Tague-Sutcliffe, J. 1995. IR-stat-pak. In Presented at the ACM Conference on Research and Development in Information Retrieval.

[10]

Borlund, P. 2003. The concept of relevance in IR. J. Amer. Soc. Inform. Sci. Tech. 54, 10 (August), 913--925.

[11]

Boyan, J., Freitag, D., and Joachims, T. 1996. A machine learning architecture for optimizing Web search engines. In Proceedings of the AAAI Workshop on Internet Based Information Systems.

[12]

Buckley, C. and Voorhees, E. M. 2000. Evaluating evaluation measure stability. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 33--40.

[13]

Carterette, B., Allan, J., and Sitaraman, R. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[14]

Cho, J., Garcia-Molina, H., and Page, L. 2000. Efficient crawling through URL ordering. In Proceedings of the World Wide Web Conference.

[15]

Chowdhury, A. and Soboroff, I. 2002. Automatic evaluation of World Wide Web search services. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 421--422.

[16]

Chowdhury, A. 2005. Automatic evaluation of Web search services. In Zelkowitz, M., Ed. Advances in Computers, Elsevier Academic Press.

[17]

Clarke, C., Scholer, F., and Soboroff, I. 2005. The TREC 2005 terabyte track. In Proceedings of the The Text Retrieval Conference, NIST.

[18]

Collings, B. J. and Hamilton, M. A. 1988. Estimating the power of the two sample Wilcoxon test for location shift. Biometrics 44, 847--860.

[19]

Cormack, G. V., Palmer, C. R., and Clarke, C. 1998. Efficient construction of large test collections. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 282--289.

[20]

Cormack, G. V. and Lynam, T. 2006. Statistical precision of information retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[21]

Davidson, R. and MacKinnon, J. G. 2000. Bootstrap tests: How many bootstraps&quest; Econometric Rev. 19, 55--68.

[22]

Davidson, R. and MacKinnon, J. G. 2006. The power of bootstrap and asymptotic tests. J. Econometrics 133, 421--441.

[23]

De Martini, D. and Rapallo, F. 2003. Calculating the power of permutation tests: A comparison between nonparametric estimators. J. Appl. Stat. Sci. 11, 109--120.

[24]

De Martini, D. 2006. On the stability of statistical tests. In Proceedings of the ASA Joint Statistical Meeting.

[25]

Ding, W. and Marchionini, G. 1996. Comparative study of Web search service performance. In Proceedings of the ASIS 1996 Annual Conference.

[26]

Efron, B. and Tibshirani, R. J. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC, 379--381.

[27]

Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C. 2005. Workshop: Intrinsic and extrinsic evaluation measures for MT and/or summarization. In Proceedings of the Annual Meeting of the Association of Computational Linguistics.

[28]

Goodman, S. N. 1992. A comment on replication, p-values and evidence. Stat. Med. 11, 875--879.

[29]

Hall, P. and Martin, M. A. 1988. On bootstrap resampling and iteration. Biometrika 75(4), 661--671.

[30]

Haveliwala, T., Gionis, A., Klein, D., and Indyk, P. 2002. Evaluating strategies for similarity search on the Web. In Proceedings of the World Wide Web Conference.

[31]

Hawking, D., Craswell, N., Thistlewaite, P., and Harman, D. K. 1999. Results and challenges in Web search evaluation. In Proceedings of the World Wide Web Conference, 243--252.

[32]

Hoenig, J. M. and Heisey, D. M. 2001. The abuse of power: The pervasive fallacy of power calculations for data analysis. Amer. Statist. 55(1), 19--24.

[33]

Hollander, M. and Wolfe, D. 1973. Nonparametric Statistical Methods. John Wiley and Sons.

[34]

Jansen, B. J. and Spink, A. 2005. How are we searching the World Wide Web&quest;: An analysis of nine search engine transaction logs. Inform. Proc. Manag. 42(1), 248--263.

[35]

Jansen, B. J., Spink, A., and Pederson, J. 2005. A temporal comparison of altavista Web searching. J. Amer. Soc. Inform. Sci. Tech. 56(6), 559--570.

[36]

Jensen, E. C., Beitzel, S. M., Chowdhury, A., and Frieder, O. 2005. A framework for determining necessary query set sizes to evaluate Web search effectiveness. In Proceedings of the World Wide Web Conference, 1176.

[37]

Jensen, E. C. 2006. Repeatable evaluation of information retrieval effectiveness in dynamic environments. Computer Science, Illinois Institute of Technology, Chicago, 88. http://ir.iit.edu/~ej/jensen_phd_thesis.pdf

[38]

Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 154--161.

[39]

Lehmann, E. 1986. Testing Statistical Hypotheses. Wiley, 150.

[40]

Lin, W.-H. and Hauptmann, A. 2005. Revisiting the effect of topic set size on retrieval error. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[41]

Manmatha, R., Feng, A., and Allan, J. 2002. A critical examination of TDT's cost function. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 403--404.

[42]

Miller, R. G., Jr. 1981. Simultaneous Statistical Inference. Springer, New York.

[43]

Munzel, U. 2001. A unified approach to simultaneous rank test procedures in the unbalanced one-way layout. Biomet. J. 43(5), 553--569.

[44]

Ntoulas, A., Cho, J., and Olston, C. 2004. What's new on the Web&quest; The evolution of the Web from a search engine perspective. In Proceedings of the World Wide Web Conference.

[45]

Nuray, R. and Can, F. 2006. Automatic ranking of information retrieval systems using data fusion. Information Processing & Management 42(3), 595--614.

[46]

Pass, G., Chowdhury, A., and Torgeson, C. 2006. A picture of search. In Proceedings of the International Conference on Scalable Information Systems (to appear).

[47]

Sakai, T. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[48]

Sanderson, M. and Joho, H. 2004. Forming test collections with no system pooling. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[49]

Sanderson, M. and Zobel, J. 2005. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[50]

Savoy, J. 1997. Statistical inference in retrieval effectiveness evaluation. Information Processing & Management 33(4) (July), 495--512.

[51]

Savoy, J. and Picard, J. 2001. Retrieval effectiveness on the Web. Inform. Proc. Manag. 37(4) (July), 543--569.

[52]

Shang, Y. and Li, L. 2002. Precision evaluation of search engines. World Wide Web 5, 159--173.

[53]

Shao, J. and Chow, S.-C. 2002. Reproducibility probability in clinical trials. Statistics in Medicine 21(12), 1727--1742.

[54]

Soboroff, I., Nicholas, C., and Cahan, P. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[55]

Soboroff, I. 2006. Dynamic test collections: Measuring search effectiveness on the live Web. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.

[56]

Spiegelhalter, D. J. and Freedman, L. S. 1986. A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. Statistics in Medicine 5, 1--13.

[57]

Srinivasan, P., Menczer, F., and Pant, G. 2005. A general evaluation framework for topical crawlers. Information Retrieval 8(3), 417--447.

[58]

Tague-Sutcliffe, J. M. 1996. Some perspectives on the evaluation of information retrieval systems. J. Amer. Soc. Inform. Sci. Tech. 47(1) (Jan.), 1--3.

[59]

Troendle, J. F. 1999. Approximating the power of wilcoxon's rank-sum test against shift alternatives. Stat. Med. 18(20) (Oct.), 2763--2773.

[60]

van-Rijsbergen, C. J. 1979. Chapter 7. In Information Retrieval. Butterworths, 178--180.

[61]

Voorhees, E. M. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 315--323.

[62]

Voorhees, E. M. and Buckley, C. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 316--323.

[63]

Wu, S. and Crestani, F. 2003. Methods for ranking information retrieval systems without relevance judgments. In Proceedings of the ACM Symposium on Applied Computing.

[64]

Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments&quest; In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 307--314.

Cited By

Keller JBreuer TSchaer POosterhuis HBast HXiong C(2024)Evaluation of Temporal Change in IR Test CollectionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672530(3-13)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672530
Keller JBreuer TSchaer P(2024)Replicability Measures for Longitudinal Information Retrieval EvaluationExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-71736-9_16(215-226)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-71736-9_16
González-Sáez GMulhem PGoeuriot L(2021)Towards the Evaluation of Information Retrieval Systems on Evolving Datasets with Pivot SystemsExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-030-85251-1_8(91-102)Online publication date: 21-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-85251-1_8
Show More Cited By

Index Terms

Repeatable evaluation of search services in dynamic environments
1. Information systems
  1. World Wide Web
    1. Web applications
    2. Web services

Recommendations

Improving Ranking Consistency for Web Search by Leveraging a Knowledge Base and Search Logs
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

In this paper, we propose a new idea called ranking consistency in web search. Relevance ranking is one of the biggest problems in creating an effective web search system. Given some queries with similar search intents, conventional approaches typically ...
Search snippet evaluation at yandex: lessons learned and future directions
CLEF'11: Proceedings of the Second international conference on Multilingual and multimodal information access evaluation

This papers surveys different approaches to evaluation of web search summaries and describes experiments conducted at Yandex. We hypothesize that the complex task of snippet evaluation is best solved with a range of different methods. Automation of ...
Identifying popular search goals behind search queries to improve web search ranking
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval Technology

Web users usually have a certain search goal before they submit a search query. However, many laypersons can't transform their search goals into suitable queries. Thus, understanding original search goals behind a query is very important for search ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 26, Issue 1

November 2007

164 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1292591

Issue’s Table of Contents

Copyright © 2007 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2007

Published in TOIS Volume 26, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
780
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Keller JBreuer TSchaer POosterhuis HBast HXiong C(2024)Evaluation of Temporal Change in IR Test CollectionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672530(3-13)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672530
Keller JBreuer TSchaer P(2024)Replicability Measures for Longitudinal Information Retrieval EvaluationExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-71736-9_16(215-226)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-71736-9_16
González-Sáez GMulhem PGoeuriot L(2021)Towards the Evaluation of Information Retrieval Systems on Evolving Datasets with Pivot SystemsExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-030-85251-1_8(91-102)Online publication date: 21-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-85251-1_8
Jensen EBeitzel SFrieder O(2018)Effectiveness Involving Multiple QueriesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_477(1265-1268)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_477
Jensen EBeitzel SFrieder O(2017)Effectiveness Involving Multiple QueriesEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_477-3(1-3)Online publication date: 11-Feb-2017
https://doi.org/10.1007/978-1-4899-7993-3_477-3
Jensen EBeitzel SFrieder O(2016)Effectiveness Involving Multiple QueriesEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_477-2(1-3)Online publication date: 21-Nov-2016
https://doi.org/10.1007/978-1-4899-7993-3_477-2
Luchen Tan Clarke C(2015)A Family of Rank Similarity Measures Based on Maximized Effectiveness DifferenceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.244854127:11(2865-2877)Online publication date: 1-Nov-2015
https://dl.acm.org/doi/10.1109/TKDE.2015.2448541
Barraza‐Urbina ACarrillo Ramos A(2012)UWIRS‐REC: integrating web information retrieval with recommendation servicesInternational Journal of Web Information Systems10.1108/174400812112419508:2(181-211)Online publication date: 15-Jun-2012
https://doi.org/10.1108/17440081211241950
Carterette BGabrilovich EJosifovski VMetzler DDavison BSuel TCraswell NLiu B(2010)Measuring the reusability of test collectionsProceedings of the third ACM international conference on Web search and data mining10.1145/1718487.1718516(231-240)Online publication date: 4-Feb-2010
https://dl.acm.org/doi/10.1145/1718487.1718516
Jensen EBeitzel SFrieder O(2009)Effectiveness Involving Multiple QueriesEncyclopedia of Database Systems10.1007/978-0-387-39940-9_477(961-963)Online publication date: 2009
https://doi.org/10.1007/978-0-387-39940-9_477
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents