skip to main content
article

Repeatable evaluation of search services in dynamic environments

Published: 01 November 2007 Publication History

Abstract

In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, as large collection sizes demand judgments deep into retrieved pools. In practice it is common to perform shallow evaluations over small numbers of live engines (often pairwise, engine A vs. engine B) without system pooling. Although these evaluations are not intended to construct reusable test collections, their utility depends on conclusions generalizing to the query population as a whole. We leverage the bootstrap estimate of the reproducibility probability of hypothesis tests in determining the query sample sizes required to ensure this, finding they are much larger than those required for static collections. We propose a semiautomatic evaluation framework to reduce this effort. We validate this framework against a manual evaluation of the top ten results of ten Web search engines across 896 queries in navigational and informational tasks. Augmenting manual judgments with pseudo-relevance judgments mined from Web taxonomies reduces both the chances of missing a correct pairwise conclusion, and those of finding an errant conclusion, by approximately 50%.

References

[1]
Aslam, J., Pavlu, V., and Yilmaz, E. 2006. Statistical method for system evaluation using incomplete judgments. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[2]
Aslam, J. A., Pavlu, V., and Savell, R. 2003. A unified model for metasearch, pooling, and system evaluation. In Proceedings of the ACM Conference on Information and Knowledge Management, 484--491.
[3]
Bacchetti, P. 2002. Peer review of statistics in medical research: The other problem. Brit. Med. J. 324, 1271--1273.
[4]
Beitzel, S. M., Jensen, E. C., Chowdhury, A., and Grossman, D. 2003a. Using titles and category names from editor-driven taxonomies for automatic evaluation. In Proceedings of the ACM Conference on Information and Knowledge Management.
[5]
Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2003b. Using manually-built Web directories for automatic evaluation of known-item retrieval. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[6]
Beitzel, S. M., Jensen, E. C., Chowdhury, A., and Grossman, D. 2004a. Evaluation of filtering current news search results. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[7]
Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2004b. Hourly analysis of a very large topically categorized web query log. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[8]
Beitzel, S. M., Jensen, E. C., Chowdhury, A., Frieder, O., and Grossman, D. 2006. Temporal analysis of a very large topically categorized Web query log. J. Amer. Soc. Inform. Sci. Tech. (to appear).
[9]
Blustein, J. and Tague-Sutcliffe, J. 1995. IR-stat-pak. In Presented at the ACM Conference on Research and Development in Information Retrieval.
[10]
Borlund, P. 2003. The concept of relevance in IR. J. Amer. Soc. Inform. Sci. Tech. 54, 10 (August), 913--925.
[11]
Boyan, J., Freitag, D., and Joachims, T. 1996. A machine learning architecture for optimizing Web search engines. In Proceedings of the AAAI Workshop on Internet Based Information Systems.
[12]
Buckley, C. and Voorhees, E. M. 2000. Evaluating evaluation measure stability. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 33--40.
[13]
Carterette, B., Allan, J., and Sitaraman, R. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[14]
Cho, J., Garcia-Molina, H., and Page, L. 2000. Efficient crawling through URL ordering. In Proceedings of the World Wide Web Conference.
[15]
Chowdhury, A. and Soboroff, I. 2002. Automatic evaluation of World Wide Web search services. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 421--422.
[16]
Chowdhury, A. 2005. Automatic evaluation of Web search services. In Zelkowitz, M., Ed. Advances in Computers, Elsevier Academic Press.
[17]
Clarke, C., Scholer, F., and Soboroff, I. 2005. The TREC 2005 terabyte track. In Proceedings of the The Text Retrieval Conference, NIST.
[18]
Collings, B. J. and Hamilton, M. A. 1988. Estimating the power of the two sample Wilcoxon test for location shift. Biometrics 44, 847--860.
[19]
Cormack, G. V., Palmer, C. R., and Clarke, C. 1998. Efficient construction of large test collections. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 282--289.
[20]
Cormack, G. V. and Lynam, T. 2006. Statistical precision of information retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[21]
Davidson, R. and MacKinnon, J. G. 2000. Bootstrap tests: How many bootstraps? Econometric Rev. 19, 55--68.
[22]
Davidson, R. and MacKinnon, J. G. 2006. The power of bootstrap and asymptotic tests. J. Econometrics 133, 421--441.
[23]
De Martini, D. and Rapallo, F. 2003. Calculating the power of permutation tests: A comparison between nonparametric estimators. J. Appl. Stat. Sci. 11, 109--120.
[24]
De Martini, D. 2006. On the stability of statistical tests. In Proceedings of the ASA Joint Statistical Meeting.
[25]
Ding, W. and Marchionini, G. 1996. Comparative study of Web search service performance. In Proceedings of the ASIS 1996 Annual Conference.
[26]
Efron, B. and Tibshirani, R. J. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC, 379--381.
[27]
Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C. 2005. Workshop: Intrinsic and extrinsic evaluation measures for MT and/or summarization. In Proceedings of the Annual Meeting of the Association of Computational Linguistics.
[28]
Goodman, S. N. 1992. A comment on replication, p-values and evidence. Stat. Med. 11, 875--879.
[29]
Hall, P. and Martin, M. A. 1988. On bootstrap resampling and iteration. Biometrika 75(4), 661--671.
[30]
Haveliwala, T., Gionis, A., Klein, D., and Indyk, P. 2002. Evaluating strategies for similarity search on the Web. In Proceedings of the World Wide Web Conference.
[31]
Hawking, D., Craswell, N., Thistlewaite, P., and Harman, D. K. 1999. Results and challenges in Web search evaluation. In Proceedings of the World Wide Web Conference, 243--252.
[32]
Hoenig, J. M. and Heisey, D. M. 2001. The abuse of power: The pervasive fallacy of power calculations for data analysis. Amer. Statist. 55(1), 19--24.
[33]
Hollander, M. and Wolfe, D. 1973. Nonparametric Statistical Methods. John Wiley and Sons.
[34]
Jansen, B. J. and Spink, A. 2005. How are we searching the World Wide Web?: An analysis of nine search engine transaction logs. Inform. Proc. Manag. 42(1), 248--263.
[35]
Jansen, B. J., Spink, A., and Pederson, J. 2005. A temporal comparison of altavista Web searching. J. Amer. Soc. Inform. Sci. Tech. 56(6), 559--570.
[36]
Jensen, E. C., Beitzel, S. M., Chowdhury, A., and Frieder, O. 2005. A framework for determining necessary query set sizes to evaluate Web search effectiveness. In Proceedings of the World Wide Web Conference, 1176.
[37]
Jensen, E. C. 2006. Repeatable evaluation of information retrieval effectiveness in dynamic environments. Computer Science, Illinois Institute of Technology, Chicago, 88. http://ir.iit.edu/~ej/jensen_phd_thesis.pdf
[38]
Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 154--161.
[39]
Lehmann, E. 1986. Testing Statistical Hypotheses. Wiley, 150.
[40]
Lin, W.-H. and Hauptmann, A. 2005. Revisiting the effect of topic set size on retrieval error. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[41]
Manmatha, R., Feng, A., and Allan, J. 2002. A critical examination of TDT's cost function. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 403--404.
[42]
Miller, R. G., Jr. 1981. Simultaneous Statistical Inference. Springer, New York.
[43]
Munzel, U. 2001. A unified approach to simultaneous rank test procedures in the unbalanced one-way layout. Biomet. J. 43(5), 553--569.
[44]
Ntoulas, A., Cho, J., and Olston, C. 2004. What's new on the Web? The evolution of the Web from a search engine perspective. In Proceedings of the World Wide Web Conference.
[45]
Nuray, R. and Can, F. 2006. Automatic ranking of information retrieval systems using data fusion. Information Processing & Management 42(3), 595--614.
[46]
Pass, G., Chowdhury, A., and Torgeson, C. 2006. A picture of search. In Proceedings of the International Conference on Scalable Information Systems (to appear).
[47]
Sakai, T. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[48]
Sanderson, M. and Joho, H. 2004. Forming test collections with no system pooling. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[49]
Sanderson, M. and Zobel, J. 2005. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[50]
Savoy, J. 1997. Statistical inference in retrieval effectiveness evaluation. Information Processing & Management 33(4) (July), 495--512.
[51]
Savoy, J. and Picard, J. 2001. Retrieval effectiveness on the Web. Inform. Proc. Manag. 37(4) (July), 543--569.
[52]
Shang, Y. and Li, L. 2002. Precision evaluation of search engines. World Wide Web 5, 159--173.
[53]
Shao, J. and Chow, S.-C. 2002. Reproducibility probability in clinical trials. Statistics in Medicine 21(12), 1727--1742.
[54]
Soboroff, I., Nicholas, C., and Cahan, P. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[55]
Soboroff, I. 2006. Dynamic test collections: Measuring search effectiveness on the live Web. In Proceedings of the ACM Conference on Research and Development in Information Retrieval.
[56]
Spiegelhalter, D. J. and Freedman, L. S. 1986. A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. Statistics in Medicine 5, 1--13.
[57]
Srinivasan, P., Menczer, F., and Pant, G. 2005. A general evaluation framework for topical crawlers. Information Retrieval 8(3), 417--447.
[58]
Tague-Sutcliffe, J. M. 1996. Some perspectives on the evaluation of information retrieval systems. J. Amer. Soc. Inform. Sci. Tech. 47(1) (Jan.), 1--3.
[59]
Troendle, J. F. 1999. Approximating the power of wilcoxon's rank-sum test against shift alternatives. Stat. Med. 18(20) (Oct.), 2763--2773.
[60]
van-Rijsbergen, C. J. 1979. Chapter 7. In Information Retrieval. Butterworths, 178--180.
[61]
Voorhees, E. M. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 315--323.
[62]
Voorhees, E. M. and Buckley, C. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 316--323.
[63]
Wu, S. and Crestani, F. 2003. Methods for ranking information retrieval systems without relevance judgments. In Proceedings of the ACM Symposium on Applied Computing.
[64]
Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the ACM Conference on Research and Development in Information Retrieval, 307--314.

Cited By

View all
  • (2024)Evaluation of Temporal Change in IR Test CollectionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672530(3-13)Online publication date: 2-Aug-2024
  • (2024)Replicability Measures for Longitudinal Information Retrieval EvaluationExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-71736-9_16(215-226)Online publication date: 9-Sep-2024
  • (2021)Towards the Evaluation of Information Retrieval Systems on Evolving Datasets with Pivot SystemsExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-030-85251-1_8(91-102)Online publication date: 21-Sep-2021
  • Show More Cited By

Index Terms

  1. Repeatable evaluation of search services in dynamic environments

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 26, Issue 1
      November 2007
      164 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/1292591
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 November 2007
      Published in TOIS Volume 26, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Evaluation
      2. Web search

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 17 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Evaluation of Temporal Change in IR Test CollectionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672530(3-13)Online publication date: 2-Aug-2024
      • (2024)Replicability Measures for Longitudinal Information Retrieval EvaluationExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-71736-9_16(215-226)Online publication date: 9-Sep-2024
      • (2021)Towards the Evaluation of Information Retrieval Systems on Evolving Datasets with Pivot SystemsExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-030-85251-1_8(91-102)Online publication date: 21-Sep-2021
      • (2018)Effectiveness Involving Multiple QueriesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_477(1265-1268)Online publication date: 7-Dec-2018
      • (2017)Effectiveness Involving Multiple QueriesEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_477-3(1-3)Online publication date: 11-Feb-2017
      • (2016)Effectiveness Involving Multiple QueriesEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_477-2(1-3)Online publication date: 21-Nov-2016
      • (2015)A Family of Rank Similarity Measures Based on Maximized Effectiveness DifferenceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.244854127:11(2865-2877)Online publication date: 1-Nov-2015
      • (2012)UWIRS‐REC: integrating web information retrieval with recommendation servicesInternational Journal of Web Information Systems10.1108/174400812112419508:2(181-211)Online publication date: 15-Jun-2012
      • (2010)Measuring the reusability of test collectionsProceedings of the third ACM international conference on Web search and data mining10.1145/1718487.1718516(231-240)Online publication date: 4-Feb-2010
      • (2009)Effectiveness Involving Multiple QueriesEncyclopedia of Database Systems10.1007/978-0-387-39940-9_477(961-963)Online publication date: 2009
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media