Skip to main content

Evaluating Web Archive Search Systems

  • Conference paper
Book cover Web Information Systems Engineering - WISE 2012 (WISE 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7651))

Included in the following conference series:

Abstract

The information published on the web, a representation of our collective memory, is rapidly vanishing. At least 77 web archives have been developed to cope with the web’s transience problem, but despite their technology having achieved a good maturity level, the retrieval effectiveness of the search services they provide still presents unsatisfactory results. In this work, we propose an evaluation methodology for web archive search systems based on a list of requirements compiled from previous characterizations of web archives and their users. The methodology includes the design of a test collection and the selection of evaluation measures to support realistic and reproducible experiments. The test collection enabled, for the first time, to measure the effectiveness of state-of-the-art IR technology employed in web archives. Results confirm the poor quality of search results retrieved with such technology. However, we show how to combine temporal features, along with the regular topical features, to improve the search effectiveness on web archives. The test collection is available to the research community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kitsuregawa, M., Tamura, T., Toyoda, M., Kaji, N.: Socio-Sense: A System for Analysing the Societal Behavior from Long Term Web Archive. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds.) APWeb 2008. LNCS, vol. 4976, pp. 1–8. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  2. Yamamoto, Y., Tezuka, T., Jatowt, A., Tanaka, K.: Honto? Search: Estimating Trustworthiness of Web Information by Search Results Aggregation and Temporal Analysis. In: Dong, G., Lin, X., Wang, W., Yang, Y., Yu, J.X. (eds.) APWeb/WAIM 2007. LNCS, vol. 4505, pp. 253–264. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  3. Chung, Y., Toyoda, M., Kitsuregawa, M.: A study of link farm distribution and evolution using a time series of web snapshots. In: Proc. of the 5th International Workshop on Adversarial Information Retrieval on the Web, pp. 9–16 (2009)

    Google Scholar 

  4. Elsas, J., Dumais, S.: Leveraging temporal dynamics of document content in relevance ranking. In: Proc. of the 3rd ACM Inter. Conference on Web Search and Data Mining, pp. 1–10 (2010)

    Google Scholar 

  5. Gomes, D., Miranda, J., Costa, M.: A Survey on Web Archiving Initiatives. In: Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds.) TPDL 2011. LNCS, vol. 6966, pp. 408–420. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  6. Voorhees, E., Harman, D.: TREC: Experiment and evaluation in information retrieval. MIT Press (2005)

    Google Scholar 

  7. Masanès, J.: Web Archiving. Springer-Verlag New York Inc. (2006)

    Google Scholar 

  8. Foundation, I.M.: Web archiving in Europe. Technical report, CommerceNet Labs (2010)

    Google Scholar 

  9. Ras, M., van Bussel, S.: Web archiving user survey. Technical report, National Library of the Netherlands (Koninklijke Bibliotheek) (2007)

    Google Scholar 

  10. Costa, M., Silva, M.J.: Characterizing search behavior in web archives. In: Proc. of the 1st International Temporal Web Analytics Workshop (2011)

    Google Scholar 

  11. Cohen, D., Amitay, E., Carmel, D.: Lucene and Juru at Trec 2007: 1-million queries track. In: Proc. of the 16th Text REtrieval Conference (2007)

    Google Scholar 

  12. Kelly, D.: Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, vol. 3. Now Publishers Inc. (2009)

    Google Scholar 

  13. Aula, A., Khan, R.M., Guan, Z.: How does search behavior change as search becomes more difficult? In: Proc. of the 28th International Conference on Human Factors in Computing Systems, pp. 35–44 (2010)

    Google Scholar 

  14. Kellar, M., Watters, C., Shepherd, M.: A field study characterizing Web-based information-seeking tasks. American Society for Information Science and Technology 58(7), 999–1018 (2007)

    Article  Google Scholar 

  15. Baeza-Yates, R., Castillo, C., Efthimiadis, E.: Characterization of national web domains. ACM Transactions on Internet Technology 7(2) (2007)

    Google Scholar 

  16. Costa, M., Silva, M.J.: Understanding the information needs of web archive users. In: Proc. of the 10th International Web Archiving Workshop, pp. 9–16 (2010)

    Google Scholar 

  17. Costa, M., Silva, M.J.: A search log analysis of a Portuguese web search engine. In: Proc. of the 2nd INForum - Simpósio de Informática, pp. 525–536 (2010)

    Google Scholar 

  18. Jansen, B., Spink, A.: How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing and Management 42(1), 248–263 (2006)

    Article  Google Scholar 

  19. Dong, A., Chang, Y., Zheng, Z., Mishne, G., Bai, J., Zhang, R., Buchner, K., Liao, C., Diaz, F.: Towards recency ranking in web search. In: Proc. of the 3rd ACM International Conference on Web Search and Data Mining, pp. 11–20 (2010)

    Google Scholar 

  20. Jones, R., Diaz, F.: Temporal profiles of queries. ACM Transactions on Information Systems (TOIS) 25(3) (2007)

    Google Scholar 

  21. Clarke, C., Kolla, M., Cormack, G., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In: Proc. of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 659–666 (2008)

    Google Scholar 

  22. Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying search results. In: Proc. of the 2nd ACM International Conference on Web Search and Data Mining, pp. 5–14 (2009)

    Google Scholar 

  23. Burner, M., Kahle, B.: The Archive File Form (September 1996), http://www.archive.org/web/researcher/ArcFileFormat.php

  24. Voorhees, E.: Topic set size redux. In: Proc. of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 806–807 (2009)

    Google Scholar 

  25. Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework. Foundations and Trends in Information Retrieval, vol. 3. Now Publishers Inc. (2009)

    Google Scholar 

  26. Al-Maskari, A., Sanderson, M., Clough, P.: Relevance judgments between TREC and Non-TREC assessors. In: Proc. of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 683–684 (2008)

    Google Scholar 

  27. Craswell, N., Hawking, D.: Overview of the TREC-2004 Web Track. NIST Special Publication, 500–261 (2005)

    Google Scholar 

  28. Lewandowski, D.: The retrieval effectiveness of search engines on navigational queries. Aslib Proceedings 63, 354–363 (2011)

    Article  Google Scholar 

  29. Blanco, R., Halpin, H., Herzig, D., Mika, P., Pound, J., Thompson, H., Tran Duc, T.: Repeatable and reliable search system evaluation using crowdsourcing. In: Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information, pp. 923–932 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Costa, M., Silva, M.J. (2012). Evaluating Web Archive Search Systems. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds) Web Information Systems Engineering - WISE 2012. WISE 2012. Lecture Notes in Computer Science, vol 7651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35063-4_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35063-4_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35062-7

  • Online ISBN: 978-3-642-35063-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics