Skip to main content

A Corpus of Realistic Known-Item Topics with Associated Web Pages in the ClueWeb09

  • Conference paper
Advances in Information Retrieval (ECIR 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9022))

Included in the following conference series:

Abstract

Known-item finding is the task of finding a previously seen item. Such items may range from visited websites to received emails but also read books or seen movies. Most of the research done on known-item finding focuses on web or email retrieval and is done on proprietary corpora not publically available. Public corpora usually are rather artificial as they contain automatically generated known-item queries or queries formulated by humans actually seeing the known-item.

In this paper, we study original known-item information needs mined from questions at the popular Yahoo!Answers Q&A service. By carefully sampling only questions with a related known-item web page in the ClueWeb09 corpus, we provide an environment for repeatable realistic studies of known-item information needs and how a retrieval system could react. In particular, our own study sheds some first light on false memories within the known-item questions articulated by the users. Our main finding shows that false memories often relate to mixed up names. This indicates that search engines not retrieving any result on a known-item query could try to avoid returning a zero-result list by ignoring or replacing names in respective query situations.

Our publically available corpus of 2,755 known-item questions mapped to web pages in the ClueWeb09 includes 240 questions with annotated and corrected false memories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adar, E., Teevan, J., Dumais, S.T.: Large scale analysis of web revisitation patterns. In: CHI 2008, pp. 1197–1206 (2008)

    Google Scholar 

  2. Azzopardi, L., de Rijke, M., Balog, K.: Building simulated queries for known-item topics: An analysis using six european languages. In: SIGIR 2007, pp. 455–462 (2007)

    Google Scholar 

  3. Barreau, D., Nardi, B.: Finding and reminding: File organization from the desktop. ACM SIGCHI Bulletin 27(3), 39–43 (1995)

    Article  Google Scholar 

  4. Blanc-Brude, T., Scapin, D.L.: What do people recall about their documents?: Implications for desktop search tools. In: IUI (2007)

    Google Scholar 

  5. Boardman, R., Sasse, M.: Stuff goes into the computer and doesn’t come out: A cross-tool study of personal information management. In: CHI 2004, pp. 583–590 (2004)

    Google Scholar 

  6. Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)

    Article  Google Scholar 

  7. Dumais, S.T., Cutrell, E., Cadiz, J.J., Jancke, G., Sarin, R., Robbins, D.C.: Stuff I’ve seen: A system for personal information retrieval and re-use. In: SIGIR 2003, pp. 72–79 (2003)

    Google Scholar 

  8. Elsweiler, D., Baillie, M., Ruthven, I.: Exploring memory in email refinding. ACM Trans. Inf. Syst. 26(4), 1–36 (2008)

    Article  Google Scholar 

  9. Elsweiler, D., Baillie, M., Ruthven, I.: What makes re-finding information difficult? A study of email re-finding. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 568–579. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Elsweiler, D., Losada, D.E., Toucedo, J.C., Fernández, R.T.: Seeding simulated queries with user-study data forpersonal search evaluation. In: SIGIR 2011, pp. 25–34 (2011)

    Google Scholar 

  11. Elsweiler, D., Ruthven, I., Jones, C.: Towards memory supporting personal information management tools. JASIST 58(7), 924–946 (2007)

    Article  Google Scholar 

  12. Gunning, R.: The technique of clear writing. McGraw-Hill (1952)

    Google Scholar 

  13. Hagen, M., Stein, B.: Applying the user-over-ranking hypothesis to query formulation. In: Amati, G., Crestani, F. (eds.) ICTIR 2011. LNCS, vol. 6931, pp. 225–237. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  14. Hauff, C., Hagen, M., Beyer, A., Stein, B.: Towards realistic known-item topics for the ClueWeb. In: IIiX 2012, pp. 274–277 (2012)

    Google Scholar 

  15. Hauff, C., Houben, G.-J.: Cognitive processes in query generation. In: Amati, G., Crestani, F. (eds.) ICTIR 2011. LNCS, vol. 6931, pp. 176–187. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  16. Kelly, L., Chen, Y., Fuller, M., Jones, G.J.F.: A study of remembered context for information access from personal digital archives. In: IIiX 2008, pp. 44–50 (2008)

    Google Scholar 

  17. Kim, J., Croft, W.B.: Retrieval experiments using pseudo-desktop collections. In: CIKM 2009, pp. 1297–1306 (2009)

    Google Scholar 

  18. Kim, J., Croft, W.B.: Ranking using multiple document types in desktop search. In: SIGIR 2010, pp. 50–57 (2010)

    Google Scholar 

  19. Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.: ChatNoir: A search engine for the ClueWeb09 corpus. In: SIGIR 2012, p. 1004 (2012)

    Google Scholar 

  20. Tyler, S.K., Teevan, J.: Large scale query log analysis of re-finding. In: WSDM 2010, pp. 191–200 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Hagen, M., Wägner, D., Stein, B. (2015). A Corpus of Realistic Known-Item Topics with Associated Web Pages in the ClueWeb09. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds) Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol 9022. Springer, Cham. https://doi.org/10.1007/978-3-319-16354-3_57

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16354-3_57

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16353-6

  • Online ISBN: 978-3-319-16354-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics