Skip to main content

Thumbnail Summarization Techniques for Web Archives

  • Conference paper
Book cover Advances in Information Retrieval (ECIR 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8416))

Included in the following conference series:

Abstract

Thumbnails of archived web pages as they appear in common browsers such as Firefox or Chrome can be useful to convey the nature of a web page and how it has changed over time. However, creating thumbnails for all archived web pages is not feasible for large collections, both in terms of time to create the thumbnails and space to store them. Furthermore, at least for the purposes of initial exploration and collection understanding, people will likely only need a few dozen thumbnails and not thousands. In this paper, we develop different algorithms to optimize the thumbnail creation procedure for web archives based on information retrieval techniques. We study different features based on HTML text that correlate with changes in rendered thumbnails so we can know in advance which archived pages to use for thumbnails. We find that SimHash correlates with changes in the thumbnails (ρ = 0.59, p < 0.005). We propose different algorithms for thumbnail creation suitable for different applications, reducing the number of thumbnails to be generated to 9% – 27% of the total size.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Woodruff, A., Faulring, A., Rosenholtz, R., Morrsion, J., Pirolli, P.: Using thumbnails to search the Web. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2001, pp. 198–205 (2001)

    Google Scholar 

  2. Kules, B., Wilson, M.L., Shneiderman, B.: From Keyword Search to Exploration: How Result Visualization Aids Discovery on the Web. Technical report, HCIL-2008-06 (2008)

    Google Scholar 

  3. Treharne, K., Powers, D.M.W.: Search Engine Result Visualisation: Challenges and Opportunities. In: Proceedings of 13th International Conference on Information Visualisation, pp. 633–638 (2009)

    Google Scholar 

  4. Kaasten, S., Greenberg, S., Edwards, C.: How People Recognise Previously Seen Web Pages from Titles, URLs and Thumbnails. In: People and Computers XVI - Memorable Yet Invisible SE, pp. 247–265. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  5. Teevan, J., Cutrell, E., Fisher, D., Drucker, S.M., Ramos, G., André, P., Hu, C.: Visual Snippets: Summarizing Web Pages for Search and Revisitation. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2009, pp. 2023–2032. ACM (2009)

    Google Scholar 

  6. Padia, K., AlNoamany, Y., Weigle, M.C.: Visualizing digital collections at Archive-It. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2012, pp. 15–18 (2012)

    Google Scholar 

  7. Adar, E., Dontcheva, M., Fogarty, J., Weld, D.S.: Zoetrope: interacting with the ephemeral web. In: Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology, UIST 2008, pp. 239–248 (2008)

    Google Scholar 

  8. AlSum, A., Nelson, M.L.: ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph. Technical report, arXiv: 1305.5959 (2013)

    Google Scholar 

  9. Mayer, R.E., Moreno, R.: Nine ways to reduce cognitive load in multimedia learning. Educational Psychologist 38(1), 43–52 (2003)

    Article  Google Scholar 

  10. Graham, A., Garcia-Molina, H., Paepcke, A., Winograd, T.: Time as essence for photo browsing through personal digital libraries. In: Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Librariesm, JCDL 2002, pp. 326–335 (2002)

    Google Scholar 

  11. Hockx-Yu, H.: The Past Issue of the Web. In: Proceedings of 3rd International Conference on Web Science, WebSci 2011, pp. 1–8 (2011)

    Google Scholar 

  12. Chen, K., Chen, Y., Ting, P.: Developing National Taiwan University Web Archiving System. In: Proceedings of 8th International Web Archiving Workshop, IWAW 2008 (2008)

    Google Scholar 

  13. Soman, S., Chhajta, A., Bonomo, A., Paepcke, A.: ArcSpread for Analyzing Web Archives. Technical report. Stanford InfoLab (2012)

    Google Scholar 

  14. Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., Wesley, G.: Stanford WebBase Components and Applications. ACM Transactions on Internet Technology 6(2) (2006)

    Google Scholar 

  15. Jatowt, A., Kawai, Y., Nakamura, S., Kidawara, Y., Tanaka, K.: Journey to the past: proposal of a framework for past web browser. In: Proceedings of the 17th Conference on Hypertext and Hypermedia, HYPERTEXT 2006, pp. 135–144. ACM (2006)

    Google Scholar 

  16. Jatowt, A., Kawai, Y., Tanaka, K.: Page History Explorer: Visualizing and Comparing Page Histories. IEICE Transactions on Information and Systems E94-D(3), 564–577 (2011)

    Article  Google Scholar 

  17. Tsang, M., Morris, N., Balakrishnan, R.: Temporal Thumbnails: rapid visualization of time-based viewing data. In: Proceedings of the Working Conference on Advanced Visual Interfaces, AVI 2004, pp. 175–178 (2004)

    Google Scholar 

  18. Stoev, S.L., Straßer, W.: A case study on interactive exploration and guidance aids for visualizing historical data. In: Proceedings of the Conference on Visualization, VIS 2001, pp. 485–488 (2001)

    Google Scholar 

  19. Janssen, W.C.: Document Icons and Page Thumbnails: Issues in Construction of Document Thumbnails for Page-Image Digital Libraries. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232, pp. 111–121. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  20. Lam, H., Baudisch, P.: Summary thumbnails: Readable Overviews for Small Screen Web Browsers. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2005, pp. 681–690 (2005)

    Google Scholar 

  21. Aula, A., Khan, R.M., Guan, Z., Fontes, P., Hong, P.: A comparison of visual and textual page previews in judging the helpfulness of web pages. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 51–59. ACM Press (2010)

    Google Scholar 

  22. Platt, J.C.: AutoAlbum: clustering digital photographs using probabilistic model merging. In: Proceedings of IEEE Workshop on Content-based Access of Image and Video Libraries, pp. 96–100 (2000)

    Google Scholar 

  23. Coelho, F., Ribeiro, C.: Image abstraction in crossmedia retrieval for text illustration. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 329–339. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  24. Chu, W.T., Lin, C.H.: Automatic selection of representative photo and smart thumbnailing using near-duplicate detection. In: Proceeding of the 16th ACM International Conference on Multimedia, MM 2008, pp. 829–832 (October 2008)

    Google Scholar 

  25. Kherfi, M.L., Ziou, D.: Image Collection Organization and Its Application to Indexing, Browsing, Summarization, and Semantic Retrieval. IEEE Transactions on Multimedia 9(4), 893–900 (2007)

    Article  Google Scholar 

  26. Henzinger, M.: Finding near-duplicate web pages. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 284–291 (2006)

    Google Scholar 

  27. Broder, A., Glassman, S.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13) (1997)

    Google Scholar 

  28. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 380–388 (2002)

    Google Scholar 

  29. Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 141–149 (2007)

    Google Scholar 

  30. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450 (2010)

    Google Scholar 

  31. Pawlik, M., Augsten, N.: RTED: a robust algorithm for the tree edit distance. In: Proceedings of the VLDB Endowment, vol. 5(4), pp. 334–345 (December 2011)

    Google Scholar 

  32. Park, H.S., Jun, C.H.: A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 36(2, pt. 2), 3336–3341 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

AlSum, A., Nelson, M.L. (2014). Thumbnail Summarization Techniques for Web Archives. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06028-6_25

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06027-9

  • Online ISBN: 978-3-319-06028-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics