Skip to main content
Log in

Detecting off-topic pages within TimeMaps in Web archives

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Web archives have become a significant repository of our recent history and cultural heritage. Archival integrity and accuracy is a precondition for future cultural research. Currently, there are no quantitative or content-based tools that allow archivists to judge the quality of the Web archive captures. In this paper, we address the problems of detecting when a particular page in a Web archive collection has gone off-topic relative to its first archived copy. We do not delete off-topic pages (they remain part of the collection), but they are flagged as off-topic so they can be excluded for consideration for downstream services, such as collection summarization and thumbnail generation. We propose different methods (cosine similarity, Jaccard similarity, intersection of the 20 most frequent terms, Web-based kernel function, and the change in size using the number of words and content length) to detect when a page has gone off-topic. Those predicted off-topic pages will be presented to the collection’s curator for possible elimination from the collection or cessation of crawling. We created a gold standard data set from three Archive-It collections to evaluate the proposed methods at different thresholds. We found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold −0.85 performs the best with accuracy = 0.987, \(F_{1}\) score = 0.906, and AUC \(=\) 0.968. We evaluated the performance of the proposed method on several Archive-It collections. The average precision of detecting off-topic pages in the collections is 0.89.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://archive-it.org/.

  2. https://archive-it.org/collections/192.

  3. https://archive-it.org/collections/2950.

  4. https://archive-it.org/collections/3010.

  5. https://archive-it.org/collections/1068.

  6. http://www.niod.nl/en.

  7. http://archive.org/web/researcher/cdx_file_format.php.

  8. https://archive-it.org/collections/4887/.

  9. https://archive-it.org/collections/5810/.

  10. https://archive-it.org/collections/3497/.

  11. https://archive-it.org/collections/5947/.

References

  1. AlNoamany, Y.: Using Web Archives to Enrich the Live Web Experience Through Storytelling. Dissertation, Old Dominion University (2016)

  2. AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Characteristics of Social Media Stories. In: Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries, TPDL ’15, pp. 267–279 (2015). doi:10.1007/978-3-319-24592-8_20

  3. AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting Off-Topic Pages in Web Archives. In: Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries, TPDL ’15, pp. 225–237. Springer International Publishing (2015). doi:10.1007/978-3-319-24592-8_17

  4. AlSum, A., Nelson, M.L.: ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’13, pp. 377–378. ACM Press (2013). doi:10.1145/2467696.2467751

  5. AlSum, A., Nelson, M.L.: ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph. Tech. Rep. (2013). arXiv:1305.5959

  6. AlSum, A., Nelson, M.L.: Thumbnail Summarization Techniques for Web Archives. In: Proceedings of the 36th European Conference on Information Retrieval, ECIR 2014, pp. 299–310 (2014). doi:10.1007/978-3-319-06028-6_25

  7. Arms, W.Y., Aya, S., Dmitriev, P., Kot, B.J., Mitchell, R., Walle, L.: Building a Research Library for the History of the Web. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’06, pp. 95–102 (2006). doi:10.1145/1141753.1141771

  8. Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic Transit Gloria Telae: Towards an Understanding of the Web’s Decay. In: WWW ’04: Proceedings of the 13th international conference on World Wide Web, pp. 328–337. ACM Press (2004). doi:10.1145/988672.988716

  9. Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, ECDL ’02, pp. 91–106. Springer-Verlag (2002)

  10. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)

    Article  MathSciNet  Google Scholar 

  11. Brewington, B., Cybenko, G.: Keeping up with the changing web. Computer 33(5), 52–58 (2000). doi:10.1109/2.841784

    Article  Google Scholar 

  12. Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. Overview of the Third Text REtrieval Conference (TREC-3) pp. 69–80 (1995)

  13. Capra, R.G., Lee, C.A., Marchionini, G., Russell, T., Shah, C., Stutzman, F.: Selection and context scoping for digital video collections: an investigation of youtube and blogs. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’08, pp. 211–220. ACM (2008). doi:10.1145/1378889.1378925

  14. Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999). doi:10.1016/S1389-1286(99)00052-3

    Article  Google Scholar 

  15. Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003). doi:10.1145/857166.857170

    Article  Google Scholar 

  16. Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30(1–7), 161–172 (1998). doi:10.1016/S0169-7552(98)00108-1

    Article  Google Scholar 

  17. Farag, M.M.G., Fox, E.A.: Intelligent Event Focused Crawling. In: Proceedings of the 11th International ISCRAM Conference, pp. 18–21 (2014)

  18. Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006). doi:10.1016/j.patrec.2005.10.010

    Article  MathSciNet  Google Scholar 

  19. Foot, K., Schneider, S.: Web Campaigning (Acting with Technology). The MIT Press, Cambridge (2006)

    Google Scholar 

  20. ISO 28500:2009—Information and documentation–WARC file format. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717 (2009)

  21. Jatowt, A., Kawai, Y., Tanaka, K.: Detecting Age of Page Content. In: Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management, WIDM ’07, pp. 137–144 (2007)

  22. Jatowt, A., Kawai, Y., Tanaka, K.: Page history explorer: visualizing and comparing page histories. IEICE Trans. Inf. Syst. 94(3), 564–577 (2011)

    Article  Google Scholar 

  23. Jatowt, A., Tanaka, K.: Towards mining past content of Web pages. New Rev. Hypermed. Multimed. 13(1), 77–86 (2007). doi:10.1080/13614560701478897

    Article  Google Scholar 

  24. Kahle, B.: Preserving the internet. Sci. Am. 276(3), 82–83 (1997)

    Article  Google Scholar 

  25. Kahle, B.: Wayback Machine Hits 400,000,000,000! http://blog.archive.org/2014/05/09/wayback-machine-hits-400000000000 (2014)

  26. Klein, M., Nelson, M.L.: Find, new, copy, web, page-tagging for the (re-)discovery of web pages. In: Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries, TPDL’11, vol. 6966, pp. 27–39. Springer, Berlin Heidelberg (2011). doi:10.1007/978-3-642-24469-8_5

  27. Klein, M., Shipman, J., Nelson, M.L.: Is this a good title? In: Proceedings of the 21st ACM conference on Hypertext and Hypermedia, HT ’10, pp. 3–12. ACM (2010). doi:10.1145/1810617.1810621

  28. Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K., Tobin, R.: Scholarly context not found: one in five articles suffers from reference rot. PloS One 9(12), e115,253 (2014). doi:10.1371/journal.pone.0115253

  29. Klein, M., Ware, J., Nelson, M.L.: Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures. In: Proceedings of the 11th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’11, pp. 137–140. ACM Press (2011). doi:10.1145/1998076.1998101

  30. Koehler, W.: Web page change and persistence—a four-year longitudinal study. J. Am. Soc. Inf. Sci. Technol. 53(2), 162–171 (2002)

    Article  Google Scholar 

  31. Koehler, W.: A longitudinal study of web pages continued: a consideration of document persistence. Inf. Res. 9(2), 2–9 (2004)

    Google Scholar 

  32. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate Detection Using Shallow Text Features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, pp. 441–450. ACM (2010). doi:10.1145/1718487.1718542

  33. Kosala, R., Blockeel, H.: Web mining research: a survey. SIGKDD Explor. Newslett. 2(1), 1–15 (2000). doi:10.1145/360402.360406

    Article  Google Scholar 

  34. Lawrence, S., Pennock, D.M., Flake, G.W., Krovetz, R., Coetzee, F.M., Glover, E., Nielsen, F.A., Kruger, A., Giles, C.L.: Persistence of web references in scientific research. Computer 34(2), 26–31 (2001). doi:10.1109/2.901164

    Article  Google Scholar 

  35. Manning, C.D., Raghavan, P., Schütze, H., Schutze, H.: Introduction to information retrieval. Cambridge University Press (2008). doi:10.1017/CBO9780511809071

  36. Marchionini, G., Shah, C., Lee, C.A., Capra, R.: Query parameters for harvesting digital video and associated contextual information. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’09, pp. 77–86. ACM (2009). doi:10.1145/1555400.1555414

  37. Marshall, C., McCown, F., Nelson, M.: Evaluating Personal Archiving Strategies for Internet-based Information. In: Proceedings of Archiving 2007, vol. 2007, pp. 151–156 (2007)

  38. Masanès, J.: Web Archiving. Springer, Cham (2006)

    Book  Google Scholar 

  39. Mohr, G., Stack, M., Ranitovic, I., Avery, D., Kimpton, M.: An Introduction to Heritrix An open source archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop, IWAW ’04, pp. 43–49. http://iwaw.europarchive.org/04/Mohr.pdf (2004)

  40. Negulescu, K.C.: Web Archiving @ the Internet Archive. Presentation at the 2010 Digital Preservation Partners Meeting. http://www.digitalpreservation.gov/meetings/documents/ndiipp10/NDIIPP072110FinalIA.ppt (2010)

  41. Nelson, M.L.: A Plan For Curating “Obsolete Data or Resources”. Tech. Rep. (2012). arXiv:1209.2664

  42. Odijk, D., Grbacea, C., Schoegje, T., Hollink, L., de Boer, V., Ribbens, K., van Ossenbruggen, J.: Supporting exploration of historical perspectives across collections. In: Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries. Lecture Notes in Computer Science, vol. 9316, pp. 238–251. Springer-Verlag (2015). doi:10.1007/978-3-319-24592-8_18

  43. Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International World Wide Web Conference, WWW ’08, p. 437. ACM Press (2008). doi:10.1145/1367497.1367557

  44. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  45. Reilly, B., Palaima, C., Norsworthy, K., Myrick, L., Tuchel, G., Simon, J.: Political Communications Web Archiving: Addressing Typology and Timing for Selection, Preservation and Access. In: Proceedings of the 3rd Workshop on Web Archives (2003)

  46. Saad, M., Gançarski, S.: Archiving the Web using Page Changes Patterns: A Case Study. In: Proceedings of the 11th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’11, pp. 113–122 (2012). doi:10.1145/1998076.1998098

  47. Sahami, M., Heilman, T.D.: A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 377–386. ACM (2006). doi:10.1145/1135777.1135834

  48. SalahEldeen, H.M., Nelson, M.L.: Carbon Dating The Web: Estimating the Age of Web Resources. In: Proceedings of 3rd Temporal Web Analytics Workshop, TempWeb ’13, pp. 1075–1082 (2013)

  49. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975). doi:10.1145/361219.361220

    Article  MATH  Google Scholar 

  50. Schneider, S.M., Foot, K., Kimpton, M., Jones, G.: Building Thematic Web Collections: Challenges and Experiences from the September 11 Web Archive and the Election 2002 Web Archive. In: Proceedings of the 3rd Workshop on Web Archives (2003)

  51. Singhal, A.: Modern information retrieval: a brief overview. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 24(4), 35–42 (2001)

    Google Scholar 

  52. Spaniol, M., Weikum, G.: Tracking Entities in Web Archives: The LAWA Project. In: Proceedings of the 21st International Conference Companion on World Wide Web, WWW ’12 Companion, pp. 287–290. ACM (2012). doi:10.1145/2187980.2188030

  53. Teevan, J., Dumais, S.T., Liebling, D.J.: A longitudinal study of how highlighting web content change affects people’s web interactions. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, pp. 1353–1356. ACM (2010). doi:10.1145/1753326.1753530

  54. Teevan, J., Dumais, S.T., Liebling, D.J., Hughes, R.L.: Changing how people view changes on the web. In: Proceedings of the 22Nd Annual ACM Symposium on User Interface Software and Technology, UIST ’09, pp. 237–246. ACM (2009). doi:10.1145/1622176.1622221

  55. Van de Sompel, H., Nelson, M.L., Sanderson, R.: RFC 7089—HTTP framework for time-based access to resource states—Memento. http://tools.ietf.org/html/rfc7089 (2013)

  56. Yin, Z., Shokouhi, M., Craswell, N.: Query expansion using external evidence. In: Advances in Information Retrieval, pp. 362–374. Springer (2009)

Download references

Acknowledgments

This work was supported in part by the AMF and the IMLS LG-71-15-0077-15. We thank Kristine Hanna from the Internet Archive for help in obtaining the data set. We also thank the anonymous reviewers for their insights regarding future directions to this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yasmin AlNoamany.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

AlNoamany, Y., Weigle, M.C. & Nelson, M.L. Detecting off-topic pages within TimeMaps in Web archives. Int J Digit Libr 17, 203–221 (2016). https://doi.org/10.1007/s00799-016-0183-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-016-0183-5

Keywords

Navigation