Detecting off-topic pages within TimeMaps in Web archives

AlNoamany, Yasmin; Weigle, Michele C.; Nelson, Michael L.

doi:10.1007/s00799-016-0183-5

Detecting off-topic pages within TimeMaps in Web archives

Published: 18 July 2016

Volume 17, pages 203–221, (2016)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Yasmin AlNoamany¹,
Michele C. Weigle¹ &
Michael L. Nelson¹

1631 Accesses
8 Citations
16 Altmetric
1 Mention
Explore all metrics

Abstract

Web archives have become a significant repository of our recent history and cultural heritage. Archival integrity and accuracy is a precondition for future cultural research. Currently, there are no quantitative or content-based tools that allow archivists to judge the quality of the Web archive captures. In this paper, we address the problems of detecting when a particular page in a Web archive collection has gone off-topic relative to its first archived copy. We do not delete off-topic pages (they remain part of the collection), but they are flagged as off-topic so they can be excluded for consideration for downstream services, such as collection summarization and thumbnail generation. We propose different methods (cosine similarity, Jaccard similarity, intersection of the 20 most frequent terms, Web-based kernel function, and the change in size using the number of words and content length) to detect when a page has gone off-topic. Those predicted off-topic pages will be presented to the collection’s curator for possible elimination from the collection or cessation of crawling. We created a gold standard data set from three Archive-It collections to evaluate the proposed methods at different thresholds. We found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold −0.85 performs the best with accuracy = 0.987, \(F_{1}\) score = 0.906, and AUC \(=\) 0.968. We evaluated the performance of the proposed method on several Archive-It collections. The average precision of detecting off-topic pages in the collections is 0.89.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Article Open access 27 February 2017

Visualizing Bibliometric Networks

Archives, linked data and the digital humanities: increasing access to digitised and born-digital archives via the semantic web

Article Open access 27 December 2021

Notes

References

AlNoamany, Y.: Using Web Archives to Enrich the Live Web Experience Through Storytelling. Dissertation, Old Dominion University (2016)
AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Characteristics of Social Media Stories. In: Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries, TPDL ’15, pp. 267–279 (2015). doi:10.1007/978-3-319-24592-8_20
AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting Off-Topic Pages in Web Archives. In: Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries, TPDL ’15, pp. 225–237. Springer International Publishing (2015). doi:10.1007/978-3-319-24592-8_17
AlSum, A., Nelson, M.L.: ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’13, pp. 377–378. ACM Press (2013). doi:10.1145/2467696.2467751
AlSum, A., Nelson, M.L.: ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph. Tech. Rep. (2013). arXiv:1305.5959
AlSum, A., Nelson, M.L.: Thumbnail Summarization Techniques for Web Archives. In: Proceedings of the 36th European Conference on Information Retrieval, ECIR 2014, pp. 299–310 (2014). doi:10.1007/978-3-319-06028-6_25
Arms, W.Y., Aya, S., Dmitriev, P., Kot, B.J., Mitchell, R., Walle, L.: Building a Research Library for the History of the Web. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’06, pp. 95–102 (2006). doi:10.1145/1141753.1141771
Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic Transit Gloria Telae: Towards an Understanding of the Web’s Decay. In: WWW ’04: Proceedings of the 13th international conference on World Wide Web, pp. 328–337. ACM Press (2004). doi:10.1145/988672.988716
Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, ECDL ’02, pp. 91–106. Springer-Verlag (2002)
Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Article MathSciNet Google Scholar
Brewington, B., Cybenko, G.: Keeping up with the changing web. Computer 33(5), 52–58 (2000). doi:10.1109/2.841784
Article Google Scholar
Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. Overview of the Third Text REtrieval Conference (TREC-3) pp. 69–80 (1995)
Capra, R.G., Lee, C.A., Marchionini, G., Russell, T., Shah, C., Stutzman, F.: Selection and context scoping for digital video collections: an investigation of youtube and blogs. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’08, pp. 211–220. ACM (2008). doi:10.1145/1378889.1378925
Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999). doi:10.1016/S1389-1286(99)00052-3
Article Google Scholar
Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003). doi:10.1145/857166.857170
Article Google Scholar
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30(1–7), 161–172 (1998). doi:10.1016/S0169-7552(98)00108-1
Article Google Scholar
Farag, M.M.G., Fox, E.A.: Intelligent Event Focused Crawling. In: Proceedings of the 11th International ISCRAM Conference, pp. 18–21 (2014)
Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006). doi:10.1016/j.patrec.2005.10.010
Article MathSciNet Google Scholar
Foot, K., Schneider, S.: Web Campaigning (Acting with Technology). The MIT Press, Cambridge (2006)
Google Scholar
ISO 28500:2009—Information and documentation–WARC file format. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717 (2009)
Jatowt, A., Kawai, Y., Tanaka, K.: Detecting Age of Page Content. In: Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management, WIDM ’07, pp. 137–144 (2007)
Jatowt, A., Kawai, Y., Tanaka, K.: Page history explorer: visualizing and comparing page histories. IEICE Trans. Inf. Syst. 94(3), 564–577 (2011)
Article Google Scholar
Jatowt, A., Tanaka, K.: Towards mining past content of Web pages. New Rev. Hypermed. Multimed. 13(1), 77–86 (2007). doi:10.1080/13614560701478897
Article Google Scholar
Kahle, B.: Preserving the internet. Sci. Am. 276(3), 82–83 (1997)
Article Google Scholar
Kahle, B.: Wayback Machine Hits 400,000,000,000! http://blog.archive.org/2014/05/09/wayback-machine-hits-400000000000 (2014)
Klein, M., Nelson, M.L.: Find, new, copy, web, page-tagging for the (re-)discovery of web pages. In: Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries, TPDL’11, vol. 6966, pp. 27–39. Springer, Berlin Heidelberg (2011). doi:10.1007/978-3-642-24469-8_5
Klein, M., Shipman, J., Nelson, M.L.: Is this a good title? In: Proceedings of the 21st ACM conference on Hypertext and Hypermedia, HT ’10, pp. 3–12. ACM (2010). doi:10.1145/1810617.1810621
Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K., Tobin, R.: Scholarly context not found: one in five articles suffers from reference rot. PloS One 9(12), e115,253 (2014). doi:10.1371/journal.pone.0115253
Klein, M., Ware, J., Nelson, M.L.: Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures. In: Proceedings of the 11th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’11, pp. 137–140. ACM Press (2011). doi:10.1145/1998076.1998101
Koehler, W.: Web page change and persistence—a four-year longitudinal study. J. Am. Soc. Inf. Sci. Technol. 53(2), 162–171 (2002)
Article Google Scholar
Koehler, W.: A longitudinal study of web pages continued: a consideration of document persistence. Inf. Res. 9(2), 2–9 (2004)
Google Scholar
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate Detection Using Shallow Text Features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, pp. 441–450. ACM (2010). doi:10.1145/1718487.1718542
Kosala, R., Blockeel, H.: Web mining research: a survey. SIGKDD Explor. Newslett. 2(1), 1–15 (2000). doi:10.1145/360402.360406
Article Google Scholar
Lawrence, S., Pennock, D.M., Flake, G.W., Krovetz, R., Coetzee, F.M., Glover, E., Nielsen, F.A., Kruger, A., Giles, C.L.: Persistence of web references in scientific research. Computer 34(2), 26–31 (2001). doi:10.1109/2.901164
Article Google Scholar
Manning, C.D., Raghavan, P., Schütze, H., Schutze, H.: Introduction to information retrieval. Cambridge University Press (2008). doi:10.1017/CBO9780511809071
Marchionini, G., Shah, C., Lee, C.A., Capra, R.: Query parameters for harvesting digital video and associated contextual information. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’09, pp. 77–86. ACM (2009). doi:10.1145/1555400.1555414
Marshall, C., McCown, F., Nelson, M.: Evaluating Personal Archiving Strategies for Internet-based Information. In: Proceedings of Archiving 2007, vol. 2007, pp. 151–156 (2007)
Masanès, J.: Web Archiving. Springer, Cham (2006)
Book Google Scholar
Mohr, G., Stack, M., Ranitovic, I., Avery, D., Kimpton, M.: An Introduction to Heritrix An open source archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop, IWAW ’04, pp. 43–49. http://iwaw.europarchive.org/04/Mohr.pdf (2004)
Negulescu, K.C.: Web Archiving @ the Internet Archive. Presentation at the 2010 Digital Preservation Partners Meeting. http://www.digitalpreservation.gov/meetings/documents/ndiipp10/NDIIPP072110FinalIA.ppt (2010)
Nelson, M.L.: A Plan For Curating “Obsolete Data or Resources”. Tech. Rep. (2012). arXiv:1209.2664
Odijk, D., Grbacea, C., Schoegje, T., Hollink, L., de Boer, V., Ribbens, K., van Ossenbruggen, J.: Supporting exploration of historical perspectives across collections. In: Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries. Lecture Notes in Computer Science, vol. 9316, pp. 238–251. Springer-Verlag (2015). doi:10.1007/978-3-319-24592-8_18
Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International World Wide Web Conference, WWW ’08, p. 437. ACM Press (2008). doi:10.1145/1367497.1367557
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Reilly, B., Palaima, C., Norsworthy, K., Myrick, L., Tuchel, G., Simon, J.: Political Communications Web Archiving: Addressing Typology and Timing for Selection, Preservation and Access. In: Proceedings of the 3rd Workshop on Web Archives (2003)
Saad, M., Gançarski, S.: Archiving the Web using Page Changes Patterns: A Case Study. In: Proceedings of the 11th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’11, pp. 113–122 (2012). doi:10.1145/1998076.1998098
Sahami, M., Heilman, T.D.: A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 377–386. ACM (2006). doi:10.1145/1135777.1135834
SalahEldeen, H.M., Nelson, M.L.: Carbon Dating The Web: Estimating the Age of Web Resources. In: Proceedings of 3rd Temporal Web Analytics Workshop, TempWeb ’13, pp. 1075–1082 (2013)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975). doi:10.1145/361219.361220
Article MATH Google Scholar
Schneider, S.M., Foot, K., Kimpton, M., Jones, G.: Building Thematic Web Collections: Challenges and Experiences from the September 11 Web Archive and the Election 2002 Web Archive. In: Proceedings of the 3rd Workshop on Web Archives (2003)
Singhal, A.: Modern information retrieval: a brief overview. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 24(4), 35–42 (2001)
Google Scholar
Spaniol, M., Weikum, G.: Tracking Entities in Web Archives: The LAWA Project. In: Proceedings of the 21st International Conference Companion on World Wide Web, WWW ’12 Companion, pp. 287–290. ACM (2012). doi:10.1145/2187980.2188030
Teevan, J., Dumais, S.T., Liebling, D.J.: A longitudinal study of how highlighting web content change affects people’s web interactions. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, pp. 1353–1356. ACM (2010). doi:10.1145/1753326.1753530
Teevan, J., Dumais, S.T., Liebling, D.J., Hughes, R.L.: Changing how people view changes on the web. In: Proceedings of the 22Nd Annual ACM Symposium on User Interface Software and Technology, UIST ’09, pp. 237–246. ACM (2009). doi:10.1145/1622176.1622221
Van de Sompel, H., Nelson, M.L., Sanderson, R.: RFC 7089—HTTP framework for time-based access to resource states—Memento. http://tools.ietf.org/html/rfc7089 (2013)
Yin, Z., Shokouhi, M., Craswell, N.: Query expansion using external evidence. In: Advances in Information Retrieval, pp. 362–374. Springer (2009)

Download references

Acknowledgments

This work was supported in part by the AMF and the IMLS LG-71-15-0077-15. We thank Kristine Hanna from the Internet Archive for help in obtaining the data set. We also thank the anonymous reviewers for their insights regarding future directions to this work.

Author information

Authors and Affiliations

Department of Computer Science, Old Dominion University, Norfolk, VA, 23529, USA
Yasmin AlNoamany, Michele C. Weigle & Michael L. Nelson

Authors

Yasmin AlNoamany
View author publications
You can also search for this author in PubMed Google Scholar
Michele C. Weigle
View author publications
You can also search for this author in PubMed Google Scholar
Michael L. Nelson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasmin AlNoamany.

Rights and permissions

Reprints and permissions

About this article

Cite this article

AlNoamany, Y., Weigle, M.C. & Nelson, M.L. Detecting off-topic pages within TimeMaps in Web archives. Int J Digit Libr 17, 203–221 (2016). https://doi.org/10.1007/s00799-016-0183-5

Download citation

Received: 10 January 2016
Revised: 24 June 2016
Accepted: 04 July 2016
Published: 18 July 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s00799-016-0183-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting off-topic pages within TimeMaps in Web archives

Abstract

Access this article

Similar content being viewed by others

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Visualizing Bibliometric Networks

Archives, linked data and the digital humanities: increasing access to digitised and born-digital archives via the semantic web

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detecting off-topic pages within TimeMaps in Web archives

Abstract

Access this article

Similar content being viewed by others

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Visualizing Bibliometric Networks

Archives, linked data and the digital humanities: increasing access to digitised and born-digital archives via the semantic web

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation