Skip to main content

Where Did the Web Archive Go?

  • Conference paper
  • First Online:
Book cover Linking Theory and Practice of Digital Libraries (TPDL 2021)

Abstract

To perform a longitudinal investigation of web archives and detecting variations and changes replaying individual archived pages, or mementos, we created a sample of 16,627 mementos from 17 public web archives. Over the course of our 14-month study (November, 2017–January, 2019), we found that four web archives changed their base URIs and did not leave a machine-readable method of locating their new base URIs, necessitating manual rediscovery. Of the 1,981 mementos in our sample from these four web archives, 537 were impacted: 517 mementos were rediscovered but with changes in their time of archiving (or Memento-Datetime), HTTP status code, or the string comprising their original URI (or URI-R), and 20 of the mementos could not be found at all.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Although this is outside of our 14-month study, this effectively means that all 351 LAC mementos are currently missing.

References

  1. Ainsworth, S.G., Nelson, M.L., Van de Sompel, H.: A framework for evaluation of composite memento temporal coherence. Tech. Rep. arXiv:1402.0928, arXiv (2014)

  2. AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting off-topic pages in web archives. In: Proceedings of Theory and Practice of Digital Libraries (TPDL), pp. 225–237 (2015). https://doi.org/10.1007/978-3-319-24592-8_17

  3. AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting off-topic pages within TimeMaps in Web archives. Int. J. Digit. Libr. 17(3), 203–221 (2016). https://doi.org/10.1007/s00799-016-0183-5

    Article  Google Scholar 

  4. Aturban, M.: Where did the archive go? Part 1: library and archives Canada (2019). https://ws-dl.blogspot.com/2019/08/2019-08-30-where-did-archive-go-part1.html

  5. Aturban, M.: Where did the archive go? Part 2: National Library of Ireland (2019). https://ws-dl.blogspot.com/2019/09/2019-09-10-where-did-archive-go-part-2.html

  6. Aturban, M.: Where did the archive go? Part 3: Public Record Office of Northern Ireland. https://ws-dl.blogspot.com/2019/09/2019-09-25-where-did-archive-go-part-3.html (2019)

  7. Aturban, M.: A Framework for verifying the fixity of archived web resources. Ph.D. thesis, Old Dominion University (2020). https://doi.org/10.25777/PC8D-Y213

  8. Aturban, M., Alam, S., Nelson, M.L., Weigle, M.C.: Archive assisted archival fixity verification framework. In: Proceedings of the 19th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 162–171 (2019). https://doi.org/10.1109/JCDL.2019.00032

  9. Aturban, M., Nelson, M.L., Weigle, M.C.: It is hard to compute fixity on archived web pages. In: Proceedings of the Workshop on Web Archiving and Digital Libraries (WADL) held in conjunction with the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2018), https://vtechworks.lib.vt.edu/bitstream/handle/10919/97988/WADL2018.pdf

  10. Aturban, M., Nelson, M.L., Weigle, M.C., Klein, M., Van de Sompel, H.: Collecting 16K archived web pages from 17 public web archives. Tech. Rep. arXiv:1905.03836, arXiv, May 2019

  11. Berlin, J.: Squidwarc - A high fidelity archival crawler that uses Chrome or Chrome Headless, July 2017. https://github.com/N0taN3rd/Squidwarc

  12. Berners-Lee, T., Fielding, R., Massinter, L.: Uniform Resource Identifier (URI): Generic Syntax, Internet RFC-3986, January 2005. https://datatracker.ietf.org/doc/html/rfc3986

  13. Bornand, N.J., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: Proceedings of the 16th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 63–72 (2016). https://doi.org/10.1145/2910896.2910899

  14. Cremona, R.: New memento support at perma.cc, February 2020. https://groups.google.com/g/memento-dev/c/XHB4IezBiqA/m/BpB4u8DjBQAJ

  15. Fielding, R.T.: REST APIs must be hypertext-driven (2008). https://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven

  16. International Organization for Standardization (ISO): WARC file format. ISO 28500:2017 (2017). https://www.iso.org/standard/68004.html

  17. Jones, S.M., Weigle, M.C., Nelson, M.L.: The off-topic memento toolkit. In: Proceedings of iPRES (2018). https://doi.org/10.17605/OSF.IO/UBW87

  18. Mohamed Aturban: Mementos-Fixity (2019). https://github.com/oduwsdl/mementos-fixity/blob/master/final_urims.txt

  19. Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states - Memento, Internet RFC 7089 (2013). http://tools.ietf.org/html/rfc7089

  20. Wilde, E.: The Sunset HTTP Header Field, Internet RFC 8594 (2019). https://tools.ietf.org/html/rfc8594

  21. Zittrain, J., Albert, K., Lessig, L.: Perma: scoping and addressing the problem of link and reference rot in legal citations. Legal Inf. Manag 14(02), 88–99 (2014). https://doi.org/10.1017/S1472669614000255

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Aturban .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aturban, M., Nelson, M.L., Weigle, M.C. (2021). Where Did the Web Archive Go?. In: Berget, G., Hall, M.M., Brenn, D., Kumpulainen, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2021. Lecture Notes in Computer Science(), vol 12866. Springer, Cham. https://doi.org/10.1007/978-3-030-86324-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86324-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86323-4

  • Online ISBN: 978-3-030-86324-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics