skip to main content
10.1145/2700171.2791044acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

Only One Out of Five Archived Web Pages Existed as Presented

Published: 24 August 2015 Publication History

Abstract

When a user retrieves a page from a web archive, the page is marked with the acquisition datetime of the root resource, which effectively asserts "this is how the page looked at a that datetime." However, embedded resources, such as images, are often archived at different datetimes than the main page. The presentation appears temporally coherent, but is composed from resources acquired over a wide range of datetimes. We examine the completeness and temporal coherence of composite archived resources (composite mementos) under two selection heuristics. The completeness and temporal coherence achieved using a single archive was compared to the results achieved using multiple archives. We found that at most 38.7% of composite mementos are both temporally coherent and that at most only 17.9% (roughly 1 in 5) are temporally coherent and 100% complete. Using multiple archives increases mean completeness by 3.1-4.1% but also reduces temporal coherence.

References

[1]
S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the Web is archived? In Proceedings of JCDL'11, pages 133--136, June 2011.
[2]
S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the Web is archived? Technical Report arXiv:1212.6177, Old Dominion University, December 2012.
[3]
S. G. Ainsworth and M. L. Nelson. Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive. In Proceedings of JCDL'13, July 2013.
[4]
S. G. Ainsworth, M. L. Nelson, and H. Van de Sompel. A framework for evaluation of composite memento temporal coherence. Technical Report arXiv:1402.0928, Old Dominion University, February 2014.
[5]
A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel. Archival HTTP redirection retrieval policies. In Proceedings of WWW'13 Companion, pages 1051--1058, Republic and Canton of Geneva, Switzerland, 2013.
[6]
A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel. Profiling web archive coverage for top-level domain and content language. International Journal on Digital Libraries, 14(3):149--166, 2014.
[7]
M. Ben Saad and S. Gançarski. Archiving the Web using page changes patterns: a case study. In Proceedings of JCDL'11, pages 113--122, 2011.
[8]
M. Ben Saad and S. Gançarski. Improving the quality of web archives through the importance of changes. In Proceedings of DEXA'11, pages 394--409, 2011.
[9]
M. Ben Saad, Z. Pehlivan, and S. Gançarski. Coherence-oriented crawling and navigation using patterns for web archives. In Proceedings of TPDL'11, pages 421--433, 2011.
[10]
A. Bright. Web evidence points to pro-Russia rebels in downing of MH17. Christian Science Monitor, 2014.
[11]
J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not all mementos are created equal: Measuring the impact of missing resources. In Proceedings of JCDL'14, pages 321--330, September 2014.
[12]
J. F. Brunelle, M. L. Nelson, L. Balakireva, R. Sanderson, and H. Van de Sompel. Evaluating the SiteStory transactional web archive with the ApacheBench tool. In Proceedings of TPDL'13, pages 204--215, 2012.
[13]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC'02, pages 380--388, New York, NY, USA, 2002.
[14]
M. Day. Preserving the fabric of our lives: A survey of web preservation initiatives. In Proceedings of ECDL'05, pages 461--472, 2003.
[15]
D. Denev, A. Mazeika, M. Spaniol, and G. Weikum. SHARC: Framework for quality-conscious web archiving. Proceedings of the VLDB Endowment, 2(1):586--597, August 2009.
[16]
C. E. Dyreson, H. ling Lin, and Y. Wang. Managing versions of web documents in a transaction-time web server. In Proceedings of WWW'04, 2004.
[17]
G. Eysenbach and M. Trudel. Going, going, still there: Using the WebCite service to permanently archive cited web pages. Journal of Medical Internet Research, 7(5), 2005.
[18]
K. Fitch. Web site archiving: an approach to recording every materially different response produced by a website. In 9th Australasian World Wide Web Conference, Sanctuary Cove, Queensland, Australia, pages 5--9, 2003.
[19]
B. A. Howell. Proving web history: How to use the Internet Archive. Journal of Internet Law, 9(8):3--9, 2006.
[20]
S. M. Jones, M. L. Nelson, H. Shankar, and H. V. de Sompel. Bringing web time travel to MediaWiki: An assessment of the Memento MediaWiki Extension. Technical Report arXiv:1406.3876, Old Dominion University and Los Alamos National Laboratory, June 2014.
[21]
B. Kahle. Wayback Machine just grew today to 479,160,477,000 pages. Go @internetarchive! https://archive.org/web {Twitter post}. Retrieved from https://twitter.com/brewster_kahle/status/603611567276589056.
[22]
B. Kahle. Wayback machine: Now with 240,000,000,000 URLs. http://blog.archive.org/2013/01/09/updated-wayback/, January 2013.
[23]
M. Klein and M. L. Nelson. Revisiting lexical signatures to (re-)discover web pages. In B. Christensen-Dalsgaard, D. Castelli, B. Ammitzbøll Jurik, and J. Lippincott, editors, Research and Advanced Technology for Digital Libraries, volume 5173 of Lecture Notes in Computer Science, pages 371--382. Springer Berlin Heidelberg, 2008.
[24]
G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of WWW'07, pages 141--150, New York, NY, USA, 2007.
[25]
J. Masanès. Web Archiving. Springer, Heidelberg, 2006.
[26]
F. McCown and M. L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, pages 48--52, May 2007. (Also available as arXiv:cs/0703083v2).
[27]
G. Mohr, M. Stack, I. Rnitovic, D. Avery, and M. Kimpton. Introduction to Heritrix, an archival quality web crawler. In Proceedings of IWAW'04, September 2004.
[28]
K. C. Negulescu. Web archiving @ the Internet Archive. http://www.digitalpreservation.gov/news/events/ndiipp_meetings/ndiipp10/docs/July21/session09/NDIIPP072110FinalIA.ppt, 2010.
[29]
S.-T. Park, D. M. Pennock, C. L. Giles, and R. Krovetz. Analysis of lexical signatures for finding lost or related documents. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 11--18, New York, NY, USA, 2002.
[30]
M. Spaniol, D. Denev, A. Mazeika, G. Weikum, and P. Senellart. Data quality in web archiving. In Proceedings of WICOW'09, pages 19--26, 2009.
[31]
M. Spaniol, A. Mazeika, D. Denev, and G. Weikum. "Catch me if you can": Visual analysis of coherence defects in web archiving. In Proceedings of IWAW'09, pages 27--37, 2009.
[32]
M. Thelwall and L. Vaughan. A fair history of the Web? examining country balance in the Internet Archive. Library & Information Science Research, 26(2):162--176, 2004.
[33]
B. Tofel. 'Wayback' for accessing web archives. In Proceedings of IWAW'07), 2007.
[34]
H. Van de Sompel, M. Nelson, and R. Sanderson. HTTP framework for time-based access to resource states--Memento (IETF RFC 7089), December 2013. http://tools.ietf.org/html/rfc7089.
[35]
H. Van de Sompel, M. L. Nelson, R. Sanderson, L. L. Balakireva, S. Ainsworth, and H. Shankar. Memento: Time travel for the Web. Technical Report arXiv:0911.1112, 2009.
[36]
M. C. Weigle. How much of the Web is archived? http://ws-dl.blogspot.com/2011/06/2011-06--23-how-much-of-web-is-archived.html, June 2011.

Cited By

View all
  • (2024)Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side RenderingProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00022(82-92)Online publication date: 26-Jun-2024
  • (2023)Hashes are not suitable to verify fixity of the public archived webPLOS ONE10.1371/journal.pone.028687918:6(e0286879)Online publication date: 9-Jun-2023
  • (2023)To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web PagesACM Transactions on the Web10.1145/358920617:4(1-49)Online publication date: 11-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HT '15: Proceedings of the 26th ACM Conference on Hypertext & Social Media
August 2015
360 pages
ISBN:9781450333955
DOI:10.1145/2700171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. digital preservation
  2. http
  3. memento
  4. resource versioning
  5. rfc 7089
  6. temporal coherence
  7. web architecture
  8. web archiving

Qualifiers

  • Research-article

Conference

HT '15
Sponsor:
HT '15: 26th ACM Conference on Hypertext and Social Media
September 1 - 4, 2015
Guzelyurt, Northern Cyprus

Acceptance Rates

HT '15 Paper Acceptance Rate 24 of 60 submissions, 40%;
Overall Acceptance Rate 378 of 1,158 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side RenderingProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00022(82-92)Online publication date: 26-Jun-2024
  • (2023)Hashes are not suitable to verify fixity of the public archived webPLOS ONE10.1371/journal.pone.028687918:6(e0286879)Online publication date: 9-Jun-2023
  • (2023)To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web PagesACM Transactions on the Web10.1145/358920617:4(1-49)Online publication date: 11-Jul-2023
  • (2023)Sorting URLs out: seeing the web through infrastructural inversion of archival crawlingInternet Histories10.1080/24701475.2023.22586977:4(386-401)Online publication date: 16-Sep-2023
  • (2023)Challenges in replaying archived Twitter pagesInternational Journal on Digital Libraries10.1007/s00799-023-00379-w25:2(217-236)Online publication date: 26-Aug-2023
  • (2022)Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay RequestsFrom Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries10.1007/978-3-031-21756-2_26(329-344)Online publication date: 7-Dec-2022
  • (2022)A Chromium-Based Memento-Aware Web BrowserLinking Theory and Practice of Digital Libraries10.1007/978-3-031-16802-4_12(147-160)Online publication date: 15-Sep-2022
  • (2021)Ethical product havens in the global diamond trade: Using the Wayback Machine to evaluate ethical market outcomesEnvironment and Planning A: Economy and Space10.1177/0308518X21102966155:5(1131-1149)Online publication date: 13-Jul-2021
  • (2021)Automatically Selecting Striking Images for Social CardsProceedings of the 13th ACM Web Science Conference 202110.1145/3447535.3462505(36-45)Online publication date: 21-Jun-2021
  • (2021)Replaying Archived Twitter: When your bird is broken, will it bring you down?2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)10.1109/JCDL52503.2021.00028(160-169)Online publication date: Sep-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media