Skip to main content

Visual Web Archive Quality Assessment

  • Conference paper
  • First Online:
Linking Theory and Practice of Digital Libraries (TPDL 2022)

Abstract

The large size of today’s web archives makes it impossible to manually assess the quality of each archived web page, i.e., to check whether a page can be reproduced faithfully from an archive. For automated web archive quality assessment, previous work proposed to measure the pixel difference between a screenshot of the original page and a screenshot of the same page when reproduced from the archive. However, when categorizing types of reproduction errors (we introduce a respective taxonomy in this paper) one finds that some errors cause high pixel differences between the screenshots, but lead to only a negligible degradation in the user experience of the reproduced web page. Therefore, we propose to visually align page segments in such cases before measuring the pixel differences. Since the diversity of reproduction error types precludes a one-size-fits-all solution for visual alignment, we focus on one common type (translated segments) and investigate the usefulness of video compression algorithms for this task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Code https://github.com/webis-de/TPDL-22.

  2. 2.

    Data https://zenodo.org/record/6881335.

  3. 3.

    The fraction of screenshots without pixel differences (no reproduction errors) in our dataset is 12.9% (845/6531).

  4. 4.

    Based on a frequency analysis, we set the threshold for small to be 5 or fewer pixels vertically and 8 or fewer pixels horizontally.

References

  1. Internet archive: quality assurance overview (2022). https://support.archive-it.org/hc/en-us/articles/208333833-Quality-Assurance-Overview

  2. Internet archive: wayback machine size as displayed on its front page (2022). https://web.archive.org/web/20220531094827/

  3. Kiesel, J., Kneist, F., Alshomary, M., Stein, B., Hagen, M., Potthast, M.: Reproducible web corpora: interactive archiving with automatic quality assessment. J. Data Inf. Qual. (JDIQ) 10(4), 17:1–17:25 (2018). https://doi.org/10.1145/3239574

  4. Reyes Ayala, B., Phillips, M., Ko, L.: Current quality assurance practices in web archiving. UNT Digital Library, pp. 1–34, August 2014

    Google Scholar 

  5. Ayala, B.R.: Correspondence as the primary measure of quality for web archives: a grounded theory study. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 73–86. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_6

    Chapter  Google Scholar 

  6. Ayala, B.R., Hitchcock, E., Sun, J.: Using image similarity metrics to measure visual quality in web archives. In: JCDL 2019: Web Archiving and Digital Libraries (WADL) Workshop, pp. 11–13. ACM (2019). https://doi.org/10.7939/r3-yh2n-rx10

  7. Tomar, S.: Converting video formats with FFmpeg. Linux J. 2006(146), 10 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Theresa Elstner .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Elstner, T. et al. (2022). Visual Web Archive Quality Assessment. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16802-4_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16801-7

  • Online ISBN: 978-3-031-16802-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics