Abstract
The large size of today’s web archives makes it impossible to manually assess the quality of each archived web page, i.e., to check whether a page can be reproduced faithfully from an archive. For automated web archive quality assessment, previous work proposed to measure the pixel difference between a screenshot of the original page and a screenshot of the same page when reproduced from the archive. However, when categorizing types of reproduction errors (we introduce a respective taxonomy in this paper) one finds that some errors cause high pixel differences between the screenshots, but lead to only a negligible degradation in the user experience of the reproduced web page. Therefore, we propose to visually align page segments in such cases before measuring the pixel differences. Since the diversity of reproduction error types precludes a one-size-fits-all solution for visual alignment, we focus on one common type (translated segments) and investigate the usefulness of video compression algorithms for this task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
The fraction of screenshots without pixel differences (no reproduction errors) in our dataset is 12.9% (845/6531).
- 4.
Based on a frequency analysis, we set the threshold for small to be 5 or fewer pixels vertically and 8 or fewer pixels horizontally.
References
Internet archive: quality assurance overview (2022). https://support.archive-it.org/hc/en-us/articles/208333833-Quality-Assurance-Overview
Internet archive: wayback machine size as displayed on its front page (2022). https://web.archive.org/web/20220531094827/
Kiesel, J., Kneist, F., Alshomary, M., Stein, B., Hagen, M., Potthast, M.: Reproducible web corpora: interactive archiving with automatic quality assessment. J. Data Inf. Qual. (JDIQ) 10(4), 17:1–17:25 (2018). https://doi.org/10.1145/3239574
Reyes Ayala, B., Phillips, M., Ko, L.: Current quality assurance practices in web archiving. UNT Digital Library, pp. 1–34, August 2014
Ayala, B.R.: Correspondence as the primary measure of quality for web archives: a grounded theory study. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 73–86. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_6
Ayala, B.R., Hitchcock, E., Sun, J.: Using image similarity metrics to measure visual quality in web archives. In: JCDL 2019: Web Archiving and Digital Libraries (WADL) Workshop, pp. 11–13. ACM (2019). https://doi.org/10.7939/r3-yh2n-rx10
Tomar, S.: Converting video formats with FFmpeg. Linux J. 2006(146), 10 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Elstner, T. et al. (2022). Visual Web Archive Quality Assessment. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_34
Download citation
DOI: https://doi.org/10.1007/978-3-031-16802-4_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16801-7
Online ISBN: 978-3-031-16802-4
eBook Packages: Computer ScienceComputer Science (R0)