Abstract
Large web archival collections are often opaque about their holdings. We created an open-source tool called, CDX Summary, to generate statistical reports based on URIs, hosts, TLDs, paths, query parameters, status codes, media types, date and time, etc. present in the CDX index of a collection of WARC files. Our tool also surfaces a configurable number of potentially good random memento samples from the collection for visual inspection, quality assurance, representative thumbnails generation, etc. The tool generates both human and machine readable reports with varying levels of details for different use cases. Furthermore, we implemented a Web Component that can render generated JSON summaries in HTML documents. Early exploration of CDX insights on Wayback Machine collections helped us improve our crawl operations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alam, S.: Archive profiler: scripts to generate profiles of various web archives (2014). https://github.com/oduwsdl/archive_profiler
Alam, S.: MemGator: A memento aggregator CLI and server in go (2015). https://github.com/oduwsdl/MemGator
Alam, S.: Web ARChive (WARC) file format (2018). https://www.slideshare.net/ibnesayeed/web-archive-warc-file-format
Alam, S.: MementoMap: a tool to summarize web archive holdings (2019). https://github.com/oduwsdl/MementoMap
Alam, S.: CDX summary (2021). https://github.com/internetarchive/cdx-summary
Alam, S.: CDX summary in PyPI (2022). https://pypi.org/project/cdxsummary/
Alam, S.: CDX summary web component (2022). https://www.npmjs.com/package/@internetarchive/cdxsummary
Alam, S., Nelson, M.L.: MemGator - a portable concurrent memento aggregator: cross-platform CLI and server binaries in go. In: Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 243–244. JCDL 2016 (2016). https://doi.org/10.1145/2910896.2925452
Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L.L., Shankar, H., Rosenthal, D.S.H.: Web archive profiling through CDX summarization. In: Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries, pp. 3–14. TPDL 2015 (2015). https://doi.org/10.1007/978-3-319-24592-8_1
Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L.L., Shankar, H., Rosenthal, D.S.H.: Web archive profiling through CDX summarization. Int. J. Digit. Libr. 17(3), 223–238 (2016). https://doi.org/10.1007/s00799-016-0184-4
Alam, S., Nelson, M.L., Van de Sompel, H., Rosenthal, D.S.H.: Web archive profiling through fulltext search. In: Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries, pp. 121–132. TPDL 2016 (2016). https://doi.org/10.1007/978-3-319-43997-6_10
Alam, S., Weigle, M.C., Nelson, M.L.: Profiling web archival voids for memento routing. In: Proceedings of the 21st ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 150–159. JCDL 2021 (2021). https://doi.org/10.1109/JCDL52503.2021.00027
Alam, S., Weigle, M.C., Nelson, M.L., Melo, F., Bicho, D., Gomes, D.: MementoMap framework for flexible and adaptive web archive profiling. In: Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 172–181. JCDL 2019 (2019). https://doi.org/10.1109/JCDL.2019.00033
AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. In: Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, pp. 60–71. TPDL 2013 (2013). https://doi.org/10.1007/978-3-642-40501-3_7
AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digital Lib. 14(3–4), 149–166 (2014). https://doi.org/10.1007/s00799-014-0118-y
Bornand, N., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 63–72. JCDL 2016 (2016). https://doi.org/10.1145/2910896.2910899
Holzmann, H., Goel, V., Anand, A.: ArchiveSpark: efficient web archive access, extraction and derivation. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 83–92. JCDL 2016 (2016). https://doi.org/10.1145/2910896.2910902
Internet archive: CDX file format (2003). http://archive.org/web/researcher/cdx_file_format.php
ISO 28500:2017: WARC file format (2017). https://iso.org/standard/68004.html
Jackson, A.: Messy web archive collections (2014). https://twitter.com/anjacks0n/status/466690812269846528
Klein, M., Balakireva, L., Shankar, H.: Evaluating memento service optimizations. In: Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 182–185. JCDL 2019 (2019). https://doi.org/10.1109/JCDL.2019.00034
Maurer, Y.: Summarize CDX(J) files for MIME analysis per 2nd-level domain (2021). https://github.com/ymaurer/cdx-summarize
Meneses, L., Furuta, R., Shipman, F.: Identifying “Soft 404’’ error pages: analyzing the lexical signatures of documents in distributed collections. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL 2012. LNCS, vol. 7489, pp. 197–208. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33290-6_22
Nelson, M.L., Van de Sompel, H.: Adding the Dimension of Time to HTTP. SAGE Handb. Web Hist. (2018)
Ruest, N., Lin, J., Milligan, I., Fritz, S.: The archives unleashed project: technology, process, and community to improve scholarly access to web archives. In: Proceedings of the 20th ACM/IEEE Joint Conference on Digital Libraries, pp. 157–166. JCDL 2020 (2020). https://doi.org/10.1145/3383583.3398513
Sanderson, R., Van de Sompel, H., Nelson, M.L.: IIPC Memento Aggregator Experiment (2012). http://www.netpreserve.org/sites/default/files/resources/Sanderson.pdf
Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states - memento. RFC 7089 (2013)
Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: time travel for the web. Technical report arXiv:0911.1112 (2009). https://arxiv.org/abs/0911.1112
Acknowledgements
We thank various IA staff members for their help. Brewster Kahle for testing the CLI, Kenji Nagahashi for feedback on the web UI text, Brenton Cheng and Isa Herico Velasco for main site integration, Jason Buckner for Web Component help, and Jim Shelton for UX feedback.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Alam, S., Graham, M. (2022). CDX Summary: Web Archival Collection Insights. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-16802-4_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16801-7
Online ISBN: 978-3-031-16802-4
eBook Packages: Computer ScienceComputer Science (R0)