Abstract
Background. Web archives store born-digital documents, which are usually collected from the Internet by crawlers and stored in the Web Archive (WARC) format. The trustworthiness and integrity of web archives is still an open challenge, especially in the news portal domain, which face additional challenges of censorship even in democratic societies.
Aim. The aim of this paper is to present a light-weight, blockchain-based solution for web archive validation, which would ensure that the crawled documents are authentic for many years to come.
Method. We developed our archive validation solution as an extension and continuation of our work in web crawler development, mainly targeting news portals. The system is designed as an overlay over a blockchain with a proof-of-stake (PoS) distributed consensus algorithm. PoS was chosen due to its lower ecological footprint compared to proof-of-work solutions (e.g. Bitcoin) and lower expected investment in computing infrastructure.
Results. We implemented a prototype of the proposed solution in Python and C#. The prototype was tested on web archive content crawled from Hungarian news portals at two different timestamps which consisted of 1 million articles in total.
Conclusions. We concluded that the proposed solution is accessible, usable by different stakeholders to validate crawled content, deployable on cheap commodity hardware, tackles the archive integrity challenge and is capable to efficiently manage duplicate documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml, last accessed, 2020/05/15.
- 2.
Internet Archive, https://archive.org/.
- 3.
https://tools.ietf.org/html/rfc7089, last accessed, 2020/05/15.
- 4.
https://tools.ietf.org/html/rfc6962, last accessed, 2020/05/15.
- 5.
https://github.com/WICG/webpackage, last accessed 2020/07/08.
- 6.
http://blog.archive.org/2018/04/24/addressing-recent-claimsof-manipulated-blog-posts-in-the-wayback-machine, last accessed 2020/07/08.
- 7.
https://www.w3.org/TR/prov-overview, last accessed 2020/07/08.
- 8.
https://medium.com/coinmonks/implementing-proof-of-stake-part-2-748156d5c85e, last accessed 2020/07/08.
- 9.
https://www.gov.uk/government/news/distributed-ledger-technology-beyond-block-chain, last accessed 2020/07/08.
- 10.
https://medium.com/chainsafe-systems/ethereum-2-0-a-complete-guide-casper-and-the-beacon-chain-be95129fc6c1, last accessed, 2020/07/08.
- 11.
Zenodo, https://zenodo.org/.
- 12.
- 13.
- 14.
- 15.
https://medium.com/chainsafe-systems/ethereum-2-0-a-complete-guide-casper-and-the-beacon-chain-be95129fc6c1, last accessed, 2020/07/08.
- 16.
- 17.
- 18.
- 19.
References
Alam, S., Kelly, M., Weigle, M.C., Nelson, M.L.: Unobtrusive and extensible archival replay banners using custom elements. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. JCDL 2018, New York, NY, USA, pp. 319–320. Association for Computing Machinery (2018). https://doi.org/10.1145/3197026.3203881
Alam, S., Weigle, M.C., Nelson, M.L., Klein, M., de Sompel, H.V.: Supporting web archiving via web packaging (2019)
Alam, S., Weigle, M.C., Nelson, M.L., Melo, F., Bicho, D., Gomes, D.: Mementomap framework for flexible and adaptive web archive profiling. In: Proceedings of the 18th Joint Conference on Digital Libraries. JCDL 2019, pp. 172–181. IEEE Press (2019). https://doi.org/10.1109/JCDL.2019.00033
Barabasi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999). https://doi.org/10.1126/science.286.5439.509, http://www.sciencemag.org/cgi/content/abstract/286/5439/509
Collomosse, J., et al.: Archangel: trusted archives of digital public documents. In: Proceedings of the ACM Symposium on Document Engineering, DocEng 2018, New York, NY, USA. Association for Computing Machinery (2018). https://doi.org/10.1145/3209280.3229120
Gomes, D., Miranda, J., Costa, M.: A survey on web archiving initiatives. In: Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds.) TPDL 2011. LNCS, vol. 6966, pp. 408–420. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24469-8_41
Holzmann, H., Goel, V., Anand, A.: Archivespark: efficient web archive access, extraction and derivation. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. JCDL 2016, New York, NY, USA, pp. 83–92. Association for Computing Machinery (2016). https://doi.org/10.1145/2910896.2910902
Indig, B., Kákonyi, T., Novák, A.: Crawling in reverse - lightweight targeted crawling of news portals. In: Kubis, M. (ed.) Proceedings of the 9th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2019), pp. 81–86. Wydawnictwo Nauka I Innowacje (2019)
Indig, B., Knap, Á., Sárközi-Lindner, Z., Timári, M., Palkó, G.: The ELTE.DH pilot corpus - creating a handcrafted Gigaword web corpus with metadata. In: Proceedings of the 12th Web as Corpus Workshop, Marseille, France, pp. 33–41. European Language Resources Association (2020). https://www.aclweb.org/anthology/2020.wac-1.5
Johnson, V., Thomas, D.: Interfaces with the past... present and future? scale and scope: the implications of size and structure for the digital archive of tomorrow. In: Proceedings Digital Heritage Conference, vol. 1 (2013)
Kelly, M., Alam, S., Nelson, M.L., Weigle, M.C.: InterPlanetary wayback: peer-to-peer permanence of web archives. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds.) TPDL 2016. LNCS, vol. 9819, pp. 411–416. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43997-6_35
Kelly, M., Weigle, M.C.: Warcreate: Create wayback-consumable warc files from any webpage. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL 2012, New York, NY, USA, pp. 437–438. Association for Computing Machinery (2012). https://doi.org/10.1145/2232817.2232930
Lemieux, V.L.: Blockchain technology for record keeping: help or hype? In: Technical Report. University of British Columbia (2016)
Lerner, A., Kohno, T., Roesner, F.: Rewriting history: changing the archived web from the present. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. CCS 2017, New York, NY, USA, pp. 1741–1755. Association for Computing Machinery (2017). https://doi.org/10.1145/3133956.3134042
Milligan, I.: Lost in the infinite archive: the promise and pitfalls of web archives. Int. J. Humanit. Arts Comput. 10(1), 78–94 (2016)
Münster, S.: Digital heritage as a scholarly field-topics, researchers, and perspectives from a bibliometric point of view. J. Comput. Cult. Herit. 12(3), 1–27 (2019). https://doi.org/10.1145/3310012
Pomikálek, J.: Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Masaryk university, Faculty of informatics, Brno (2011)
Schreibman, S., Siemens, R., Unsworth, J.: A Companion to Digital Humanities. John Wiley & Sons, Hoboken (2008)
Sigurðsson, K.: Managing duplicates across sequential crawls. IWAW 2006, p. 99 (2006)
You, L.L., Pollack, K.T., Long, D.D.E.: Deep store: an archival storage system architecture. In: 21st International Conference on Data Engineering (ICDE 2005), pp. 804–815 (2005). https://doi.org/10.1109/ICDE.2005.47
Acknowledgement
This research was supported by the Institutional Excellence Program for Higher Education (FIKP) of the Republic of Hungary.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Lendák, I., Indig, B., Palkó, G. (2021). WARChain: Blockchain-Based Validation of Web Archives. In: Groß, T., Viganò, L. (eds) Socio-Technical Aspects in Security and Trust. STAST 2020. Lecture Notes in Computer Science(), vol 12812. Springer, Cham. https://doi.org/10.1007/978-3-030-79318-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-79318-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79317-3
Online ISBN: 978-3-030-79318-0
eBook Packages: Computer ScienceComputer Science (R0)