Skip to main content

WARChain: Blockchain-Based Validation of Web Archives

  • Conference paper
  • First Online:
Socio-Technical Aspects in Security and Trust (STAST 2020)

Abstract

Background. Web archives store born-digital documents, which are usually collected from the Internet by crawlers and stored in the Web Archive (WARC) format. The trustworthiness and integrity of web archives is still an open challenge, especially in the news portal domain, which face additional challenges of censorship even in democratic societies.

Aim. The aim of this paper is to present a light-weight, blockchain-based solution for web archive validation, which would ensure that the crawled documents are authentic for many years to come.

Method. We developed our archive validation solution as an extension and continuation of our work in web crawler development, mainly targeting news portals. The system is designed as an overlay over a blockchain with a proof-of-stake (PoS) distributed consensus algorithm. PoS was chosen due to its lower ecological footprint compared to proof-of-work solutions (e.g. Bitcoin) and lower expected investment in computing infrastructure.

Results. We implemented a prototype of the proposed solution in Python and C#. The prototype was tested on web archive content crawled from Hungarian news portals at two different timestamps which consisted of 1 million articles in total.

Conclusions. We concluded that the proposed solution is accessible, usable by different stakeholders to validate crawled content, deployable on cheap commodity hardware, tackles the archive integrity challenge and is capable to efficiently manage duplicate documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml, last accessed, 2020/05/15.

  2. 2.

    Internet Archive, https://archive.org/.

  3. 3.

    https://tools.ietf.org/html/rfc7089, last accessed, 2020/05/15.

  4. 4.

    https://tools.ietf.org/html/rfc6962, last accessed, 2020/05/15.

  5. 5.

    https://github.com/WICG/webpackage, last accessed 2020/07/08.

  6. 6.

    http://blog.archive.org/2018/04/24/addressing-recent-claimsof-manipulated-blog-posts-in-the-wayback-machine, last accessed 2020/07/08.

  7. 7.

    https://www.w3.org/TR/prov-overview, last accessed 2020/07/08.

  8. 8.

    https://medium.com/coinmonks/implementing-proof-of-stake-part-2-748156d5c85e, last accessed 2020/07/08.

  9. 9.

    https://www.gov.uk/government/news/distributed-ledger-technology-beyond-block-chain, last accessed 2020/07/08.

  10. 10.

    https://medium.com/chainsafe-systems/ethereum-2-0-a-complete-guide-casper-and-the-beacon-chain-be95129fc6c1, last accessed, 2020/07/08.

  11. 11.

    Zenodo, https://zenodo.org/.

  12. 12.

    https://www.blockchain.com/charts/blocks-size.

  13. 13.

    https://www.worldwidewebsize.com.

  14. 14.

    https://github.com/lendak/warchain.git.

  15. 15.

    https://medium.com/chainsafe-systems/ethereum-2-0-a-complete-guide-casper-and-the-beacon-chain-be95129fc6c1, last accessed, 2020/07/08.

  16. 16.

    https://github.com/webrecorder/warcio.

  17. 17.

    https://www.crummy.com/software/BeautifulSoup/.

  18. 18.

    https://www.blockchain.com/charts/median-confirmation-time.

  19. 19.

    https://ethgasstation.info/blog/ethereum-transaction-how-long.

References

  1. Alam, S., Kelly, M., Weigle, M.C., Nelson, M.L.: Unobtrusive and extensible archival replay banners using custom elements. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. JCDL 2018, New York, NY, USA, pp. 319–320. Association for Computing Machinery (2018). https://doi.org/10.1145/3197026.3203881

  2. Alam, S., Weigle, M.C., Nelson, M.L., Klein, M., de Sompel, H.V.: Supporting web archiving via web packaging (2019)

    Google Scholar 

  3. Alam, S., Weigle, M.C., Nelson, M.L., Melo, F., Bicho, D., Gomes, D.: Mementomap framework for flexible and adaptive web archive profiling. In: Proceedings of the 18th Joint Conference on Digital Libraries. JCDL 2019, pp. 172–181. IEEE Press (2019). https://doi.org/10.1109/JCDL.2019.00033

  4. Barabasi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999). https://doi.org/10.1126/science.286.5439.509, http://www.sciencemag.org/cgi/content/abstract/286/5439/509

  5. Collomosse, J., et al.: Archangel: trusted archives of digital public documents. In: Proceedings of the ACM Symposium on Document Engineering, DocEng 2018, New York, NY, USA. Association for Computing Machinery (2018). https://doi.org/10.1145/3209280.3229120

  6. Gomes, D., Miranda, J., Costa, M.: A survey on web archiving initiatives. In: Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds.) TPDL 2011. LNCS, vol. 6966, pp. 408–420. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24469-8_41

    Chapter  Google Scholar 

  7. Holzmann, H., Goel, V., Anand, A.: Archivespark: efficient web archive access, extraction and derivation. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. JCDL 2016, New York, NY, USA, pp. 83–92. Association for Computing Machinery (2016). https://doi.org/10.1145/2910896.2910902

  8. Indig, B., Kákonyi, T., Novák, A.: Crawling in reverse - lightweight targeted crawling of news portals. In: Kubis, M. (ed.) Proceedings of the 9th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2019), pp. 81–86. Wydawnictwo Nauka I Innowacje (2019)

    Google Scholar 

  9. Indig, B., Knap, Á., Sárközi-Lindner, Z., Timári, M., Palkó, G.: The ELTE.DH pilot corpus - creating a handcrafted Gigaword web corpus with metadata. In: Proceedings of the 12th Web as Corpus Workshop, Marseille, France, pp. 33–41. European Language Resources Association (2020). https://www.aclweb.org/anthology/2020.wac-1.5

  10. Johnson, V., Thomas, D.: Interfaces with the past... present and future? scale and scope: the implications of size and structure for the digital archive of tomorrow. In: Proceedings Digital Heritage Conference, vol. 1 (2013)

    Google Scholar 

  11. Kelly, M., Alam, S., Nelson, M.L., Weigle, M.C.: InterPlanetary wayback: peer-to-peer permanence of web archives. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds.) TPDL 2016. LNCS, vol. 9819, pp. 411–416. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43997-6_35

    Chapter  Google Scholar 

  12. Kelly, M., Weigle, M.C.: Warcreate: Create wayback-consumable warc files from any webpage. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL 2012, New York, NY, USA, pp. 437–438. Association for Computing Machinery (2012). https://doi.org/10.1145/2232817.2232930

  13. Lemieux, V.L.: Blockchain technology for record keeping: help or hype? In: Technical Report. University of British Columbia (2016)

    Google Scholar 

  14. Lerner, A., Kohno, T., Roesner, F.: Rewriting history: changing the archived web from the present. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. CCS 2017, New York, NY, USA, pp. 1741–1755. Association for Computing Machinery (2017). https://doi.org/10.1145/3133956.3134042

  15. Milligan, I.: Lost in the infinite archive: the promise and pitfalls of web archives. Int. J. Humanit. Arts Comput. 10(1), 78–94 (2016)

    Article  MathSciNet  Google Scholar 

  16. Münster, S.: Digital heritage as a scholarly field-topics, researchers, and perspectives from a bibliometric point of view. J. Comput. Cult. Herit. 12(3), 1–27 (2019). https://doi.org/10.1145/3310012

    Article  Google Scholar 

  17. Pomikálek, J.: Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Masaryk university, Faculty of informatics, Brno (2011)

    Google Scholar 

  18. Schreibman, S., Siemens, R., Unsworth, J.: A Companion to Digital Humanities. John Wiley & Sons, Hoboken (2008)

    Google Scholar 

  19. Sigurðsson, K.: Managing duplicates across sequential crawls. IWAW 2006, p. 99 (2006)

    Google Scholar 

  20. You, L.L., Pollack, K.T., Long, D.D.E.: Deep store: an archival storage system architecture. In: 21st International Conference on Data Engineering (ICDE 2005), pp. 804–815 (2005). https://doi.org/10.1109/ICDE.2005.47

Download references

Acknowledgement

This research was supported by the Institutional Excellence Program for Higher Education (FIKP) of the Republic of Hungary.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imre Lendák .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lendák, I., Indig, B., Palkó, G. (2021). WARChain: Blockchain-Based Validation of Web Archives. In: Groß, T., Viganò, L. (eds) Socio-Technical Aspects in Security and Trust. STAST 2020. Lecture Notes in Computer Science(), vol 12812. Springer, Cham. https://doi.org/10.1007/978-3-030-79318-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-79318-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-79317-3

  • Online ISBN: 978-3-030-79318-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics