Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives

Published: 31 July 2017 Publication History


Web archiving initiatives around the world capture ephemeral Web content to preserve our collective digital memory. However, unlocking the potential of Web archives for humanities scholars and social scientists requires a scalable analytics infrastructure to support exploration of captured content. We present Warcbase, an open-source Web archiving platform that aims to fill this need. Our platform takes advantage of modern open-source “big data” infrastructure, namely Hadoop, HBase, and Spark, that has been widely deployed in industry. Warcbase provides two main capabilities: support for temporal browsing and a domain-specific language that allows scholars to interrogate Web archives in several different ways. This work represents a collaboration between computer scientists and historians, where we have engaged in iterative codesign to build tools for scholars with no formal computer science training. To provide guidance, we propose a process model for scholarly interactions with Web archives that begins with a question and proceeds iteratively through four main steps: filter, analyze, aggregate, and visualize. We call this the FAAV cycle for short and illustrate with three prototypical case studies. This article presents the current state of the project and discusses future directions.


Information & Contributors


Published In

cover image Journal on Computing and Cultural Heritage
Journal on Computing and Cultural Heritage   Volume 10, Issue 4
October 2017
126 pages
Issue’s Table of Contents
Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 July 2017
Accepted: 01 December 2016
Revised: 01 October 2016
Received: 01 June 2016
Published in JOCCH Volume 10, Issue 4


Author Tags

  ARC
  Apache HBase
  Apache Hadoop
  Apache Spark
  Big data
  WARC


Funding Sources

  Natural Sciences and Engineering Research Council of Canada
  Social Sciences and Humanities Research Council of Canada
  Compute Canada, both through their digital humanities cloud service and a Research Platforms and Portals
  Ontario Ministry of Research and Innovation's Early Researcher Award
  Columbia University's Web Archiving Incentive Program, U.S. NSF


