skip to main content
research-article

Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives

Published: 31 July 2017 Publication History

Abstract

Web archiving initiatives around the world capture ephemeral Web content to preserve our collective digital memory. However, unlocking the potential of Web archives for humanities scholars and social scientists requires a scalable analytics infrastructure to support exploration of captured content. We present Warcbase, an open-source Web archiving platform that aims to fill this need. Our platform takes advantage of modern open-source “big data” infrastructure, namely Hadoop, HBase, and Spark, that has been widely deployed in industry. Warcbase provides two main capabilities: support for temporal browsing and a domain-specific language that allows scholars to interrogate Web archives in several different ways. This work represents a collaboration between computer scientists and historians, where we have engaged in iterative codesign to build tools for scholars with no formal computer science training. To provide guidance, we propose a process model for scholarly interactions with Web archives that begins with a question and proceeds iteratively through four main steps: filter, analyze, aggregate, and visualize. We call this the FAAV cycle for short and illustrate with three prototypical case studies. This article presents the current state of the project and discusses future directions.

References

[1]
Amitanand Aiyer, Mikhail Bautin, Guoqiang Chen, Pritam Khemani, Kannan Muthukkaruppan, Karthik Spiegelberg, Liyin Tang, and Madhuwanti Vaidya. 2012. Storage infrastructure behind Facebook messages: Using HBase at scale. IEEE Data Engineering Bulletin 35, 2, 4--13.
[2]
Yasmin A. AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2013. Access patterns for robots and humans in Web archives. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’13). 339--348.
[3]
Alex Ball. 2010. Web Archiving. Digital Curation Centre, Edinburgh, UK.
[4]
Stanislav Barton. 2012. Mignify: A big data refinery built on HBase. In Proceedings of the Official Conference of the Apache HBase Community (HBaseCon’12).
[5]
Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. 2007. A time machine for text search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). 519--526.
[6]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.
[7]
Peter Braunstein and Michael William Doyle (Eds.). 2002. Imagine Nation: The American Counterculture of the 1960s and’70s. Routledge.
[8]
Niels Brügger. 2008. The archived Website and Website philology: A new type of historical document? Nordicom Review 29, 2, 155--175.
[9]
Niels Brügger (Ed.). 2010. Web History. Peter Lang.
[10]
Niels Brügger. 2013. Historical network analysis of the Web. Social Science Computer Review 31, 3, 306--321.
[11]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating System Design and Implementation (OSDI’06). 205--218.
[12]
Jason Chuang, Christopher D. Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the 2012 International Working Conference on Advanced Visual Interfaces. 74--77.
[13]
Miguel Costa, Daniel Gomes, Francisco Couto, and Mário Silva. 2013. A survey of Web archive search architectures. In Proceedings of the 22nd International World Wide Web Conference Companion (WWW’13). 1045--1050.
[14]
Danish National Library Authority. 2001. Preserving the Present for the Future: Conference on Strategies for the Internet. Danish National Library Authority.
[15]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation (OSDI’04). 137--150.
[16]
Meghan Dougherty and Eric Meyer. 2014. Community, tools, and practices in Web archiving: The state of the art in relation to social science and humanities research needs. Journal of the American Society for Information Science and Technology 65, 11, 2195--2209.
[17]
Emily Gade and John Wilkerson. 2017. The .GOV archive: A big data resource for political science. Political Methodologist. Retrieved May 30, 2017, from https://thepoliticalmethodologist.com/2017/03/16/the-gov-internet-archive-a-big-data-resource-for-political-science/.
[18]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29--43.
[19]
Todd Gitlin. 1987. The Sixties: Years of Hope, Days of Rage. Bantam Books.
[20]
Daniel Gomes, David Cruz, João Miranda, Miguel Costa, and Simão Fontes. 2013. Search the past with the Portuguese Web archive. In Proceedings of the 22nd International World Wide Web Conference Companion (WWW’13). 321--324.
[21]
Daniel Gomes, João Miranda, and Miguel Costa. 2011. A survey on Web archiving initiatives. In Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries (TPDL’11). 408--420.
[22]
Susan Havre, Elizabeth G. Hetzler, Paul Whitney, and Lucy T. Nowell. 2002. ThemeRiver: Visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics 8, 1, 9--20.
[23]
Jinru He, Junyuan Zeng, and Torsten Suel. 2010. Improved index compression techniques for versioned document collections. In Proceedings of 19th International Conference on Information and Knowledge Management (CIKM’10). 1239--1248.
[24]
Michael Herscovici, Ronny Lempel, and Sivan Yogev. 2007. Efficient indexing of versioned document sequences. In Proceedings of the 29th European Conference on Information Retrieval Research (ECIR’07). 76--87.
[25]
Helen Hockx-Yu. 2011. The past issue of the Web. In Proceedings of the 3rd International Web Science Conference (WebSci’11). 12:1--12:8.
[26]
Helen Hockx-Yu. 2013. Scholarly use of Web archives. In Digital Conversations at the British Library. London, England.
[27]
Helge Holzmann, Vinay Goel, and Avishek Anand. 2016. ArchiveSpark: Efficient Web archive access, extraction and derivation. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 83--92.
[28]
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX’10). 145--158.
[29]
Maurice Isserman. 1987. If I Had a Hammer: The Death of the Old Left and the Birth of the New Left. Basic Books.
[30]
Andrew Jackson, Jimmy Lin, Ian Milligan, and Nick Ruest. 2016. Desiderata for exploratory search interfaces to Web archives in support of scholarly activities. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 103--106.
[31]
Brewster Kahle. 1997. Preserving the Internet. Scientific American 276, 3, 82--83.
[32]
Cyril Levitt. 1984. Children of Privilege: Student Revolt in the Sixties. University of Toronto Press.
[33]
Jimmy Lin. 2015. Scaling down distributed infrastructure on wimpy machines for personal Web archiving. In Proceedings of the 24th International World Wide Web Conference Companion (WWW’15). 1351--1355.
[34]
Jimmy Lin, Milad Gholami, and Jinfeng Rao. 2014. Infrastructure for supporting exploration and discovery in Web archives. In Proceedings of the 23rd International World Wide Web Conference Companion (WWW’14). 851--855.
[35]
Arthur Marwick. 1998. The Sixties: Cultural Revolution in Britain, France, Italy, and the United States, 1958--1974. Oxford University Press.
[36]
Ian Milligan. 2014. Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada. University of British Columbia Press.
[37]
Ian Milligan. 2017. Welcome to the Web: The online community of geocities and the early years of the World Wide Web. In The Web as History, N. Brügger and R. Schroeder (Eds.). UCL Press, London, England, 137--158.
[38]
Ian Milligan, Nick Ruest, and Jimmy Lin. 2016. Content selection and curation for Web archiving: The gatekeepers vs. the masses. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 107--110.
[39]
Franco Moretti. 2007. Graphs, Maps, Trees: Abstract Models for Literary History. Verso.
[40]
Clemens Neudecker and Sven Schlarb. 2013. The elephant in the library: Integrating Hadoop. In Proceedings of Hadoop Summit Europe.
[41]
Jinfang Niu. 2012. An overview of Web archiving. D-Lib Magazine 18, 3/4, Article No. 2.
[42]
Kjetil Nørvåg. 2003. Space-efficient support for temporal text indexing in a document archive context. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’03). 511--522.
[43]
Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. 2004. What’s new on the Web? The evolution of the Web from a search engine perspective. In Proceedings of the 13th International World Wide Web Conference (WWW’04). 1--12.
[44]
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 1099--1110.
[45]
Mohamed Rasheed. 2013. Fedora Commons with Apache Hadoop: A research study. code4lib Journal. Retrieved May 30, 2017, from http://journal.code4lib.org/articles/8988
[46]
Vedran Sabol, Wolfgang Kienreich, Markus Muhr, Werner Klieber, and Michael Granitzer. 2009. Visual knowledge discovery in dynamic enterprise text repositories. In Proceedings of the 13th International Conference on Information Visualisation (IV’09). 361--368.
[47]
Steven M. Schneider and Kirsten A. Foot. 2004. The Web as an object of study. New Media and Society 6, 1, 114--122.
[48]
Ralph Schroder and Niels Brügger (Eds.). 2017. The Web as History: Using Web Archives to Understand the Past and Present. UCL Press, London, England.
[49]
Sang Song. 2010. Long-Term Information Preservation and Access. Ph.D. Dissertation. University of Maryland.
[50]
Ed Summers and Ricardo Punzalan. 2016. Bots, seeds and people: Web archives as infrastructure. arXiv:1611.02493v1.
[51]
Brad Tofel. 2007. ‘Wayback’ for accessing Web archives. In Proceedings of the 7th International Web Archiving Workshop.
[52]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13).
[53]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, et al. 2012. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation.

Cited By

View all
  • (2023)API Entity and Relation Joint Extraction from Text via Dynamic Prompt-tuned Language ModelACM Transactions on Software Engineering and Methodology10.1145/360718833:1(1-25)Online publication date: 23-Nov-2023
  • (2023)Revisiting the Identification of the Co-evolution of Production and Test CodeACM Transactions on Software Engineering and Methodology10.1145/360718332:6(1-37)Online publication date: 30-Sep-2023
  • (2023)Exploring the Impact of Code Clones on Deep Learning SoftwareACM Transactions on Software Engineering and Methodology10.1145/360718132:6(1-34)Online publication date: 3-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage
Journal on Computing and Cultural Heritage   Volume 10, Issue 4
October 2017
126 pages
ISSN:1556-4673
EISSN:1556-4711
DOI:10.1145/3129537
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 July 2017
Accepted: 01 December 2016
Revised: 01 October 2016
Received: 01 June 2016
Published in JOCCH Volume 10, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ARC
  2. Apache HBase
  3. Apache Hadoop
  4. Apache Spark
  5. Big data
  6. WARC

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Natural Sciences and Engineering Research Council of Canada
  • Social Sciences and Humanities Research Council of Canada
  • Compute Canada, both through their digital humanities cloud service and a Research Platforms and Portals
  • Ontario Ministry of Research and Innovation's Early Researcher Award
  • Columbia University's Web Archiving Incentive Program, U.S. NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)API Entity and Relation Joint Extraction from Text via Dynamic Prompt-tuned Language ModelACM Transactions on Software Engineering and Methodology10.1145/360718833:1(1-25)Online publication date: 23-Nov-2023
  • (2023)Revisiting the Identification of the Co-evolution of Production and Test CodeACM Transactions on Software Engineering and Methodology10.1145/360718332:6(1-37)Online publication date: 30-Sep-2023
  • (2023)Exploring the Impact of Code Clones on Deep Learning SoftwareACM Transactions on Software Engineering and Methodology10.1145/360718132:6(1-34)Online publication date: 3-Jul-2023
  • (2023)Procedural Modeling Based Shape Grammar as a Key to Generating Digital Architectural HeritageJournal on Computing and Cultural Heritage 10.1145/360670116:4(1-17)Online publication date: 9-Aug-2023
  • (2023)Summarizing Web Archive Corpora via Social Media Storytelling by Automatically Selecting and Visualizing ExemplarsACM Transactions on the Web10.1145/360603018:1(1-48)Online publication date: 11-Oct-2023
  • (2023)Web archive analytics: Blind spots and silences in distant readings of the archived webDigital Scholarship in the Humanities10.1093/llc/fqad01438:3(1033-1048)Online publication date: 19-Apr-2023
  • (2023)Synthesizing Web Archive Collections into Big Data: Lessons from Mining Data from Web ArchivesLinking Theory and Practice of Digital Libraries10.1007/978-3-031-43849-3_19(220-229)Online publication date: 26-Sep-2023
  • (2021)Hadoop-Based Painting Resource Storage and Retrieval Platform Construction and TestingComplexity10.1155/2021/99333302021Online publication date: 1-Jan-2021
  • (2021)‘Go fish’: Conceptualising the challenges of engaging national web archives for digital researchInternational Journal of Digital Humanities10.1007/s42803-021-00032-52:1-3(43-63)Online publication date: 27-Apr-2021
  • (2021)From archive to analysis: accessing web archives at scale through a cloud-based interfaceInternational Journal of Digital Humanities10.1007/s42803-020-00029-6Online publication date: 6-Jan-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media