research-article

Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives

Authors:

Alice ZhouAuthors Info & Claims

Journal on Computing and Cultural Heritage (JOCCH), Volume 10, Issue 4

Article No.: 22, Pages 1 - 30

https://doi.org/10.1145/3097570

Published: 31 July 2017 Publication History

Abstract

Web archiving initiatives around the world capture ephemeral Web content to preserve our collective digital memory. However, unlocking the potential of Web archives for humanities scholars and social scientists requires a scalable analytics infrastructure to support exploration of captured content. We present Warcbase, an open-source Web archiving platform that aims to fill this need. Our platform takes advantage of modern open-source “big data” infrastructure, namely Hadoop, HBase, and Spark, that has been widely deployed in industry. Warcbase provides two main capabilities: support for temporal browsing and a domain-specific language that allows scholars to interrogate Web archives in several different ways. This work represents a collaboration between computer scientists and historians, where we have engaged in iterative codesign to build tools for scholars with no formal computer science training. To provide guidance, we propose a process model for scholarly interactions with Web archives that begins with a question and proceeds iteratively through four main steps: filter, analyze, aggregate, and visualize. We call this the FAAV cycle for short and illustrate with three prototypical case studies. This article presents the current state of the project and discusses future directions.

References

[1]

Amitanand Aiyer, Mikhail Bautin, Guoqiang Chen, Pritam Khemani, Kannan Muthukkaruppan, Karthik Spiegelberg, Liyin Tang, and Madhuwanti Vaidya. 2012. Storage infrastructure behind Facebook messages: Using HBase at scale. IEEE Data Engineering Bulletin 35, 2, 4--13.

[2]

Yasmin A. AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2013. Access patterns for robots and humans in Web archives. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’13). 339--348.

Digital Library

[3]

Alex Ball. 2010. Web Archiving. Digital Curation Centre, Edinburgh, UK.

[4]

Stanislav Barton. 2012. Mignify: A big data refinery built on HBase. In Proceedings of the Official Conference of the Apache HBase Community (HBaseCon’12).

[5]

Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. 2007. A time machine for text search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). 519--526.

Digital Library

[6]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.

Digital Library

[7]

Peter Braunstein and Michael William Doyle (Eds.). 2002. Imagine Nation: The American Counterculture of the 1960s and’70s. Routledge.

[8]

Niels Brügger. 2008. The archived Website and Website philology: A new type of historical document? Nordicom Review 29, 2, 155--175.

[9]

Niels Brügger (Ed.). 2010. Web History. Peter Lang.

[10]

Niels Brügger. 2013. Historical network analysis of the Web. Social Science Computer Review 31, 3, 306--321.

Digital Library

[11]

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating System Design and Implementation (OSDI’06). 205--218.

[12]

Jason Chuang, Christopher D. Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the 2012 International Working Conference on Advanced Visual Interfaces. 74--77.

Digital Library

[13]

Miguel Costa, Daniel Gomes, Francisco Couto, and Mário Silva. 2013. A survey of Web archive search architectures. In Proceedings of the 22nd International World Wide Web Conference Companion (WWW’13). 1045--1050.

Digital Library

[14]

Danish National Library Authority. 2001. Preserving the Present for the Future: Conference on Strategies for the Internet. Danish National Library Authority.

[15]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation (OSDI’04). 137--150.

Digital Library

[16]

Meghan Dougherty and Eric Meyer. 2014. Community, tools, and practices in Web archiving: The state of the art in relation to social science and humanities research needs. Journal of the American Society for Information Science and Technology 65, 11, 2195--2209.

Digital Library

[17]

Emily Gade and John Wilkerson. 2017. The .GOV archive: A big data resource for political science. Political Methodologist. Retrieved May 30, 2017, from https://thepoliticalmethodologist.com/2017/03/16/the-gov-internet-archive-a-big-data-resource-for-political-science/.

[18]

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29--43.

Digital Library

[19]

Todd Gitlin. 1987. The Sixties: Years of Hope, Days of Rage. Bantam Books.

[20]

Daniel Gomes, David Cruz, João Miranda, Miguel Costa, and Simão Fontes. 2013. Search the past with the Portuguese Web archive. In Proceedings of the 22nd International World Wide Web Conference Companion (WWW’13). 321--324.

Digital Library

[21]

Daniel Gomes, João Miranda, and Miguel Costa. 2011. A survey on Web archiving initiatives. In Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries (TPDL’11). 408--420.

[22]

Susan Havre, Elizabeth G. Hetzler, Paul Whitney, and Lucy T. Nowell. 2002. ThemeRiver: Visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics 8, 1, 9--20.

Digital Library

[23]

Jinru He, Junyuan Zeng, and Torsten Suel. 2010. Improved index compression techniques for versioned document collections. In Proceedings of 19th International Conference on Information and Knowledge Management (CIKM’10). 1239--1248.

Digital Library

[24]

Michael Herscovici, Ronny Lempel, and Sivan Yogev. 2007. Efficient indexing of versioned document sequences. In Proceedings of the 29th European Conference on Information Retrieval Research (ECIR’07). 76--87.

[25]

Helen Hockx-Yu. 2011. The past issue of the Web. In Proceedings of the 3rd International Web Science Conference (WebSci’11). 12:1--12:8.

Digital Library

[26]

Helen Hockx-Yu. 2013. Scholarly use of Web archives. In Digital Conversations at the British Library. London, England.

[27]

Helge Holzmann, Vinay Goel, and Avishek Anand. 2016. ArchiveSpark: Efficient Web archive access, extraction and derivation. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 83--92.

Digital Library

[28]

Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX’10). 145--158.

[29]

Maurice Isserman. 1987. If I Had a Hammer: The Death of the Old Left and the Birth of the New Left. Basic Books.

[30]

Andrew Jackson, Jimmy Lin, Ian Milligan, and Nick Ruest. 2016. Desiderata for exploratory search interfaces to Web archives in support of scholarly activities. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 103--106.

Digital Library

[31]

Brewster Kahle. 1997. Preserving the Internet. Scientific American 276, 3, 82--83.

[32]

Cyril Levitt. 1984. Children of Privilege: Student Revolt in the Sixties. University of Toronto Press.

[33]

Jimmy Lin. 2015. Scaling down distributed infrastructure on wimpy machines for personal Web archiving. In Proceedings of the 24th International World Wide Web Conference Companion (WWW’15). 1351--1355.

Digital Library

[34]

Jimmy Lin, Milad Gholami, and Jinfeng Rao. 2014. Infrastructure for supporting exploration and discovery in Web archives. In Proceedings of the 23rd International World Wide Web Conference Companion (WWW’14). 851--855.

Digital Library

[35]

Arthur Marwick. 1998. The Sixties: Cultural Revolution in Britain, France, Italy, and the United States, 1958--1974. Oxford University Press.

[36]

Ian Milligan. 2014. Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada. University of British Columbia Press.

[37]

Ian Milligan. 2017. Welcome to the Web: The online community of geocities and the early years of the World Wide Web. In The Web as History, N. Brügger and R. Schroeder (Eds.). UCL Press, London, England, 137--158.

[38]

Ian Milligan, Nick Ruest, and Jimmy Lin. 2016. Content selection and curation for Web archiving: The gatekeepers vs. the masses. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 107--110.

Digital Library

[39]

Franco Moretti. 2007. Graphs, Maps, Trees: Abstract Models for Literary History. Verso.

[40]

Clemens Neudecker and Sven Schlarb. 2013. The elephant in the library: Integrating Hadoop. In Proceedings of Hadoop Summit Europe.

[41]

Jinfang Niu. 2012. An overview of Web archiving. D-Lib Magazine 18, 3/4, Article No. 2.

[42]

Kjetil Nørvåg. 2003. Space-efficient support for temporal text indexing in a document archive context. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’03). 511--522.

[43]

Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. 2004. What’s new on the Web? The evolution of the Web from a search engine perspective. In Proceedings of the 13th International World Wide Web Conference (WWW’04). 1--12.

[44]

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 1099--1110.

Digital Library

[45]

Mohamed Rasheed. 2013. Fedora Commons with Apache Hadoop: A research study. code4lib Journal. Retrieved May 30, 2017, from http://journal.code4lib.org/articles/8988

[46]

Vedran Sabol, Wolfgang Kienreich, Markus Muhr, Werner Klieber, and Michael Granitzer. 2009. Visual knowledge discovery in dynamic enterprise text repositories. In Proceedings of the 13th International Conference on Information Visualisation (IV’09). 361--368.

Digital Library

[47]

Steven M. Schneider and Kirsten A. Foot. 2004. The Web as an object of study. New Media and Society 6, 1, 114--122.

[48]

Ralph Schroder and Niels Brügger (Eds.). 2017. The Web as History: Using Web Archives to Understand the Past and Present. UCL Press, London, England.

[49]

Sang Song. 2010. Long-Term Information Preservation and Access. Ph.D. Dissertation. University of Maryland.

[50]

Ed Summers and Ricardo Punzalan. 2016. Bots, seeds and people: Web archives as infrastructure. arXiv:1611.02493v1.

[51]

Brad Tofel. 2007. ‘Wayback’ for accessing Web archives. In Proceedings of the 7th International Web Archiving Workshop.

[52]

Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13).

Digital Library

[53]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, et al. 2012. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation.

Cited By

Huang QSun YXing ZYu MXu XLu Q(2023)API Entity and Relation Joint Extraction from Text via Dynamic Prompt-tuned Language ModelACM Transactions on Software Engineering and Methodology10.1145/360718833:1(1-25)Online publication date: 23-Nov-2023
https://dl.acm.org/doi/10.1145/3607188
Sun WYan MLiu ZXia XLei YLo D(2023)Revisiting the Identification of the Co-evolution of Production and Test CodeACM Transactions on Software Engineering and Methodology10.1145/360718332:6(1-37)Online publication date: 30-Sep-2023
https://dl.acm.org/doi/10.1145/3607183
Mo RZhang YWang YZhang SXiong PLi ZZhao Y(2023)Exploring the Impact of Code Clones on Deep Learning SoftwareACM Transactions on Software Engineering and Methodology10.1145/360718132:6(1-34)Online publication date: 3-Jul-2023
https://dl.acm.org/doi/10.1145/3607181
Show More Cited By

Index Terms

Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives
1. Applied computing
  1. Arts and humanities
2. Information systems
  1. Information systems applications
    1. Computing platforms
    2. Digital libraries and archives
  2. World Wide Web

Recommendations

Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
A novel big data analytics framework for smart cities
Abstract
The emergence of smart cities aims at mitigating the challenges raised due to the continuous urbanization development and increasing population density in cities. To face these challenges, governments and decision makers undertake ...
Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm
Highlights
- Distributed Heterogeneous Ensemble is designed for big data classification.
- ...
Abstract
In this era of big data, processing large scale data efficiently and accurately has become a challenging problem. Ensemble classification is a type of supervised learning that uses multiple experts to generate the final output. It ...

Comments

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage

Journal on Computing and Cultural Heritage Volume 10, Issue 4

October 2017

126 pages

ISSN:1556-4673

EISSN:1556-4711

DOI:10.1145/3129537

Editor:
Roberto Scopigno
CNRźISTI, Italy

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 July 2017

Accepted: 01 December 2016

Revised: 01 October 2016

Received: 01 June 2016

Published in JOCCH Volume 10, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Natural Sciences and Engineering Research Council of Canada
Social Sciences and Humanities Research Council of Canada
Compute Canada, both through their digital humanities cloud service and a Research Platforms and Portals
Ontario Ministry of Research and Innovation's Early Researcher Award
Columbia University's Web Archiving Incentive Program, U.S. NSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
650
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang QSun YXing ZYu MXu XLu Q(2023)API Entity and Relation Joint Extraction from Text via Dynamic Prompt-tuned Language ModelACM Transactions on Software Engineering and Methodology10.1145/360718833:1(1-25)Online publication date: 23-Nov-2023
https://dl.acm.org/doi/10.1145/3607188
Sun WYan MLiu ZXia XLei YLo D(2023)Revisiting the Identification of the Co-evolution of Production and Test CodeACM Transactions on Software Engineering and Methodology10.1145/360718332:6(1-37)Online publication date: 30-Sep-2023
https://dl.acm.org/doi/10.1145/3607183
Mo RZhang YWang YZhang SXiong PLi ZZhao Y(2023)Exploring the Impact of Code Clones on Deep Learning SoftwareACM Transactions on Software Engineering and Methodology10.1145/360718132:6(1-34)Online publication date: 3-Jul-2023
https://dl.acm.org/doi/10.1145/3607181
Alfadalat MAl-Azhari WDabbour L(2023)Procedural Modeling Based Shape Grammar as a Key to Generating Digital Architectural HeritageJournal on Computing and Cultural Heritage 10.1145/360670116:4(1-17)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3606701
Jones SKlein MWeigle MNelson M(2023)Summarizing Web Archive Corpora via Social Media Storytelling by Automatically Selecting and Visualizing ExemplarsACM Transactions on the Web10.1145/360603018:1(1-48)Online publication date: 11-Oct-2023
https://dl.acm.org/doi/10.1145/3606030
Donig SEckl MGassner SRehbein M(2023)Web archive analytics: Blind spots and silences in distant readings of the archived webDigital Scholarship in the Humanities10.1093/llc/fqad01438:3(1033-1048)Online publication date: 19-Apr-2023
https://doi.org/10.1093/llc/fqad014
Jones SJayanetti HKlein MWeigle MNelson M(2023)Synthesizing Web Archive Collections into Big Data: Lessons from Mining Data from Web ArchivesLinking Theory and Practice of Digital Libraries10.1007/978-3-031-43849-3_19(220-229)Online publication date: 26-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-43849-3_19
Zu C(2021)Hadoop-Based Painting Resource Storage and Retrieval Platform Construction and TestingComplexity10.1155/2021/99333302021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/9933330
Ogden JMaemura E(2021)‘Go fish’: Conceptualising the challenges of engaging national web archives for digital researchInternational Journal of Digital Humanities10.1007/s42803-021-00032-52:1-3(43-63)Online publication date: 27-Apr-2021
https://doi.org/10.1007/s42803-021-00032-5
Ruest NFritz SDeschamps RLin JMilligan I(2021)From archive to analysis: accessing web archives at scale through a cloud-based interfaceInternational Journal of Digital Humanities10.1007/s42803-020-00029-6Online publication date: 6-Jan-2021
https://doi.org/10.1007/s42803-020-00029-6
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents