skip to main content
research-article

Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives

Published:31 July 2017Publication History
Skip Abstract Section

Abstract

Web archiving initiatives around the world capture ephemeral Web content to preserve our collective digital memory. However, unlocking the potential of Web archives for humanities scholars and social scientists requires a scalable analytics infrastructure to support exploration of captured content. We present Warcbase, an open-source Web archiving platform that aims to fill this need. Our platform takes advantage of modern open-source “big data” infrastructure, namely Hadoop, HBase, and Spark, that has been widely deployed in industry. Warcbase provides two main capabilities: support for temporal browsing and a domain-specific language that allows scholars to interrogate Web archives in several different ways. This work represents a collaboration between computer scientists and historians, where we have engaged in iterative codesign to build tools for scholars with no formal computer science training. To provide guidance, we propose a process model for scholarly interactions with Web archives that begins with a question and proceeds iteratively through four main steps: filter, analyze, aggregate, and visualize. We call this the FAAV cycle for short and illustrate with three prototypical case studies. This article presents the current state of the project and discusses future directions.

References

  1. Amitanand Aiyer, Mikhail Bautin, Guoqiang Chen, Pritam Khemani, Kannan Muthukkaruppan, Karthik Spiegelberg, Liyin Tang, and Madhuwanti Vaidya. 2012. Storage infrastructure behind Facebook messages: Using HBase at scale. IEEE Data Engineering Bulletin 35, 2, 4--13.Google ScholarGoogle Scholar
  2. Yasmin A. AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2013. Access patterns for robots and humans in Web archives. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’13). 339--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alex Ball. 2010. Web Archiving. Digital Curation Centre, Edinburgh, UK.Google ScholarGoogle Scholar
  4. Stanislav Barton. 2012. Mignify: A big data refinery built on HBase. In Proceedings of the Official Conference of the Apache HBase Community (HBaseCon’12).Google ScholarGoogle Scholar
  5. Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. 2007. A time machine for text search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). 519--526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Peter Braunstein and Michael William Doyle (Eds.). 2002. Imagine Nation: The American Counterculture of the 1960s and’70s. Routledge.Google ScholarGoogle Scholar
  8. Niels Brügger. 2008. The archived Website and Website philology: A new type of historical document? Nordicom Review 29, 2, 155--175.Google ScholarGoogle ScholarCross RefCross Ref
  9. Niels Brügger (Ed.). 2010. Web History. Peter Lang.Google ScholarGoogle Scholar
  10. Niels Brügger. 2013. Historical network analysis of the Web. Social Science Computer Review 31, 3, 306--321.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating System Design and Implementation (OSDI’06). 205--218.Google ScholarGoogle Scholar
  12. Jason Chuang, Christopher D. Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the 2012 International Working Conference on Advanced Visual Interfaces. 74--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Miguel Costa, Daniel Gomes, Francisco Couto, and Mário Silva. 2013. A survey of Web archive search architectures. In Proceedings of the 22nd International World Wide Web Conference Companion (WWW’13). 1045--1050. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Danish National Library Authority. 2001. Preserving the Present for the Future: Conference on Strategies for the Internet. Danish National Library Authority.Google ScholarGoogle Scholar
  15. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation (OSDI’04). 137--150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Meghan Dougherty and Eric Meyer. 2014. Community, tools, and practices in Web archiving: The state of the art in relation to social science and humanities research needs. Journal of the American Society for Information Science and Technology 65, 11, 2195--2209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Emily Gade and John Wilkerson. 2017. The .GOV archive: A big data resource for political science. Political Methodologist. Retrieved May 30, 2017, from https://thepoliticalmethodologist.com/2017/03/16/the-gov-internet-archive-a-big-data-resource-for-political-science/.Google ScholarGoogle Scholar
  18. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Todd Gitlin. 1987. The Sixties: Years of Hope, Days of Rage. Bantam Books.Google ScholarGoogle Scholar
  20. Daniel Gomes, David Cruz, João Miranda, Miguel Costa, and Simão Fontes. 2013. Search the past with the Portuguese Web archive. In Proceedings of the 22nd International World Wide Web Conference Companion (WWW’13). 321--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Daniel Gomes, João Miranda, and Miguel Costa. 2011. A survey on Web archiving initiatives. In Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries (TPDL’11). 408--420. Google ScholarGoogle ScholarCross RefCross Ref
  22. Susan Havre, Elizabeth G. Hetzler, Paul Whitney, and Lucy T. Nowell. 2002. ThemeRiver: Visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics 8, 1, 9--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jinru He, Junyuan Zeng, and Torsten Suel. 2010. Improved index compression techniques for versioned document collections. In Proceedings of 19th International Conference on Information and Knowledge Management (CIKM’10). 1239--1248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Michael Herscovici, Ronny Lempel, and Sivan Yogev. 2007. Efficient indexing of versioned document sequences. In Proceedings of the 29th European Conference on Information Retrieval Research (ECIR’07). 76--87. Google ScholarGoogle ScholarCross RefCross Ref
  25. Helen Hockx-Yu. 2011. The past issue of the Web. In Proceedings of the 3rd International Web Science Conference (WebSci’11). 12:1--12:8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Helen Hockx-Yu. 2013. Scholarly use of Web archives. In Digital Conversations at the British Library. London, England.Google ScholarGoogle Scholar
  27. Helge Holzmann, Vinay Goel, and Avishek Anand. 2016. ArchiveSpark: Efficient Web archive access, extraction and derivation. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX’10). 145--158.Google ScholarGoogle Scholar
  29. Maurice Isserman. 1987. If I Had a Hammer: The Death of the Old Left and the Birth of the New Left. Basic Books.Google ScholarGoogle Scholar
  30. Andrew Jackson, Jimmy Lin, Ian Milligan, and Nick Ruest. 2016. Desiderata for exploratory search interfaces to Web archives in support of scholarly activities. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 103--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Brewster Kahle. 1997. Preserving the Internet. Scientific American 276, 3, 82--83. Google ScholarGoogle ScholarCross RefCross Ref
  32. Cyril Levitt. 1984. Children of Privilege: Student Revolt in the Sixties. University of Toronto Press.Google ScholarGoogle Scholar
  33. Jimmy Lin. 2015. Scaling down distributed infrastructure on wimpy machines for personal Web archiving. In Proceedings of the 24th International World Wide Web Conference Companion (WWW’15). 1351--1355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jimmy Lin, Milad Gholami, and Jinfeng Rao. 2014. Infrastructure for supporting exploration and discovery in Web archives. In Proceedings of the 23rd International World Wide Web Conference Companion (WWW’14). 851--855. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Arthur Marwick. 1998. The Sixties: Cultural Revolution in Britain, France, Italy, and the United States, 1958--1974. Oxford University Press.Google ScholarGoogle Scholar
  36. Ian Milligan. 2014. Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada. University of British Columbia Press.Google ScholarGoogle Scholar
  37. Ian Milligan. 2017. Welcome to the Web: The online community of geocities and the early years of the World Wide Web. In The Web as History, N. Brügger and R. Schroeder (Eds.). UCL Press, London, England, 137--158.Google ScholarGoogle Scholar
  38. Ian Milligan, Nick Ruest, and Jimmy Lin. 2016. Content selection and curation for Web archiving: The gatekeepers vs. the masses. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 107--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Franco Moretti. 2007. Graphs, Maps, Trees: Abstract Models for Literary History. Verso.Google ScholarGoogle Scholar
  40. Clemens Neudecker and Sven Schlarb. 2013. The elephant in the library: Integrating Hadoop. In Proceedings of Hadoop Summit Europe.Google ScholarGoogle Scholar
  41. Jinfang Niu. 2012. An overview of Web archiving. D-Lib Magazine 18, 3/4, Article No. 2. Google ScholarGoogle ScholarCross RefCross Ref
  42. Kjetil Nørvåg. 2003. Space-efficient support for temporal text indexing in a document archive context. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’03). 511--522.Google ScholarGoogle ScholarCross RefCross Ref
  43. Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. 2004. What’s new on the Web? The evolution of the Web from a search engine perspective. In Proceedings of the 13th International World Wide Web Conference (WWW’04). 1--12.Google ScholarGoogle Scholar
  44. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 1099--1110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Mohamed Rasheed. 2013. Fedora Commons with Apache Hadoop: A research study. code4lib Journal. Retrieved May 30, 2017, from http://journal.code4lib.org/articles/8988Google ScholarGoogle Scholar
  46. Vedran Sabol, Wolfgang Kienreich, Markus Muhr, Werner Klieber, and Michael Granitzer. 2009. Visual knowledge discovery in dynamic enterprise text repositories. In Proceedings of the 13th International Conference on Information Visualisation (IV’09). 361--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Steven M. Schneider and Kirsten A. Foot. 2004. The Web as an object of study. New Media and Society 6, 1, 114--122. Google ScholarGoogle ScholarCross RefCross Ref
  48. Ralph Schroder and Niels Brügger (Eds.). 2017. The Web as History: Using Web Archives to Understand the Past and Present. UCL Press, London, England.Google ScholarGoogle Scholar
  49. Sang Song. 2010. Long-Term Information Preservation and Access. Ph.D. Dissertation. University of Maryland.Google ScholarGoogle Scholar
  50. Ed Summers and Ricardo Punzalan. 2016. Bots, seeds and people: Web archives as infrastructure. arXiv:1611.02493v1.Google ScholarGoogle Scholar
  51. Brad Tofel. 2007. ‘Wayback’ for accessing Web archives. In Proceedings of the 7th International Web Archiving Workshop.Google ScholarGoogle Scholar
  52. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, et al. 2012. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation.Google ScholarGoogle Scholar

Index Terms

  1. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Journal on Computing and Cultural Heritage
            Journal on Computing and Cultural Heritage   Volume 10, Issue 4
            October 2017
            126 pages
            ISSN:1556-4673
            EISSN:1556-4711
            DOI:10.1145/3129537
            Issue’s Table of Contents

            Copyright © 2017 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 31 July 2017
            • Accepted: 1 December 2016
            • Revised: 1 October 2016
            • Received: 1 June 2016
            Published in jocch Volume 10, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader