Abstract
Web archiving initiatives around the world capture ephemeral Web content to preserve our collective digital memory. However, unlocking the potential of Web archives for humanities scholars and social scientists requires a scalable analytics infrastructure to support exploration of captured content. We present Warcbase, an open-source Web archiving platform that aims to fill this need. Our platform takes advantage of modern open-source “big data” infrastructure, namely Hadoop, HBase, and Spark, that has been widely deployed in industry. Warcbase provides two main capabilities: support for temporal browsing and a domain-specific language that allows scholars to interrogate Web archives in several different ways. This work represents a collaboration between computer scientists and historians, where we have engaged in iterative codesign to build tools for scholars with no formal computer science training. To provide guidance, we propose a process model for scholarly interactions with Web archives that begins with a question and proceeds iteratively through four main steps: filter, analyze, aggregate, and visualize. We call this the FAAV cycle for short and illustrate with three prototypical case studies. This article presents the current state of the project and discusses future directions.
- Amitanand Aiyer, Mikhail Bautin, Guoqiang Chen, Pritam Khemani, Kannan Muthukkaruppan, Karthik Spiegelberg, Liyin Tang, and Madhuwanti Vaidya. 2012. Storage infrastructure behind Facebook messages: Using HBase at scale. IEEE Data Engineering Bulletin 35, 2, 4--13.Google Scholar
- Yasmin A. AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2013. Access patterns for robots and humans in Web archives. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’13). 339--348. Google ScholarDigital Library
- Alex Ball. 2010. Web Archiving. Digital Curation Centre, Edinburgh, UK.Google Scholar
- Stanislav Barton. 2012. Mignify: A big data refinery built on HBase. In Proceedings of the Official Conference of the Apache HBase Community (HBaseCon’12).Google Scholar
- Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. 2007. A time machine for text search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). 519--526. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.Google ScholarDigital Library
- Peter Braunstein and Michael William Doyle (Eds.). 2002. Imagine Nation: The American Counterculture of the 1960s and’70s. Routledge.Google Scholar
- Niels Brügger. 2008. The archived Website and Website philology: A new type of historical document? Nordicom Review 29, 2, 155--175.Google ScholarCross Ref
- Niels Brügger (Ed.). 2010. Web History. Peter Lang.Google Scholar
- Niels Brügger. 2013. Historical network analysis of the Web. Social Science Computer Review 31, 3, 306--321.Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating System Design and Implementation (OSDI’06). 205--218.Google Scholar
- Jason Chuang, Christopher D. Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the 2012 International Working Conference on Advanced Visual Interfaces. 74--77. Google ScholarDigital Library
- Miguel Costa, Daniel Gomes, Francisco Couto, and Mário Silva. 2013. A survey of Web archive search architectures. In Proceedings of the 22nd International World Wide Web Conference Companion (WWW’13). 1045--1050. Google ScholarDigital Library
- Danish National Library Authority. 2001. Preserving the Present for the Future: Conference on Strategies for the Internet. Danish National Library Authority.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation (OSDI’04). 137--150.Google ScholarDigital Library
- Meghan Dougherty and Eric Meyer. 2014. Community, tools, and practices in Web archiving: The state of the art in relation to social science and humanities research needs. Journal of the American Society for Information Science and Technology 65, 11, 2195--2209. Google ScholarDigital Library
- Emily Gade and John Wilkerson. 2017. The .GOV archive: A big data resource for political science. Political Methodologist. Retrieved May 30, 2017, from https://thepoliticalmethodologist.com/2017/03/16/the-gov-internet-archive-a-big-data-resource-for-political-science/.Google Scholar
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29--43. Google ScholarDigital Library
- Todd Gitlin. 1987. The Sixties: Years of Hope, Days of Rage. Bantam Books.Google Scholar
- Daniel Gomes, David Cruz, João Miranda, Miguel Costa, and Simão Fontes. 2013. Search the past with the Portuguese Web archive. In Proceedings of the 22nd International World Wide Web Conference Companion (WWW’13). 321--324. Google ScholarDigital Library
- Daniel Gomes, João Miranda, and Miguel Costa. 2011. A survey on Web archiving initiatives. In Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries (TPDL’11). 408--420. Google ScholarCross Ref
- Susan Havre, Elizabeth G. Hetzler, Paul Whitney, and Lucy T. Nowell. 2002. ThemeRiver: Visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics 8, 1, 9--20. Google ScholarDigital Library
- Jinru He, Junyuan Zeng, and Torsten Suel. 2010. Improved index compression techniques for versioned document collections. In Proceedings of 19th International Conference on Information and Knowledge Management (CIKM’10). 1239--1248. Google ScholarDigital Library
- Michael Herscovici, Ronny Lempel, and Sivan Yogev. 2007. Efficient indexing of versioned document sequences. In Proceedings of the 29th European Conference on Information Retrieval Research (ECIR’07). 76--87. Google ScholarCross Ref
- Helen Hockx-Yu. 2011. The past issue of the Web. In Proceedings of the 3rd International Web Science Conference (WebSci’11). 12:1--12:8. Google ScholarDigital Library
- Helen Hockx-Yu. 2013. Scholarly use of Web archives. In Digital Conversations at the British Library. London, England.Google Scholar
- Helge Holzmann, Vinay Goel, and Avishek Anand. 2016. ArchiveSpark: Efficient Web archive access, extraction and derivation. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 83--92. Google ScholarDigital Library
- Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX’10). 145--158.Google Scholar
- Maurice Isserman. 1987. If I Had a Hammer: The Death of the Old Left and the Birth of the New Left. Basic Books.Google Scholar
- Andrew Jackson, Jimmy Lin, Ian Milligan, and Nick Ruest. 2016. Desiderata for exploratory search interfaces to Web archives in support of scholarly activities. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 103--106. Google ScholarDigital Library
- Brewster Kahle. 1997. Preserving the Internet. Scientific American 276, 3, 82--83. Google ScholarCross Ref
- Cyril Levitt. 1984. Children of Privilege: Student Revolt in the Sixties. University of Toronto Press.Google Scholar
- Jimmy Lin. 2015. Scaling down distributed infrastructure on wimpy machines for personal Web archiving. In Proceedings of the 24th International World Wide Web Conference Companion (WWW’15). 1351--1355. Google ScholarDigital Library
- Jimmy Lin, Milad Gholami, and Jinfeng Rao. 2014. Infrastructure for supporting exploration and discovery in Web archives. In Proceedings of the 23rd International World Wide Web Conference Companion (WWW’14). 851--855. Google ScholarDigital Library
- Arthur Marwick. 1998. The Sixties: Cultural Revolution in Britain, France, Italy, and the United States, 1958--1974. Oxford University Press.Google Scholar
- Ian Milligan. 2014. Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada. University of British Columbia Press.Google Scholar
- Ian Milligan. 2017. Welcome to the Web: The online community of geocities and the early years of the World Wide Web. In The Web as History, N. Brügger and R. Schroeder (Eds.). UCL Press, London, England, 137--158.Google Scholar
- Ian Milligan, Nick Ruest, and Jimmy Lin. 2016. Content selection and curation for Web archiving: The gatekeepers vs. the masses. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’16). 107--110. Google ScholarDigital Library
- Franco Moretti. 2007. Graphs, Maps, Trees: Abstract Models for Literary History. Verso.Google Scholar
- Clemens Neudecker and Sven Schlarb. 2013. The elephant in the library: Integrating Hadoop. In Proceedings of Hadoop Summit Europe.Google Scholar
- Jinfang Niu. 2012. An overview of Web archiving. D-Lib Magazine 18, 3/4, Article No. 2. Google ScholarCross Ref
- Kjetil Nørvåg. 2003. Space-efficient support for temporal text indexing in a document archive context. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’03). 511--522.Google ScholarCross Ref
- Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. 2004. What’s new on the Web? The evolution of the Web from a search engine perspective. In Proceedings of the 13th International World Wide Web Conference (WWW’04). 1--12.Google Scholar
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 1099--1110. Google ScholarDigital Library
- Mohamed Rasheed. 2013. Fedora Commons with Apache Hadoop: A research study. code4lib Journal. Retrieved May 30, 2017, from http://journal.code4lib.org/articles/8988Google Scholar
- Vedran Sabol, Wolfgang Kienreich, Markus Muhr, Werner Klieber, and Michael Granitzer. 2009. Visual knowledge discovery in dynamic enterprise text repositories. In Proceedings of the 13th International Conference on Information Visualisation (IV’09). 361--368. Google ScholarDigital Library
- Steven M. Schneider and Kirsten A. Foot. 2004. The Web as an object of study. New Media and Society 6, 1, 114--122. Google ScholarCross Ref
- Ralph Schroder and Niels Brügger (Eds.). 2017. The Web as History: Using Web Archives to Understand the Past and Present. UCL Press, London, England.Google Scholar
- Sang Song. 2010. Long-Term Information Preservation and Access. Ph.D. Dissertation. University of Maryland.Google Scholar
- Ed Summers and Ricardo Punzalan. 2016. Bots, seeds and people: Web archives as infrastructure. arXiv:1611.02493v1.Google Scholar
- Brad Tofel. 2007. ‘Wayback’ for accessing Web archives. In Proceedings of the 7th International Web Archiving Workshop.Google Scholar
- Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13). Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, et al. 2012. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation.Google Scholar
Index Terms
- Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives
Recommendations
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing ResearchThe term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
A novel big data analytics framework for smart cities
AbstractThe emergence of smart cities aims at mitigating the challenges raised due to the continuous urbanization development and increasing population density in cities. To face these challenges, governments and decision makers undertake ...
Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm
Highlights- Distributed Heterogeneous Ensemble is designed for big data classification.
- ...
AbstractIn this era of big data, processing large scale data efficiently and accurately has become a challenging problem. Ensemble classification is a type of supervised learning that uses multiple experts to generate the final output. It ...
Comments