skip to main content
10.1145/2247596.2247598acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Inside "Big Data management": ogres, onions, or parfaits?

Published: 27 March 2012 Publication History

Abstract

In this paper we review the history of systems for managing "Big Data" as well as today's activities and architectures from the (perhaps biased) perspective of three "database guys" who have been watching this space for a number of years and are currently working together on "Big Data" problems. Our focus is on architectural issues, and particularly on the components and layers that have been developed recently (in open source and elsewhere) and on how they are being used (or abused) to tackle challenges posed by today's notion of "Big Data". Also covered is the approach we are taking in the ASTERIX project at UC Irvine, where we are developing our own set of answers to the questions of the "right" components and the "right" set of layers for taming the "Big Data" beast. We close by sharing our opinions on what some of the important open questions are in this area as well as our thoughts on how the dataintensive computing community might best seek out answers.

References

[1]
Apache Cassandra website. http://cassandra.apache.org.
[2]
Apache Hadoop website. http://hadoop.apache.org.
[3]
Apache HBase website. http://hbase.apache.org.
[4]
Apache Hive website. http://hive.apache.org.
[5]
jaql: Query language for JavaScript Object Notation (JSON). http://code.google.com/p/jaql/.
[6]
Memorable quotes for Shrek (2001). IMDB.com.
[7]
Jim Gray -- industry leader. Transaction Processing Performance Council (TPC) web site, April 2009. http://www.tpc.org/information/who/gray.asp.
[8]
The big data era: How to succeed. Information Week, August 9, 2010.
[9]
Data, data everywhere. The Economist, February 25, 2010.
[10]
Anon Et Al. A measure of transaction processing power. In Technical Report 85.2. Tandem Computers, February 1985.
[11]
Apache Hadoop, http://hadoop.apache.org.
[12]
Apache Hive, http://hadoop.apache.org/hive.
[13]
C. Baru, G. Fecteau, A. Goyal, H.-I. Hsiao, A. Jhingran, S. Padmanabhan, W. Wilson, and A. G. H.-I. Hsiao. DB2 parallel edition. IBM Systems Journal, 34(2), 1995.
[14]
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In SoCC, pages 119--130, New York, NY, USA, 2010. ACM.
[15]
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In Proc. of the 2010 International Conference on Management of Data, SIGMOD '10, New York, NY, USA, 2010. ACM.
[16]
V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011.
[17]
E. A. Brewer. Combining systems and databases: A search engine retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems, Fourth Edition. MIT Press, 2005.
[18]
M. Calabresi. The Supreme Court weighs the implications of big data. Time, November 16, 2011.
[19]
R. Cattell. Scalable SQL and NoSQL data stores. SIGMOD Rec., 39:12--27, May 2011.
[20]
R. G. G. Cattell, editor. The Object Database Standard: ODMG 2.0. Morgan Kaufmann, 1997.
[21]
R. Chaiken, B. Jenkins, P. A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008.
[22]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), 2008.
[23]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI '04, pages 137--150, December 2004.
[24]
J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53:72--77, Jan. 2010.
[25]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In SOSP, pages 205--220, 2007.
[26]
D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.
[27]
D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - a high performance dataflow database machine. In VLDB, pages 228--237, 1986.
[28]
C. Freeland. In big data, potential for big division. New York Times, January 12, 2012.
[29]
S. Fushimi, M. Kitsuregawa, and H. Tanaka. An overview of the system software of a parallel relational database machine GRACE. In VLDB, pages 209--219, 1986.
[30]
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of MapReduce: the Pig experience. PVLDB, 2(2):1414--1425, 2009.
[31]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proc. 19th ACM Symp. on Operating Systems Principles, SOSP '03, New York, NY, USA, 2003. ACM.
[32]
G. Graefe. Volcano - an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng., 6(1):120--135, 1994.
[33]
R. B. Hagmann and D. Ferrari. Performance analysis of several back-end database architectures. ACM Trans. Database Syst., 11, March 1986.
[34]
B. Hopkins. Beyond the hype of big data. CIO.com, October 28, 2011.
[35]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.
[36]
JSON. http://www.json.org/.
[37]
W. Kim. Special Issue on Database Machines. IEEE Database Engineering Bulletin, December 1981.
[38]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 international conference on Management of data, SIGMOD '10, pages 135--146, New York, NY, USA, 2010. ACM.
[39]
Object database management systems. http://www.odbms.org/odmg/.
[40]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008.
[41]
P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Inf., 33:351--385, June 1996.
[42]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Sotonebrakeotonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178, 2009.
[43]
R. Ramakrishnan and J. Gehrke. Database Management Systems. WCB/McGraw-Hill, 2002.
[44]
J. Shemer and P. Neches. The genesis of a database computer. Computer, 17(11):42--56, Nov. 1984.
[45]
M. Stonebraker. Operating system support for database management. Commun. ACM, 24, July 1981.
[46]
M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53:64--71, Jan. 2010.
[47]
M. Stonebraker and U. Cetintemel. One size fits all: An idea whose time has come and gone. Data Engineering, International Conference on, 0:2--11, 2005.
[48]
The Tandem Database Group. Nonstop SQL: A distributed, high-performance, high-availability implementation of SQL. Second International Workshop on High Performance Transaction Systems, September 1987.
[49]
R. Vernica. Efficient Processing of Set-Similarity Joins on Large Clusters. Ph. D. Thesis, Computer Science Department, University of California-Irvine, 2011.
[50]
R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD Conference, 2010.
[51]
T. Walter. Teradata past, present, and future. In UCI ISG Lecture Series on Scalable Data Management, October 2009. http://isg.ics.uci.edu/scalable_dml_lectures2009-10.html.
[52]
M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in scalops, a higher order cloud computing language. In NIPS 2011 Workshop on parallel and large-scale machine learning (BigLearn), December 2011.
[53]
D. Weinberger. The machine that would predict the future. Scientific American, November 15, 2011.
[54]
XQuery 1.0: An XML query language. http://www.w3.org/TR/xquery/.
[55]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In R. Draves and R. van Renesse, editors, OSDI, pages 1--14. USENIX Association, 2008.

Cited By

View all
  • (2023)Real-Time Clickstream Data Processing and Visualization Using Apache Tools2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392270(1-5)Online publication date: 18-Aug-2023
  • (2022)An Analysis of Big Data AnalyticsResearch Anthology on Big Data Analytics, Architectures, and Applications10.4018/978-1-6684-3662-2.ch054(1126-1148)Online publication date: 2022
  • (2022)Big Data Analytics and Big Data Processing for IOT-Based Sensing DevicesTransforming Management with AI, Big-Data, and IoT10.1007/978-3-030-86749-2_2(17-49)Online publication date: 17-Feb-2022
  • Show More Cited By

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
March 2012
643 pages
ISBN:9781450307901
DOI:10.1145/2247596
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 March 2012

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

EDBT '12

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)6
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Real-Time Clickstream Data Processing and Visualization Using Apache Tools2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392270(1-5)Online publication date: 18-Aug-2023
  • (2022)An Analysis of Big Data AnalyticsResearch Anthology on Big Data Analytics, Architectures, and Applications10.4018/978-1-6684-3662-2.ch054(1126-1148)Online publication date: 2022
  • (2022)Big Data Analytics and Big Data Processing for IOT-Based Sensing DevicesTransforming Management with AI, Big-Data, and IoT10.1007/978-3-030-86749-2_2(17-49)Online publication date: 17-Feb-2022
  • (2021)Dynamic Capabilities of Decision-oriented Service SystemsResearch Anthology on Decision Support Systems and Decision Management in Healthcare, Business, and Engineering10.4018/978-1-7998-9023-2.ch011(240-266)Online publication date: 2021
  • (2021)Dynamic Capabilities of Decision-oriented Service SystemsResearch Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing10.4018/978-1-7998-5339-8.ch045(957-984)Online publication date: 2021
  • (2021)An Analysis of Big Data AnalyticsSmart Agricultural Services Using Deep Learning, Big Data, and IoT10.4018/978-1-7998-5003-8.ch011(203-230)Online publication date: 2021
  • (2020)Big Data in Internet of Things: Architecture and Open Research Challenges2020 IEEE 23rd International Multitopic Conference (INMIC)10.1109/INMIC50486.2020.9318203(1-6)Online publication date: 5-Nov-2020
  • (2020)Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data AnalysisAdvanced Intelligent Systems for Sustainable Development (AI2SD’2019)10.1007/978-3-030-36664-3_43(385-394)Online publication date: 6-Feb-2020
  • (2019)Big Data Processing and Big AnalyticsEmerging Technologies and Applications in Data Processing and Management10.4018/978-1-5225-8446-9.ch014(285-315)Online publication date: 2019
  • (2019)Application of Big Data in Economic PolicyWeb Services10.4018/978-1-5225-7501-6.ch118(2289-2307)Online publication date: 2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media