Big Data 2.0 Processing Systems: Taxonomy and Open Challenges

Bajaber, Fuad; Elshawi, Radwa; Batarfi, Omar; Altalhi, Abdulrahman; Barnawi, Ahmed; Sakr, Sherif

doi:10.1007/s10723-016-9371-1

Big Data 2.0 Processing Systems: Taxonomy and Open Challenges

Published: 24 June 2016

Volume 14, pages 379–405, (2016)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Fuad Bajaber¹,
Radwa Elshawi²,
Omar Batarfi¹,
Abdulrahman Altalhi¹,
Ahmed Barnawi¹ &
…
Sherif Sakr^3,4

941 Accesses
40 Citations
Explore all metrics

Abstract

Data is key resource in the modern world. Big data has become a popular term which is used to describe the exponential growth and availability of data. In practice, the growing demand for large-scale data processing and data analysis applications spurred the development of novel solutions from both the industry and academia. For a decade, the MapReduce framework, and its open source realization, Hadoop, has emerged as a highly successful framework that has created a lot of momentum in both the research and industrial communities such that it has become the defacto standard of big data processing platforms. However, in recent years, academia and industry have started to recognize the limitations of the Hadoop framework in several application domains and big data processing scenarios such as large scale processing of structured data, graph data and streaming data. Thus, we have witnessed an unprecedented interest to tackle these challenges with new solutions which constituted a new wave of mostly domain-specific, optimized big data processing platforms. In this article, we refer to this new wave of systems as Big Data 2.0 processing systems. To better understand the latest ongoing developments in the world of big data processing systems, we provide a taxonomy and detailed analysis of the state-of-the-art in this domain. In addition, we identify a set of the current open research challenges and discuss some promising directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Amplab big data benchmark. https://amplab.cs.berkeley.edu/benchmark/
Abadi, D., Babu, S., Ozcan, F., Pandis, I.: Tutorial: SQL-on-Hadoop Systems. PVLDB 8(12), 2050–2061 (2015)
Google Scholar
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB 2(1), 922–933 (2009)
Google Scholar
Alexandrov, A., Battré, D., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., Warneke, D.: Massively Parallel Data Analysis with PACTs on Nephele. PVLDB 3(2), 1625–1628 (2010)
Google Scholar
Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C., Hueske, F., Heise, A., Kao, O., Leich, M., Leser, U., Markl, V., Naumann, F., Peters, M., Rheinländer, A., Sax, M.J., Schelter, S., Höger, M., Tzoumas, K., Warneke, D.: The Stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)
Article Google Scholar
Aly, A.M., Sallam, A., Gnanasekaran, B.M., Nguyen-Dinh, L.-V., Aref, W.G., Ouzzani, M., Ghafoor, A.: M3: stream processing on main-memory mapreduce. In: IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, pp 1253–1256 (2012)
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: Relational Data Processing in Spark. In: SIGMOD, pp 1383–1394 (2015)
Balakrishnan, H., Frans Kaashoek, M., Karger, D.R., Morris, R., Stoica, I.: Looking up data in p2p systems. Commun. ACM 46(2), 43–48 (2003)
Article Google Scholar
Barnawi, A., Batarfi, O., Beheshti, S.-M.-R., Shawi, R.E., Fayoumi, A.G., Nouri, R., Sakr, S.: On Characterizing the Performance of Distributed Graph Computation Platforms. In Performance Characterization and Benchmarking. Traditional to Big Data - 6th TPC Technology Conference, TPCTC 2014, Hangzhou, China, September 1-5, 2014. Revised Selected Papers, 29–43 (2014)
Batarfi, O., Shawi, R.E., Fayoumi, A.G., Nouri, R., Beheshti, S.-M.-R., Barnawi, A., Sakr, S.: Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3), 1189–1213 (2015)
Article Google Scholar
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: SoCC, pp 119–130 (2010)
Bedini, I., Sakr, S., Theeten, B., Sala, A., Cogan, P.: Modeling performance of a parallel streaming engine: bridging theory and costs. In: ICPE, pp 173–184 (2013)
Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE, pp 1151–1162 (2011)
Bu, Y., Borkar, V.R., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big(ger) Graph Analytics on a Dataflow Engine. PVLDB 8(2), 161–172 (2014)
Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)
Article Google Scholar
Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., Lonergan, L., Cohen, J., Welton, C., Sherry, G., Bhandarkar, M.: HAWQ: a massively parallel processing SQL engine in hadoop. In: SIGMOD, pp 1223–1234 (2014)
Chohan, N., Bunch, C., Krintz, C., Canumalla, N.: Cloud platform datastore support. J. Grid Comput. 11(1), 63–81 (2013)
Article Google Scholar
Choi, H., Son, J., Yang, H., Ryu, H., Lim, B., Kim, S., Chung, Y.D.: Tajo: A distributed data warehouse system on large clusters. In: ICDE, pp 1320–1323 (2013)
Clinger, W.D.: Foundations of Actor Semantics. Technical report. Cambridge, MA, USA (1981)
Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce Online. In: NSDI, pp 313–328 (2010)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: SIGMOD Conference, pp 1115–1118 (2010)
Dahiphale, D., Karve, R., Vasilakos, A.V., Liu, H., Yu, Z., Chhajer, A., Wang, J., Wang, C.: An advanced mapreduce: Cloud mapreduce, enhancements and applications. IEEE Trans. Netw. Serv. Manag. 11(1), 101–115 (2014)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
Dean, J., Ghemawa, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp 137–150 (2004)
DeWitt, D.J., Halverson, A., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: SIGMOD, pp 1255–1266 (2013)
Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB 3(1), 518–529 (2010)
Google Scholar
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: HPDC, pp 810–818 (2010)
Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs in pig. In: SIGMOD Conference, pp 701–704 (2012)
Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-Oriented Storage Techniques for MapReduce. PVLDB 4(7), 419–429 (2011)
Google Scholar
Gankidi, V.R., Teletia, N., Patel, J.M., Halverson, A., DeWitt, D.J.: Indexing HDFS Data in PDW: Splitting the data from the index. PVLDB 7(13), 1520–1528 (2014)
Google Scholar
Gedik, B., Andrade, H., Wu, K.-L., Yu, P.S., Doo, M.: SPADE: the system s declarative stream processing engine. In: SIGMOD, pp 1123–1134 (2008)
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: towards an industry standard benchmark for big data analytics. In: SIGMOD, pp 1197–1208 (2013)
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In: OSDI, pp 17–30 (2012)
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: Graph Processing in a Distributed Dataflow Framework. In: OSDI, pp 599–613 (2014)
Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An Experimental Comparison of Pregel-like Graph Processing Systems. PVLDB 7(12), 1047–1058 (2014)
Google Scholar
Han, W.-S., Lee, S., Park, K., Lee, J.-H., Kim, M.-S., Kim, J., Yu, H.: TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: KDD, pp 77–85 (2013)
Heise, A., Rheinlnder, A., Leich, U., Leser, U., Naumann, F.: Meteor/Sopremo: An Extensible Query Language and Operator Model. In: BigData Workshop in conjunection with VLDB (2012)
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in Apache Hive. In: SIGMOD, pp 1235–1246 (2014)
Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters. IEEE TKDE 23(9), 1299–1311 (2011)
Google Scholar
Khan, A., Elnikety, S.: Systems for big-graphs. PVLDB 7(13), 1709–1710 (2014)
Google Scholar
Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., Kalnis, P.: Mizan: a system for dynamic load balancing in large-scale graph processing. In: EuroSys, pp 169–182 (2013)
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, I., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-Milne, S., Yoder, M.: Impala: A Modern, Open-Source SQL Engine for Hadoop. In: CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings (2015)
Kyrola, A., Blelloch, G.E., Guestrin, C.: GraphChi: Large-Scale Graph Computation on Just a PC. In: OSDI, pp 31–46 (2012)
Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In: SIGMOD Conference, pp 961–972 (2011)
Loesing, S., Hentschel, M., Kraska, T., Kossmann, D.: Stormy: an elastic and highly available streaming service in the cloud. In: EDBT/ICDT Workshops, pp 55–60 (2012)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB 5(8), 716–727 (2012)
Google Scholar
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD Conference, pp 135–146 (2010)
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB 3(1), 330–339 (2010)
Google Scholar
Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: Distributed Stream Computing Platform. In: ICDMW, pp 170–177 (2010)
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB 3(1), 494–505 (2010)
MATH Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp 165–178 (2009)
Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A.C., Curino, C.: Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications. In: SIGMOD, pp 1357–1369 (2015)
Sakr, S.: GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries. In: Database Systems for Advanced Applications, 14th International Conference, DASFAA 2009, Brisbane, Australia, April 21-23, 2009. Proceedings, pp 123–137 (2009)
Sakr, S., Elnikety, S., He, Y.: G-SPARQL: a hybrid engine for querying large attributed graphs. In: 21st ACM International Conference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29 - November 02, 2012, pp 335–344 (2012)
Sakr, S., Gaber, M.M.: editors. Large Scale and Big Data - Processing and Management. Auerbach Publications (2014)
Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)
Article Google Scholar
Salihoglu, S., Widom, J.: GPS: a graph processing system. In: SSDBM, p 22 (2013)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access Path Selection in a Relational Database Management System. In: SIGMOD Conference, pp 23–34 (1979)
Shamsi, J., Khojaye, M.A., Qasmi, M.A.: Data-intensive cloud computing: Requirements, expectations, challenges, and solutions. J. Grid Comput. 11(2), 281–310 (2013)
Article Google Scholar
Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: SIGMOD, pp 505–516 (2013)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST, pp 1–10 (2010)
Sparks, E.R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J.E., Franklin, M.J., Jordan, M.I., Kraska, T.: MLI: An API for Distributed Machine Learning. In: ICDM, pp 1187–1192 (2013)
Stutz, P., Bernstein, A., Cohen, W.W.: Signal/Collect: Graph Algorithms for the (Semantic) Web. In: International Semantic Web Conference (1), pp 764–780 (2010)
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J.S., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at facebook. In: SIGMOD, pp 1013–1020 (2010)
Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From ”Think Like a Vertex” to ”Think Like a Graph”. PVLDB 7(3), 193–204 (2013)
Google Scholar
Tsai, C.W., Lai, C.F., Chao, H.C., Vasilakos, A. V.: Big data analytics: a survey. Journal of Big Data 2(21) (2015)
Vahi, K., Harvey, I., Samak, T., Gunter, D.K., Evans, K., Rogers, D.H., Taylor, I.J., Goode, M., Silva, F., Al-Shakarchi, E., Mehta, G., Deelman, E., Jones, A.: A case study into using common real-time workflow monitoring infrastructure for scientific workflows. J. Grid Comput. 11(3), 381–406 (2013)
Article Google Scholar
Valiant, L.G.: A Bridging Model for Parallel Computation. Commun. ACM 33(8), 103–111 (1990)
Article Google Scholar
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: ACM Symposium on Cloud Computing, SOCC ’13, Santa Clara, CA, USA October 1-3, 2013, pp 5:1–5:16 (2013)
Wanderman-Milne, S., Li, N.: Runtime Code Generation in Cloudera Impala. IEEE Data Eng. Bull. 37(1), 31–37 (2014)
Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media (2012)
Yang, H., Dasdan, A., Hsiao, R., Parker, D.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD, pp 1029–1040 (2007)
Zaharia, M., Chowdhury, M. , Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010 (2010)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapReduce: A Distributed Computing Framework for Iterative Computation. J. Grid Comput. 10(1), 47–68 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

King Abdulaziz University, Jeddah, Saudi Arabia
Fuad Bajaber, Omar Batarfi, Abdulrahman Altalhi & Ahmed Barnawi
Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia
Radwa Elshawi
University of New South Wales, Sydney, NSW, Australia
Sherif Sakr
King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
Sherif Sakr

Authors

Fuad Bajaber
View author publications
You can also search for this author in PubMed Google Scholar
Radwa Elshawi
View author publications
You can also search for this author in PubMed Google Scholar
Omar Batarfi
View author publications
You can also search for this author in PubMed Google Scholar
Abdulrahman Altalhi
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Barnawi
View author publications
You can also search for this author in PubMed Google Scholar
Sherif Sakr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sherif Sakr.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bajaber, F., Elshawi, R., Batarfi, O. et al. Big Data 2.0 Processing Systems: Taxonomy and Open Challenges. J Grid Computing 14, 379–405 (2016). https://doi.org/10.1007/s10723-016-9371-1

Download citation

Received: 16 July 2015
Accepted: 14 June 2016
Published: 24 June 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s10723-016-9371-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big Data 2.0 Processing Systems: Taxonomy and Open Challenges

Abstract

Access this article

Similar content being viewed by others

Big Data Storage and Management: Challenges and Opportunities

Big Data

Big Data and Cloud Computing: A Survey of the State-of-the-Art and Research Challenges

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Big Data 2.0 Processing Systems: Taxonomy and Open Challenges

Abstract

Access this article

Similar content being viewed by others

Big Data Storage and Management: Challenges and Opportunities

Big Data

Big Data and Cloud Computing: A Survey of the State-of-the-Art and Research Challenges

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation