Skip to main content
Log in

Big Data 2.0 Processing Systems: Taxonomy and Open Challenges

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Data is key resource in the modern world. Big data has become a popular term which is used to describe the exponential growth and availability of data. In practice, the growing demand for large-scale data processing and data analysis applications spurred the development of novel solutions from both the industry and academia. For a decade, the MapReduce framework, and its open source realization, Hadoop, has emerged as a highly successful framework that has created a lot of momentum in both the research and industrial communities such that it has become the defacto standard of big data processing platforms. However, in recent years, academia and industry have started to recognize the limitations of the Hadoop framework in several application domains and big data processing scenarios such as large scale processing of structured data, graph data and streaming data. Thus, we have witnessed an unprecedented interest to tackle these challenges with new solutions which constituted a new wave of mostly domain-specific, optimized big data processing platforms. In this article, we refer to this new wave of systems as Big Data 2.0 processing systems. To better understand the latest ongoing developments in the world of big data processing systems, we provide a taxonomy and detailed analysis of the state-of-the-art in this domain. In addition, we identify a set of the current open research challenges and discuss some promising directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amplab big data benchmark. https://amplab.cs.berkeley.edu/benchmark/

  2. Abadi, D., Babu, S., Ozcan, F., Pandis, I.: Tutorial: SQL-on-Hadoop Systems. PVLDB 8(12), 2050–2061 (2015)

    Google Scholar 

  3. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB 2(1), 922–933 (2009)

    Google Scholar 

  4. Alexandrov, A., Battré, D., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., Warneke, D.: Massively Parallel Data Analysis with PACTs on Nephele. PVLDB 3(2), 1625–1628 (2010)

    Google Scholar 

  5. Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C., Hueske, F., Heise, A., Kao, O., Leich, M., Leser, U., Markl, V., Naumann, F., Peters, M., Rheinländer, A., Sax, M.J., Schelter, S., Höger, M., Tzoumas, K., Warneke, D.: The Stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)

    Article  Google Scholar 

  6. Aly, A.M., Sallam, A., Gnanasekaran, B.M., Nguyen-Dinh, L.-V., Aref, W.G., Ouzzani, M., Ghafoor, A.: M3: stream processing on main-memory mapreduce. In: IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, pp 1253–1256 (2012)

  7. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: Relational Data Processing in Spark. In: SIGMOD, pp 1383–1394 (2015)

  8. Balakrishnan, H., Frans Kaashoek, M., Karger, D.R., Morris, R., Stoica, I.: Looking up data in p2p systems. Commun. ACM 46(2), 43–48 (2003)

    Article  Google Scholar 

  9. Barnawi, A., Batarfi, O., Beheshti, S.-M.-R., Shawi, R.E., Fayoumi, A.G., Nouri, R., Sakr, S.: On Characterizing the Performance of Distributed Graph Computation Platforms. In Performance Characterization and Benchmarking. Traditional to Big Data - 6th TPC Technology Conference, TPCTC 2014, Hangzhou, China, September 1-5, 2014. Revised Selected Papers, 29–43 (2014)

  10. Batarfi, O., Shawi, R.E., Fayoumi, A.G., Nouri, R., Beheshti, S.-M.-R., Barnawi, A., Sakr, S.: Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3), 1189–1213 (2015)

    Article  Google Scholar 

  11. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: SoCC, pp 119–130 (2010)

  12. Bedini, I., Sakr, S., Theeten, B., Sala, A., Cogan, P.: Modeling performance of a parallel streaming engine: bridging theory and costs. In: ICPE, pp 173–184 (2013)

  13. Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE, pp 1151–1162 (2011)

  14. Bu, Y., Borkar, V.R., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big(ger) Graph Analytics on a Dataflow Engine. PVLDB 8(2), 161–172 (2014)

    Google Scholar 

  15. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)

    Article  Google Scholar 

  16. Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., Lonergan, L., Cohen, J., Welton, C., Sherry, G., Bhandarkar, M.: HAWQ: a massively parallel processing SQL engine in hadoop. In: SIGMOD, pp 1223–1234 (2014)

  17. Chohan, N., Bunch, C., Krintz, C., Canumalla, N.: Cloud platform datastore support. J. Grid Comput. 11(1), 63–81 (2013)

    Article  Google Scholar 

  18. Choi, H., Son, J., Yang, H., Ryu, H., Lim, B., Kim, S., Chung, Y.D.: Tajo: A distributed data warehouse system on large clusters. In: ICDE, pp 1320–1323 (2013)

  19. Clinger, W.D.: Foundations of Actor Semantics. Technical report. Cambridge, MA, USA (1981)

    Google Scholar 

  20. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce Online. In: NSDI, pp 313–328 (2010)

  21. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: SIGMOD Conference, pp 1115–1118 (2010)

  22. Dahiphale, D., Karve, R., Vasilakos, A.V., Liu, H., Yu, Z., Chhajer, A., Wang, J., Wang, C.: An advanced mapreduce: Cloud mapreduce, enhancements and applications. IEEE Trans. Netw. Serv. Manag. 11(1), 101–115 (2014)

    Article  Google Scholar 

  23. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  24. Dean, J., Ghemawa, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp 137–150 (2004)

  25. DeWitt, D.J., Halverson, A., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: SIGMOD, pp 1255–1266 (2013)

  26. Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB 3(1), 518–529 (2010)

    Google Scholar 

  27. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: HPDC, pp 810–818 (2010)

  28. Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs in pig. In: SIGMOD Conference, pp 701–704 (2012)

  29. Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-Oriented Storage Techniques for MapReduce. PVLDB 4(7), 419–429 (2011)

    Google Scholar 

  30. Gankidi, V.R., Teletia, N., Patel, J.M., Halverson, A., DeWitt, D.J.: Indexing HDFS Data in PDW: Splitting the data from the index. PVLDB 7(13), 1520–1528 (2014)

    Google Scholar 

  31. Gedik, B., Andrade, H., Wu, K.-L., Yu, P.S., Doo, M.: SPADE: the system s declarative stream processing engine. In: SIGMOD, pp 1123–1134 (2008)

  32. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: towards an industry standard benchmark for big data analytics. In: SIGMOD, pp 1197–1208 (2013)

  33. Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In: OSDI, pp 17–30 (2012)

  34. Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: Graph Processing in a Distributed Dataflow Framework. In: OSDI, pp 599–613 (2014)

  35. Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An Experimental Comparison of Pregel-like Graph Processing Systems. PVLDB 7(12), 1047–1058 (2014)

    Google Scholar 

  36. Han, W.-S., Lee, S., Park, K., Lee, J.-H., Kim, M.-S., Kim, J., Yu, H.: TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: KDD, pp 77–85 (2013)

  37. Heise, A., Rheinlnder, A., Leich, U., Leser, U., Naumann, F.: Meteor/Sopremo: An Extensible Query Language and Operator Model. In: BigData Workshop in conjunection with VLDB (2012)

  38. Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in Apache Hive. In: SIGMOD, pp 1235–1246 (2014)

  39. Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters. IEEE TKDE 23(9), 1299–1311 (2011)

    Google Scholar 

  40. Khan, A., Elnikety, S.: Systems for big-graphs. PVLDB 7(13), 1709–1710 (2014)

    Google Scholar 

  41. Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., Kalnis, P.: Mizan: a system for dynamic load balancing in large-scale graph processing. In: EuroSys, pp 169–182 (2013)

  42. Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, I., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-Milne, S., Yoder, M.: Impala: A Modern, Open-Source SQL Engine for Hadoop. In: CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings (2015)

  43. Kyrola, A., Blelloch, G.E., Guestrin, C.: GraphChi: Large-Scale Graph Computation on Just a PC. In: OSDI, pp 31–46 (2012)

  44. Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In: SIGMOD Conference, pp 961–972 (2011)

  45. Loesing, S., Hentschel, M., Kraska, T., Kossmann, D.: Stormy: an elastic and highly available streaming service in the cloud. In: EDBT/ICDT Workshops, pp 55–60 (2012)

  46. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB 5(8), 716–727 (2012)

    Google Scholar 

  47. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD Conference, pp 135–146 (2010)

  48. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB 3(1), 330–339 (2010)

    Google Scholar 

  49. Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: Distributed Stream Computing Platform. In: ICDMW, pp 170–177 (2010)

  50. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB 3(1), 494–505 (2010)

    MATH  Google Scholar 

  51. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp 165–178 (2009)

  52. Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A.C., Curino, C.: Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications. In: SIGMOD, pp 1357–1369 (2015)

  53. Sakr, S.: GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries. In: Database Systems for Advanced Applications, 14th International Conference, DASFAA 2009, Brisbane, Australia, April 21-23, 2009. Proceedings, pp 123–137 (2009)

  54. Sakr, S., Elnikety, S., He, Y.: G-SPARQL: a hybrid engine for querying large attributed graphs. In: 21st ACM International Conference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29 - November 02, 2012, pp 335–344 (2012)

  55. Sakr, S., Gaber, M.M.: editors. Large Scale and Big Data - Processing and Management. Auerbach Publications (2014)

  56. Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)

    Article  Google Scholar 

  57. Salihoglu, S., Widom, J.: GPS: a graph processing system. In: SSDBM, p 22 (2013)

  58. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access Path Selection in a Relational Database Management System. In: SIGMOD Conference, pp 23–34 (1979)

  59. Shamsi, J., Khojaye, M.A., Qasmi, M.A.: Data-intensive cloud computing: Requirements, expectations, challenges, and solutions. J. Grid Comput. 11(2), 281–310 (2013)

    Article  Google Scholar 

  60. Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: SIGMOD, pp 505–516 (2013)

  61. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST, pp 1–10 (2010)

  62. Sparks, E.R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J.E., Franklin, M.J., Jordan, M.I., Kraska, T.: MLI: An API for Distributed Machine Learning. In: ICDM, pp 1187–1192 (2013)

  63. Stutz, P., Bernstein, A., Cohen, W.W.: Signal/Collect: Graph Algorithms for the (Semantic) Web. In: International Semantic Web Conference (1), pp 764–780 (2010)

  64. Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J.S., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at facebook. In: SIGMOD, pp 1013–1020 (2010)

  65. Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From ”Think Like a Vertex” to ”Think Like a Graph”. PVLDB 7(3), 193–204 (2013)

    Google Scholar 

  66. Tsai, C.W., Lai, C.F., Chao, H.C., Vasilakos, A. V.: Big data analytics: a survey. Journal of Big Data 2(21) (2015)

  67. Vahi, K., Harvey, I., Samak, T., Gunter, D.K., Evans, K., Rogers, D.H., Taylor, I.J., Goode, M., Silva, F., Al-Shakarchi, E., Mehta, G., Deelman, E., Jones, A.: A case study into using common real-time workflow monitoring infrastructure for scientific workflows. J. Grid Comput. 11(3), 381–406 (2013)

    Article  Google Scholar 

  68. Valiant, L.G.: A Bridging Model for Parallel Computation. Commun. ACM 33(8), 103–111 (1990)

    Article  Google Scholar 

  69. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: ACM Symposium on Cloud Computing, SOCC ’13, Santa Clara, CA, USA October 1-3, 2013, pp 5:1–5:16 (2013)

  70. Wanderman-Milne, S., Li, N.: Runtime Code Generation in Cloudera Impala. IEEE Data Eng. Bull. 37(1), 31–37 (2014)

    Google Scholar 

  71. White, T.: Hadoop: The Definitive Guide. O’Reilly Media (2012)

  72. Yang, H., Dasdan, A., Hsiao, R., Parker, D.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD, pp 1029–1040 (2007)

  73. Zaharia, M., Chowdhury, M. , Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010 (2010)

  74. Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapReduce: A Distributed Computing Framework for Iterative Computation. J. Grid Comput. 10(1), 47–68 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sherif Sakr.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bajaber, F., Elshawi, R., Batarfi, O. et al. Big Data 2.0 Processing Systems: Taxonomy and Open Challenges. J Grid Computing 14, 379–405 (2016). https://doi.org/10.1007/s10723-016-9371-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-016-9371-1

Keywords

Navigation