Skip to main content

Classifying Big Data Analytic Approaches: A Generic Architecture

  • Conference paper
  • First Online:
Software Technologies (ICSOFT 2017)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 868))

Included in the following conference series:

Abstract

The explosion of the huge amount of generated data to be analyzed by several applications, imposes the trend of the moment, the Big Data boom, which in turn causes the existence of a vast landscape of architectural solutions. Non expert users who have to decide which analytical solutions are the most appropriates for their particular constraints and specific requirements in a Big Data context, are today lost, faced with a panoply of disparate and diverse solutions. To support users in this hard selection task, in a previous work, we proposed a generic architecture to classify Big Data Analytical Approaches and a set of criteria of comparison/evaluation. In this paper, we extend our classification architecture to consider more types of Big Data analytic tools and approaches and improve the list of criteria to evaluate them. We classify different existing Big Data analytics solutions according to our proposed generic architecture and qualitatively evaluate them in terms of the criteria of comparison. Additionally, we propose a preliminary design of a decision support system, intended to generate suggestions to users based on such classification and on a qualitative evaluation in terms of previous users experiences, users requirements, nature of the analysis they need, and the set of evaluation criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://hadoop.apache.org.

  2. 2.

    http://storm.apache.org.

  3. 3.

    https://flink.apache.org.

  4. 4.

    http://www.pentaho.com/product/data-integration.

  5. 5.

    http://www.talend.com.

  6. 6.

    http://mahout.apache.org/.

  7. 7.

    https://spark.apache.org/mllib/.

  8. 8.

    http://www.cs.waikato.ac.nz/ml/index.html.

  9. 9.

    https://www.h2o.ai/h2o/.

  10. 10.

    http://oryx.io.

  11. 11.

    https://samoa.incubator.apache.org.

  12. 12.

    http://www.cascading.org.

  13. 13.

    https://www.elastic.co/fr/products/elasticsearch.

  14. 14.

    http://lucene.apache.org/solr/.

  15. 15.

    http://kylin.apache.org.

  16. 16.

    https://www.cloudera.com/products/open-source/apache-hadoop/impala.html.

  17. 17.

    http://www.microsoft.com/azure.

  18. 18.

    http://www.asterdata.com/.

  19. 19.

    http://www.actian.com.

  20. 20.

    http://nanocubes.net.

  21. 21.

    https://help.sap.com/viewer/product/SAP_HANA_PLATFORM/2.0.00/en-US.

  22. 22.

    Apache Giraph - http://giraph.apache.org.

  23. 23.

    Graph Engine https://www.graphengine.io.

  24. 24.

    https://ignite.apache.org.

References

  1. Kune, R., Konugurthi, P.K., Agarwal, A., Chillarige, R.R., Buyya, R.: The anatomy of big data computing. Softw. Pract. Exp. 46, 79–105 (2016)

    Article  Google Scholar 

  2. Grolinger, K., Higashino, W.A., Tiwari, A., Capretz, M.A.: Data management in cloud environments: NoSQL and NewSQL data stores. J. Cloud Comput.: Adv. Syst. Appl. 2, 22 (2013)

    Article  Google Scholar 

  3. Pavlo, A., Aslett, M.: What’s really new with NewSQL? SIGMOD Rec. 45, 45–55 (2016)

    Article  Google Scholar 

  4. Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014)

    Article  Google Scholar 

  5. Philip Chen, C., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)

    Article  Google Scholar 

  6. Cardinale, Y., Guehis, S., Rukoz, M.: Big data analytic approaches classification. In: Proceedings of the International Conference on Software Technologies, ICSOFT 2017, pp. 151–162. SCITEPRESS (2017)

    Google Scholar 

  7. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  8. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 165–178 (2009)

    Google Scholar 

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  10. Battré, D., et al.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: Proceedings of Symposium on Cloud Computing, pp. 119–130 (2010)

    Google Scholar 

  11. Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of Workshop on Many-Task Computing on Grids and Supercomputers, pp. 8:1–8:10 (2009)

    Google Scholar 

  12. Zaharia, M., Chowdhury, M., Das, T., Dave, A., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of Conference on Networked Systems Design and Implementation, pp. 15–28 (2012)

    Google Scholar 

  13. Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., et al.: Tenzing: a SQL implementation on the MapReduce framework. PVLDB 4, 1318–1327 (2011)

    Google Scholar 

  14. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with Sawzall. Sci. Program. 13, 277–298 (2005)

    Google Scholar 

  15. Olston, C., Reed, B., Srivastava, U., Kumar, R., et al.: Pig latin: A not-so-foreign language for data processing. In: Proceedings of International Conference on Management of Data, pp. 1099–1110 (2008)

    Google Scholar 

  16. Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., et al.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4, 1272–1283 (2011)

    Google Scholar 

  17. Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: FlumeJava: easy, efficient data-parallel pipelines. SIGPLAN Not. 45, 363–375 (2010)

    Article  Google Scholar 

  18. Meijer, E., Beckman, B., Bierman, G.: LINQ: reconciling object, relations and XML in the .NET framework. In: Proceedings of ACM International Conference on Management of Data, p. 706 (2006)

    Google Scholar 

  19. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., et al.: Hive - a petabyte scale data warehouse using hadoop. In: Proceedings of International Conference on Data Engineering, pp. 996–1005 (2010)

    Google Scholar 

  20. Zhou, J., Bruno, N., Wu, M.C., Larson, P.A., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet MapReduce. VLDB J. 21, 611–636 (2012)

    Article  Google Scholar 

  21. Chaiken, R., Jenkins, B., et al.: SCOPE: easy and efficient parallel processing of massive data sets. VLDB Endow. 1, 1265–1276 (2008)

    Article  Google Scholar 

  22. Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of ACM International Conference on Management of Data, pp. 13–24 (2013)

    Google Scholar 

  23. Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. VLDB Endow. 3, 1459–1468 (2010)

    Article  Google Scholar 

  24. Hasani, Z., Kon-Popovska, M., Velinov, G.: Lambda architecture for real time big data analytic. In: ICT Innovations 2014 Web Proceedings, pp. 133–143 (2014)

    Google Scholar 

  25. (Apache Flume). http://flume.apache.org/

  26. Wang, G., Koshy, J., Subramanian, S., Paramasivam, K., Zadeh, M., Narkhede, N., Rao, J., Kreps, J., Stein, J.: Building a replicated logging system with Apache Kafka. Proc. VLDB Endow. 8, 1654–1655 (2015)

    Article  Google Scholar 

  27. (Apache Sqoop). http://sqoop.apache.org/

  28. Lee, G., Lin, J., Liu, C., Lorek, A., Ryaboy, D.: The unified logging infrastructure for data analytics at Twitter. VLDB Endow. 5, 1771–1780 (2012)

    Article  Google Scholar 

  29. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21, 169–190 (2012)

    Article  Google Scholar 

  30. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. In: Proceedings of the International WWW Conference, Brisbane, Australia, pp. 161–172 (1998)

    Google Scholar 

  31. Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: Proceedings of the USENIX Conference on Operating Systems Design and Implementation, pp. 599–613 (2014)

    Google Scholar 

  32. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the ACM International Conference on Management of Data, pp. 135–146. ACM (2010)

    Google Scholar 

  33. Wu, L., Sumbaly, R., Riccomini, C., Koo, G., Kim, H.J., Kreps, J., Shah, S.: Avatara: OLAP for web-scale analytics products. Proc. VLDB Endow. 5, 1874–1877 (2012)

    Article  Google Scholar 

  34. Sumbaly, R., Kreps, J., Gao, L., Feinberg, A., Soman, C., Shah, S.: Serving large-scale batch computed data with project Voldemort. In: Proceedings of the USENIX Conference on File and Storage Technologies, p. 18 (2012)

    Google Scholar 

  35. Gupta, A., Yang, F., Govig, J., Kirsch, A., Chan, K., Lai, K., Wu, S., Dhoot, S.G., Kumar, A.R., Agiwal, A., Bhansali, S., Hong, M., Cameron, J., et al.: Mesa: geo-replicated, near real-time, scalable data warehousing. PVLDB 7, 1259–1270 (2014)

    Google Scholar 

  36. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS Oper. Syst. Rev. 37, 29–43 (2003)

    Article  Google Scholar 

  37. Fay, C., Jeffrey, D., Sanjay, G., et al.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 4:1–4:26 (2008)

    Google Scholar 

  38. Lamport, L.: Paxos made simple. ACM SIGACT News (Distrib. Comput. Column) 32, 51–58 (2001)

    Google Scholar 

  39. Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53, 64–71 (2010)

    Article  Google Scholar 

  40. Hall, A., Bachmann, O., Büssow, R., Gănceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. VLDB Endow. 5, 1436–1446 (2012)

    Article  Google Scholar 

  41. Xu, Y., Kostamaa, P., Gao, L.: Integrating hadoop and parallel DBMs. In: Proceedings of SIGMOD International Conference on Management of Data, pp. 969–974 (2010)

    Google Scholar 

  42. Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endow. 2, 1402–1413 (2009)

    Article  Google Scholar 

  43. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54, 114–123 (2011)

    Article  Google Scholar 

  44. DeWitt, D.J., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 1255–1266 (2013)

    Google Scholar 

  45. Pedro, E., Rocha, P., Luis, E.d.B., Chris, C.: Cubrick: a scalable distributed MOLAP database for fast analytics. In: Proceedings of International Conference on Very Large Databases, pp. 1–4 (2015)

    Google Scholar 

  46. Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., Srinivasan, V.: Amazon redshift and the case for simpler data warehouses. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1917–1923 (2015)

    Google Scholar 

  47. Yang, F., Tschetter, E., Léauté, X., Ray, N., et al.: Druid: a real-time analytical data store. In: Proceedings of ACM International Conference on Management of Data, pp. 157–168 (2014)

    Google Scholar 

  48. Lamb, A., Fuller, M., Varadarajan, R., Tran, N., Vandiver, B., Doshi, L., Bear, C.: The vertica analytic database: C-store 7 years later. VLDB Endow. 5, 1790–1801 (2012)

    Article  Google Scholar 

  49. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990)

    Article  Google Scholar 

  50. Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 716–727 (2012)

    Article  Google Scholar 

  51. Simmhan, Y., Wickramaarachchi, C., Kumbhare, A.G., Frîncu, M., Nagarkar, S., Ravi, S., Raghavendra, C.S., Prasanna, V.K.: Scalable analytics over distributed time-series graphs using goffish. CoRR abs/1406.5975 (2014)

    Google Scholar 

  52. Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 505–516 (2013)

    Google Scholar 

  53. Mayer, R., Mayer, C., Tariq, M.A., Rothermel, K.: GraphCEP: real-time data analytics using parallel complex event and graph processing. In: Proceedings of the ACM International Conference on Distributed and Event-based Systems, pp. 309–316 (2016)

    Google Scholar 

  54. Mayer, R., Koldehofe, B., Rothermel, K.: Predictable low-latency event detection with parallel complex event processing. IEEE Internet Things J. 2, 1 (2015)

    Article  Google Scholar 

  55. Acharjya, D.P., Ahmed, K.: A survey on big data analytics: challenges, open research issues and tools. Int. J. Adv. Comput. Sci. Appl. 7, 511–518 (2016)

    Google Scholar 

  56. Inoubli, W., Aridhi, S., Mezni, H., Jung, A.: An experimental survey on big data frameworks. ArXiv e-prints, pp. 1–41 (2017)

    Google Scholar 

  57. Madhuri, T., Sowjanya, P.: Microsoft Azure v/s Amazon AWS cloud services: a comparative study. J. Innov. Res. Sci. Eng. Technol. 5, 3904–3908 (2016)

    Google Scholar 

  58. Pkknen, P., Pakkala, D.: Reference architecture and classification of technologies, products and services for big data systems. Big Data Res. 2, 166–186 (2015)

    Article  Google Scholar 

  59. Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the hadoop ecosystem. J. Big Data 2, 1–36 (2015)

    Article  Google Scholar 

  60. Khalifa, S., Elshater, Y., Sundaravarathan, K., Bhat, A., Martin, P., Imam, F., Rope, D., et al.: The six pillars for building big data analytics ecosystems. ACM Comput. Surv. 49, 33:1–33:36 (2016)

    Article  Google Scholar 

  61. Poleto, T., de Carvalho, V.D.H., Costa, A.P.C.S.: The roles of big data in the decision-support process: an empirical investigation. In: Delibašić, B., Hernández, J.E., Papathanasiou, J., Dargam, F., Zaraté, P., Ribeiro, R., Liu, S., Linden, I. (eds.) ICDSST 2015. LNBIP, vol. 216, pp. 10–21. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18533-0_2

    Chapter  Google Scholar 

  62. Lahcene, B., Ladjel, B., Yassine, O.: Coupling multi-criteria decision making and ontologies for recommending DBMS. In: Proceedings of International Conference on Management of Data (2017)

    Google Scholar 

  63. Sahri, S., Moussa, R., Long, D.D.E., Benbernou, S.: DBaaS-expert: a recommender for the selection of the right cloud database. In: Andreasen, T., Christiansen, H., Cubero, J.-C., Raś, Z.W. (eds.) ISMIS 2014. LNCS (LNAI), vol. 8502, pp. 315–324. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08326-1_32

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yudith Cardinale .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cardinale, Y., Guehis, S., Rukoz, M. (2018). Classifying Big Data Analytic Approaches: A Generic Architecture. In: Cabello, E., Cardoso, J., Maciaszek, L., van Sinderen, M. (eds) Software Technologies. ICSOFT 2017. Communications in Computer and Information Science, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-319-93641-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93641-3_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93640-6

  • Online ISBN: 978-3-319-93641-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics