Classifying Big Data Analytic Approaches: A Generic Architecture

Cardinale, Yudith; Guehis, Sonia; Rukoz, Marta

doi:10.1007/978-3-319-93641-3_13

Yudith Cardinale¹³,
Sonia Guehis^14,15 &
Marta Rukoz^14,15

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 868))

Included in the following conference series:

International Conference on Software Technologies

587 Accesses
1 Citations

Abstract

The explosion of the huge amount of generated data to be analyzed by several applications, imposes the trend of the moment, the Big Data boom, which in turn causes the existence of a vast landscape of architectural solutions. Non expert users who have to decide which analytical solutions are the most appropriates for their particular constraints and specific requirements in a Big Data context, are today lost, faced with a panoply of disparate and diverse solutions. To support users in this hard selection task, in a previous work, we proposed a generic architecture to classify Big Data Analytical Approaches and a set of criteria of comparison/evaluation. In this paper, we extend our classification architecture to consider more types of Big Data analytic tools and approaches and improve the list of criteria to evaluate them. We classify different existing Big Data analytics solutions according to our proposed generic architecture and qualitatively evaluate them in terms of the criteria of comparison. Additionally, we propose a preliminary design of a decision support system, intended to generate suggestions to users based on such classification and on a qualitative evaluation in terms of previous users experiences, users requirements, nature of the analysis they need, and the set of evaluation criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://hadoop.apache.org.
2.
http://storm.apache.org.
3.
https://flink.apache.org.
4.
http://www.pentaho.com/product/data-integration.
5.
http://www.talend.com.
6.
http://mahout.apache.org/.
7.
https://spark.apache.org/mllib/.
8.
http://www.cs.waikato.ac.nz/ml/index.html.
9.
https://www.h2o.ai/h2o/.
10.
http://oryx.io.
11.
https://samoa.incubator.apache.org.
12.
http://www.cascading.org.
13.
https://www.elastic.co/fr/products/elasticsearch.
14.
http://lucene.apache.org/solr/.
15.
http://kylin.apache.org.
16.
https://www.cloudera.com/products/open-source/apache-hadoop/impala.html.
17.
http://www.microsoft.com/azure.
18.
http://www.asterdata.com/.
19.
http://www.actian.com.
20.
http://nanocubes.net.
21.
https://help.sap.com/viewer/product/SAP_HANA_PLATFORM/2.0.00/en-US.
22.
Apache Giraph - http://giraph.apache.org.
23.
Graph Engine https://www.graphengine.io.
24.
https://ignite.apache.org.

References

Kune, R., Konugurthi, P.K., Agarwal, A., Chillarige, R.R., Buyya, R.: The anatomy of big data computing. Softw. Pract. Exp. 46, 79–105 (2016)
Article Google Scholar
Grolinger, K., Higashino, W.A., Tiwari, A., Capretz, M.A.: Data management in cloud environments: NoSQL and NewSQL data stores. J. Cloud Comput.: Adv. Syst. Appl. 2, 22 (2013)
Article Google Scholar
Pavlo, A., Aslett, M.: What’s really new with NewSQL? SIGMOD Rec. 45, 45–55 (2016)
Article Google Scholar
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014)
Article Google Scholar
Philip Chen, C., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)
Article Google Scholar
Cardinale, Y., Guehis, S., Rukoz, M.: Big data analytic approaches classification. In: Proceedings of the International Conference on Software Technologies, ICSOFT 2017, pp. 151–162. SCITEPRESS (2017)
Google Scholar
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)
Book Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 165–178 (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Battré, D., et al.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: Proceedings of Symposium on Cloud Computing, pp. 119–130 (2010)
Google Scholar
Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of Workshop on Many-Task Computing on Grids and Supercomputers, pp. 8:1–8:10 (2009)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of Conference on Networked Systems Design and Implementation, pp. 15–28 (2012)
Google Scholar
Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., et al.: Tenzing: a SQL implementation on the MapReduce framework. PVLDB 4, 1318–1327 (2011)
Google Scholar
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with Sawzall. Sci. Program. 13, 277–298 (2005)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., et al.: Pig latin: A not-so-foreign language for data processing. In: Proceedings of International Conference on Management of Data, pp. 1099–1110 (2008)
Google Scholar
Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., et al.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4, 1272–1283 (2011)
Google Scholar
Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: FlumeJava: easy, efficient data-parallel pipelines. SIGPLAN Not. 45, 363–375 (2010)
Article Google Scholar
Meijer, E., Beckman, B., Bierman, G.: LINQ: reconciling object, relations and XML in the .NET framework. In: Proceedings of ACM International Conference on Management of Data, p. 706 (2006)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., et al.: Hive - a petabyte scale data warehouse using hadoop. In: Proceedings of International Conference on Data Engineering, pp. 996–1005 (2010)
Google Scholar
Zhou, J., Bruno, N., Wu, M.C., Larson, P.A., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet MapReduce. VLDB J. 21, 611–636 (2012)
Article Google Scholar
Chaiken, R., Jenkins, B., et al.: SCOPE: easy and efficient parallel processing of massive data sets. VLDB Endow. 1, 1265–1276 (2008)
Article Google Scholar
Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of ACM International Conference on Management of Data, pp. 13–24 (2013)
Google Scholar
Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. VLDB Endow. 3, 1459–1468 (2010)
Article Google Scholar
Hasani, Z., Kon-Popovska, M., Velinov, G.: Lambda architecture for real time big data analytic. In: ICT Innovations 2014 Web Proceedings, pp. 133–143 (2014)
Google Scholar
(Apache Flume). http://flume.apache.org/
Wang, G., Koshy, J., Subramanian, S., Paramasivam, K., Zadeh, M., Narkhede, N., Rao, J., Kreps, J., Stein, J.: Building a replicated logging system with Apache Kafka. Proc. VLDB Endow. 8, 1654–1655 (2015)
Article Google Scholar
(Apache Sqoop). http://sqoop.apache.org/
Lee, G., Lin, J., Liu, C., Lorek, A., Ryaboy, D.: The unified logging infrastructure for data analytics at Twitter. VLDB Endow. 5, 1771–1780 (2012)
Article Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21, 169–190 (2012)
Article Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. In: Proceedings of the International WWW Conference, Brisbane, Australia, pp. 161–172 (1998)
Google Scholar
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: Proceedings of the USENIX Conference on Operating Systems Design and Implementation, pp. 599–613 (2014)
Google Scholar
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the ACM International Conference on Management of Data, pp. 135–146. ACM (2010)
Google Scholar
Wu, L., Sumbaly, R., Riccomini, C., Koo, G., Kim, H.J., Kreps, J., Shah, S.: Avatara: OLAP for web-scale analytics products. Proc. VLDB Endow. 5, 1874–1877 (2012)
Article Google Scholar
Sumbaly, R., Kreps, J., Gao, L., Feinberg, A., Soman, C., Shah, S.: Serving large-scale batch computed data with project Voldemort. In: Proceedings of the USENIX Conference on File and Storage Technologies, p. 18 (2012)
Google Scholar
Gupta, A., Yang, F., Govig, J., Kirsch, A., Chan, K., Lai, K., Wu, S., Dhoot, S.G., Kumar, A.R., Agiwal, A., Bhansali, S., Hong, M., Cameron, J., et al.: Mesa: geo-replicated, near real-time, scalable data warehousing. PVLDB 7, 1259–1270 (2014)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS Oper. Syst. Rev. 37, 29–43 (2003)
Article Google Scholar
Fay, C., Jeffrey, D., Sanjay, G., et al.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 4:1–4:26 (2008)
Google Scholar
Lamport, L.: Paxos made simple. ACM SIGACT News (Distrib. Comput. Column) 32, 51–58 (2001)
Google Scholar
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53, 64–71 (2010)
Article Google Scholar
Hall, A., Bachmann, O., Büssow, R., Gănceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. VLDB Endow. 5, 1436–1446 (2012)
Article Google Scholar
Xu, Y., Kostamaa, P., Gao, L.: Integrating hadoop and parallel DBMs. In: Proceedings of SIGMOD International Conference on Management of Data, pp. 969–974 (2010)
Google Scholar
Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endow. 2, 1402–1413 (2009)
Article Google Scholar
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54, 114–123 (2011)
Article Google Scholar
DeWitt, D.J., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 1255–1266 (2013)
Google Scholar
Pedro, E., Rocha, P., Luis, E.d.B., Chris, C.: Cubrick: a scalable distributed MOLAP database for fast analytics. In: Proceedings of International Conference on Very Large Databases, pp. 1–4 (2015)
Google Scholar
Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., Srinivasan, V.: Amazon redshift and the case for simpler data warehouses. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1917–1923 (2015)
Google Scholar
Yang, F., Tschetter, E., Léauté, X., Ray, N., et al.: Druid: a real-time analytical data store. In: Proceedings of ACM International Conference on Management of Data, pp. 157–168 (2014)
Google Scholar
Lamb, A., Fuller, M., Varadarajan, R., Tran, N., Vandiver, B., Doshi, L., Bear, C.: The vertica analytic database: C-store 7 years later. VLDB Endow. 5, 1790–1801 (2012)
Article Google Scholar
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990)
Article Google Scholar
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 716–727 (2012)
Article Google Scholar
Simmhan, Y., Wickramaarachchi, C., Kumbhare, A.G., Frîncu, M., Nagarkar, S., Ravi, S., Raghavendra, C.S., Prasanna, V.K.: Scalable analytics over distributed time-series graphs using goffish. CoRR abs/1406.5975 (2014)
Google Scholar
Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 505–516 (2013)
Google Scholar
Mayer, R., Mayer, C., Tariq, M.A., Rothermel, K.: GraphCEP: real-time data analytics using parallel complex event and graph processing. In: Proceedings of the ACM International Conference on Distributed and Event-based Systems, pp. 309–316 (2016)
Google Scholar
Mayer, R., Koldehofe, B., Rothermel, K.: Predictable low-latency event detection with parallel complex event processing. IEEE Internet Things J. 2, 1 (2015)
Article Google Scholar
Acharjya, D.P., Ahmed, K.: A survey on big data analytics: challenges, open research issues and tools. Int. J. Adv. Comput. Sci. Appl. 7, 511–518 (2016)
Google Scholar
Inoubli, W., Aridhi, S., Mezni, H., Jung, A.: An experimental survey on big data frameworks. ArXiv e-prints, pp. 1–41 (2017)
Google Scholar
Madhuri, T., Sowjanya, P.: Microsoft Azure v/s Amazon AWS cloud services: a comparative study. J. Innov. Res. Sci. Eng. Technol. 5, 3904–3908 (2016)
Google Scholar
Pkknen, P., Pakkala, D.: Reference architecture and classification of technologies, products and services for big data systems. Big Data Res. 2, 166–186 (2015)
Article Google Scholar
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the hadoop ecosystem. J. Big Data 2, 1–36 (2015)
Article Google Scholar
Khalifa, S., Elshater, Y., Sundaravarathan, K., Bhat, A., Martin, P., Imam, F., Rope, D., et al.: The six pillars for building big data analytics ecosystems. ACM Comput. Surv. 49, 33:1–33:36 (2016)
Article Google Scholar
Poleto, T., de Carvalho, V.D.H., Costa, A.P.C.S.: The roles of big data in the decision-support process: an empirical investigation. In: Delibašić, B., Hernández, J.E., Papathanasiou, J., Dargam, F., Zaraté, P., Ribeiro, R., Liu, S., Linden, I. (eds.) ICDSST 2015. LNBIP, vol. 216, pp. 10–21. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18533-0_2
Chapter Google Scholar
Lahcene, B., Ladjel, B., Yassine, O.: Coupling multi-criteria decision making and ontologies for recommending DBMS. In: Proceedings of International Conference on Management of Data (2017)
Google Scholar
Sahri, S., Moussa, R., Long, D.D.E., Benbernou, S.: DBaaS-expert: a recommender for the selection of the right cloud database. In: Andreasen, T., Christiansen, H., Cubero, J.-C., Raś, Z.W. (eds.) ISMIS 2014. LNCS (LNAI), vol. 8502, pp. 315–324. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08326-1_32
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Dpto. de Computación y TI, Universidad Simón Bolívar, Caracas, 1080-A, Venezuela
Yudith Cardinale
Université Paris Nanterre, 92001, Nanterre, France
Sonia Guehis & Marta Rukoz
Université Paris Dauphine, PSL Research University, CNRS, UMR[7243], LAMSADE, 75016, Paris, France
Sonia Guehis & Marta Rukoz

Authors

Yudith Cardinale
View author publications
You can also search for this author in PubMed Google Scholar
Sonia Guehis
View author publications
You can also search for this author in PubMed Google Scholar
Marta Rukoz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yudith Cardinale .

Editor information

Editors and Affiliations

King Juan Carlos University, Madrid, Spain
Enrique Cabello
University of Coimbra, Coimbra, Portugal
Jorge Cardoso
Wroclaw University of Economics, Wroclaw, Poland
Leszek A. Maciaszek
Computer Science, University of Twente, Enschede, The Netherlands
Marten van Sinderen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cardinale, Y., Guehis, S., Rukoz, M. (2018). Classifying Big Data Analytic Approaches: A Generic Architecture. In: Cabello, E., Cardoso, J., Maciaszek, L., van Sinderen, M. (eds) Software Technologies. ICSOFT 2017. Communications in Computer and Information Science, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-319-93641-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-93641-3_13
Published: 08 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93640-6
Online ISBN: 978-3-319-93641-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics