Scalability and Realtime on Big Data, MapReduce, NoSQL and Spark

Furtado, Pedro

doi:10.1007/978-3-319-61164-8_4

Pedro Furtado⁸

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 280))

Included in the following conference series:

European Business Intelligence Summer School

2212 Accesses

Abstract

Big data platforms strive to achieve scalability and realtime for query processing and complex analytics over “big” and/or “fast” data. In this context, big data warehouses are huge repositories of data to be used in analytics and machine learning. This work discusses models, concepts and approaches to reach scalability and realtime in big data processing and big data warehouses. The main concepts of NoSQL, Parallel Data Management Systems (PDBMS), MapReduce and Spark are reviewed in the context of scalability. The first two offering data management, the last two adding flexible and scalable processing capacities. We also turn our attention to realtime data processing, lambda architecture and its relation with scalability, and we revisit our own recent research on the issue. Three approaches are included that are directly related to realtime and scalability: the use of a realtime component in a data warehouse, parallelized de-normalization for scalability and execution tree sharing for scaling to simultaneous sessions. With these models and technologies we revisit some of the major current solutions for data management and data processing with scalability and realtime capacities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A. and Zaharia, M.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015 (2015)
Google Scholar
Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on Apache Spark, Research Report RT0968, IBM Research – Tokyo, 16 October 2015
Google Scholar
Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on Apache Spark. In: 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2016)
Google Scholar
Costa, J., Cecílio, J., Martins, P., Furtado, P.: ONE: predictable and scalable DW model. In: International Conference on Big Data Analytics and Knowledge Discovery (2011)
Google Scholar
Costa, J.P.: Massively scalable data warehouses with performance predictability, PhD thesis, University of Coimbra, July 2015
Google Scholar
Costa, J.P., Furtado, P.: Data warehouse processing scale-up for massive concurrent queries with SPIN. In: Hameurlain, A., Küng, J., Wagner, R., Bellatreche, L., Mohania, M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XVII. LNCS, vol. 8970, pp. 1–23. Springer, Heidelberg (2015). doi:10.1007/978-3-662-46335-2_1
Google Scholar
Costa, J.P., Furtado, P.: Improving the processing of DW star-queries under concurrent query workloads. In: Bellatreche, L., Mohania, Mukesh K. (eds.) DaWaK 2014. LNCS, vol. 8646, pp. 245–253. Springer, Cham (2014). doi:10.1007/978-3-319-10160-6_22
Google Scholar
Costa, J.P., Furtado, P.: SPIN: concurrent workload scaling over data warehouses. In: Bellatreche, L., Mohania, Mukesh K. (eds.) DaWaK 2013. LNCS, vol. 8057, pp. 60–71. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40131-2_6
Chapter Google Scholar
Costa, J.P., Martins, P., Cecilio, J., Furtado, P.: Providing timely results with an elastic parallel DW. In: Chen, L., Felfernig, A., Liu, J., Raś, Z.W. (eds.) ISMIS 2012. LNCS, vol. 7661, pp. 415–424. Springer, Heidelberg (2012)
Chapter Google Scholar
Costa, J.P., Martins, P., Cecilio, J., Furtado, P.: TEEPA: a timely-aware elastic parallel architecture. In: Proceedings of the 16th International Database Engineering & Applications Symposium, IDEAS 2012, Prague, Czech Republic (2012)
Google Scholar
Costa, J.P., Cecílio, J., Martins, P., Furtado, P.: Overcoming the scalability limitations of parallel star schema data warehouses. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds.) ICA3PP 2012. LNCS, vol. 7439, pp. 473–486. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33078-0_34
Chapter Google Scholar
Ferreira, N., Furtado, P.: Near real-time with traditional data warehouse architectures: factors and how-to. In: 17th International Database Engineering and Applications Symposium (2013)
Google Scholar
Ferreira, N., Furtado, P.: Real-time data warehouse: a solution and evaluation. Int. J. Bus. Intell. Data Min. 8(3), 244–263 (2014)
Article Google Scholar
Furtado, P.: Experimental evidence on partitioning in parallel data warehouses. In: Proceedings of the ACM DOLAP 2004 - Workshop of the International Conference on Information and Knowledge Management, Washington USA, November 2004
Google Scholar
Furtado, P.: Workload-based placement and join processing in node-partitioned data warehouses. In: Proceedings of the International Conference on Data Warehousing and Knowledge Discovery, Zaragoza, Spain, pp. 38–47, September 2004
Google Scholar
Furtado, P.: Efficient and robust node-partitioned data warehouses. In: Wrembel, R., Koncilia, C. (eds.) Data Warehouses and OLAP: Concepts, Architectures and Solutions, Chap. IX, pp. 203–229. Ideas Group, Inc. ISBN 1-59904365-3
Google Scholar
Furtado, P.: A survey of parallel and distributed data warehouses. Int. J. Data Warehous. Min. 5(2), 57 (2009)
Article Google Scholar
Furtado, P.: Replication in node-partitioned data warehouses. In: DDIDR2005 Workshop of International Conference on Very Large Databases, VLDB (2005)
Google Scholar
Furtado, P.: Efficiently processing query-intensive databases over a non-dedicated local network. In: Proceedings of the 19th International Parallel and Distributed Processing Symposium, Denver, Colorado, USA, May 2005
Google Scholar
Furtado, P.: Model and procedure for performance and availability-wise parallel warehouses. Distrib. Parallel Databases 25(1), 71 (2009)
Article Google Scholar
Furtado, P.: Scalability and Realtime for Data Warehouses and Big data, Paperback, 11 September 2015
Google Scholar
Martins, P.: Elastic ETL+Q for any data-warehouse using time bounds. PhD thesis, University of Coimbra, February 2016
Google Scholar
Martins, P., Abbasi, M., Furtado, P.: Data-warehouse ETL+Q auto-scale framework. Int. J. Bus. Intell. Syst. Eng. 1(1), 49–76 (2015)
Google Scholar
Martins, P., Abbasi, M., Furtado, P.: AutoScale: automatic ETL scale process. In: 19th East European Conference on Advances in Databases and Information Systems (2015)
Google Scholar
Martins, P., Abbasi, M., Furtado, P.: Preparing a full auto- scale framework for data-warehouse ETL+Q. In: IEEE Big data Congress 2015, New York, USA (2015)
Google Scholar
Martins, P., Abbasi, M., Furtado, P.: AScale: automatically scaling the ETL+Q process for performance. Int. J. Bus. Process Integr. Manage. 7(4), 300–313 (2015)
Article Google Scholar
Marz, N., Warren, J.: Big Data: principles and best practices of scalable realtime data systems, 1st Manning Publications Co. Greenwich, CT, USA ©2015 (2015), ISBN:1617290343 9781617290343
Google Scholar
O’Neil, P., O’Neil, E., Chen, X.: Star schema benchmark - revision 3. Technical report, UMass/Boston (2009)
Google Scholar
Waas, F., Wrembel, R., Freudenreich, T., Thiele, M., Koncilia, C., Furtado, P.: On-demand ELT architecture for right-time BI: extending the vision. Int. J. Data Warehous. Mining 9(2), 21–38 (2013)
Article Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI 2012). USENIX Association, Berkeley, CA, USA, p. 2 (2012)
Google Scholar
Zhao, J., Pjesivac-Grbovic, J.: MapReduce, the programming model and practice. In: Sigmetrics/Performance 2009, Tutorials, 19 June 2009
Google Scholar
Spark Homepage. http://spark.apache.org/. Accessed Jul 2016
Spark SQL homepage. http://spark.apache.org/sql/. Accessed Jul 2016
Parquet File Format. https://parquet.apache.org/. Accessed Jul 2016
Spark Streaming. http://spark.apache.org/streaming/. Accessed Jul 2016
Kafka homepage http://kafka.apache.org/. Accessed Jul 2016
CassandraTM Homepage. http://cassandra.apache.org/. Accessed Jul 2016
TCP Council homepage. www.tpc.org. Accessed Jul 2016
Snijders, C., Matzat, U., Reips, U.-D.: ‘Big Data’: Big gaps of knowledge in the field of Internet. Int. J. Internet Sci. 7, 1–5 (2012)
Google Scholar
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise big data on cloud computing: review and open research issues. Information Systems 47, 98–115 (2015)
Article Google Scholar
“Data, data everywhere”. The Economist. 25 February 2010. Retrieved 2 December 2016
Google Scholar
“Supercomputing the Climate: NASA’s Big Data Mission”. CSC World. Computer Sciences Corporation. Retrieved 2 December 2016
Google Scholar
“DNAstack tackles massive, complex DNA datasets with Google Genomics”. Google Cloud Platform. Retrieved 20 December 2016
Google Scholar
Mirkes, E.M., Coats, T.J., Levesley, J., Gorban, A.N.: Handling missing data in large healthcare dataset: a case study of unknown trauma outcomes. Comput. Biol. Med. 75, 203–216 (2016)
Article Google Scholar
Brewer, E.: CAP twelve years later: how the “rules” have changed. Computer 45(2), 23–29 (2012)
Article Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data, OSDI 2006: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November 2006
Google Scholar
Google File System and BigTable, Radar (World Wide Web log), Database War Stories (7). O’Reilly, May 2006
Google Scholar
Angles, R., Gutierrez, C.: Survey of graph database models. ACM Comput. Surv. 40(1), 1 (2008)
Article Google Scholar
Robinson, I., Webber, J., Eifrem, E.: Graph Databases. 2nd edn. O’Reilly Media (2015)
Google Scholar
Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Rec. 39(4), 12–27 (2010)
Article Google Scholar
Han, J., et al.: Survey on NoSQL database. In: 2011 6th International Conference on Pervasive Computing and Applications (ICPCA). IEEE (2011)
Google Scholar
Grolinger, K., et al.: Data management in cloud environments: NoSQL and NewSQL data stores. J. Cloud Comput. Adv. Syst. Appl. 2(1), 1 (2013)
Article Google Scholar
Graefe, G.: Volcano - an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng. 6(1), 120–135 (1994)
Article Google Scholar
Kossmann, D.: Distributed query processing approaches. In: ACM Computing Surveys (CSUR) (2000)
Google Scholar
Deshpande, A., Ives, Z., Raman, V.: Adaptive query processing. Found. Trends Databases 1(1), 1–140 (2007)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Engenharia Informatica and Centro de Informatica e Sistemas da Universidade de Coimbra, Universidade de Coimbra, Polo II, 3030-290, Coimbra, Portugal
Pedro Furtado

Authors

Pedro Furtado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Furtado .

Editor information

Editors and Affiliations

University of Tours, Tours, France
Patrick Marcel
Dept. of Computer and Decision Engg. (CoDE), Universite Libre de Bruxelles, Brussels, Belgium
Esteban Zimányi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Furtado, P. (2017). Scalability and Realtime on Big Data, MapReduce, NoSQL and Spark. In: Marcel, P., Zimányi, E. (eds) Business Intelligence. eBISS 2016. Lecture Notes in Business Information Processing, vol 280. Springer, Cham. https://doi.org/10.1007/978-3-319-61164-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-61164-8_4
Published: 04 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61163-1
Online ISBN: 978-3-319-61164-8
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics