Abstract
Big data platforms strive to achieve scalability and realtime for query processing and complex analytics over “big” and/or “fast” data. In this context, big data warehouses are huge repositories of data to be used in analytics and machine learning. This work discusses models, concepts and approaches to reach scalability and realtime in big data processing and big data warehouses. The main concepts of NoSQL, Parallel Data Management Systems (PDBMS), MapReduce and Spark are reviewed in the context of scalability. The first two offering data management, the last two adding flexible and scalable processing capacities. We also turn our attention to realtime data processing, lambda architecture and its relation with scalability, and we revisit our own recent research on the issue. Three approaches are included that are directly related to realtime and scalability: the use of a realtime component in a data warehouse, parallelized de-normalization for scalability and execution tree sharing for scaling to simultaneous sessions. With these models and technologies we revisit some of the major current solutions for data management and data processing with scalability and realtime capacities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A. and Zaharia, M.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015 (2015)
Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on Apache Spark, Research Report RT0968, IBM Research – Tokyo, 16 October 2015
Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on Apache Spark. In: 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2016)
Costa, J., Cecílio, J., Martins, P., Furtado, P.: ONE: predictable and scalable DW model. In: International Conference on Big Data Analytics and Knowledge Discovery (2011)
Costa, J.P.: Massively scalable data warehouses with performance predictability, PhD thesis, University of Coimbra, July 2015
Costa, J.P., Furtado, P.: Data warehouse processing scale-up for massive concurrent queries with SPIN. In: Hameurlain, A., Küng, J., Wagner, R., Bellatreche, L., Mohania, M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XVII. LNCS, vol. 8970, pp. 1–23. Springer, Heidelberg (2015). doi:10.1007/978-3-662-46335-2_1
Costa, J.P., Furtado, P.: Improving the processing of DW star-queries under concurrent query workloads. In: Bellatreche, L., Mohania, Mukesh K. (eds.) DaWaK 2014. LNCS, vol. 8646, pp. 245–253. Springer, Cham (2014). doi:10.1007/978-3-319-10160-6_22
Costa, J.P., Furtado, P.: SPIN: concurrent workload scaling over data warehouses. In: Bellatreche, L., Mohania, Mukesh K. (eds.) DaWaK 2013. LNCS, vol. 8057, pp. 60–71. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40131-2_6
Costa, J.P., Martins, P., Cecilio, J., Furtado, P.: Providing timely results with an elastic parallel DW. In: Chen, L., Felfernig, A., Liu, J., Raś, Z.W. (eds.) ISMIS 2012. LNCS, vol. 7661, pp. 415–424. Springer, Heidelberg (2012)
Costa, J.P., Martins, P., Cecilio, J., Furtado, P.: TEEPA: a timely-aware elastic parallel architecture. In: Proceedings of the 16th International Database Engineering & Applications Symposium, IDEAS 2012, Prague, Czech Republic (2012)
Costa, J.P., Cecílio, J., Martins, P., Furtado, P.: Overcoming the scalability limitations of parallel star schema data warehouses. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds.) ICA3PP 2012. LNCS, vol. 7439, pp. 473–486. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33078-0_34
Ferreira, N., Furtado, P.: Near real-time with traditional data warehouse architectures: factors and how-to. In: 17th International Database Engineering and Applications Symposium (2013)
Ferreira, N., Furtado, P.: Real-time data warehouse: a solution and evaluation. Int. J. Bus. Intell. Data Min. 8(3), 244–263 (2014)
Furtado, P.: Experimental evidence on partitioning in parallel data warehouses. In: Proceedings of the ACM DOLAP 2004 - Workshop of the International Conference on Information and Knowledge Management, Washington USA, November 2004
Furtado, P.: Workload-based placement and join processing in node-partitioned data warehouses. In: Proceedings of the International Conference on Data Warehousing and Knowledge Discovery, Zaragoza, Spain, pp. 38–47, September 2004
Furtado, P.: Efficient and robust node-partitioned data warehouses. In: Wrembel, R., Koncilia, C. (eds.) Data Warehouses and OLAP: Concepts, Architectures and Solutions, Chap. IX, pp. 203–229. Ideas Group, Inc. ISBN 1-59904365-3
Furtado, P.: A survey of parallel and distributed data warehouses. Int. J. Data Warehous. Min. 5(2), 57 (2009)
Furtado, P.: Replication in node-partitioned data warehouses. In: DDIDR2005 Workshop of International Conference on Very Large Databases, VLDB (2005)
Furtado, P.: Efficiently processing query-intensive databases over a non-dedicated local network. In: Proceedings of the 19th International Parallel and Distributed Processing Symposium, Denver, Colorado, USA, May 2005
Furtado, P.: Model and procedure for performance and availability-wise parallel warehouses. Distrib. Parallel Databases 25(1), 71 (2009)
Furtado, P.: Scalability and Realtime for Data Warehouses and Big data, Paperback, 11 September 2015
Martins, P.: Elastic ETL+Q for any data-warehouse using time bounds. PhD thesis, University of Coimbra, February 2016
Martins, P., Abbasi, M., Furtado, P.: Data-warehouse ETL+Q auto-scale framework. Int. J. Bus. Intell. Syst. Eng. 1(1), 49–76 (2015)
Martins, P., Abbasi, M., Furtado, P.: AutoScale: automatic ETL scale process. In: 19th East European Conference on Advances in Databases and Information Systems (2015)
Martins, P., Abbasi, M., Furtado, P.: Preparing a full auto- scale framework for data-warehouse ETL+Q. In: IEEE Big data Congress 2015, New York, USA (2015)
Martins, P., Abbasi, M., Furtado, P.: AScale: automatically scaling the ETL+Q process for performance. Int. J. Bus. Process Integr. Manage. 7(4), 300–313 (2015)
Marz, N., Warren, J.: Big Data: principles and best practices of scalable realtime data systems, 1st Manning Publications Co. Greenwich, CT, USA ©2015 (2015), ISBN:1617290343 9781617290343
O’Neil, P., O’Neil, E., Chen, X.: Star schema benchmark - revision 3. Technical report, UMass/Boston (2009)
Waas, F., Wrembel, R., Freudenreich, T., Thiele, M., Koncilia, C., Furtado, P.: On-demand ELT architecture for right-time BI: extending the vision. Int. J. Data Warehous. Mining 9(2), 21–38 (2013)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI 2012). USENIX Association, Berkeley, CA, USA, p. 2 (2012)
Zhao, J., Pjesivac-Grbovic, J.: MapReduce, the programming model and practice. In: Sigmetrics/Performance 2009, Tutorials, 19 June 2009
Spark Homepage. http://spark.apache.org/. Accessed Jul 2016
Spark SQL homepage. http://spark.apache.org/sql/. Accessed Jul 2016
Parquet File Format. https://parquet.apache.org/. Accessed Jul 2016
Spark Streaming. http://spark.apache.org/streaming/. Accessed Jul 2016
Kafka homepage http://kafka.apache.org/. Accessed Jul 2016
CassandraTM Homepage. http://cassandra.apache.org/. Accessed Jul 2016
TCP Council homepage. www.tpc.org. Accessed Jul 2016
Snijders, C., Matzat, U., Reips, U.-D.: ‘Big Data’: Big gaps of knowledge in the field of Internet. Int. J. Internet Sci. 7, 1–5 (2012)
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise big data on cloud computing: review and open research issues. Information Systems 47, 98–115 (2015)
“Data, data everywhere”. The Economist. 25 February 2010. Retrieved 2 December 2016
“Supercomputing the Climate: NASA’s Big Data Mission”. CSC World. Computer Sciences Corporation. Retrieved 2 December 2016
“DNAstack tackles massive, complex DNA datasets with Google Genomics”. Google Cloud Platform. Retrieved 20 December 2016
Mirkes, E.M., Coats, T.J., Levesley, J., Gorban, A.N.: Handling missing data in large healthcare dataset: a case study of unknown trauma outcomes. Comput. Biol. Med. 75, 203–216 (2016)
Brewer, E.: CAP twelve years later: how the “rules” have changed. Computer 45(2), 23–29 (2012)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data, OSDI 2006: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November 2006
Google File System and BigTable, Radar (World Wide Web log), Database War Stories (7). O’Reilly, May 2006
Angles, R., Gutierrez, C.: Survey of graph database models. ACM Comput. Surv. 40(1), 1 (2008)
Robinson, I., Webber, J., Eifrem, E.: Graph Databases. 2nd edn. O’Reilly Media (2015)
Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Rec. 39(4), 12–27 (2010)
Han, J., et al.: Survey on NoSQL database. In: 2011 6th International Conference on Pervasive Computing and Applications (ICPCA). IEEE (2011)
Grolinger, K., et al.: Data management in cloud environments: NoSQL and NewSQL data stores. J. Cloud Comput. Adv. Syst. Appl. 2(1), 1 (2013)
Graefe, G.: Volcano - an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng. 6(1), 120–135 (1994)
Kossmann, D.: Distributed query processing approaches. In: ACM Computing Surveys (CSUR) (2000)
Deshpande, A., Ives, Z., Raman, V.: Adaptive query processing. Found. Trends Databases 1(1), 1–140 (2007)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Furtado, P. (2017). Scalability and Realtime on Big Data, MapReduce, NoSQL and Spark. In: Marcel, P., Zimányi, E. (eds) Business Intelligence. eBISS 2016. Lecture Notes in Business Information Processing, vol 280. Springer, Cham. https://doi.org/10.1007/978-3-319-61164-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-61164-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61163-1
Online ISBN: 978-3-319-61164-8
eBook Packages: Business and ManagementBusiness and Management (R0)