Skip to main content

Scalability and Realtime on Big Data, MapReduce, NoSQL and Spark

  • Conference paper
  • First Online:
Business Intelligence (eBISS 2016)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 280))

Included in the following conference series:

  • 2212 Accesses

Abstract

Big data platforms strive to achieve scalability and realtime for query processing and complex analytics over “big” and/or “fast” data. In this context, big data warehouses are huge repositories of data to be used in analytics and machine learning. This work discusses models, concepts and approaches to reach scalability and realtime in big data processing and big data warehouses. The main concepts of NoSQL, Parallel Data Management Systems (PDBMS), MapReduce and Spark are reviewed in the context of scalability. The first two offering data management, the last two adding flexible and scalable processing capacities. We also turn our attention to realtime data processing, lambda architecture and its relation with scalability, and we revisit our own recent research on the issue. Three approaches are included that are directly related to realtime and scalability: the use of a realtime component in a data warehouse, parallelized de-normalization for scalability and execution tree sharing for scaling to simultaneous sessions. With these models and technologies we revisit some of the major current solutions for data management and data processing with scalability and realtime capacities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A. and Zaharia, M.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015 (2015)

    Google Scholar 

  2. Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on Apache Spark, Research Report RT0968, IBM Research – Tokyo, 16 October 2015

    Google Scholar 

  3. Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on Apache Spark. In: 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2016)

    Google Scholar 

  4. Costa, J., Cecílio, J., Martins, P., Furtado, P.: ONE: predictable and scalable DW model. In: International Conference on Big Data Analytics and Knowledge Discovery (2011)

    Google Scholar 

  5. Costa, J.P.: Massively scalable data warehouses with performance predictability, PhD thesis, University of Coimbra, July 2015

    Google Scholar 

  6. Costa, J.P., Furtado, P.: Data warehouse processing scale-up for massive concurrent queries with SPIN. In: Hameurlain, A., Küng, J., Wagner, R., Bellatreche, L., Mohania, M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XVII. LNCS, vol. 8970, pp. 1–23. Springer, Heidelberg (2015). doi:10.1007/978-3-662-46335-2_1

    Google Scholar 

  7. Costa, J.P., Furtado, P.: Improving the processing of DW star-queries under concurrent query workloads. In: Bellatreche, L., Mohania, Mukesh K. (eds.) DaWaK 2014. LNCS, vol. 8646, pp. 245–253. Springer, Cham (2014). doi:10.1007/978-3-319-10160-6_22

    Google Scholar 

  8. Costa, J.P., Furtado, P.: SPIN: concurrent workload scaling over data warehouses. In: Bellatreche, L., Mohania, Mukesh K. (eds.) DaWaK 2013. LNCS, vol. 8057, pp. 60–71. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40131-2_6

    Chapter  Google Scholar 

  9. Costa, J.P., Martins, P., Cecilio, J., Furtado, P.: Providing timely results with an elastic parallel DW. In: Chen, L., Felfernig, A., Liu, J., Raś, Z.W. (eds.) ISMIS 2012. LNCS, vol. 7661, pp. 415–424. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  10. Costa, J.P., Martins, P., Cecilio, J., Furtado, P.: TEEPA: a timely-aware elastic parallel architecture. In: Proceedings of the 16th International Database Engineering & Applications Symposium, IDEAS 2012, Prague, Czech Republic (2012)

    Google Scholar 

  11. Costa, J.P., Cecílio, J., Martins, P., Furtado, P.: Overcoming the scalability limitations of parallel star schema data warehouses. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds.) ICA3PP 2012. LNCS, vol. 7439, pp. 473–486. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33078-0_34

    Chapter  Google Scholar 

  12. Ferreira, N., Furtado, P.: Near real-time with traditional data warehouse architectures: factors and how-to. In: 17th International Database Engineering and Applications Symposium (2013)

    Google Scholar 

  13. Ferreira, N., Furtado, P.: Real-time data warehouse: a solution and evaluation. Int. J. Bus. Intell. Data Min. 8(3), 244–263 (2014)

    Article  Google Scholar 

  14. Furtado, P.: Experimental evidence on partitioning in parallel data warehouses. In: Proceedings of the ACM DOLAP 2004 - Workshop of the International Conference on Information and Knowledge Management, Washington USA, November 2004

    Google Scholar 

  15. Furtado, P.: Workload-based placement and join processing in node-partitioned data warehouses. In: Proceedings of the International Conference on Data Warehousing and Knowledge Discovery, Zaragoza, Spain, pp. 38–47, September 2004

    Google Scholar 

  16. Furtado, P.: Efficient and robust node-partitioned data warehouses. In: Wrembel, R., Koncilia, C. (eds.) Data Warehouses and OLAP: Concepts, Architectures and Solutions, Chap. IX, pp. 203–229. Ideas Group, Inc. ISBN 1-59904365-3

    Google Scholar 

  17. Furtado, P.: A survey of parallel and distributed data warehouses. Int. J. Data Warehous. Min. 5(2), 57 (2009)

    Article  Google Scholar 

  18. Furtado, P.: Replication in node-partitioned data warehouses. In: DDIDR2005 Workshop of International Conference on Very Large Databases, VLDB (2005)

    Google Scholar 

  19. Furtado, P.: Efficiently processing query-intensive databases over a non-dedicated local network. In: Proceedings of the 19th International Parallel and Distributed Processing Symposium, Denver, Colorado, USA, May 2005

    Google Scholar 

  20. Furtado, P.: Model and procedure for performance and availability-wise parallel warehouses. Distrib. Parallel Databases 25(1), 71 (2009)

    Article  Google Scholar 

  21. Furtado, P.: Scalability and Realtime for Data Warehouses and Big data, Paperback, 11 September 2015

    Google Scholar 

  22. Martins, P.: Elastic ETL+Q for any data-warehouse using time bounds. PhD thesis, University of Coimbra, February 2016

    Google Scholar 

  23. Martins, P., Abbasi, M., Furtado, P.: Data-warehouse ETL+Q auto-scale framework. Int. J. Bus. Intell. Syst. Eng. 1(1), 49–76 (2015)

    Google Scholar 

  24. Martins, P., Abbasi, M., Furtado, P.: AutoScale: automatic ETL scale process. In: 19th East European Conference on Advances in Databases and Information Systems (2015)

    Google Scholar 

  25. Martins, P., Abbasi, M., Furtado, P.: Preparing a full auto- scale framework for data-warehouse ETL+Q. In: IEEE Big data Congress 2015, New York, USA (2015)

    Google Scholar 

  26. Martins, P., Abbasi, M., Furtado, P.: AScale: automatically scaling the ETL+Q process for performance. Int. J. Bus. Process Integr. Manage. 7(4), 300–313 (2015)

    Article  Google Scholar 

  27. Marz, N., Warren, J.: Big Data: principles and best practices of scalable realtime data systems, 1st Manning Publications Co. Greenwich, CT, USA ©2015 (2015), ISBN:1617290343 9781617290343

    Google Scholar 

  28. O’Neil, P., O’Neil, E., Chen, X.: Star schema benchmark - revision 3. Technical report, UMass/Boston (2009)

    Google Scholar 

  29. Waas, F., Wrembel, R., Freudenreich, T., Thiele, M., Koncilia, C., Furtado, P.: On-demand ELT architecture for right-time BI: extending the vision. Int. J. Data Warehous. Mining 9(2), 21–38 (2013)

    Article  Google Scholar 

  30. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI 2012). USENIX Association, Berkeley, CA, USA, p. 2 (2012)

    Google Scholar 

  31. Zhao, J., Pjesivac-Grbovic, J.: MapReduce, the programming model and practice. In: Sigmetrics/Performance 2009, Tutorials, 19 June 2009

    Google Scholar 

  32. Spark Homepage. http://spark.apache.org/. Accessed Jul 2016

  33. Spark SQL homepage. http://spark.apache.org/sql/. Accessed Jul 2016

  34. Parquet File Format. https://parquet.apache.org/. Accessed Jul 2016

  35. Spark Streaming. http://spark.apache.org/streaming/. Accessed Jul 2016

  36. Kafka homepage http://kafka.apache.org/. Accessed Jul 2016

  37. CassandraTM Homepage. http://cassandra.apache.org/. Accessed Jul 2016

  38. TCP Council homepage. www.tpc.org. Accessed Jul 2016

  39. Snijders, C., Matzat, U., Reips, U.-D.: ‘Big Data’: Big gaps of knowledge in the field of Internet. Int. J. Internet Sci. 7, 1–5 (2012)

    Google Scholar 

  40. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise big data on cloud computing: review and open research issues. Information Systems 47, 98–115 (2015)

    Article  Google Scholar 

  41. “Data, data everywhere”. The Economist. 25 February 2010. Retrieved 2 December 2016

    Google Scholar 

  42. “Supercomputing the Climate: NASA’s Big Data Mission”. CSC World. Computer Sciences Corporation. Retrieved 2 December 2016

    Google Scholar 

  43. “DNAstack tackles massive, complex DNA datasets with Google Genomics”. Google Cloud Platform. Retrieved 20 December 2016

    Google Scholar 

  44. Mirkes, E.M., Coats, T.J., Levesley, J., Gorban, A.N.: Handling missing data in large healthcare dataset: a case study of unknown trauma outcomes. Comput. Biol. Med. 75, 203–216 (2016)

    Article  Google Scholar 

  45. Brewer, E.: CAP twelve years later: how the “rules” have changed. Computer 45(2), 23–29 (2012)

    Article  Google Scholar 

  46. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data, OSDI 2006: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November 2006

    Google Scholar 

  47. Google File System and BigTable, Radar (World Wide Web log), Database War Stories (7). O’Reilly, May 2006

    Google Scholar 

  48. Angles, R., Gutierrez, C.: Survey of graph database models. ACM Comput. Surv. 40(1), 1 (2008)

    Article  Google Scholar 

  49. Robinson, I., Webber, J., Eifrem, E.: Graph Databases. 2nd edn. O’Reilly Media (2015)

    Google Scholar 

  50. Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Rec. 39(4), 12–27 (2010)

    Article  Google Scholar 

  51. Han, J., et al.: Survey on NoSQL database. In: 2011 6th International Conference on Pervasive Computing and Applications (ICPCA). IEEE (2011)

    Google Scholar 

  52. Grolinger, K., et al.: Data management in cloud environments: NoSQL and NewSQL data stores. J. Cloud Comput. Adv. Syst. Appl. 2(1), 1 (2013)

    Article  Google Scholar 

  53. Graefe, G.: Volcano - an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng. 6(1), 120–135 (1994)

    Article  Google Scholar 

  54. Kossmann, D.: Distributed query processing approaches. In: ACM Computing Surveys (CSUR) (2000)

    Google Scholar 

  55. Deshpande, A., Ives, Z., Raman, V.: Adaptive query processing. Found. Trends Databases 1(1), 1–140 (2007)

    Article  Google Scholar 

  56. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  57. Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Furtado .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Furtado, P. (2017). Scalability and Realtime on Big Data, MapReduce, NoSQL and Spark. In: Marcel, P., Zimányi, E. (eds) Business Intelligence. eBISS 2016. Lecture Notes in Business Information Processing, vol 280. Springer, Cham. https://doi.org/10.1007/978-3-319-61164-8_4

Download citation

Publish with us

Policies and ethics