Skip to main content
Log in

A rewrite/merge approach for supporting real-time data warehousing via lightweight data integration

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This paper proposes and experimentally assesses a rewrite/merge approach for supporting real-time data warehousing via lightweight data integration. Real-time data warehouses are becoming more and more relevant actually, due to emerging research challenges such as Big Data and Cloud Computing. Our contribution fulfills limitations of actual data warehousing architectures, which are no suitable to perform classical operations (e.g., loading, aggregation, indexing, OLAP query answering, and so forth) under real-time constraints. The proposed approach is based on intelligent manipulation of SQL statements of input queries, which are decomposed in suitable sub-queries (the rewrite phase) that are finally submitted as (final) input queries to an ad hoc component responsible for the cooperative query answering via a parallel query processing inspired method (the merge phase). This method induces in a novel data warehousing framework where the static phase is separated by the dynamic phase, in order to achieve the real-time processing features. We complete our analytical contributions by means of an extensive experimental campaign where we stress the performance of our proposed real-time data warehousing framework against a popular data warehouse benchmark, and in comparison with traditional architectures, which finally confirms the benefits deriving from our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

References

  1. Agrawal D, Das D, El Abbadi A (2011) Big data and cloud computing: current state and future opportunities. In: Proceedings of EDBT, pp 530–533

  2. Apache. Apache Spark. http://spark.apache.org/. Accessed: Sept 2018. Apache. Apache Spark. http://spark.apache.org/. Accessed: Sept. 2018

  3. Apache. Spark Streaming. http://spark.apache.org/streaming/. Accessed: Sept. 2018

  4. Babu S, Widom J (2001) Continuous queries over data streams. SIGMOD Rec 30(3):109–120

    Article  Google Scholar 

  5. Bayer R, McCreight E (1972) Organization and maintenance of large ordered indexes. Acta Inf 1(3):173–189

    Article  MATH  Google Scholar 

  6. Barkhordari M, Niamanesh M (2017) Atrak: a MapReduce-based data warehouse for big data. J Supercomput 73(10):4596–4610

    Article  Google Scholar 

  7. Bateni MH, Golab L, Hajiaghayi MT, Karloff HJ (2011) Scheduling to minimize staleness and stretch in real-time data warehouses. Theory Comput Syst 49(4):757–780

    Article  MathSciNet  MATH  Google Scholar 

  8. Bellatreche L, Cuzzocrea A, Benkrid S (2012) Effectively and efficiently designing and querying parallel relational data warehouses on heterogeneous database clusters: the F&A approach. J Database Manag 23(4):17–51

    Article  Google Scholar 

  9. Benslimane D, Dustdar S, Sheth A (2008) Services mashups: the new generation of web applications. IEEE Internet Comput 10(5):13–15

    Article  Google Scholar 

  10. Bernstein PA (1996) Middleware: a model for distributed system services. Commun ACM 39(2):86–98

    Article  MathSciNet  Google Scholar 

  11. Bouaziz S, Nabli A, Gargouri F (2016) From traditional data warehouse to real time data warehouse. In: Proceedings of ISDA, pp 467–477

  12. Chan CY, Ioannidis YE (1998) Bitmap index design and evaluation. In: Proceedings of ACM SIGMOD, pp 355–366

  13. Chaudhuri S, Dayal U (1997) An overview of data warehousing and OLAP technology. SIGMOD Rec 26(1):65–74

    Article  Google Scholar 

  14. Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C (2009) MAD skills: new analysis practices for big data. PVLDB 2(2):1481–1492

    Google Scholar 

  15. Cuzzocrea A (2005) Providing probabilistically-bounded approximate answers to non-holistic aggregate range queries in OLAP. In: Proceedings of ACM DOLAP, pp 97–106

  16. Cuzzocrea A (2005) Overcoming limitations of approximate range query answering in OLAP. In: Proceedings of IEEE IDEAS, pp 200–209

  17. Cuzzocrea A (2011) A framework for modeling and supporting data transformation services over data and knowledge grids with real-time bound constraints. Concur Comput Pract Exp 23(5):436–457

    Article  Google Scholar 

  18. Cuzzocrea A (2011) Data warehousing and knowledge discovery from sensors and streams. Knowl Inf Syst 28(3):491–493

    Article  Google Scholar 

  19. Cuzzocrea A (2013) Analytics over big data: exploring the convergence of data warehousing, OLAP and data-intensive cloud infrastructures. In: Proceedings of IEEE COMPSAC, pp 481–483

  20. Cuzzocrea A (2017) Big web data: warehousing and analytics—recent trends and future challenges. In: Proceedings of ICWE Workshops, pp 265–266

  21. Cuzzocrea A (2013) Theoretical and practical aspects of warehousing, querying and mining sensor and streaming data. J Comput Syst Sci 79(3):309–311

    Article  MathSciNet  Google Scholar 

  22. Cuzzocrea A (2014) Data warehousing and OLAP over big data. In: Proceedings of BigData Congress

  23. Cuzzocrea A, Bellatreche L, Song IY (2013) Data warehousing and OLAP over big data: current challenges and future research directions. In: Proceedings of DOLAP, pp 67–70

  24. Cuzzocrea A, Furfaro F, Masciari E, Saccà D, Sirangelo C (2004) Approximate query answering on sensor network data streams. In: Stefanidis A, Nittel S (eds) GeoSensor networks. CRC Press, London, pp 53–72

    Google Scholar 

  25. Cuzzocrea A, Gunopulos D (2014) A decomposition framework for computing and querying multidimensional OLAP data cubes over probabilistic relational data. Fundam Inf 132(2):239–266

    Article  Google Scholar 

  26. Cuzzocrea A, Moussa R, Vercelli G (2018) An innovative lambda-architecture-based data warehouse maintenance framework for effective and efficient near-real-time OLAP over big data. In: Proceedings of BigData Congress, pp 149–165

  27. Cuzzocrea A, Saccà D, Serafino P (2007) Semantics-aware advanced OLAP visualization of multidimensional data cubes. Int J Data Warehous Min 3(4):1–30

    Article  Google Scholar 

  28. Cuzzocrea A, Saccà D, Ullman JD (2013) Big data: a research agenda. In: Proceedings of ACM IDEAS, pp 198–203

  29. Cuzzocrea A, Serafino P (2009) LCS-Hist: taming massive high-dimensional data cube compression. In: Proceedings of ACM EDBT, pp 768–779

  30. Cuzzocrea A, Song I-Y, Davis KC (2011) Analytics over large-scale multidimensional data: the big data revolution! In: Proceedings of ACM DOLAP, pp 101–104

  31. Das S, Botev C, Surlaker K, Ghosh B, Varadarajan B, Nagaraj S, Zhang D, Gao L, Westerman J, Ganti P, Shkolnik B, Topiwala S, Pachev A, Somasundaram N, Subramaniam S (2012) All aboard the databus! linkedin’s scalable consistent change data capture platform. In: Proceedings of SoCC, p 18

  32. Davoudian A, Chen L, Liu MA (2018) Survey on NoSQL stores. ACM Comput Surv 51(2):40:1–40:43

    Article  Google Scholar 

  33. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  34. Eavis T, Cueva D (2007) A Hilbert space compression architecture for data warehouse environments. In: Proceedings of DaWaK, pp 1–12

  35. Eccles MJ, Evans DJ, Beaumont AJ (2010) True real-time change data capture with web service database encapsulation. In: Proceedings of SERVICES, pp 128–131

  36. Erl T (2005) Service-oriented architecture: concepts, technology, and design. Prentice Hall, Upper Saddle River

    Google Scholar 

  37. Ferreira N, Furtado P (2013) Real-time data warehouse: a solution and evaluation. Int J Bus Intell Data Min 8(3):244–263

    Article  Google Scholar 

  38. Furtado P (2005) Efficiently processing query-intensive databases over a non-dedicated local network. In: Proceedings of IEEE IPDPS, p 72

  39. Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub totals. Data Min Knowl Discov 1(1):152–159

    Article  Google Scholar 

  40. Guo K, Pan W, Lu M, Zhou X, Ma J (2015) An effective and economical architecture for semantic-based heterogeneous multimedia big data retrieval. J Syst Softw 102(1):207–216

    Article  Google Scholar 

  41. Guo K, Tang Y, Zhang P (2017) CSF: crowdsourcing semantic fusion for heterogeneous media big data in the internet of things. Inf Fusion 37(1):77–85

    Article  Google Scholar 

  42. Gupta A, Mumick IS (1999) Materialized, views: techniques, implementations, and applications. MIT Press, Cambridge

    Book  Google Scholar 

  43. Gupta A, Yang F, Govig J, Kirsch A, Chan K, Lai K, Wu S, Dhoot SG, Kumar AR, Agiwal A, Bhansali S, Hong M, Cameron J, Siddiqi M, Jones D, Shute J, Gubarev A, Venkataraman S, Agrawal D (2014) Mesa: geo-replicated, near real-time, scalable data warehousing. PVLDB 7(12):1259–1270

    Google Scholar 

  44. Hamdi I, Bouazizi E, Alshomrani S, Feki J (2015) 2LPA-RTDW: a two-level data partitioning approach for real-time data warehouse. In: Proceedings of ICIS, pp 632–638

  45. Hamdi I, Bouazizi E, Alshomrani S, Feki J (2018) Improving QoS in real-time data warehouses by using feedback control scheduling. Int J Inf Decis Sci 10(3):181–211

    Google Scholar 

  46. Hamdi I, Bouazizi E, Feki J (2014) Dynamic management of materialized views in real-time data warehouses. In: Proceedings of SoCPaR, pp 168–173

  47. Ishigaki A, Hibino H (2014) Optimal storage assignment for an automated warehouse system with mixed loading. In: Proceedings of APMS, pp 475–482

  48. Jain T, Rajasree S, Saluja S (2012) Refreshing data warehouse in near real-time. Int J Comput Appl 46(18):24–29

    Google Scholar 

  49. Jia R, Xu S, Peng C (2013) Research on real time data warehouse architecture. In: Proceedings of ICICA, pp 333–342

  50. Kimball R (2008) The data warehouse lifecycle toolkit, 2nd edn. Wiley, Hoboken

    Google Scholar 

  51. Larson P-A (2013) Special issue on main-memory database systems. IEEE Data Eng Bull 36(2):1

    Google Scholar 

  52. Li J, Srivastava J (2002) Efficient aggregation algorithms for compressed data warehouses. IEEE Trans Knowl Data Eng 14(3):515–529

    Article  Google Scholar 

  53. Lpez MA, Nadal S, Djedaini M, Marcel P, Peralta V, Furtado P (2015) An approach for alert raising in real-time data warehouses. In: Proceedings of EDA, pp 145–160

  54. Lu H, Tan KL, Ooi B-C (1994) Query processing in parallel relational database systems. IEEE Computer Society Press, Los Alamitos

    Google Scholar 

  55. Naeem MA (2013) Tuned X-HYBRIDJOIN for near-real-time data warehousing. In: Proceedings of APWeb, pp 494–505

  56. Naeem MA (2013) A robust join operator to process streaming data in real-time data warehousing. In: Proceedings of ICDIM, pp 119–124

  57. Naeem MA, Dobbie G, Weber G (2014) Efficient processing of streaming updates with archived master data in near-real-time data warehousing. Knowl Inf Syst 40(13):615–637

    Article  Google Scholar 

  58. Naeem MA, Jamil N (2014) An efficient stream-based join to process end user transactions in real-time data warehousing. J Dig Inf Manag 12(3):201–215

    Google Scholar 

  59. Naeem MA, Nguyen KT, Weber G (2017) A multi-way semi-stream join for a near-real-time data warehouse. In: Proceedings of ADC, pp 59–70

  60. Navathe SB, Ceri S, Wiederhold G, Dou J (1984) Vertical partitioning algorithms for database design. ACM Trans Database Syst 9(4):680–710

    Article  Google Scholar 

  61. Nguyen M, Tjoa AM (2003) Zero-latency data warehousing for heterogeneous data sources and continuous data streams. In: Proceedings of iiWAS, pp 55–64

  62. O’Neil P, O’Neil E, Chen X, Revilak S (2009) Star schema benchmark and augmented fact table indexing. In: Proceedings of TPCTC, pp 237–252

  63. Oracle (2012) Best practices for real-time data warehousing. White Paper

  64. Pereira DA, de Morais WO, de Freitas EP (2018) NoSQL real-time database performance comparison. Int J Parallel Emerg Distrib Syst 33(2):144–156

    Article  Google Scholar 

  65. Qu W, Basavaraj V, Shankar S, Dessloch S (2015) Real-time snapshot maintenance with incremental ETL pipelines in data warehouses. In: Proceedings of DaWaK, pp 217–228

  66. Qu W, Deloch S (2017) Incremental ETL pipeline scheduling for near real-time data warehouses. In: Proceedings of BTW, pp 299–308

  67. Ram P, Do L (2000) Extracting delta for incremental data warehouse maintenance. In: Proceedings of IEEE ICDE, pp 220–229

  68. Reese G (2000) Database programming with JDBC & Java, 2nd edn. O’Reilly, Sebastopol

    MATH  Google Scholar 

  69. Santos RJ, Bernardino J (2008) Real-time data warehouse loading methodology. In: Proceedings of ACM IDEAS, pp 49–58

  70. Sarawagi S, Sathe G (2000) i3: intelligent, interactive investigation of OLAP data cubes. In: Proceedings of ACM SIGMOD, p 589

  71. Shi J, Bao Y, Leng F, Yu G (2008) Study on log-based change data capture and handling mechanism in real-time data warehouse. In: Proceedings of IEEE CSSE, pp 478–481

  72. Snoddy D, Spyker J, Rupik M, Jory M, Kobylinski K (2009) Change data capture: what is it and how it impacts solutions architecture. In: Proceedings of CASCON, pp 297–298

  73. Song X, Shibasaki R, Yuan NJ, Xie X, Li T, Adachi R (2017) DeepMob: learning deep knowledge of human emergency behavior and mobility from big and heterogeneous data. ACM Trans Inf Syst 35(4):41:1–41:19

    Google Scholar 

  74. Ting I-H, Lin C-H, Wang C-S (2011) Constructing a cloud computing based social networks data warehousing and analyzing system. In: Proceedings of ASONAM, pp 735–740

  75. Transaction Processing Performance Council. TPC-H Benchmark. http://www.tpc.org/tpch/. Accessed Apr 2018

  76. Valncio CR, Marioto MH, Zafalon GFD, Machado JM, Momente JC (2013) Real time delta extraction based on triggers to support data warehousing. In: Proceedings of PDCAT, pp 293–297

  77. Vassiliadis P, Simitsis A (2009) Near real time ETL. New trends in data warehousing and data analysis. Ann Inf Syst 3:1–31

    Article  Google Scholar 

  78. Vertica. Real-Time Loading and Querying. http://www.vertica.com/the-analytics-platform/real-time-loading-querying/

  79. Wu M-C, Buchmann AP (1998) Encoded bitmap indexing for data warehouses. In: Proceedings of IEEE ICDE, pp 220–230

  80. Zikopoulos P, Eaton C, Deutsch T, Lapis G (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill, New York

    Google Scholar 

  81. Zhu Y, An L, Liu S (2008) Data updating and query in real-time data warehouse system. In: Proceedings of IEEE CSSE, pp 1295–1297

  82. Zuters J (2011) Near real-time data warehousing with multi-stage trickle and flip. In: Proceedings of BIR, pp 73–82

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alfredo Cuzzocrea.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cuzzocrea, A., Ferreira, N. & Furtado, P. A rewrite/merge approach for supporting real-time data warehousing via lightweight data integration. J Supercomput 76, 3898–3922 (2020). https://doi.org/10.1007/s11227-018-2707-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2707-9

Keywords

Navigation