Abstract
This paper proposes and experimentally assesses a rewrite/merge approach for supporting real-time data warehousing via lightweight data integration. Real-time data warehouses are becoming more and more relevant actually, due to emerging research challenges such as Big Data and Cloud Computing. Our contribution fulfills limitations of actual data warehousing architectures, which are no suitable to perform classical operations (e.g., loading, aggregation, indexing, OLAP query answering, and so forth) under real-time constraints. The proposed approach is based on intelligent manipulation of SQL statements of input queries, which are decomposed in suitable sub-queries (the rewrite phase) that are finally submitted as (final) input queries to an ad hoc component responsible for the cooperative query answering via a parallel query processing inspired method (the merge phase). This method induces in a novel data warehousing framework where the static phase is separated by the dynamic phase, in order to achieve the real-time processing features. We complete our analytical contributions by means of an extensive experimental campaign where we stress the performance of our proposed real-time data warehousing framework against a popular data warehouse benchmark, and in comparison with traditional architectures, which finally confirms the benefits deriving from our proposal.




















Similar content being viewed by others
References
Agrawal D, Das D, El Abbadi A (2011) Big data and cloud computing: current state and future opportunities. In: Proceedings of EDBT, pp 530–533
Apache. Apache Spark. http://spark.apache.org/. Accessed: Sept 2018. Apache. Apache Spark. http://spark.apache.org/. Accessed: Sept. 2018
Apache. Spark Streaming. http://spark.apache.org/streaming/. Accessed: Sept. 2018
Babu S, Widom J (2001) Continuous queries over data streams. SIGMOD Rec 30(3):109–120
Bayer R, McCreight E (1972) Organization and maintenance of large ordered indexes. Acta Inf 1(3):173–189
Barkhordari M, Niamanesh M (2017) Atrak: a MapReduce-based data warehouse for big data. J Supercomput 73(10):4596–4610
Bateni MH, Golab L, Hajiaghayi MT, Karloff HJ (2011) Scheduling to minimize staleness and stretch in real-time data warehouses. Theory Comput Syst 49(4):757–780
Bellatreche L, Cuzzocrea A, Benkrid S (2012) Effectively and efficiently designing and querying parallel relational data warehouses on heterogeneous database clusters: the F&A approach. J Database Manag 23(4):17–51
Benslimane D, Dustdar S, Sheth A (2008) Services mashups: the new generation of web applications. IEEE Internet Comput 10(5):13–15
Bernstein PA (1996) Middleware: a model for distributed system services. Commun ACM 39(2):86–98
Bouaziz S, Nabli A, Gargouri F (2016) From traditional data warehouse to real time data warehouse. In: Proceedings of ISDA, pp 467–477
Chan CY, Ioannidis YE (1998) Bitmap index design and evaluation. In: Proceedings of ACM SIGMOD, pp 355–366
Chaudhuri S, Dayal U (1997) An overview of data warehousing and OLAP technology. SIGMOD Rec 26(1):65–74
Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C (2009) MAD skills: new analysis practices for big data. PVLDB 2(2):1481–1492
Cuzzocrea A (2005) Providing probabilistically-bounded approximate answers to non-holistic aggregate range queries in OLAP. In: Proceedings of ACM DOLAP, pp 97–106
Cuzzocrea A (2005) Overcoming limitations of approximate range query answering in OLAP. In: Proceedings of IEEE IDEAS, pp 200–209
Cuzzocrea A (2011) A framework for modeling and supporting data transformation services over data and knowledge grids with real-time bound constraints. Concur Comput Pract Exp 23(5):436–457
Cuzzocrea A (2011) Data warehousing and knowledge discovery from sensors and streams. Knowl Inf Syst 28(3):491–493
Cuzzocrea A (2013) Analytics over big data: exploring the convergence of data warehousing, OLAP and data-intensive cloud infrastructures. In: Proceedings of IEEE COMPSAC, pp 481–483
Cuzzocrea A (2017) Big web data: warehousing and analytics—recent trends and future challenges. In: Proceedings of ICWE Workshops, pp 265–266
Cuzzocrea A (2013) Theoretical and practical aspects of warehousing, querying and mining sensor and streaming data. J Comput Syst Sci 79(3):309–311
Cuzzocrea A (2014) Data warehousing and OLAP over big data. In: Proceedings of BigData Congress
Cuzzocrea A, Bellatreche L, Song IY (2013) Data warehousing and OLAP over big data: current challenges and future research directions. In: Proceedings of DOLAP, pp 67–70
Cuzzocrea A, Furfaro F, Masciari E, Saccà D, Sirangelo C (2004) Approximate query answering on sensor network data streams. In: Stefanidis A, Nittel S (eds) GeoSensor networks. CRC Press, London, pp 53–72
Cuzzocrea A, Gunopulos D (2014) A decomposition framework for computing and querying multidimensional OLAP data cubes over probabilistic relational data. Fundam Inf 132(2):239–266
Cuzzocrea A, Moussa R, Vercelli G (2018) An innovative lambda-architecture-based data warehouse maintenance framework for effective and efficient near-real-time OLAP over big data. In: Proceedings of BigData Congress, pp 149–165
Cuzzocrea A, Saccà D, Serafino P (2007) Semantics-aware advanced OLAP visualization of multidimensional data cubes. Int J Data Warehous Min 3(4):1–30
Cuzzocrea A, Saccà D, Ullman JD (2013) Big data: a research agenda. In: Proceedings of ACM IDEAS, pp 198–203
Cuzzocrea A, Serafino P (2009) LCS-Hist: taming massive high-dimensional data cube compression. In: Proceedings of ACM EDBT, pp 768–779
Cuzzocrea A, Song I-Y, Davis KC (2011) Analytics over large-scale multidimensional data: the big data revolution! In: Proceedings of ACM DOLAP, pp 101–104
Das S, Botev C, Surlaker K, Ghosh B, Varadarajan B, Nagaraj S, Zhang D, Gao L, Westerman J, Ganti P, Shkolnik B, Topiwala S, Pachev A, Somasundaram N, Subramaniam S (2012) All aboard the databus! linkedin’s scalable consistent change data capture platform. In: Proceedings of SoCC, p 18
Davoudian A, Chen L, Liu MA (2018) Survey on NoSQL stores. ACM Comput Surv 51(2):40:1–40:43
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Eavis T, Cueva D (2007) A Hilbert space compression architecture for data warehouse environments. In: Proceedings of DaWaK, pp 1–12
Eccles MJ, Evans DJ, Beaumont AJ (2010) True real-time change data capture with web service database encapsulation. In: Proceedings of SERVICES, pp 128–131
Erl T (2005) Service-oriented architecture: concepts, technology, and design. Prentice Hall, Upper Saddle River
Ferreira N, Furtado P (2013) Real-time data warehouse: a solution and evaluation. Int J Bus Intell Data Min 8(3):244–263
Furtado P (2005) Efficiently processing query-intensive databases over a non-dedicated local network. In: Proceedings of IEEE IPDPS, p 72
Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub totals. Data Min Knowl Discov 1(1):152–159
Guo K, Pan W, Lu M, Zhou X, Ma J (2015) An effective and economical architecture for semantic-based heterogeneous multimedia big data retrieval. J Syst Softw 102(1):207–216
Guo K, Tang Y, Zhang P (2017) CSF: crowdsourcing semantic fusion for heterogeneous media big data in the internet of things. Inf Fusion 37(1):77–85
Gupta A, Mumick IS (1999) Materialized, views: techniques, implementations, and applications. MIT Press, Cambridge
Gupta A, Yang F, Govig J, Kirsch A, Chan K, Lai K, Wu S, Dhoot SG, Kumar AR, Agiwal A, Bhansali S, Hong M, Cameron J, Siddiqi M, Jones D, Shute J, Gubarev A, Venkataraman S, Agrawal D (2014) Mesa: geo-replicated, near real-time, scalable data warehousing. PVLDB 7(12):1259–1270
Hamdi I, Bouazizi E, Alshomrani S, Feki J (2015) 2LPA-RTDW: a two-level data partitioning approach for real-time data warehouse. In: Proceedings of ICIS, pp 632–638
Hamdi I, Bouazizi E, Alshomrani S, Feki J (2018) Improving QoS in real-time data warehouses by using feedback control scheduling. Int J Inf Decis Sci 10(3):181–211
Hamdi I, Bouazizi E, Feki J (2014) Dynamic management of materialized views in real-time data warehouses. In: Proceedings of SoCPaR, pp 168–173
Ishigaki A, Hibino H (2014) Optimal storage assignment for an automated warehouse system with mixed loading. In: Proceedings of APMS, pp 475–482
Jain T, Rajasree S, Saluja S (2012) Refreshing data warehouse in near real-time. Int J Comput Appl 46(18):24–29
Jia R, Xu S, Peng C (2013) Research on real time data warehouse architecture. In: Proceedings of ICICA, pp 333–342
Kimball R (2008) The data warehouse lifecycle toolkit, 2nd edn. Wiley, Hoboken
Larson P-A (2013) Special issue on main-memory database systems. IEEE Data Eng Bull 36(2):1
Li J, Srivastava J (2002) Efficient aggregation algorithms for compressed data warehouses. IEEE Trans Knowl Data Eng 14(3):515–529
Lpez MA, Nadal S, Djedaini M, Marcel P, Peralta V, Furtado P (2015) An approach for alert raising in real-time data warehouses. In: Proceedings of EDA, pp 145–160
Lu H, Tan KL, Ooi B-C (1994) Query processing in parallel relational database systems. IEEE Computer Society Press, Los Alamitos
Naeem MA (2013) Tuned X-HYBRIDJOIN for near-real-time data warehousing. In: Proceedings of APWeb, pp 494–505
Naeem MA (2013) A robust join operator to process streaming data in real-time data warehousing. In: Proceedings of ICDIM, pp 119–124
Naeem MA, Dobbie G, Weber G (2014) Efficient processing of streaming updates with archived master data in near-real-time data warehousing. Knowl Inf Syst 40(13):615–637
Naeem MA, Jamil N (2014) An efficient stream-based join to process end user transactions in real-time data warehousing. J Dig Inf Manag 12(3):201–215
Naeem MA, Nguyen KT, Weber G (2017) A multi-way semi-stream join for a near-real-time data warehouse. In: Proceedings of ADC, pp 59–70
Navathe SB, Ceri S, Wiederhold G, Dou J (1984) Vertical partitioning algorithms for database design. ACM Trans Database Syst 9(4):680–710
Nguyen M, Tjoa AM (2003) Zero-latency data warehousing for heterogeneous data sources and continuous data streams. In: Proceedings of iiWAS, pp 55–64
O’Neil P, O’Neil E, Chen X, Revilak S (2009) Star schema benchmark and augmented fact table indexing. In: Proceedings of TPCTC, pp 237–252
Oracle (2012) Best practices for real-time data warehousing. White Paper
Pereira DA, de Morais WO, de Freitas EP (2018) NoSQL real-time database performance comparison. Int J Parallel Emerg Distrib Syst 33(2):144–156
Qu W, Basavaraj V, Shankar S, Dessloch S (2015) Real-time snapshot maintenance with incremental ETL pipelines in data warehouses. In: Proceedings of DaWaK, pp 217–228
Qu W, Deloch S (2017) Incremental ETL pipeline scheduling for near real-time data warehouses. In: Proceedings of BTW, pp 299–308
Ram P, Do L (2000) Extracting delta for incremental data warehouse maintenance. In: Proceedings of IEEE ICDE, pp 220–229
Reese G (2000) Database programming with JDBC & Java, 2nd edn. O’Reilly, Sebastopol
Santos RJ, Bernardino J (2008) Real-time data warehouse loading methodology. In: Proceedings of ACM IDEAS, pp 49–58
Sarawagi S, Sathe G (2000) i3: intelligent, interactive investigation of OLAP data cubes. In: Proceedings of ACM SIGMOD, p 589
Shi J, Bao Y, Leng F, Yu G (2008) Study on log-based change data capture and handling mechanism in real-time data warehouse. In: Proceedings of IEEE CSSE, pp 478–481
Snoddy D, Spyker J, Rupik M, Jory M, Kobylinski K (2009) Change data capture: what is it and how it impacts solutions architecture. In: Proceedings of CASCON, pp 297–298
Song X, Shibasaki R, Yuan NJ, Xie X, Li T, Adachi R (2017) DeepMob: learning deep knowledge of human emergency behavior and mobility from big and heterogeneous data. ACM Trans Inf Syst 35(4):41:1–41:19
Ting I-H, Lin C-H, Wang C-S (2011) Constructing a cloud computing based social networks data warehousing and analyzing system. In: Proceedings of ASONAM, pp 735–740
Transaction Processing Performance Council. TPC-H Benchmark. http://www.tpc.org/tpch/. Accessed Apr 2018
Valncio CR, Marioto MH, Zafalon GFD, Machado JM, Momente JC (2013) Real time delta extraction based on triggers to support data warehousing. In: Proceedings of PDCAT, pp 293–297
Vassiliadis P, Simitsis A (2009) Near real time ETL. New trends in data warehousing and data analysis. Ann Inf Syst 3:1–31
Vertica. Real-Time Loading and Querying. http://www.vertica.com/the-analytics-platform/real-time-loading-querying/
Wu M-C, Buchmann AP (1998) Encoded bitmap indexing for data warehouses. In: Proceedings of IEEE ICDE, pp 220–230
Zikopoulos P, Eaton C, Deutsch T, Lapis G (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill, New York
Zhu Y, An L, Liu S (2008) Data updating and query in real-time data warehouse system. In: Proceedings of IEEE CSSE, pp 1295–1297
Zuters J (2011) Near real-time data warehousing with multi-stage trickle and flip. In: Proceedings of BIR, pp 73–82
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cuzzocrea, A., Ferreira, N. & Furtado, P. A rewrite/merge approach for supporting real-time data warehousing via lightweight data integration. J Supercomput 76, 3898–3922 (2020). https://doi.org/10.1007/s11227-018-2707-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2707-9