Abstract
In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-real-time data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm MESHJOIN (Mesh Join) has been proposed to amortize disk access over fast streams. MESHJOIN makes no assumptions about the data distribution. In real-world applications, however, skewed distributions can be found, such as a stream of products sold, where certain products are sold more frequently than the remainder of the products. The question arises is how much does MESHJOIN lose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be found in non-adaptive approaches such as MESHJOIN. We also present a cost model for X-HYBRIDJOIN, and based on that cost model, the algorithm is tuned.
Similar content being viewed by others
References
Abramowitz M, Stegun IA (1964) Handbook of mathematical functions with formulas, graphs, and mathematical tables. Dover, New York
Anderson C (2006) The long tail: why the future of business is selling less of more. Hyperion
Bornea MA, Deligiannakis A, Kotidis Y, Vassalos V (2011) Semi-streamed index join for near-real time execution of ETL transformations. In: ICDE ’09: proceedings of the 27th international conference on data engineering (ICDE). IEEE Computer Society, Washington, DC, USA, pp 159–170
Bruckner RM, List B, Schiefer J (2002) Striving towards near real-time data integration for data warehouses. In: DaWaK 2000: proceedings of the 4th international conference on data warehousing and knowledge discovery. Springer, London, UK, pp 317–326
Chakraborty A, Singh A (2009) A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In: IPDPS ’09: proceedings of the 2009 IEEE international symposium on parallel and distributed processing. IEEE Computer Society, Washington, DC, USA, pp 1–11
Dittrich J, Seeger B, Taylor DS, Widmayer P (2002) Progressive merge join: a generic and non-blocking sort-based join algorithm. In: VLDB ’02: proceedings of the 28th international conference on very large data bases. Hong Kong, China, pp 299–310
Francisco A (2003) Real-time data warehousing with temporal requirements. In: CAiSE workshops
Golab L, Johnson T, Seidel JS, Shkapenyuk V (2009) Stream warehousing with datadepot. In: SIGMOD ’09: proceedings of the 35th SIGMOD international conference on management of data. ACM, New York, NY, USA, pp 847–854
Gupta A, Mumick IS (1995) Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng Bull 18(2):3–18
Han X, Li J, Yang D (2012) PI-join: efficiently processing join queries on massive data. Knowl Inf Syst 32(3):527–557
Heising WP (1963) Note on random addressing techniques.: IBM Syst J 2(2), 112–116
Hohpe G, Woolf B (2003) Enterprise integration patterns: designing, building, and deploying messaging solutions. Addison-Wesley Longman Publishing, Boston
Ives ZG, Florescu D, Friedman M, Levy A, Weld DS (1999) An adaptive query execution system for data integration. In: SIGMOD Rec., vol 28, no 2. ACM, New York, NY, USA, pp 299–310
Karakasidis A, Vassiliadis P, Pitoura E (2005) ETL queues for active data warehousing. In: IQIS ’05: proceedings of the 2nd international workshop on information quality in information systems. ACM, New York, NY, USA, pp 28–39
Knuth DE (2006) The art of computer programming, vol 3, 2nd edn. Sorting and searching. Addison Wesley Longman Publishing, Redwood City
Labio W, Garcia-Molina H (1996) Efficient snapshot differential algorithms for data warehousing. In: VLDB ’96: proceedings of the 22th international conference on very large data bases. San Francisco, CA, USA, pp 63–74
Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (2000) Performance issues in incremental warehouse maintenance. In: VLDB ’00: proceedings of the 26th international conference on very large data bases. San Francisco, CA, USA, pp 461–472
Labio WJ, Wiener JL, Garcia-Molina H, Gorelik V (2000) Efficient resumption of interrupted warehouse loads. In: SIGMOD Rec., vol 29, no 2. New York, NY, USA, pp 46–57
Lawrence R (2005) Early hash join: a configurable algorithm for the efficient and early production of join results. In: VLDB ’05: proceedings of the 31st international conference on very large data bases. VLDB endowment, Trondheim, Norway, pp 841–852
Levene M, Borges J, Loizou G (2001) Zipf’s law for web surfers. Knowl Inf Syst 3(1):120–129
Mokbel MF, Lu M, Aref WG (2004) Hash-merge join: a non-blocking join algorithm for producing fast and early join results. In: ICDE ’04: proceedings of the 20th international conference on data engineering. IEEE Computer Society, Washington, DC, USA, pp 251–263
Naeem MA, Dobbie G, Weber G (2008) An event-based near real-time data integration architecture. In: Enterprise distributed object computing conference workshops. IEEE, Munich, Germany, pp 401–404
Naeem MA, Dobbie G, Weber G (2010) R-MESHJOIN for near-real-time data warehousing. In: DOLAP’10: proceedings of the ACM 13th international workshop on data warehousing and OLAP. ACM, Toronto, Canada, pp 53–60
Naeem MA, Dobbie G, Weber G (2011) X-HYBRIDJOIN for near-real-time data warehousing. In: Proceedings of 28th British national conference on databases (BNCOD 28). Springer, Berlin/Heidelberg, pp 33–47
Nguyen A, Tjoa A (2003) Zero-latency data warehousing for hetrogeneous data sources and continuous data streams. In: iiWAS’2003: the fifth international conference on information integrationand web-based applications services, Austrian Computer Society (OCG), pp 55–64
Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell N (2008) Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans Knowl Data Eng 20(7):976–991
Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell NE (2007) Supporting streaming updates in an active data warehouse. In: ICDE 2007. IEEE 23rd international conference on data engineering. Los Alamitos, CA, USA, pp 476–485
Tao Y, Yiu ML, Papadias D, Hadjieleftheriou M, Mamoulis N (2005) RPJ: producing fast join results on streams through rate-based optimization. In: SIGMOD ’05: proceedings of the 2005 ACM SIGMOD international conference on management of data. New York, NY, USA. pp 371–382
Tolga U, Michael JF (2000) Xjoin: a reactively-scheduled pipelined join operator. IEEE Data Eng Bull 23(2):27–33
Urhan T, Franklin MJ (1999) XJoin: getting fast answers from slow and bursty networks. University of Maryland, College Park
Viglas SD, Naughton JF, Burger J (2003) Maximizing the output rate of multi-way join queries over streaming information sources. In: VLDB ’2003: proceedings of the 29th international conference on very large data bases. VLDB Endowment, Berlin, Germany, pp 285–296
Wilschut AN, Apers PMG (1991) Dataflow query execution in a parallel main-memory environment. In: PDIS ’91: proceedings of the first international conference on parallel and distributed information systems. IEEE Computer Society Press, Los Alamitos, CA, USA, pp 68–77
Wilschut AN, Apers PMG (1990) Pipelining in query execution. In: PARBASE-90: international conference on databases, parallel architectures and their applications. Miami, FL, USA, pp 562–562
Zhang X, Rundensteiner EA (2002) Integrating the maintenance and synchronization of data warehouses using a cooperative framework. Inf Syst 27(4):219–243
Zhuge Y, García-Molina H, Hammer J, Widom J (1995) View maintenance in a warehousing environment. In: SIGMOD ’95: proceedings of the 1995 ACM SIGMOD international conference on management of data. ACM, New York, NY, USA, pp 316–327
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Naeem, M.A., Dobbie, G. & Weber, G. Efficient processing of streaming updates with archived master data in near-real-time data warehousing. Knowl Inf Syst 40, 615–637 (2014). https://doi.org/10.1007/s10115-013-0653-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0653-7