Efficient processing of streaming updates with archived master data in near-real-time data warehousing

Naeem, M. Asif; Dobbie, Gillian; Weber, Gerald

doi:10.1007/s10115-013-0653-7

Efficient processing of streaming updates with archived master data in near-real-time data warehousing

Regular Paper
Published: 05 May 2013

Volume 40, pages 615–637, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

M. Asif Naeem¹,
Gillian Dobbie² &
Gerald Weber²

493 Accesses
2 Citations
Explore all metrics

Abstract

In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-real-time data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm MESHJOIN (Mesh Join) has been proposed to amortize disk access over fast streams. MESHJOIN makes no assumptions about the data distribution. In real-world applications, however, skewed distributions can be found, such as a stream of products sold, where certain products are sold more frequently than the remainder of the products. The question arises is how much does MESHJOIN lose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be found in non-adaptive approaches such as MESHJOIN. We also present a cost model for X-HYBRIDJOIN, and based on that cost model, the algorithm is tuned.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multi-way Semi-stream Join for a Near-Real-Time Data Warehouse

Tuned X-HYBRIDJOIN for Near-Real-Time Data Warehousing

TinyLFU-based semi-stream cache join for near-real-time data warehousing

Article 11 September 2022

M. Asif Naeem, Wasiullah Waqar, … Ali Tahir

References

Abramowitz M, Stegun IA (1964) Handbook of mathematical functions with formulas, graphs, and mathematical tables. Dover, New York
MATH Google Scholar
Anderson C (2006) The long tail: why the future of business is selling less of more. Hyperion
Bornea MA, Deligiannakis A, Kotidis Y, Vassalos V (2011) Semi-streamed index join for near-real time execution of ETL transformations. In: ICDE ’09: proceedings of the 27th international conference on data engineering (ICDE). IEEE Computer Society, Washington, DC, USA, pp 159–170
Bruckner RM, List B, Schiefer J (2002) Striving towards near real-time data integration for data warehouses. In: DaWaK 2000: proceedings of the 4th international conference on data warehousing and knowledge discovery. Springer, London, UK, pp 317–326
Chakraborty A, Singh A (2009) A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In: IPDPS ’09: proceedings of the 2009 IEEE international symposium on parallel and distributed processing. IEEE Computer Society, Washington, DC, USA, pp 1–11
Dittrich J, Seeger B, Taylor DS, Widmayer P (2002) Progressive merge join: a generic and non-blocking sort-based join algorithm. In: VLDB ’02: proceedings of the 28th international conference on very large data bases. Hong Kong, China, pp 299–310
Francisco A (2003) Real-time data warehousing with temporal requirements. In: CAiSE workshops
Golab L, Johnson T, Seidel JS, Shkapenyuk V (2009) Stream warehousing with datadepot. In: SIGMOD ’09: proceedings of the 35th SIGMOD international conference on management of data. ACM, New York, NY, USA, pp 847–854
Gupta A, Mumick IS (1995) Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng Bull 18(2):3–18
Google Scholar
Han X, Li J, Yang D (2012) PI-join: efficiently processing join queries on massive data. Knowl Inf Syst 32(3):527–557
Google Scholar
Heising WP (1963) Note on random addressing techniques.: IBM Syst J 2(2), 112–116
Hohpe G, Woolf B (2003) Enterprise integration patterns: designing, building, and deploying messaging solutions. Addison-Wesley Longman Publishing, Boston
Google Scholar
Ives ZG, Florescu D, Friedman M, Levy A, Weld DS (1999) An adaptive query execution system for data integration. In: SIGMOD Rec., vol 28, no 2. ACM, New York, NY, USA, pp 299–310
Karakasidis A, Vassiliadis P, Pitoura E (2005) ETL queues for active data warehousing. In: IQIS ’05: proceedings of the 2nd international workshop on information quality in information systems. ACM, New York, NY, USA, pp 28–39
Knuth DE (2006) The art of computer programming, vol 3, 2nd edn. Sorting and searching. Addison Wesley Longman Publishing, Redwood City
Labio W, Garcia-Molina H (1996) Efficient snapshot differential algorithms for data warehousing. In: VLDB ’96: proceedings of the 22th international conference on very large data bases. San Francisco, CA, USA, pp 63–74
Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (2000) Performance issues in incremental warehouse maintenance. In: VLDB ’00: proceedings of the 26th international conference on very large data bases. San Francisco, CA, USA, pp 461–472
Labio WJ, Wiener JL, Garcia-Molina H, Gorelik V (2000) Efficient resumption of interrupted warehouse loads. In: SIGMOD Rec., vol 29, no 2. New York, NY, USA, pp 46–57
Lawrence R (2005) Early hash join: a configurable algorithm for the efficient and early production of join results. In: VLDB ’05: proceedings of the 31st international conference on very large data bases. VLDB endowment, Trondheim, Norway, pp 841–852
Levene M, Borges J, Loizou G (2001) Zipf’s law for web surfers. Knowl Inf Syst 3(1):120–129
Google Scholar
Mokbel MF, Lu M, Aref WG (2004) Hash-merge join: a non-blocking join algorithm for producing fast and early join results. In: ICDE ’04: proceedings of the 20th international conference on data engineering. IEEE Computer Society, Washington, DC, USA, pp 251–263
Naeem MA, Dobbie G, Weber G (2008) An event-based near real-time data integration architecture. In: Enterprise distributed object computing conference workshops. IEEE, Munich, Germany, pp 401–404
Naeem MA, Dobbie G, Weber G (2010) R-MESHJOIN for near-real-time data warehousing. In: DOLAP’10: proceedings of the ACM 13th international workshop on data warehousing and OLAP. ACM, Toronto, Canada, pp 53–60
Naeem MA, Dobbie G, Weber G (2011) X-HYBRIDJOIN for near-real-time data warehousing. In: Proceedings of 28th British national conference on databases (BNCOD 28). Springer, Berlin/Heidelberg, pp 33–47
Nguyen A, Tjoa A (2003) Zero-latency data warehousing for hetrogeneous data sources and continuous data streams. In: iiWAS’2003: the fifth international conference on information integrationand web-based applications services, Austrian Computer Society (OCG), pp 55–64
Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell N (2008) Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans Knowl Data Eng 20(7):976–991
Google Scholar
Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell NE (2007) Supporting streaming updates in an active data warehouse. In: ICDE 2007. IEEE 23rd international conference on data engineering. Los Alamitos, CA, USA, pp 476–485
Tao Y, Yiu ML, Papadias D, Hadjieleftheriou M, Mamoulis N (2005) RPJ: producing fast join results on streams through rate-based optimization. In: SIGMOD ’05: proceedings of the 2005 ACM SIGMOD international conference on management of data. New York, NY, USA. pp 371–382
Tolga U, Michael JF (2000) Xjoin: a reactively-scheduled pipelined join operator. IEEE Data Eng Bull 23(2):27–33
Google Scholar
Urhan T, Franklin MJ (1999) XJoin: getting fast answers from slow and bursty networks. University of Maryland, College Park
Google Scholar
Viglas SD, Naughton JF, Burger J (2003) Maximizing the output rate of multi-way join queries over streaming information sources. In: VLDB ’2003: proceedings of the 29th international conference on very large data bases. VLDB Endowment, Berlin, Germany, pp 285–296
Wilschut AN, Apers PMG (1991) Dataflow query execution in a parallel main-memory environment. In: PDIS ’91: proceedings of the first international conference on parallel and distributed information systems. IEEE Computer Society Press, Los Alamitos, CA, USA, pp 68–77
Wilschut AN, Apers PMG (1990) Pipelining in query execution. In: PARBASE-90: international conference on databases, parallel architectures and their applications. Miami, FL, USA, pp 562–562
Zhang X, Rundensteiner EA (2002) Integrating the maintenance and synchronization of data warehouses using a cooperative framework. Inf Syst 27(4):219–243
Google Scholar
Zhuge Y, García-Molina H, Hammer J, Widom J (1995) View maintenance in a warehousing environment. In: SIGMOD ’95: proceedings of the 1995 ACM SIGMOD international conference on management of data. ACM, New York, NY, USA, pp 316–327

Download references

Author information

Authors and Affiliations

School of Computing and Mathematical Sciences, Auckland University of Technology, Private Bag 92006, Auckland, New Zealand
M. Asif Naeem
Department of Computer Science, The University of Auckland, Private Bag 92019, Auckland, New Zealand
Gillian Dobbie & Gerald Weber

Authors

M. Asif Naeem
View author publications
You can also search for this author in PubMed Google Scholar
Gillian Dobbie
View author publications
You can also search for this author in PubMed Google Scholar
Gerald Weber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Asif Naeem.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Naeem, M.A., Dobbie, G. & Weber, G. Efficient processing of streaming updates with archived master data in near-real-time data warehousing. Knowl Inf Syst 40, 615–637 (2014). https://doi.org/10.1007/s10115-013-0653-7

Download citation

Received: 20 November 2011
Revised: 15 February 2013
Accepted: 14 April 2013
Published: 05 May 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10115-013-0653-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient processing of streaming updates with archived master data in near-real-time data warehousing

Abstract

Access this article

Similar content being viewed by others

A Multi-way Semi-stream Join for a Near-Real-Time Data Warehouse

Tuned X-HYBRIDJOIN for Near-Real-Time Data Warehousing

TinyLFU-based semi-stream cache join for near-real-time data warehousing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient processing of streaming updates with archived master data in near-real-time data warehousing

Abstract

Access this article

Similar content being viewed by others

A Multi-way Semi-stream Join for a Near-Real-Time Data Warehouse

Tuned X-HYBRIDJOIN for Near-Real-Time Data Warehousing

TinyLFU-based semi-stream cache join for near-real-time data warehousing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation