Skip to main content
Log in

Efficient processing of streaming updates with archived master data in near-real-time data warehousing

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-real-time data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm MESHJOIN (Mesh Join) has been proposed to amortize disk access over fast streams. MESHJOIN makes no assumptions about the data distribution. In real-world applications, however, skewed distributions can be found, such as a stream of products sold, where certain products are sold more frequently than the remainder of the products. The question arises is how much does MESHJOIN lose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be found in non-adaptive approaches such as MESHJOIN. We also present a cost model for X-HYBRIDJOIN, and based on that cost model, the algorithm is tuned.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Abramowitz M, Stegun IA (1964) Handbook of mathematical functions with formulas, graphs, and mathematical tables. Dover, New York

    MATH  Google Scholar 

  2. Anderson C (2006) The long tail: why the future of business is selling less of more. Hyperion

  3. Bornea MA, Deligiannakis A, Kotidis Y, Vassalos V (2011) Semi-streamed index join for near-real time execution of ETL transformations. In: ICDE ’09: proceedings of the 27th international conference on data engineering (ICDE). IEEE Computer Society, Washington, DC, USA, pp 159–170

  4. Bruckner RM, List B, Schiefer J (2002) Striving towards near real-time data integration for data warehouses. In: DaWaK 2000: proceedings of the 4th international conference on data warehousing and knowledge discovery. Springer, London, UK, pp 317–326

  5. Chakraborty A, Singh A (2009) A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In: IPDPS ’09: proceedings of the 2009 IEEE international symposium on parallel and distributed processing. IEEE Computer Society, Washington, DC, USA, pp 1–11

  6. Dittrich J, Seeger B, Taylor DS, Widmayer P (2002) Progressive merge join: a generic and non-blocking sort-based join algorithm. In: VLDB ’02: proceedings of the 28th international conference on very large data bases. Hong Kong, China, pp 299–310

  7. Francisco A (2003) Real-time data warehousing with temporal requirements. In: CAiSE workshops

  8. Golab L, Johnson T, Seidel JS, Shkapenyuk V (2009) Stream warehousing with datadepot. In: SIGMOD ’09: proceedings of the 35th SIGMOD international conference on management of data. ACM, New York, NY, USA, pp 847–854

  9. Gupta A, Mumick IS (1995) Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng Bull 18(2):3–18

    Google Scholar 

  10. Han X, Li J, Yang D (2012) PI-join: efficiently processing join queries on massive data. Knowl Inf Syst 32(3):527–557

    Google Scholar 

  11. Heising WP (1963) Note on random addressing techniques.: IBM Syst J 2(2), 112–116

  12. Hohpe G, Woolf B (2003) Enterprise integration patterns: designing, building, and deploying messaging solutions. Addison-Wesley Longman Publishing, Boston

    Google Scholar 

  13. Ives ZG, Florescu D, Friedman M, Levy A, Weld DS (1999) An adaptive query execution system for data integration. In: SIGMOD Rec., vol 28, no 2. ACM, New York, NY, USA, pp 299–310

  14. Karakasidis A, Vassiliadis P, Pitoura E (2005) ETL queues for active data warehousing. In: IQIS ’05: proceedings of the 2nd international workshop on information quality in information systems. ACM, New York, NY, USA, pp 28–39

  15. Knuth DE (2006) The art of computer programming, vol 3, 2nd edn. Sorting and searching. Addison Wesley Longman Publishing, Redwood City

  16. Labio W, Garcia-Molina H (1996) Efficient snapshot differential algorithms for data warehousing. In: VLDB ’96: proceedings of the 22th international conference on very large data bases. San Francisco, CA, USA, pp 63–74

  17. Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (2000) Performance issues in incremental warehouse maintenance. In: VLDB ’00: proceedings of the 26th international conference on very large data bases. San Francisco, CA, USA, pp 461–472

  18. Labio WJ, Wiener JL, Garcia-Molina H, Gorelik V (2000) Efficient resumption of interrupted warehouse loads. In: SIGMOD Rec., vol 29, no 2. New York, NY, USA, pp 46–57

  19. Lawrence R (2005) Early hash join: a configurable algorithm for the efficient and early production of join results. In: VLDB ’05: proceedings of the 31st international conference on very large data bases. VLDB endowment, Trondheim, Norway, pp 841–852

  20. Levene M, Borges J, Loizou G (2001) Zipf’s law for web surfers. Knowl Inf Syst 3(1):120–129

    Google Scholar 

  21. Mokbel MF, Lu M, Aref WG (2004) Hash-merge join: a non-blocking join algorithm for producing fast and early join results. In: ICDE ’04: proceedings of the 20th international conference on data engineering. IEEE Computer Society, Washington, DC, USA, pp 251–263

  22. Naeem MA, Dobbie G, Weber G (2008) An event-based near real-time data integration architecture. In: Enterprise distributed object computing conference workshops. IEEE, Munich, Germany, pp 401–404

  23. Naeem MA, Dobbie G, Weber G (2010) R-MESHJOIN for near-real-time data warehousing. In: DOLAP’10: proceedings of the ACM 13th international workshop on data warehousing and OLAP. ACM, Toronto, Canada, pp 53–60

  24. Naeem MA, Dobbie G, Weber G (2011) X-HYBRIDJOIN for near-real-time data warehousing. In: Proceedings of 28th British national conference on databases (BNCOD 28). Springer, Berlin/Heidelberg, pp 33–47

  25. Nguyen A, Tjoa A (2003) Zero-latency data warehousing for hetrogeneous data sources and continuous data streams. In: iiWAS’2003: the fifth international conference on information integrationand web-based applications services, Austrian Computer Society (OCG), pp 55–64

  26. Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell N (2008) Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans Knowl Data Eng 20(7):976–991

    Google Scholar 

  27. Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell NE (2007) Supporting streaming updates in an active data warehouse. In: ICDE 2007. IEEE 23rd international conference on data engineering. Los Alamitos, CA, USA, pp 476–485

  28. Tao Y, Yiu ML, Papadias D, Hadjieleftheriou M, Mamoulis N (2005) RPJ: producing fast join results on streams through rate-based optimization. In: SIGMOD ’05: proceedings of the 2005 ACM SIGMOD international conference on management of data. New York, NY, USA. pp 371–382

  29. Tolga U, Michael JF (2000) Xjoin: a reactively-scheduled pipelined join operator. IEEE Data Eng Bull 23(2):27–33

    Google Scholar 

  30. Urhan T, Franklin MJ (1999) XJoin: getting fast answers from slow and bursty networks. University of Maryland, College Park

    Google Scholar 

  31. Viglas SD, Naughton JF, Burger J (2003) Maximizing the output rate of multi-way join queries over streaming information sources. In: VLDB ’2003: proceedings of the 29th international conference on very large data bases. VLDB Endowment, Berlin, Germany, pp 285–296

  32. Wilschut AN, Apers PMG (1991) Dataflow query execution in a parallel main-memory environment. In: PDIS ’91: proceedings of the first international conference on parallel and distributed information systems. IEEE Computer Society Press, Los Alamitos, CA, USA, pp 68–77

  33. Wilschut AN, Apers PMG (1990) Pipelining in query execution. In: PARBASE-90: international conference on databases, parallel architectures and their applications. Miami, FL, USA, pp 562–562

  34. Zhang X, Rundensteiner EA (2002) Integrating the maintenance and synchronization of data warehouses using a cooperative framework. Inf Syst 27(4):219–243

    Google Scholar 

  35. Zhuge Y, García-Molina H, Hammer J, Widom J (1995) View maintenance in a warehousing environment. In: SIGMOD ’95: proceedings of the 1995 ACM SIGMOD international conference on management of data. ACM, New York, NY, USA, pp 316–327

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Asif Naeem.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Naeem, M.A., Dobbie, G. & Weber, G. Efficient processing of streaming updates with archived master data in near-real-time data warehousing. Knowl Inf Syst 40, 615–637 (2014). https://doi.org/10.1007/s10115-013-0653-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0653-7

Keywords

Navigation