skip to main content
10.1145/1871940.1871952acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

R-MESHJOIN for near-real-time data warehousing

Published: 30 October 2010 Publication History

Abstract

To fulfill the increasing demand of business for the latest information, current data integration approaches are moving towards real-time updates. One important element in real-time data integration is the join of a continuous incoming data stream with a disk-based relation. In this paper we investigate a stream-based join algorithm, called mesh join (MESHJOIN), and propose an improved version called reduced MESHJOIN (R-MESHJOIN). Both algorithms tune the memory, allocating parts of the memory to key components. In MESHJOIN there is a dependency between the size of partitions in an internal queue for the stream data and the number of iterations required to bring the disk-based relation into memory. This dependency hampers the optimal distribution of memory among the join components. In particular the size of the disk-buffer varies with the size of the disk-based relation which is unnecessary. On the other hand the R-MESHJOIN algorithm removes this dependency. This enables an optimal distribution of available memory among the join components. In R-MESHJOIN a change in the size of the disk-based relation does not affect the size of the disk-buffer. An experimental study is conducted in order to validate the arguments.

References

[1]
Z. G. Ives, D. Florescu, M. Friedman, A. Levy, and D. S. Weld. An adaptive query execution system for data integration. SIGMOD Rec., 28(2):299--310, 1999.
[2]
A. Karakasidis, P. Vassiliadis, and E. Pitoura. ETL queues for active data warehousing. In IQIS '05:Proceedings of the 2nd International Workshop on Information Quality in Information Systems, pages 28--39, New York, NY, USA, 2005. ACM.
[3]
W. Labio, J. Yang, Y. Cui, H. Garcia-Molina, and J. Widom. Performance issues in incremental warehouse maintenance. In VLDB '00: Proceedings of the 26th International Conference on Very Large Data Bases, pages 461--472, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
[4]
W. J. Labio, J. L. Wiener, H. Garcia-Molina, and V. Gorelik. Efficient resumption of interrupted warehouse loads. SIGMOD Rec., 29(2):46--57, 2000.
[5]
R. Lawrence. Early Hash Join: a configurable algorithm for the efficient and early production of join results. In VLDB 05: Proceedings of the 31st International Conference on Very Large Data Bases, pages 841--852. VLDB Endowment, 2005.
[6]
M. F. Mokbel, M. Lu, and W. G. Aref. Hash-Merge Join: A non-blocking join algorithm for producing fast and early join results. In ICDE, pages 251--263, 2004.
[7]
M. A. Naeem, G. Dobbie, and G. Weber. An event-based near real-time data integration architecture. In EDOCW '08: Proceedings of the 2008 12th Enterprise Distributed Object Computing Conference Workshops, pages 401--404, Washington, DC, USA, 2008. IEEE Computer Society.
[8]
M. A. Naeem, G. Dobbie, and G. Weber. Comparing global optimization and default settings of stream-based joins. In: VLDB Workshop (BIRTE'09), Lyon, France, 2009.
[9]
A. Nguyen and A. Tjoa. Zero-latency data warehousing for heterogeneous data sources and continuous data streams. In iiWAS'2003 - The Fifth International Conference on Information Integration and Web-based Applications Services, pages 55--64, 2003.
[10]
N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N. Frantzell. Supporting streaming updates in an active data warehouse. Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, April 15--20, 2007, Istanbul, Turkey, pages 476--485, April 2007.
[11]
N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N. Frantzell. Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans. on Knowl. and Data Eng., 20(7):976--991, 2008.
[12]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB '01: Proceedings of the 27th International Conference on Very Large Data Bases, pages 381--390, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[13]
L. D. Shapiro. Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3):239--264, 1986.
[14]
T. Urhan and M. J. Franklin. XJoin: A reactively-scheduled pipelined join operator. IEEE Data Engineering Bulletin, 23:2000, 2000.
[15]
A. N. Wilschut and P. M. G. Apers. Pipelining in query execution. In Proceedings of the International Conference on Databases, Parallel Architectures and Their Applications (PARBASE 1990), Miami Beach, FL, USA, pages 562--562, Los Alamitos, March 1990. IEEE Computer Society Press.
[16]
A. N. Wilschut and P. M. G. Apers. Data flow query execution in a parallel main-memory environment. In PDIS '91: Proceedings of the first International Conference on Parallel and Distributed Information Systems, pages 68--77, Los Alamitos, CA, USA, 1991. IEEE Computer Society Press.

Cited By

View all
  • (2021)An Efficient Data Access Approach With Queue and Stack in Optimized Hybrid JoinIEEE Access10.1109/ACCESS.2021.30642029(41261-41274)Online publication date: 2021
  • (2020)Optimizing Semi-Stream CACHEJOIN for Near-Real- Time Data WarehousingJournal of Database Management10.4018/JDM.202001010231:1(20-37)Online publication date: 1-Jan-2020
  • (2020)Parallelisation of a Cache-Based Stream-Relation Join for a Near-Real-Time Data WarehouseElectronics10.3390/electronics90812999:8(1299)Online publication date: 12-Aug-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DOLAP '10: Proceedings of the ACM 13th international workshop on Data warehousing and OLAP
October 2010
112 pages
ISBN:9781450303835
DOI:10.1145/1871940
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data warehousing foundations and architectures
  2. performance optimization and tuning
  3. real-time data warehouses

Qualifiers

  • Research-article

Conference

CIKM '10

Acceptance Rates

Overall Acceptance Rate 29 of 79 submissions, 37%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)An Efficient Data Access Approach With Queue and Stack in Optimized Hybrid JoinIEEE Access10.1109/ACCESS.2021.30642029(41261-41274)Online publication date: 2021
  • (2020)Optimizing Semi-Stream CACHEJOIN for Near-Real- Time Data WarehousingJournal of Database Management10.4018/JDM.202001010231:1(20-37)Online publication date: 1-Jan-2020
  • (2020)Parallelisation of a Cache-Based Stream-Relation Join for a Near-Real-Time Data WarehouseElectronics10.3390/electronics90812999:8(1299)Online publication date: 12-Aug-2020
  • (2020)Big Data Velocity Management–From Stream to Warehouse via High Performance Memory Optimized Index JoinIEEE Access10.1109/ACCESS.2020.30334648(195370-195384)Online publication date: 2020
  • (2017)Optimization of cache-based semi-stream joins2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA)10.1109/ICCCBDA.2017.7951887(76-81)Online publication date: Apr-2017
  • (2017)Skewed distributions in semi-stream joinsInformation Systems10.1016/j.is.2016.09.00764:C(63-74)Online publication date: 1-Mar-2017
  • (2017)A Multi-way Semi-stream Join for a Near-Real-Time Data WarehouseDatabases Theory and Applications10.1007/978-3-319-68155-9_5(59-70)Online publication date: 20-Sep-2017
  • (2016)Business Intelligence IndicatorsInternational Journal of Data Warehousing and Mining10.4018/IJDWM.201610010412:4(75-98)Online publication date: 1-Oct-2016
  • (2016)A Cached-Based Stream-Relation Join Operator for Semi-Stream Data ProcessingInternational Journal of Data Warehousing and Mining10.4018/IJDWM.201607010212:3(14-31)Online publication date: Jul-2016
  • (2016)Optimising Queue-Based Semi-stream Joins by Introducing a Queue of Frequent PagesDatabases Theory and Applications10.1007/978-3-319-46922-5_32(407-418)Online publication date: 21-Sep-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media