Skip to main content

Optimizing Queue-Based Semi-Stream Joins with Indexed Master Data

  • Conference paper
Data Warehousing and Knowledge Discovery (DaWaK 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8646))

Included in the following conference series:

Abstract

In Data Stream Management Systems (DSMS) semi-stream processing has become a popular area of research due to the high demand of applications for up-to-date information (e.g. in real-time data warehousing). A common operation in stream processing is joining an incoming stream with disk-based master data, also known as semi-stream join. This join typically works under the constraint of limited main memory, which is generally not large enough to hold the whole disk-based master data. Many semi-stream joins use a queue of stream tuples to amortize the disk access to the master data, and use an index to allow directed access to master data, avoiding the loading of unnecessary master data. In such a situation the question arises which master data partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the master data index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. In this paper we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve the performance of a semi-stream join significantly, both for synthetic and real data sets with known skewed distributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anderson, C.: The Long Tail: Why the Future of Business Is Selling Less of More. Hyperion (2006)

    Google Scholar 

  2. Bornea, M.A., Deligiannakis, A., Kotidis, Y., Vassalos, V.: Semi-streamed index join for near-real time execution of ETL transformations. In: ICDE 2011: IEEE 27th International Conference on Data Engineering, pp. 159–170. IEEE Computer Society (2011)

    Google Scholar 

  3. Chakraborty, A., Singh, A.: A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In: IPDPS 2009: IEEE International Symposium on Parallel & Distributed Processing, pp. 1–11. IEEE Computer Society (2009)

    Google Scholar 

  4. Karakasidis, A., Vassiliadis, P., Pitoura, E.: ETL queues for active data warehousing. In: IQIS 2005: 2nd International Workshop on Information Quality in Information Systems, pp. 28–39. ACM (2005)

    Google Scholar 

  5. Naeem, M.A., Dobbie, G., Weber, G.: HYBRIDJOIN for near-real-time data warehousing. International Journal of Data Warehousing and Mining (IJDWM) 7(4) (2011)

    Google Scholar 

  6. Naeem, M.A., Dobbie, G., Weber, G.: A lightweight stream-based join with limited resource consumption. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2012. LNCS, vol. 7448, pp. 431–442. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  7. Naeem, M.A., Dobbie, G., Weber, G., Alam, S.: R-MESHJOIN for near-real-time data warehousing. In: DOLAP 2010: ACM 13th International Workshop on Data Warehousing and OLAP. ACM (2010)

    Google Scholar 

  8. Naeem, M.A., Weber, G., Dobbie, G., Lutteroth, C.: SSCJ: A semi-stream cache join using a front-stage cache module. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2013. LNCS, vol. 8057, pp. 236–247. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  9. Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.E.: Supporting streaming updates in an active data warehouse. In: ICDE 2007: 23rd International Conference on Data Engineering, pp. 476–485 (2007)

    Google Scholar 

  10. Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.: Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans. on Knowl. and Data Eng. 20(7), 976–991 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Asif Naeem, M., Weber, G., Lutteroth, C., Dobbie, G. (2014). Optimizing Queue-Based Semi-Stream Joins with Indexed Master Data. In: Bellatreche, L., Mohania, M.K. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2014. Lecture Notes in Computer Science, vol 8646. Springer, Cham. https://doi.org/10.1007/978-3-319-10160-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10160-6_16

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10159-0

  • Online ISBN: 978-3-319-10160-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics