Skip to main content

Advertisement

Log in

TinyLFU-based semi-stream cache join for near-real-time data warehousing

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Semi-stream join is an emerging research problem in the domain of near-real-time data warehousing. A semi-stream join is basically a join between a fast stream (S) and a slow disk-based relation (R). In the modern era of technology, huge amounts of data are being generated swiftly on a daily basis which needs to be instantly analyzed for making successful business decisions. Keeping this in mind, a famous algorithm called CACHEJOIN (Cache Join) was proposed. The limitation of the CACHEJOIN algorithm is that it does not deal with the frequently changing trends in a stream data efficiently. To overcome this limitation, in this paper, we propose a TinyLFU-CACHEJOIN algorithm, a modified version of the original CACHEJOIN algorithm, which is designed to enhance the performance of a CACHEJOIN algorithm. TinyLFU-CACHEJOIN employs an intelligent strategy which keeps only those records of R in the cache that have a high hit rate in S. This mechanism of TinyLFU-CACHEJOIN allows it to deal with the sudden and abrupt trend changes in S. We developed a cost model for our TinyLFU-CACHEJOIN algorithm and proved it empirically. We also assessed the performance of our proposed TinyLFU-CACHEJOIN algorithm with the existing CACHEJOIN algorithm on a skewed synthetic dataset. The experiments proved that TinyLFU-CACHEJOIN algorithm significantly outperforms the CACHEJOIN algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

Enquiries about data availability should be directed to the authors.

References

  • Agrahari K, Singh D (2017) Realisation of cache optimisation using new technique. Int J Adv Res Comput Sci 8:750–752

    Google Scholar 

  • Arora R, Gupta M (2017) E-governance using data warehousing and data mining. Int J Comput Appl 169:28–31

    Google Scholar 

  • Aziz O, Anees T, Mehmood E (2021) An efficient data access approach with queue and stack in optimized hybrid join. IEEE Access 9:41261–41274

    Article  Google Scholar 

  • Baig M, Shuib L, Yadegaridehkordi E (2019) Big data adoption: state of the art and research challenges. Inf Process Manag 56:102095

    Article  Google Scholar 

  • Dobbie G, Naeem MA, Weber G (2011) Hybridjoin for near-real-time data warehousing. Int J Data Wareh Min (IJDWM) 7(4):21–42

    Article  Google Scholar 

  • Einziger G, Friedman R, Manes B (2017) Tinylfu: a highly efficient cache admission policy. ACM Trans Storage 13:1–31

    Article  Google Scholar 

  • Ferrer-i Cancho R, Vitevitch M (2018) The origins of zipf’s meaning-frequency law. J Am Soc Inf Sci 69:1369–1379

    Google Scholar 

  • Garani G, Chernov A, Savvas I, Butakova M (2019) A data warehouse approach for business intelligence. p 70–75

  • Gupta D, Batra S (2017) A short survey on bloom filter and its variants. p 1086–1092

  • Jain S, Sharma S (2018) Application of data warehouse in decision support and business intelligence system, pp 231–234

  • Kim H, Lee K (2020) Semi-stream similarity join processing in a distributed environment. IEEE Access 8:130194–130204

    Article  Google Scholar 

  • Kim H-J, Lee K-H (2020) Semi-stream similarity join processing in a distributed environment. IEEE Access 8:130194–130204

    Article  Google Scholar 

  • Kim K, Jeong Y, Lee Y, Lee S (2019) Analysis of counting bloom filters used for count thresholding. Electronics 8:779

    Article  Google Scholar 

  • Kudagi S, Jayakumar N (2019) Survey on different cache replacement algorithms 7:10–13

  • Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horiz 60:293–303

    Article  Google Scholar 

  • Martínez AB, Galvis-Lista EA, Florez LCG (2012) Modeling techniques for extraction transformation and load processes: a critical review, pp 41–47

  • Mehmood E, Anees T (2019) Performance analysis of not only sql semi-stream join using mongodb for real-time data warehousing. IEEE Access 7:134215–134225

    Article  Google Scholar 

  • Naeem MA (2013) Efficient processing of semi-stream data. In: Eighth international conference on digital information management (ICDIM 2013), p 7–10

  • Naeem MA, Dobbie G, Weber G (2012) A lightweight stream-based join with limited resource consumption 7448:431–442

  • Naeem MA, Weber G, Lutteroth C (2019) A memory optimal many-to-many semi-stream join. Distrib Parallel Databases 37:623–649

    Article  Google Scholar 

  • Patgiri R, Nayak S, Borgohain S (2018) Role of bloom filter in big data research: a survey. Int J Adv Comput Sci Appl 9:655–661

    Google Scholar 

  • Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell N (2008) Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans Knowl Data Eng 20(7):976–991

    Article  Google Scholar 

  • Ramakrishnan R, Gehrke J, Gehrke J (2003) Database management systems. McGraw-Hill, New York

    MATH  Google Scholar 

  • Sabtu A et al. (2017) The challenges of extract, transform and loading (etl) system implementation for near real-time environment. p 1–5

  • Sarna G, Bhatia M (2018) Identification of suspicious patterns in social network using zipf’s law. p 957–962

  • Singh N, Agrahari K (2018) Enhanced performance of cache memory. Int J Adv Res Comput Sci 9:34–36

  • Vyas S, Vaishnav P (2017) A comparative study of various etl process and their testing techniques in data warehouse. J Stat Manag Syst 20(4):753–763

  • Wijaya R, Pudjoatmodjo B (2015) An overview and implementation of extraction-transformation-loading (etl) process in data warehouse (case study: department of agriculture). p 70–74

  • Zhang F, Chen H, Jin H (2019) Simois: a scalable distributed stream join system with skewed workloads. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), p 176–185

Download references

Funding

This is not a funded research.

Author information

Authors and Affiliations

Authors

Contributions

MAN has a leading role in this research. He presented the idea and prepared the architecture for the proposed approach. WW as a master student implemented the algorithm and produced the initial performance results. FM contributed in performance tuning and proofreading the paper. AT helped in write up of the paper.

Corresponding author

Correspondence to M. Asif Naeem.

Ethics declarations

Conflict of interest

The authors have no conflict of interest with any editorial member of the journal.

Ethics approval

The research presented in the paper has no human involvement, and therefore no ethical approval is required.

Consent for publication

The authors approves the consent for publishing their work in this journal.

Additional information

Communicated by Sara Shahzad.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Naeem, M.A., Waqar, W., Mirza, F. et al. TinyLFU-based semi-stream cache join for near-real-time data warehousing. Soft Comput 26, 11091–11103 (2022). https://doi.org/10.1007/s00500-022-07475-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-022-07475-0

Keywords