Abstract
Semi-stream join is an emerging research problem in the domain of near-real-time data warehousing. A semi-stream join is basically a join between a fast stream (S) and a slow disk-based relation (R). In the modern era of technology, huge amounts of data are being generated swiftly on a daily basis which needs to be instantly analyzed for making successful business decisions. Keeping this in mind, a famous algorithm called CACHEJOIN (Cache Join) was proposed. The limitation of the CACHEJOIN algorithm is that it does not deal with the frequently changing trends in a stream data efficiently. To overcome this limitation, in this paper, we propose a TinyLFU-CACHEJOIN algorithm, a modified version of the original CACHEJOIN algorithm, which is designed to enhance the performance of a CACHEJOIN algorithm. TinyLFU-CACHEJOIN employs an intelligent strategy which keeps only those records of R in the cache that have a high hit rate in S. This mechanism of TinyLFU-CACHEJOIN allows it to deal with the sudden and abrupt trend changes in S. We developed a cost model for our TinyLFU-CACHEJOIN algorithm and proved it empirically. We also assessed the performance of our proposed TinyLFU-CACHEJOIN algorithm with the existing CACHEJOIN algorithm on a skewed synthetic dataset. The experiments proved that TinyLFU-CACHEJOIN algorithm significantly outperforms the CACHEJOIN algorithm.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Enquiries about data availability should be directed to the authors.
References
Agrahari K, Singh D (2017) Realisation of cache optimisation using new technique. Int J Adv Res Comput Sci 8:750–752
Arora R, Gupta M (2017) E-governance using data warehousing and data mining. Int J Comput Appl 169:28–31
Aziz O, Anees T, Mehmood E (2021) An efficient data access approach with queue and stack in optimized hybrid join. IEEE Access 9:41261–41274
Baig M, Shuib L, Yadegaridehkordi E (2019) Big data adoption: state of the art and research challenges. Inf Process Manag 56:102095
Dobbie G, Naeem MA, Weber G (2011) Hybridjoin for near-real-time data warehousing. Int J Data Wareh Min (IJDWM) 7(4):21–42
Einziger G, Friedman R, Manes B (2017) Tinylfu: a highly efficient cache admission policy. ACM Trans Storage 13:1–31
Ferrer-i Cancho R, Vitevitch M (2018) The origins of zipf’s meaning-frequency law. J Am Soc Inf Sci 69:1369–1379
Garani G, Chernov A, Savvas I, Butakova M (2019) A data warehouse approach for business intelligence. p 70–75
Gupta D, Batra S (2017) A short survey on bloom filter and its variants. p 1086–1092
Jain S, Sharma S (2018) Application of data warehouse in decision support and business intelligence system, pp 231–234
Kim H, Lee K (2020) Semi-stream similarity join processing in a distributed environment. IEEE Access 8:130194–130204
Kim H-J, Lee K-H (2020) Semi-stream similarity join processing in a distributed environment. IEEE Access 8:130194–130204
Kim K, Jeong Y, Lee Y, Lee S (2019) Analysis of counting bloom filters used for count thresholding. Electronics 8:779
Kudagi S, Jayakumar N (2019) Survey on different cache replacement algorithms 7:10–13
Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horiz 60:293–303
Martínez AB, Galvis-Lista EA, Florez LCG (2012) Modeling techniques for extraction transformation and load processes: a critical review, pp 41–47
Mehmood E, Anees T (2019) Performance analysis of not only sql semi-stream join using mongodb for real-time data warehousing. IEEE Access 7:134215–134225
Naeem MA (2013) Efficient processing of semi-stream data. In: Eighth international conference on digital information management (ICDIM 2013), p 7–10
Naeem MA, Dobbie G, Weber G (2012) A lightweight stream-based join with limited resource consumption 7448:431–442
Naeem MA, Weber G, Lutteroth C (2019) A memory optimal many-to-many semi-stream join. Distrib Parallel Databases 37:623–649
Patgiri R, Nayak S, Borgohain S (2018) Role of bloom filter in big data research: a survey. Int J Adv Comput Sci Appl 9:655–661
Polyzotis N, Skiadopoulos S, Vassiliadis P, Simitsis A, Frantzell N (2008) Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans Knowl Data Eng 20(7):976–991
Ramakrishnan R, Gehrke J, Gehrke J (2003) Database management systems. McGraw-Hill, New York
Sabtu A et al. (2017) The challenges of extract, transform and loading (etl) system implementation for near real-time environment. p 1–5
Sarna G, Bhatia M (2018) Identification of suspicious patterns in social network using zipf’s law. p 957–962
Singh N, Agrahari K (2018) Enhanced performance of cache memory. Int J Adv Res Comput Sci 9:34–36
Vyas S, Vaishnav P (2017) A comparative study of various etl process and their testing techniques in data warehouse. J Stat Manag Syst 20(4):753–763
Wijaya R, Pudjoatmodjo B (2015) An overview and implementation of extraction-transformation-loading (etl) process in data warehouse (case study: department of agriculture). p 70–74
Zhang F, Chen H, Jin H (2019) Simois: a scalable distributed stream join system with skewed workloads. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), p 176–185
Funding
This is not a funded research.
Author information
Authors and Affiliations
Contributions
MAN has a leading role in this research. He presented the idea and prepared the architecture for the proposed approach. WW as a master student implemented the algorithm and produced the initial performance results. FM contributed in performance tuning and proofreading the paper. AT helped in write up of the paper.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest with any editorial member of the journal.
Ethics approval
The research presented in the paper has no human involvement, and therefore no ethical approval is required.
Consent for publication
The authors approves the consent for publishing their work in this journal.
Additional information
Communicated by Sara Shahzad.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Naeem, M.A., Waqar, W., Mirza, F. et al. TinyLFU-based semi-stream cache join for near-real-time data warehousing. Soft Comput 26, 11091–11103 (2022). https://doi.org/10.1007/s00500-022-07475-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-022-07475-0