Abstract
Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm.
Similar content being viewed by others
References
Alsubaiee S, Behm A, Borkar V, Heilbron Z, Kim YS, Carey MJ, Dreseler M, Li C (2014) Storage management in asterixdb. Proc VLDB Endow 7(10):841–852
Beheshti A, Benatallah B, Nouri R, Chhieng VM, Xiong H, Zhao X (2017) Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, pp 2451–2454
Beheshti A, Benatallah B, Nouri R, Tabebordbar A (2018) Corekg: a knowledge lake service. Proc VLDB Endow 11(12):1942–1945
Carcillo F, Dal Pozzolo A, Le Borgne YA, Caelen O, Mazzer Y, Bontempi G (2018) Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf Fusion 41:182–194
Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS quarterly, pp 1165–1188
Chen L, Ko J, Yeo J (2015) Analysis of the influence factors of data loading performance using apache sqoop. KIPS Trans Softw Data Eng 4(2):77–82
Chen TY, Yang CT, Kristiani E, Cheng CT (2018) On construction of a power data lake platform using spark. In: International Conference on Frontier Computing. Springer, pp 99–108
Chou SC, Yang CT, Jiang FC, Chang CH (2018) The implementation of a data-accessing platform built from big data warehouse of electric loads. In: 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), vol 2. IEEE, pp 87–92
Fang H (2015) Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In: 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER). IEEE, pp 820–824
Gruenheid A, Omiecinski E, Mark L (2011) Query optimization using column statistics in hive. In: Proceedings of the 15th symposium on international database engineering & applications. ACM, pp 97–105
Gupta K, Sachdev A, Sureka A (2017) Empirical analysis on comparing the performance of alpha miner algorithm in sql query language and nosql column-oriented databases using apache phoenix. arXiv preprint arXiv:1703.05481
Gupta M, Patwa F, Benson J, Sandhu R (2017) Multi-layer authorization framework for a representative hadoop ecosystem deployment. In: Proceedings of the 22nd ACM on symposium on access control models and technologies. ACM, pp 183–190
Hai R, Geisler S, Quix C (2016) Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp 2097–2100
John Walker S (2014) Big data: a revolution that will transform how we live, work, and think
Kathiravelu P, Sharma A (2016) A dynamic data warehousing platform for creating and accessing biomedical data lakes. In: VLDB workshop on data management and analytics for medicine and healthcare. Springer, pp 101–120
Kimball R, Ross M (2011) The data warehouse toolkit: the complete guide to dimensional modeling. Wiley, Hoboken
Liu PY, Tsan YT, Chan YW, Chan WC, Shi ZY, Yang CT, Lou BS (2018) Associations of PM2.5 and aspergillosis: ambient fine particulate air pollution and population-based big data linkage analyses. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-0852-x
Liu RH, Kuo CF, Yang CT, Chen ST, Liu JC (2016) On construction of an energy monitoring service using big data technology for smart campus. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE, pp 81–86
Małysiak-Mrozek B, Stabla M, Mrozek D (2018) Soft and declarative fishing of information in big data lake. IEEE Trans Fuzzy Syst 26(5):2732–2747
Miloslavskaya N, Tolstoy A (2016) Big data, fast data and data lake concepts. Procedia Comput Sci 88:300–305
Pal A, Jain K, Agrawal P, Agrawal S (2014) A performance analysis of mapreduce task with large number of files dataset in big data using hadoop. In: 2014 4th International Conference on Communication Systems and Network Technologies (CSNT). IEEE, pp 587–591
Ramakrishnan R, Sridharan B, Douceur JR, Kasturi P, Krishnamachari-Sampath B, Krishnamoorthy K, Li P, Manu M, Michaylov S, Ramos R et al (2017) Azure data lake store: a hyperscale distributed file service for big data analytics. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM, pp 51–63
Rangarajan S, Liu H, Wang H, Wang CL (2015) Scalable architecture for personalized healthcare service recommendation using big data lake. In: Service research and innovation. Springer, pp 65–79
Sun PL, Weng JY, Yang CT, Chen ST, Liu JC (2016) The implementation of air pollution monitoring service using hybrid database converter. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE, pp 269–274
Terrizzano IG, Schwarz PM, Roth M, Colino JE (2015) Data wrangling: the challenging yourney from the wild to the lake. In: CIDR
Tratar LF, Strmčnik E (2016) The comparison of holt-winters method and multiple regression method: a case study. Energy 109:266–276
Wang G, Koshy J, Subramanian S, Paramasivam K, Zadeh M, Narkhede N, Rao J, Kreps J, Stein J (2015) Building a replicated logging system with apache kafka. Proc VLDB Endow 8(12):1654–1655
Wang Y, Xu Y, Liu Y, Chen J, Hu S (2015) Qmapper for smart grid: migrating sql-based application to hive. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp 647–658
Xia F, Yang LT, Wang L, Vinel A (2012) Internet of things. Int J Commun Syst 25(9):1101
Yang CT, Chen CJ, Tsan YT, Liu PY, Chan YW, Chan WC (2018) An implementation of real-time air quality and influenza-like illness data storage and processing platform. Comput Human Behav. https://doi.org/10.1016/j.chb.2018.10.009
Yang CT, Chen ST, Den W, Wang YT, Kristiani E (2018) Implementation of an intelligent indoor environmental monitoring and management system in cloud. Future Gener Comput Syst
Yang CT, Chen ST, Yan YZ (2017) The implementation of a cloud city traffic state assessment system using a novel big data architecture. Cluster Comput 20(2):1101–1121. https://doi.org/10.1007/s10586-017-0846-z
Yang CT, Liu JC, Chen ST, Lu HW (2017) Implementation of a big data accessing and processing platform for medical records in cloud. J Med Syst 41(10):149
Zhang C, Liu X (2013) Hbasemq: a distributed message queuing system on clouds with hbase. In: INFOCOM, 2013 Proceedings IEEE. IEEE, pp 40–44
Zikopoulos P, Eaton C et al (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media, New York
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This document is the results of the research project funded by the Ministry of Science and Technology (MOST), Taiwan R.O.C., under Grant Number This document is the results of the research project funded by the Ministry of Science and Technology (MOST), Taiwan R.O.C., Under Grant Numbers 109-2221-E-029-020-, 109-2621-M-029-002- and 109-2119-M-029-001-A.
Rights and permissions
About this article
Cite this article
Yang, CT., Chen, TY., Kristiani, E. et al. The implementation of data storage and analytics platform for big data lake of electricity usage with spark. J Supercomput 77, 5934–5959 (2021). https://doi.org/10.1007/s11227-020-03505-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03505-6