The implementation of data storage and analytics platform for big data lake of electricity usage with spark

Yang, Chao-Tung; Chen, Tzu-Yang; Kristiani, Endah; Wu, Shyhtsun Felix

doi:10.1007/s11227-020-03505-6

The implementation of data storage and analytics platform for big data lake of electricity usage with spark

Published: 13 November 2020

Volume 77, pages 5934–5959, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Chao-Tung Yang ORCID: orcid.org/0000-0002-9579-4426^1,2,3,
Tzu-Yang Chen¹,
Endah Kristiani^4,5 &
…
Shyhtsun Felix Wu⁶

1363 Accesses
18 Citations
Explore all metrics

Abstract

Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

Big data privacy: a technological perspective and review

Article Open access 26 November 2016

Priyank Jain, Manasi Gyanchandani & Nilay Khare

References

Alsubaiee S, Behm A, Borkar V, Heilbron Z, Kim YS, Carey MJ, Dreseler M, Li C (2014) Storage management in asterixdb. Proc VLDB Endow 7(10):841–852
Article Google Scholar
Beheshti A, Benatallah B, Nouri R, Chhieng VM, Xiong H, Zhao X (2017) Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, pp 2451–2454
Beheshti A, Benatallah B, Nouri R, Tabebordbar A (2018) Corekg: a knowledge lake service. Proc VLDB Endow 11(12):1942–1945
Article Google Scholar
Carcillo F, Dal Pozzolo A, Le Borgne YA, Caelen O, Mazzer Y, Bontempi G (2018) Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf Fusion 41:182–194
Article Google Scholar
Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS quarterly, pp 1165–1188
Chen L, Ko J, Yeo J (2015) Analysis of the influence factors of data loading performance using apache sqoop. KIPS Trans Softw Data Eng 4(2):77–82
Article Google Scholar
Chen TY, Yang CT, Kristiani E, Cheng CT (2018) On construction of a power data lake platform using spark. In: International Conference on Frontier Computing. Springer, pp 99–108
Chou SC, Yang CT, Jiang FC, Chang CH (2018) The implementation of a data-accessing platform built from big data warehouse of electric loads. In: 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), vol 2. IEEE, pp 87–92
Fang H (2015) Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In: 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER). IEEE, pp 820–824
Gruenheid A, Omiecinski E, Mark L (2011) Query optimization using column statistics in hive. In: Proceedings of the 15th symposium on international database engineering & applications. ACM, pp 97–105
Gupta K, Sachdev A, Sureka A (2017) Empirical analysis on comparing the performance of alpha miner algorithm in sql query language and nosql column-oriented databases using apache phoenix. arXiv preprint arXiv:1703.05481
Gupta M, Patwa F, Benson J, Sandhu R (2017) Multi-layer authorization framework for a representative hadoop ecosystem deployment. In: Proceedings of the 22nd ACM on symposium on access control models and technologies. ACM, pp 183–190
Hai R, Geisler S, Quix C (2016) Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp 2097–2100
John Walker S (2014) Big data: a revolution that will transform how we live, work, and think
Kathiravelu P, Sharma A (2016) A dynamic data warehousing platform for creating and accessing biomedical data lakes. In: VLDB workshop on data management and analytics for medicine and healthcare. Springer, pp 101–120
Kimball R, Ross M (2011) The data warehouse toolkit: the complete guide to dimensional modeling. Wiley, Hoboken
Google Scholar
Liu PY, Tsan YT, Chan YW, Chan WC, Shi ZY, Yang CT, Lou BS (2018) Associations of PM2.5 and aspergillosis: ambient fine particulate air pollution and population-based big data linkage analyses. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-0852-x
Article Google Scholar
Liu RH, Kuo CF, Yang CT, Chen ST, Liu JC (2016) On construction of an energy monitoring service using big data technology for smart campus. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE, pp 81–86
Małysiak-Mrozek B, Stabla M, Mrozek D (2018) Soft and declarative fishing of information in big data lake. IEEE Trans Fuzzy Syst 26(5):2732–2747
Article Google Scholar
Miloslavskaya N, Tolstoy A (2016) Big data, fast data and data lake concepts. Procedia Comput Sci 88:300–305
Article Google Scholar
Pal A, Jain K, Agrawal P, Agrawal S (2014) A performance analysis of mapreduce task with large number of files dataset in big data using hadoop. In: 2014 4th International Conference on Communication Systems and Network Technologies (CSNT). IEEE, pp 587–591
Ramakrishnan R, Sridharan B, Douceur JR, Kasturi P, Krishnamachari-Sampath B, Krishnamoorthy K, Li P, Manu M, Michaylov S, Ramos R et al (2017) Azure data lake store: a hyperscale distributed file service for big data analytics. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM, pp 51–63
Rangarajan S, Liu H, Wang H, Wang CL (2015) Scalable architecture for personalized healthcare service recommendation using big data lake. In: Service research and innovation. Springer, pp 65–79
Sun PL, Weng JY, Yang CT, Chen ST, Liu JC (2016) The implementation of air pollution monitoring service using hybrid database converter. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE, pp 269–274
Terrizzano IG, Schwarz PM, Roth M, Colino JE (2015) Data wrangling: the challenging yourney from the wild to the lake. In: CIDR
Tratar LF, Strmčnik E (2016) The comparison of holt-winters method and multiple regression method: a case study. Energy 109:266–276
Article Google Scholar
Wang G, Koshy J, Subramanian S, Paramasivam K, Zadeh M, Narkhede N, Rao J, Kreps J, Stein J (2015) Building a replicated logging system with apache kafka. Proc VLDB Endow 8(12):1654–1655
Article Google Scholar
Wang Y, Xu Y, Liu Y, Chen J, Hu S (2015) Qmapper for smart grid: migrating sql-based application to hive. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp 647–658
Xia F, Yang LT, Wang L, Vinel A (2012) Internet of things. Int J Commun Syst 25(9):1101
Article Google Scholar
Yang CT, Chen CJ, Tsan YT, Liu PY, Chan YW, Chan WC (2018) An implementation of real-time air quality and influenza-like illness data storage and processing platform. Comput Human Behav. https://doi.org/10.1016/j.chb.2018.10.009
Article Google Scholar
Yang CT, Chen ST, Den W, Wang YT, Kristiani E (2018) Implementation of an intelligent indoor environmental monitoring and management system in cloud. Future Gener Comput Syst
Yang CT, Chen ST, Yan YZ (2017) The implementation of a cloud city traffic state assessment system using a novel big data architecture. Cluster Comput 20(2):1101–1121. https://doi.org/10.1007/s10586-017-0846-z
Article Google Scholar
Yang CT, Liu JC, Chen ST, Lu HW (2017) Implementation of a big data accessing and processing platform for medical records in cloud. J Med Syst 41(10):149
Article Google Scholar
Zhang C, Liu X (2013) Hbasemq: a distributed message queuing system on clouds with hbase. In: INFOCOM, 2013 Proceedings IEEE. IEEE, pp 40–44
Zikopoulos P, Eaton C et al (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media, New York
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Tunghai University, Taichung City, 407224, Taiwan, ROC
Chao-Tung Yang & Tzu-Yang Chen
Research Center for Smart Sustainable Circular Economy, Tunghai University, No. 1727, Sec.4, Taiwan Boulevard, Taichung City, 407224, Taiwan, ROC
Chao-Tung Yang
Research Center for Nanotechnology, Tunghai University, No. 1727, Sec.4, Taiwan Boulevard, Taichung City, 407224, Taiwan, ROC
Chao-Tung Yang
Department of Industrial Engineering and Enterprise Information, Tunghai University, Taichung City, 407224, Taiwan, ROC
Endah Kristiani
Department of Informatics, Krida Wacana Christian University, Jakarta, 11470, Indonesia
Endah Kristiani
Department of Computer Science, University of California, Davis, CA, 95616, USA
Shyhtsun Felix Wu

Authors

Chao-Tung Yang
View author publications
You can also search for this author in PubMed Google Scholar
Tzu-Yang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Endah Kristiani
View author publications
You can also search for this author in PubMed Google Scholar
Shyhtsun Felix Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao-Tung Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This document is the results of the research project funded by the Ministry of Science and Technology (MOST), Taiwan R.O.C., under Grant Number This document is the results of the research project funded by the Ministry of Science and Technology (MOST), Taiwan R.O.C., Under Grant Numbers 109-2221-E-029-020-, 109-2621-M-029-002- and 109-2119-M-029-001-A.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, CT., Chen, TY., Kristiani, E. et al. The implementation of data storage and analytics platform for big data lake of electricity usage with spark. J Supercomput 77, 5934–5959 (2021). https://doi.org/10.1007/s11227-020-03505-6

Download citation

Accepted: 29 October 2020
Published: 13 November 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11227-020-03505-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The implementation of data storage and analytics platform for big data lake of electricity usage with spark

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data analytics on Apache Spark

Big data privacy: a technological perspective and review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The implementation of data storage and analytics platform for big data lake of electricity usage with spark

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data analytics on Apache Spark

Big data privacy: a technological perspective and review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation