Skip to main content
Log in

The implementation of data storage and analytics platform for big data lake of electricity usage with spark

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

References

  1. Alsubaiee S, Behm A, Borkar V, Heilbron Z, Kim YS, Carey MJ, Dreseler M, Li C (2014) Storage management in asterixdb. Proc VLDB Endow 7(10):841–852

    Article  Google Scholar 

  2. Beheshti A, Benatallah B, Nouri R, Chhieng VM, Xiong H, Zhao X (2017) Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, pp 2451–2454

  3. Beheshti A, Benatallah B, Nouri R, Tabebordbar A (2018) Corekg: a knowledge lake service. Proc VLDB Endow 11(12):1942–1945

    Article  Google Scholar 

  4. Carcillo F, Dal Pozzolo A, Le Borgne YA, Caelen O, Mazzer Y, Bontempi G (2018) Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf Fusion 41:182–194

    Article  Google Scholar 

  5. Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS quarterly, pp 1165–1188

  6. Chen L, Ko J, Yeo J (2015) Analysis of the influence factors of data loading performance using apache sqoop. KIPS Trans Softw Data Eng 4(2):77–82

    Article  Google Scholar 

  7. Chen TY, Yang CT, Kristiani E, Cheng CT (2018) On construction of a power data lake platform using spark. In: International Conference on Frontier Computing. Springer, pp 99–108

  8. Chou SC, Yang CT, Jiang FC, Chang CH (2018) The implementation of a data-accessing platform built from big data warehouse of electric loads. In: 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), vol 2. IEEE, pp 87–92

  9. Fang H (2015) Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In: 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER). IEEE, pp 820–824

  10. Gruenheid A, Omiecinski E, Mark L (2011) Query optimization using column statistics in hive. In: Proceedings of the 15th symposium on international database engineering & applications. ACM, pp 97–105

  11. Gupta K, Sachdev A, Sureka A (2017) Empirical analysis on comparing the performance of alpha miner algorithm in sql query language and nosql column-oriented databases using apache phoenix. arXiv preprint arXiv:1703.05481

  12. Gupta M, Patwa F, Benson J, Sandhu R (2017) Multi-layer authorization framework for a representative hadoop ecosystem deployment. In: Proceedings of the 22nd ACM on symposium on access control models and technologies. ACM, pp 183–190

  13. Hai R, Geisler S, Quix C (2016) Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp 2097–2100

  14. John Walker S (2014) Big data: a revolution that will transform how we live, work, and think

  15. Kathiravelu P, Sharma A (2016) A dynamic data warehousing platform for creating and accessing biomedical data lakes. In: VLDB workshop on data management and analytics for medicine and healthcare. Springer, pp 101–120

  16. Kimball R, Ross M (2011) The data warehouse toolkit: the complete guide to dimensional modeling. Wiley, Hoboken

    Google Scholar 

  17. Liu PY, Tsan YT, Chan YW, Chan WC, Shi ZY, Yang CT, Lou BS (2018) Associations of PM2.5 and aspergillosis: ambient fine particulate air pollution and population-based big data linkage analyses. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-0852-x

    Article  Google Scholar 

  18. Liu RH, Kuo CF, Yang CT, Chen ST, Liu JC (2016) On construction of an energy monitoring service using big data technology for smart campus. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE, pp 81–86

  19. Małysiak-Mrozek B, Stabla M, Mrozek D (2018) Soft and declarative fishing of information in big data lake. IEEE Trans Fuzzy Syst 26(5):2732–2747

    Article  Google Scholar 

  20. Miloslavskaya N, Tolstoy A (2016) Big data, fast data and data lake concepts. Procedia Comput Sci 88:300–305

    Article  Google Scholar 

  21. Pal A, Jain K, Agrawal P, Agrawal S (2014) A performance analysis of mapreduce task with large number of files dataset in big data using hadoop. In: 2014 4th International Conference on Communication Systems and Network Technologies (CSNT). IEEE, pp 587–591

  22. Ramakrishnan R, Sridharan B, Douceur JR, Kasturi P, Krishnamachari-Sampath B, Krishnamoorthy K, Li P, Manu M, Michaylov S, Ramos R et al (2017) Azure data lake store: a hyperscale distributed file service for big data analytics. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM, pp 51–63

  23. Rangarajan S, Liu H, Wang H, Wang CL (2015) Scalable architecture for personalized healthcare service recommendation using big data lake. In: Service research and innovation. Springer, pp 65–79

  24. Sun PL, Weng JY, Yang CT, Chen ST, Liu JC (2016) The implementation of air pollution monitoring service using hybrid database converter. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE, pp 269–274

  25. Terrizzano IG, Schwarz PM, Roth M, Colino JE (2015) Data wrangling: the challenging yourney from the wild to the lake. In: CIDR

  26. Tratar LF, Strmčnik E (2016) The comparison of holt-winters method and multiple regression method: a case study. Energy 109:266–276

    Article  Google Scholar 

  27. Wang G, Koshy J, Subramanian S, Paramasivam K, Zadeh M, Narkhede N, Rao J, Kreps J, Stein J (2015) Building a replicated logging system with apache kafka. Proc VLDB Endow 8(12):1654–1655

    Article  Google Scholar 

  28. Wang Y, Xu Y, Liu Y, Chen J, Hu S (2015) Qmapper for smart grid: migrating sql-based application to hive. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp 647–658

  29. Xia F, Yang LT, Wang L, Vinel A (2012) Internet of things. Int J Commun Syst 25(9):1101

    Article  Google Scholar 

  30. Yang CT, Chen CJ, Tsan YT, Liu PY, Chan YW, Chan WC (2018) An implementation of real-time air quality and influenza-like illness data storage and processing platform. Comput Human Behav. https://doi.org/10.1016/j.chb.2018.10.009

    Article  Google Scholar 

  31. Yang CT, Chen ST, Den W, Wang YT, Kristiani E (2018) Implementation of an intelligent indoor environmental monitoring and management system in cloud. Future Gener Comput Syst

  32. Yang CT, Chen ST, Yan YZ (2017) The implementation of a cloud city traffic state assessment system using a novel big data architecture. Cluster Comput 20(2):1101–1121. https://doi.org/10.1007/s10586-017-0846-z

    Article  Google Scholar 

  33. Yang CT, Liu JC, Chen ST, Lu HW (2017) Implementation of a big data accessing and processing platform for medical records in cloud. J Med Syst 41(10):149

    Article  Google Scholar 

  34. Zhang C, Liu X (2013) Hbasemq: a distributed message queuing system on clouds with hbase. In: INFOCOM, 2013 Proceedings IEEE. IEEE, pp 40–44

  35. Zikopoulos P, Eaton C et al (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media, New York

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao-Tung Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This document is the results of the research project funded by the Ministry of Science and Technology (MOST), Taiwan R.O.C., under Grant Number This document is the results of the research project funded by the Ministry of Science and Technology (MOST), Taiwan R.O.C., Under Grant Numbers 109-2221-E-029-020-, 109-2621-M-029-002- and 109-2119-M-029-001-A.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, CT., Chen, TY., Kristiani, E. et al. The implementation of data storage and analytics platform for big data lake of electricity usage with spark. J Supercomput 77, 5934–5959 (2021). https://doi.org/10.1007/s11227-020-03505-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03505-6

Keywords

Navigation