TY - JOUR
T1 - The implementation of data storage and analytics platform for big data lake of electricity usage with spark
AU - Yang, Chao Tung
AU - Chen, Tzu Yang
AU - Kristiani, Endah
AU - Wu, Shyhtsun Felix
N1 - Publisher Copyright:
© 2020, Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2021/6
Y1 - 2021/6
N2 - Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm.
AB - Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm.
UR - http://www.scopus.com/inward/record.url?scp=85095992049&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85095992049&partnerID=8YFLogxK
U2 - 10.1007/s11227-020-03505-6
DO - 10.1007/s11227-020-03505-6
M3 - Article
AN - SCOPUS:85095992049
SN - 0920-8542
VL - 77
SP - 5934
EP - 5959
JO - Journal of Supercomputing
JF - Journal of Supercomputing
IS - 6
ER -