Abstract
In recent years, the rapid development of Internet of Things and sensor networks makes the time series data experiencing explosive growth. OpenTSDB and other emerging systems begin to use Hadoop, HBase to store massive time series data, and how to use these platforms to query and mine time series data has become a current research hotspot. As a typical time series distance measurement method, correlation coefficient is widely used in various applications. However, it requires a large amount of I/O and network transmission to compute the correlation coefficient of long time sequence on HBase in real time, and therefore cannot be applied to interactive query. To address this problem, in this paper, we present two methods to estimate the correlation coefficients of two sequences on HBase. We first propose a fast estimation algorithm for the upper and lower bounds of correlation coefficient, named as DCE. In order to further reduce the cost of I/O, we extend the DCE algorithm, and propose the ADCE algorithm, which can estimate the correlation coefficient quickly with an iterative manner. Experiments show that the algorithms proposed in this paper can quickly calculate the correlation coefficient of the long time series.
Similar content being viewed by others
References
Mueen A, Nath S, Liu J. Fast approximate correlation for massive time-series data. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 171–182
Tao Y F, Papadias D, Faloutsos C. Approximate temporal aggregation. In: Proceedings of the 20th IEEE International Conference on Data Engineering. 2004, 190–201
Tao Y F, Yi K, Sheng C, Pei J, Li F F. Logging every footstep: quantile summaries for the entire history. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 639–650
Esling P, Agon C. Time–series data mining. ACM Computing Surveys, 2012, 45(1): 12
Camerra A, Palpanas T, Shieh J, Keogh E. iSAX 2.0: indexing and mining one billion time series. In: Proceedings of the 10th IEEE International Conference on Data Mining. 2010, 58–67
Yang J, Widom J. Incremental computation and maintenance of temporal aggregates. The VLDB Journal — The International Journal on Very Large Data Bases, 2003, 12(3): 262–283
Jin J, An N, Sivasubramaniam A. Analyzing range queries on spatial data. In: Proceedings of the 16th IEEE International Conference on Data Engineering. 2000, 525–534
Mueen A, Hamooni H, Estrada T. Time series join on subsequence correlation. In: Proceedings of the 2014 IEEE International Conference on Data Mining. 2014, 450–459
Li Y H, Hou U L, Yiu ML, Gong Z G. Discovering longest–lasting correlation in sequence databases. Proceedings of the VLDB Endowment, 2013, 6(14): 1666–1677
Wang Y, Wang P, Pei J, Huang S. A data–adaptive and dynamic segmentation index for whole matching on time series. Proceedings of the VLDB Endowment, 2013, 6(10): 793–804
Jeffrey J, Jeff M P, Li F F, Tang M W. Ranking large temporal data. Proceedings of the VLDB Endowment, 2012, 5(11): 1412–1423
Luo WM, Tan H Y, Chen L, Lionel M. Finding time period–based most frequent path in big trajectory data. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 713–724
Agrawal R, Faloutsos C, Swami A. Efficient similarity search in sequence databases. In: Proceedings of the International Conference on Foundations of Data Organization and Algorithms. 1993, 69–84
Chan K P, Fu WC. Efficient time series matching by wavelets. In: Proceedings of the IEEE International Conference on Data Engineering. 1999, 126–133
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems, 2002, 27(2): 188–228
Camerra A, Shieh J, Palpanas T, Rakthanmanon T, Keogh E. Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. Knowledge & Information Systems, 2014, 39(1):123–151
Faloutsos C, Ranganathan M, Manolopoulos Y. Fast subsequence matching in time–series databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 1994, 419–429
Soroush E, Balazinska M, Wang D. ArrayStore: a storage manager for complex parallel array processing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 253–264
Das S, Sismanis Y, Beyer K S. Ricardo: integrating R and Hadoop. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 987–998
Huang B, Babu S, Yang J. Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1–12
Acknowledgements
This work was supported by Natural Science Foundation of Xinjiang Uygur Autonomous Region (2017D01B09).
Author information
Authors and Affiliations
Corresponding author
Additional information
Wen Liu received the BS degree in computer science from Xinjiang Normal University, China in 2004 and the MS degree in computer science from Dalian University of Technology, China in 2009. He is currently a PhD student of computer science at Dalian University of Technology, China. His research interests include database, stream data processing, and cloud computing.
Tuqian Zhang received the BS degree in Electronic Information science from West Anhui University, China in 2008 and the MS degree in Information System from Xinjiang Agricultural University, China in 2012. He is currently a second MS degree student of computer science at Dalian University of Technology, China. His research interests include database and cloud computing.
Yanming Shen received the BS degree in automation from Tsinghua University, China in 2000 and the PhD degree from the Department of Electrical and Computer Engineering at the NYU Polytechnic School of Engineering in 2007. He is an professor with the School of Computer Science and Technology, Dalian University of Technology, China. His general research interests include packet switch design, data center networks, cloud computing, and distributed systems. He is a recipient of the 2011 Best Paper Awards for Multimedia Communications (awarded by IEEE Communications Society).
Peng Wang received the BS degree in mathematics from Nankai University, China in 2001 and the PhD degree in computer science from Fudan University, China in 2007. He is currently an associate professor in school of Computer Science, Fudan University, China. His research interests include database and stream data processing.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Liu, W., Zhang, T., Shen, Y. et al. Fast correlation coefficient estimation algorithm for HBase-based massive time series data. Front. Comput. Sci. 13, 864–878 (2019). https://doi.org/10.1007/s11704-018-6308-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-018-6308-9