Fast correlation coefficient estimation algorithm for HBase-based massive time series data

Liu, Wen; Zhang, Tuqian; Shen, Yanming; Wang, Peng

doi:10.1007/s11704-018-6308-9

Fast correlation coefficient estimation algorithm for HBase-based massive time series data

Research Article
Published: 18 June 2019

Volume 13, pages 864–878, (2019)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Wen Liu^1,2,
Tuqian Zhang²,
Yanming Shen² &
…
Peng Wang³

82 Accesses
1 Citation
Explore all metrics

Abstract

In recent years, the rapid development of Internet of Things and sensor networks makes the time series data experiencing explosive growth. OpenTSDB and other emerging systems begin to use Hadoop, HBase to store massive time series data, and how to use these platforms to query and mine time series data has become a current research hotspot. As a typical time series distance measurement method, correlation coefficient is widely used in various applications. However, it requires a large amount of I/O and network transmission to compute the correlation coefficient of long time sequence on HBase in real time, and therefore cannot be applied to interactive query. To address this problem, in this paper, we present two methods to estimate the correlation coefficients of two sequences on HBase. We first propose a fast estimation algorithm for the upper and lower bounds of correlation coefficient, named as DCE. In order to further reduce the cost of I/O, we extend the DCE algorithm, and propose the ADCE algorithm, which can estimate the correlation coefficient quickly with an iterative manner. Experiments show that the algorithms proposed in this paper can quickly calculate the correlation coefficient of the long time series.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive correlation exploitation in big data query optimization

Article 28 July 2018

Low Redundancy Estimation of Correlation Matrices for Time Series Using Triangular Bounds

An ensemble approach-based intrusion detection system utilizing ISHO-HBA and SE-ResNet152

Article 21 November 2023

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Mueen A, Nath S, Liu J. Fast approximate correlation for massive time-series data. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 171–182
Chapter Google Scholar
Tao Y F, Papadias D, Faloutsos C. Approximate temporal aggregation. In: Proceedings of the 20th IEEE International Conference on Data Engineering. 2004, 190–201
Chapter Google Scholar
Tao Y F, Yi K, Sheng C, Pei J, Li F F. Logging every footstep: quantile summaries for the entire history. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 639–650
Chapter Google Scholar
Esling P, Agon C. Time–series data mining. ACM Computing Surveys, 2012, 45(1): 12
Article MATH Google Scholar
Camerra A, Palpanas T, Shieh J, Keogh E. iSAX 2.0: indexing and mining one billion time series. In: Proceedings of the 10th IEEE International Conference on Data Mining. 2010, 58–67
Google Scholar
Yang J, Widom J. Incremental computation and maintenance of temporal aggregates. The VLDB Journal — The International Journal on Very Large Data Bases, 2003, 12(3): 262–283
Article Google Scholar
Jin J, An N, Sivasubramaniam A. Analyzing range queries on spatial data. In: Proceedings of the 16th IEEE International Conference on Data Engineering. 2000, 525–534
Google Scholar
Mueen A, Hamooni H, Estrada T. Time series join on subsequence correlation. In: Proceedings of the 2014 IEEE International Conference on Data Mining. 2014, 450–459
Chapter Google Scholar
Li Y H, Hou U L, Yiu ML, Gong Z G. Discovering longest–lasting correlation in sequence databases. Proceedings of the VLDB Endowment, 2013, 6(14): 1666–1677
Article Google Scholar
Wang Y, Wang P, Pei J, Huang S. A data–adaptive and dynamic segmentation index for whole matching on time series. Proceedings of the VLDB Endowment, 2013, 6(10): 793–804
Article Google Scholar
Jeffrey J, Jeff M P, Li F F, Tang M W. Ranking large temporal data. Proceedings of the VLDB Endowment, 2012, 5(11): 1412–1423
Article Google Scholar
Luo WM, Tan H Y, Chen L, Lionel M. Finding time period–based most frequent path in big trajectory data. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 713–724
Chapter Google Scholar
Agrawal R, Faloutsos C, Swami A. Efficient similarity search in sequence databases. In: Proceedings of the International Conference on Foundations of Data Organization and Algorithms. 1993, 69–84
Chapter Google Scholar
Chan K P, Fu WC. Efficient time series matching by wavelets. In: Proceedings of the IEEE International Conference on Data Engineering. 1999, 126–133
Google Scholar
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems, 2002, 27(2): 188–228
Article Google Scholar
Camerra A, Shieh J, Palpanas T, Rakthanmanon T, Keogh E. Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. Knowledge & Information Systems, 2014, 39(1):123–151
Article Google Scholar
Faloutsos C, Ranganathan M, Manolopoulos Y. Fast subsequence matching in time–series databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 1994, 419–429
Chapter Google Scholar
Soroush E, Balazinska M, Wang D. ArrayStore: a storage manager for complex parallel array processing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 253–264
Chapter Google Scholar
Das S, Sismanis Y, Beyer K S. Ricardo: integrating R and Hadoop. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 987–998
Chapter Google Scholar
Huang B, Babu S, Yang J. Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1–12
Google Scholar

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Xinjiang Uygur Autonomous Region (2017D01B09).

Author information

Authors and Affiliations

Department of Electrical and Information Engineering, Xinjiang Institute of Engineering, Urumqi, 830091, China
Wen Liu
School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
Wen Liu, Tuqian Zhang & Yanming Shen
School of Computer Science, Fudan University, Shanghai, 201203, China
Peng Wang

Authors

Wen Liu
View author publications
Search author on:PubMed Google Scholar
Tuqian Zhang
View author publications
Search author on:PubMed Google Scholar
Yanming Shen
View author publications
Search author on:PubMed Google Scholar
Peng Wang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yanming Shen.

Additional information

Wen Liu received the BS degree in computer science from Xinjiang Normal University, China in 2004 and the MS degree in computer science from Dalian University of Technology, China in 2009. He is currently a PhD student of computer science at Dalian University of Technology, China. His research interests include database, stream data processing, and cloud computing.

Tuqian Zhang received the BS degree in Electronic Information science from West Anhui University, China in 2008 and the MS degree in Information System from Xinjiang Agricultural University, China in 2012. He is currently a second MS degree student of computer science at Dalian University of Technology, China. His research interests include database and cloud computing.

Yanming Shen received the BS degree in automation from Tsinghua University, China in 2000 and the PhD degree from the Department of Electrical and Computer Engineering at the NYU Polytechnic School of Engineering in 2007. He is an professor with the School of Computer Science and Technology, Dalian University of Technology, China. His general research interests include packet switch design, data center networks, cloud computing, and distributed systems. He is a recipient of the 2011 Best Paper Awards for Multimedia Communications (awarded by IEEE Communications Society).

Peng Wang received the BS degree in mathematics from Nankai University, China in 2001 and the PhD degree in computer science from Fudan University, China in 2007. He is currently an associate professor in school of Computer Science, Fudan University, China. His research interests include database and stream data processing.

Electronic supplementary material

Supplementary material, approximately 250 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, W., Zhang, T., Shen, Y. et al. Fast correlation coefficient estimation algorithm for HBase-based massive time series data. Front. Comput. Sci. 13, 864–878 (2019). https://doi.org/10.1007/s11704-018-6308-9

Download citation

Received: 13 June 2016
Accepted: 07 August 2017
Published: 18 June 2019
Issue Date: August 2019
DOI: https://doi.org/10.1007/s11704-018-6308-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast correlation coefficient estimation algorithm for HBase-based massive time series data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adaptive correlation exploitation in big data query optimization

Low Redundancy Estimation of Correlation Matrices for Time Series Using Triangular Bounds

An ensemble approach-based intrusion detection system utilizing ISHO-HBA and SE-ResNet152

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 250 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now