Skip to main content
Log in

Fast correlation coefficient estimation algorithm for HBase-based massive time series data

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

In recent years, the rapid development of Internet of Things and sensor networks makes the time series data experiencing explosive growth. OpenTSDB and other emerging systems begin to use Hadoop, HBase to store massive time series data, and how to use these platforms to query and mine time series data has become a current research hotspot. As a typical time series distance measurement method, correlation coefficient is widely used in various applications. However, it requires a large amount of I/O and network transmission to compute the correlation coefficient of long time sequence on HBase in real time, and therefore cannot be applied to interactive query. To address this problem, in this paper, we present two methods to estimate the correlation coefficients of two sequences on HBase. We first propose a fast estimation algorithm for the upper and lower bounds of correlation coefficient, named as DCE. In order to further reduce the cost of I/O, we extend the DCE algorithm, and propose the ADCE algorithm, which can estimate the correlation coefficient quickly with an iterative manner. Experiments show that the algorithms proposed in this paper can quickly calculate the correlation coefficient of the long time series.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Mueen A, Nath S, Liu J. Fast approximate correlation for massive time-series data. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 171–182

    Chapter  Google Scholar 

  2. Tao Y F, Papadias D, Faloutsos C. Approximate temporal aggregation. In: Proceedings of the 20th IEEE International Conference on Data Engineering. 2004, 190–201

    Chapter  Google Scholar 

  3. Tao Y F, Yi K, Sheng C, Pei J, Li F F. Logging every footstep: quantile summaries for the entire history. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 639–650

    Chapter  Google Scholar 

  4. Esling P, Agon C. Time–series data mining. ACM Computing Surveys, 2012, 45(1): 12

    Article  MATH  Google Scholar 

  5. Camerra A, Palpanas T, Shieh J, Keogh E. iSAX 2.0: indexing and mining one billion time series. In: Proceedings of the 10th IEEE International Conference on Data Mining. 2010, 58–67

    Google Scholar 

  6. Yang J, Widom J. Incremental computation and maintenance of temporal aggregates. The VLDB Journal — The International Journal on Very Large Data Bases, 2003, 12(3): 262–283

    Article  Google Scholar 

  7. Jin J, An N, Sivasubramaniam A. Analyzing range queries on spatial data. In: Proceedings of the 16th IEEE International Conference on Data Engineering. 2000, 525–534

    Google Scholar 

  8. Mueen A, Hamooni H, Estrada T. Time series join on subsequence correlation. In: Proceedings of the 2014 IEEE International Conference on Data Mining. 2014, 450–459

    Chapter  Google Scholar 

  9. Li Y H, Hou U L, Yiu ML, Gong Z G. Discovering longest–lasting correlation in sequence databases. Proceedings of the VLDB Endowment, 2013, 6(14): 1666–1677

    Article  Google Scholar 

  10. Wang Y, Wang P, Pei J, Huang S. A data–adaptive and dynamic segmentation index for whole matching on time series. Proceedings of the VLDB Endowment, 2013, 6(10): 793–804

    Article  Google Scholar 

  11. Jeffrey J, Jeff M P, Li F F, Tang M W. Ranking large temporal data. Proceedings of the VLDB Endowment, 2012, 5(11): 1412–1423

    Article  Google Scholar 

  12. Luo WM, Tan H Y, Chen L, Lionel M. Finding time period–based most frequent path in big trajectory data. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 713–724

    Chapter  Google Scholar 

  13. Agrawal R, Faloutsos C, Swami A. Efficient similarity search in sequence databases. In: Proceedings of the International Conference on Foundations of Data Organization and Algorithms. 1993, 69–84

    Chapter  Google Scholar 

  14. Chan K P, Fu WC. Efficient time series matching by wavelets. In: Proceedings of the IEEE International Conference on Data Engineering. 1999, 126–133

    Google Scholar 

  15. Keogh E, Chakrabarti K, Pazzani M, Mehrotra S. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems, 2002, 27(2): 188–228

    Article  Google Scholar 

  16. Camerra A, Shieh J, Palpanas T, Rakthanmanon T, Keogh E. Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. Knowledge & Information Systems, 2014, 39(1):123–151

    Article  Google Scholar 

  17. Faloutsos C, Ranganathan M, Manolopoulos Y. Fast subsequence matching in time–series databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 1994, 419–429

    Chapter  Google Scholar 

  18. Soroush E, Balazinska M, Wang D. ArrayStore: a storage manager for complex parallel array processing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 253–264

    Chapter  Google Scholar 

  19. Das S, Sismanis Y, Beyer K S. Ricardo: integrating R and Hadoop. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 987–998

    Chapter  Google Scholar 

  20. Huang B, Babu S, Yang J. Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1–12

    Google Scholar 

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Xinjiang Uygur Autonomous Region (2017D01B09).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanming Shen.

Additional information

Wen Liu received the BS degree in computer science from Xinjiang Normal University, China in 2004 and the MS degree in computer science from Dalian University of Technology, China in 2009. He is currently a PhD student of computer science at Dalian University of Technology, China. His research interests include database, stream data processing, and cloud computing.

Tuqian Zhang received the BS degree in Electronic Information science from West Anhui University, China in 2008 and the MS degree in Information System from Xinjiang Agricultural University, China in 2012. He is currently a second MS degree student of computer science at Dalian University of Technology, China. His research interests include database and cloud computing.

Yanming Shen received the BS degree in automation from Tsinghua University, China in 2000 and the PhD degree from the Department of Electrical and Computer Engineering at the NYU Polytechnic School of Engineering in 2007. He is an professor with the School of Computer Science and Technology, Dalian University of Technology, China. His general research interests include packet switch design, data center networks, cloud computing, and distributed systems. He is a recipient of the 2011 Best Paper Awards for Multimedia Communications (awarded by IEEE Communications Society).

Peng Wang received the BS degree in mathematics from Nankai University, China in 2001 and the PhD degree in computer science from Fudan University, China in 2007. He is currently an associate professor in school of Computer Science, Fudan University, China. His research interests include database and stream data processing.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, W., Zhang, T., Shen, Y. et al. Fast correlation coefficient estimation algorithm for HBase-based massive time series data. Front. Comput. Sci. 13, 864–878 (2019). https://doi.org/10.1007/s11704-018-6308-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-018-6308-9

Keywords

Navigation