Skip to main content

Advertisement

Log in

Correlation analysis techniques for uncertain time series

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Many applications such as location-based services and wireless sensor networks generate and deal with uncertain time series (UTS), where the “exact” value at each timestamp is unknown. Traditional correlation analysis and search techniques developed for standard time series are inadequate for UTS data analysis required in such applications. Motivated by this need, we propose suitable concepts and techniques for UTS correlation analysis. We formalize the notion of normalization and correlation for UTS in two general settings based on the available information at each timestamp: (1) PDF-based UTS (having probability density function) and (2) multiset-based UTS (having multiset of observed values). For each case, we formulate correlation as a random variable and develop techniques to determine the underlying probability density function. For setup (2), we also present probabilistic pruning and sampling techniques to speed up the search process. We conducted numerous experiments to evaluate the performance of the proposed techniques under different configurations using the UCR benchmark datasets. Our results indicate effectiveness of the proposed techniques. For setup (2), in particular, our results show significant improvement in space utilization and computation time. We believe the proposed ideas and solutions lend themselves to powerful tools for UTS analysis and search tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

References

  1. Asfalg J, Kriegel HP, Kröger P, Renz M (2009) Probabilistic similarity search for uncertain time series. In: Proceedings of international conference on scientific and statistical database management (SSDBM), pp 435–443

  2. Bagnall A, Ratanamahatana CA, Keogh E, Lonardi S, Janacek G (2006) A bit level representation for time series data mining with shape based similarity. Proc Data Min Knowl Discov J 13(1):11–40

    Article  MathSciNet  Google Scholar 

  3. Bernecker T, Kriegel H-P, Renz M, Zuefle A (2009) Probabilistic ranking in uncertain vector spaces. In: Proceedings of workshop on managing data quality in collaborative information systems

  4. Bohm C, Pryakhin A, Schubert M (2006) The Gauss-tree: efficient object identification of probabilistic feature vectors. In: Proceedings of international conference on data engineering (ICDE)

  5. Cheng R, Kalashnikov DV, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD international conference on management of data, pp 551–562

  6. Cheng R, Kalashnikov DV, Prabhakar S (2004) Querying imprecise data in moving object environments. IEEE Trans Knowl Data Eng 9(16):1112–1127

    Article  Google Scholar 

  7. Cheng R, Singh S, Prabhakar S, Shah R, Vitter JS, Xia Y (2006) Efficient join processing over uncertain data. In: Proceedings of ACM international conference on information and knowledge management (CIKM), pp 738–747

  8. Complete experimental results of this paper. http://tinyurl.com/qfvbauf

  9. Dallachiesa M, Jacques-Silva G, Gedik B, Wu KL, Palpanas T (2014) Sliding windows over uncertain data streams. Knowl Inf Syst J 45(1):159–190

  10. Dallachiesa M, Nushi B, Mirylenka K, Palpanas T (2012) Uncertain time series similarity: return to the basics. Proc VLDB Endow 5(11):1662–1673

    Article  Google Scholar 

  11. Dallachiesa M, Palpanas T, Ilyas IF (2014) Top-k nearest neighbor search in uncertain data series. Proc VLDB Endow J 8(1):13–24

    Article  Google Scholar 

  12. Dvoretzky A, Kiefer J, Wolfowitz J (1956) Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann Math Stat 27(3):642–669

    Article  MathSciNet  MATH  Google Scholar 

  13. Emrich T, Kriegel H-P, Mamoulis N, Renz M, Zufle A (2012) Querying uncertain spatio-temporal data. In: Proceedings of international conference on data engineering (ICDE), pp 354–365

  14. Hong Y (2013) On computing the distribution function for the Poisson binomial distribution. Comput Stat Data Anal 59:41–51

    Article  MathSciNet  Google Scholar 

  15. Keogh E, Zhu Q, Hu B, Hao Y, Xi X, Wei L, Ratanamahatana CA. The UCR time series classification/clustering homepage. http://www.cs.ucr.edu/~eamonn/time_series_data/

  16. Kriegel H-P, Kunath P, Renz M (2007) Probabilistic nearest-neighbor query on uncertain objects. In: Proceedings of international conference on database systems for advanced, pp 337–348

  17. Lian X, Chen L, Yu JW (2008) Pattern matching over cloaked time series. In: Proceedings of international conference on data engineering (ICDE), pp 1462–1464

  18. Ljosa V, Singh AK (2007) APLA: indexing arbitrary probability distributions. In: Proceedings of international conference on data engineering (ICDE), pp 946–955

  19. Lomnicki ZA, Zaremba SK (1955) Some applications of zero-one processes. Proc J R Stat Soc 17(2):243–255

    MathSciNet  MATH  Google Scholar 

  20. Massart P (1990) The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality. Ann Probab 18(3):1269–1283

    Article  MathSciNet  MATH  Google Scholar 

  21. Nguyen P, Shiri N (2008) Fast correlation analysis on time series datasets. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 787–796

  22. Orang M, Shiri N (2012) A probabilistic approach to correlation queries in uncertain time series data. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 2229–2233

  23. Orang M, Shiri N (2014) An experimental evaluation of similarity measures for uncertain time series. In: Proceedings of international database engineering and applications symposium (IDEAS), pp 261–264

  24. Orang M, Shiri N (2015) Improving performance of similarity measures for uncertain time series using preprocessing techniques. In: Proceedings of international conference on scientific and statistical database management (SSDBM), vol 31, pp 1–12

  25. Ross SM (2009) Introductory statistics. Academic Press, San Diego

    MATH  Google Scholar 

  26. Sarangi SR, Murth K (2010) DUST: a generalized notion of similarity between uncertain time series. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 383–392

  27. Shasha D, Zhu Y (2004) High performance discovery in time series: techniques and case studies. Springer, New York

    Book  MATH  Google Scholar 

  28. Shorack GR, Wellner JA (2009) Empirical processes with applications to statistics. Society for Industrial and Applied Mathematics, Philadelphia

  29. Tao Y, Cheng R, Xiao X, Ngai W, Kao B, Prabhakar S (2005) Indexing multidimensional uncertain data with arbitrary probability density functions. In: Proceedings of international conference on very large data bases (VLDB), pp 922–933

  30. Weld DS, de Kleer J (1990) Readings in qualitative reasoning about physical systems. Morgan Kaufmann, Burlington

    Google Scholar 

  31. Wu WCH, Yeh MY, Pei J (2012) Random error reduction in similarity search on time series: a statistical approach. In: Proceedings of IEEE international conference on data engineering (ICDE), pp 858–869

  32. Yeh MY, Wu KL, Yu PS, Chen MS (2009) PROUD: a probabilistic approach to processing similarity queries over uncertain data streams. In: Proceedings of international conference on extending database technology, advances in database technology (EDBT), pp 684–695

  33. Zhang L, Li J, Wang Z (2011) Uneven two-step sampling and distance calculation for uncertain trajectory. J Inf Comput Sci 9(8):1505–1513

    Google Scholar 

  34. Zhang T, Yue D, Yu G, Gu Y (2007) Correlation analysis based on hierarchical Boolean representation over time series data streams. In: Proceedings of international conference on fuzzy systems and knowledge discovery (FSKD), vol 2, pp 740–744

  35. Zhao Y, Aggarwal CC, Yu PS (2010) On wavelet decomposition of uncertain time series data sets. In: Proceedings of ACM international conference on information and knowledge management (CIKM), pp 129–138

Download references

Acknowledgments

The authors would like to thank anonymous reviewers for their comments that helped improve the manuscript. This work was supported in part by Natural Sciences and Engineering Research Council (NSERC) of Canada and by Concordia University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahsa Orang.

Appendices

Appendix 1

Table 1 provides a glossary for the notations used in this paper.

Table 1 Main notations used in this paper

Appendix 2

For each dataset, Table 2 shows the percentage improvement defined as the \(F_{1}\) score of the probabilistic queries minus that of the deterministic queries divided by the \(F_{1}\) score of the deterministic queries for the PDF-based model. Table 2 illustrates that for all the datasets, the percentage improvement is positive. This shows that probabilistic queries always outperform deterministic queries. We noted that for the datasets Beef and Trace, the \(F_{1}\) score of the deterministic queries was 0, while it was nonzero for probabilistic queries. Thus, for these two datasets, the improvement percentage is undefined.

Table 3 illustrates the improvement in percentage of \(F_{1}\) score of the probabilistic queries for the multiset-based model for all datasets. Similar to Table 2, this table shows that the probabilistic queries outperform the deterministic queries. Moreover, in both tables, the higher the uncertainty level (i.e., SDR) the higher the improvement percentage of the \(F_{1}\) score. This implies that compared to deterministic queries, probabilistic queries are more resilient to the uncertainty level.

Table 2 Improvement percentage for different UCR datasets for the PDF-based model
Table 3 Improvement percentage of UCR datasets for the multiset-based model for number of observed values less than 6

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Orang, M., Shiri, N. Correlation analysis techniques for uncertain time series. Knowl Inf Syst 50, 79–116 (2017). https://doi.org/10.1007/s10115-016-0939-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0939-7

Keywords