Abstract
Many applications such as location-based services and wireless sensor networks generate and deal with uncertain time series (UTS), where the “exact” value at each timestamp is unknown. Traditional correlation analysis and search techniques developed for standard time series are inadequate for UTS data analysis required in such applications. Motivated by this need, we propose suitable concepts and techniques for UTS correlation analysis. We formalize the notion of normalization and correlation for UTS in two general settings based on the available information at each timestamp: (1) PDF-based UTS (having probability density function) and (2) multiset-based UTS (having multiset of observed values). For each case, we formulate correlation as a random variable and develop techniques to determine the underlying probability density function. For setup (2), we also present probabilistic pruning and sampling techniques to speed up the search process. We conducted numerous experiments to evaluate the performance of the proposed techniques under different configurations using the UCR benchmark datasets. Our results indicate effectiveness of the proposed techniques. For setup (2), in particular, our results show significant improvement in space utilization and computation time. We believe the proposed ideas and solutions lend themselves to powerful tools for UTS analysis and search tasks.























Similar content being viewed by others
References
Asfalg J, Kriegel HP, Kröger P, Renz M (2009) Probabilistic similarity search for uncertain time series. In: Proceedings of international conference on scientific and statistical database management (SSDBM), pp 435–443
Bagnall A, Ratanamahatana CA, Keogh E, Lonardi S, Janacek G (2006) A bit level representation for time series data mining with shape based similarity. Proc Data Min Knowl Discov J 13(1):11–40
Bernecker T, Kriegel H-P, Renz M, Zuefle A (2009) Probabilistic ranking in uncertain vector spaces. In: Proceedings of workshop on managing data quality in collaborative information systems
Bohm C, Pryakhin A, Schubert M (2006) The Gauss-tree: efficient object identification of probabilistic feature vectors. In: Proceedings of international conference on data engineering (ICDE)
Cheng R, Kalashnikov DV, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD international conference on management of data, pp 551–562
Cheng R, Kalashnikov DV, Prabhakar S (2004) Querying imprecise data in moving object environments. IEEE Trans Knowl Data Eng 9(16):1112–1127
Cheng R, Singh S, Prabhakar S, Shah R, Vitter JS, Xia Y (2006) Efficient join processing over uncertain data. In: Proceedings of ACM international conference on information and knowledge management (CIKM), pp 738–747
Complete experimental results of this paper. http://tinyurl.com/qfvbauf
Dallachiesa M, Jacques-Silva G, Gedik B, Wu KL, Palpanas T (2014) Sliding windows over uncertain data streams. Knowl Inf Syst J 45(1):159–190
Dallachiesa M, Nushi B, Mirylenka K, Palpanas T (2012) Uncertain time series similarity: return to the basics. Proc VLDB Endow 5(11):1662–1673
Dallachiesa M, Palpanas T, Ilyas IF (2014) Top-k nearest neighbor search in uncertain data series. Proc VLDB Endow J 8(1):13–24
Dvoretzky A, Kiefer J, Wolfowitz J (1956) Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann Math Stat 27(3):642–669
Emrich T, Kriegel H-P, Mamoulis N, Renz M, Zufle A (2012) Querying uncertain spatio-temporal data. In: Proceedings of international conference on data engineering (ICDE), pp 354–365
Hong Y (2013) On computing the distribution function for the Poisson binomial distribution. Comput Stat Data Anal 59:41–51
Keogh E, Zhu Q, Hu B, Hao Y, Xi X, Wei L, Ratanamahatana CA. The UCR time series classification/clustering homepage. http://www.cs.ucr.edu/~eamonn/time_series_data/
Kriegel H-P, Kunath P, Renz M (2007) Probabilistic nearest-neighbor query on uncertain objects. In: Proceedings of international conference on database systems for advanced, pp 337–348
Lian X, Chen L, Yu JW (2008) Pattern matching over cloaked time series. In: Proceedings of international conference on data engineering (ICDE), pp 1462–1464
Ljosa V, Singh AK (2007) APLA: indexing arbitrary probability distributions. In: Proceedings of international conference on data engineering (ICDE), pp 946–955
Lomnicki ZA, Zaremba SK (1955) Some applications of zero-one processes. Proc J R Stat Soc 17(2):243–255
Massart P (1990) The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality. Ann Probab 18(3):1269–1283
Nguyen P, Shiri N (2008) Fast correlation analysis on time series datasets. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 787–796
Orang M, Shiri N (2012) A probabilistic approach to correlation queries in uncertain time series data. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 2229–2233
Orang M, Shiri N (2014) An experimental evaluation of similarity measures for uncertain time series. In: Proceedings of international database engineering and applications symposium (IDEAS), pp 261–264
Orang M, Shiri N (2015) Improving performance of similarity measures for uncertain time series using preprocessing techniques. In: Proceedings of international conference on scientific and statistical database management (SSDBM), vol 31, pp 1–12
Ross SM (2009) Introductory statistics. Academic Press, San Diego
Sarangi SR, Murth K (2010) DUST: a generalized notion of similarity between uncertain time series. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 383–392
Shasha D, Zhu Y (2004) High performance discovery in time series: techniques and case studies. Springer, New York
Shorack GR, Wellner JA (2009) Empirical processes with applications to statistics. Society for Industrial and Applied Mathematics, Philadelphia
Tao Y, Cheng R, Xiao X, Ngai W, Kao B, Prabhakar S (2005) Indexing multidimensional uncertain data with arbitrary probability density functions. In: Proceedings of international conference on very large data bases (VLDB), pp 922–933
Weld DS, de Kleer J (1990) Readings in qualitative reasoning about physical systems. Morgan Kaufmann, Burlington
Wu WCH, Yeh MY, Pei J (2012) Random error reduction in similarity search on time series: a statistical approach. In: Proceedings of IEEE international conference on data engineering (ICDE), pp 858–869
Yeh MY, Wu KL, Yu PS, Chen MS (2009) PROUD: a probabilistic approach to processing similarity queries over uncertain data streams. In: Proceedings of international conference on extending database technology, advances in database technology (EDBT), pp 684–695
Zhang L, Li J, Wang Z (2011) Uneven two-step sampling and distance calculation for uncertain trajectory. J Inf Comput Sci 9(8):1505–1513
Zhang T, Yue D, Yu G, Gu Y (2007) Correlation analysis based on hierarchical Boolean representation over time series data streams. In: Proceedings of international conference on fuzzy systems and knowledge discovery (FSKD), vol 2, pp 740–744
Zhao Y, Aggarwal CC, Yu PS (2010) On wavelet decomposition of uncertain time series data sets. In: Proceedings of ACM international conference on information and knowledge management (CIKM), pp 129–138
Acknowledgments
The authors would like to thank anonymous reviewers for their comments that helped improve the manuscript. This work was supported in part by Natural Sciences and Engineering Research Council (NSERC) of Canada and by Concordia University.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1
Table 1 provides a glossary for the notations used in this paper.
Appendix 2
For each dataset, Table 2 shows the percentage improvement defined as the \(F_{1}\) score of the probabilistic queries minus that of the deterministic queries divided by the \(F_{1}\) score of the deterministic queries for the PDF-based model. Table 2 illustrates that for all the datasets, the percentage improvement is positive. This shows that probabilistic queries always outperform deterministic queries. We noted that for the datasets Beef and Trace, the \(F_{1}\) score of the deterministic queries was 0, while it was nonzero for probabilistic queries. Thus, for these two datasets, the improvement percentage is undefined.
Table 3 illustrates the improvement in percentage of \(F_{1}\) score of the probabilistic queries for the multiset-based model for all datasets. Similar to Table 2, this table shows that the probabilistic queries outperform the deterministic queries. Moreover, in both tables, the higher the uncertainty level (i.e., SDR) the higher the improvement percentage of the \(F_{1}\) score. This implies that compared to deterministic queries, probabilistic queries are more resilient to the uncertainty level.
Rights and permissions
About this article
Cite this article
Orang, M., Shiri, N. Correlation analysis techniques for uncertain time series. Knowl Inf Syst 50, 79–116 (2017). https://doi.org/10.1007/s10115-016-0939-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-0939-7