Skip to main content
Log in

TS-stream: clustering time series on data streams

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

The current ability to produce massive amounts of data and the impossibility in storing it motivated the development of data stream mining strategies. Despite the proposal of many techniques, this research area still lacks in approaches to mine data streams composed of multiple time series, which has applications in finance, medicine and science. Most of the current techniques for clustering streaming time series have a serious limitation in their similarity measure, which are based on the Pearson correlation. In this paper, we show the Pearson correlation is not capable of detecting similarities even for classic time series models, such as those by Box and Jenkins. This limitation motivated our proposal to cluster streaming time series based on their generating functions, which is achieved by considering features obtained using descriptive measures, such as Auto Mutual Information, the Hurst Exponent and several others. We present a new tree-based clustering algorithm, entitled TS-Stream, which uses the extracted features to produce partitions in better accordance to the time series generating functions. Experiments with synthetic data sets confirm TS-Stream outperforms ODAC, currently the most popular technique, in terms of clustering quality. Using real financial time series from the NYSE and NASDAQ, we conducted stock trading simulations employing TS-Stream to support the creation of diversified investment portfolios. Results confirmed TS-Stream increased the monetary returns in several orders of magnitude when compared to trading strategies simply based on the Moving Average Convergence Divergence financial indicator.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30

Similar content being viewed by others

Notes

  1. Determinism is a measure based on Recurrence Quantification Analysis (Marwan et al. 2007).

  2. Defined as \(\sigma ^{2}(X)~=~\sum _{i=1}^{n} p_{i}~\cdot ~(x_{i} - \mu )^{2}\) for a series X = (x 1, x 2, … , x n ), in which p i is the probability of observation x i and μ is the mean of the sequence.

  3. A deterministic, chaotic, nonlinear model.

  4. In this case outliers are simply exceptional gains, but when plotted hinder a better visualization of the distribution.

References

  • Aggarwal, C.C., Han, J., Wang, J., Yu, P.S. (2003). A framework for clustering evolving data streams. In: VLDB ’2003: Proceedings of the 29th international conference on very large data bases (pp. 81–92). VLDB Endowment.

  • Aggarwal, C.C., Han, J.,Wang, J., Yu, P.S. (2004). A framework for projected clustering of high dimensional data streams. In VLDB ’04: Proceedings of the 30th international conference on very large data bases (pp. 852–863). VLDB Endowment.

  • Ahmed, N., Natarajan, T., Rao, K.R. (1974). Discrete cosine transfom. IEEE Transactions on Computers, 23, 90–93.

    Article  MATH  MathSciNet  Google Scholar 

  • Appel, G. (2005). Technical analysis: power tools for active investors, 1st edn, FT Press.

  • Ardia, D., Boudt, K., Carl, P., Mullen, K.M., Peterson, B.G. (2011). Differential Evolution with DEoptim: An application to non-convex portfolio optimization. The Royal Journal, 3(1), 27–34.

    Google Scholar 

  • Athanassioum, P. (2012). Research handbook on hedge funds, private equity and alternative investments, Edward Elgar Pub.

  • Bélisle, C. (1992). Convergence theorems for a class of simulated annealing algorithms on rd. Journal of Applied Probability, 885–895.

  • Beringer, J., & Hüllermeier, E. (2006). Online clustering of parallel data streams. Data Knowledge Engineering, 58, 180–204. doi:10.1016/j.datak.2005.05.009.

    Article  Google Scholar 

  • Bifet, A., & Kirby, R. (2009). Data stream mining: A practical approach. Technical report, The University of Waikato.

  • Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. The journal of political economy, 637–654.

  • Box, G., & Jenkins, G. (1994). Time series analysis: forecasting and control, Prentice Hall PTR.

  • Cao, F. (2006). Density-based clustering over an evolving data stream with noise. In Proceedings of the 6th SIAM international conference data mining.

  • Chaovalit, P. (2009). Clustering transient data streams by example and by variable. PhD thesis, University of Maryland.

  • Chatfield, C. (2003). The analysis of time series: an introduction (Vol. 59). CRC press.

  • Daubechies, I. (1992). Ten lectures on wavelets. Society for industrial and applied mathematics, Philadelphia.

  • Díaz, S.P., & Vilar, J.A. (2010). Comparing several parametric and nonparametric approaches to time series clustering: a simulation study. Journal of Classification, 27, 333–362. doi:10.1007/s00357-010-9064-6.

    Article  MathSciNet  Google Scholar 

  • D’haeseleer, P., et al. (2005). How does gene expression clustering work? Nature biotechnology, 23(12), 1499–1502.

    Article  Google Scholar 

  • Fourier, J. (1888). Théorie analytique de la chaleur (Vol. 1). Gauthier-Villars et fils.

  • Fu, T.C. (2011). A review on time series data mining. Engineering Applications of Artificial Intelligence, 24(1), 164–181. doi:10.1016/j.engappai.2010.09.007, http://www.sciencedirect.com/science/article/B6V2M-516KF3X-1/2/f93f19227049b30e34b3de788e9e2b7f.

  • Fujita, A., Sato, J., Demasi, M., Sogayar, M., Ferreira, C., Miyano, S. (2009). Comparing pearson, spearman and hoeffding’s d measure for gene expression association analysis. Journal of Bioinformatics and Computational Biology, 7(4), 663–84.

    Article  Google Scholar 

  • Gama, J. (2010). Knowledge discovery from data streams, 1st edn. Chapman & Hall/CRC.

  • Hilbert, D. (1912). Grundzüge einer allgemeinen Theorie der linearen Integralgleichungen. BG Teubner.

  • Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 13–30.

  • Huang, N., Shen, Z., Long, S., Wu, M., Shih, H., Zheng, Q., Yen, N., Tung, C., Liu, H. (1998). The empirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London Series A: Mathematical, Physical and Engineering Sciences, 454(1971), 903.

    Article  MATH  MathSciNet  Google Scholar 

  • Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. doi:10.1007/BF01908075.

    Article  Google Scholar 

  • Ishii, R.P., Rios, R.A., de Mello, R.F. (2011). Classification of time series generation processes using experimental tools: a survey and proposal of an automatic and systematic approach. International Journal of Computational Science and Engineering, 1, 1–21.

    Google Scholar 

  • Kantz, H., & Schreiber, T. (1997). Nonlinear time series analysis. New York: Cambridge University Press.

  • Keogh, E., Lin, J., Truppel, W. (2003). Clustering of time series subsequences is meaningless: implications for previous and future research. In ICDM ’03: Proceedings of the 3rd IEEE international conference on data mining, IEEE computer society (pp. 115–). Washington, DC. http://dl.acm.org/citation.cfm?id=951949.952156.

  • Kontaki, M., Papadopoulos, A.N., Manolopoulos, Y. (2008). Continuous subspace clustering in streaming time series. Information Systems, 33(2), 240–260.

    Article  Google Scholar 

  • Kranen, P., Assent, I., Baldauf, C., Seidl, T. (2011). The clustree: indexing micro-clusters for anytime stream mining. Knowledge and Information Systems, 29(2), 249–272. doi:10.1007/s10115-010-0342-8.

    Article  Google Scholar 

  • Lorenz, E.N. (1963). Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20(2), 130–141.

    Article  Google Scholar 

  • MacKay, D. (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Marwan, N., Romano, M.C., Thiel, M., Kurths, J. (2007). Recurrence plots for the analysis of complex systems. Physics Reports, 438(5-6), 237–329. doi:10.1016/j.physrep.2006.11.001.

    Article  MathSciNet  Google Scholar 

  • Peng, C., Buldyrev, S., Havlin, S., Simons, M., Stanley, H., Goldberger, A. (1994). On the mosaic organization of DNA sequences. Physical Review E, 49, 1685–1689.

    Article  Google Scholar 

  • Pompe, B. (1993). Measuring statistical dependences in a time series. Journal of Statistical Physics, 73(3), 587–610.

    Article  MATH  MathSciNet  Google Scholar 

  • Quinlan, J. (1986). Induction of decision trees. Machine learning, 1(1), 81–106.

    Google Scholar 

  • Quinlan, J. (1993). C4. 5: programs for machine learning, Morgan Kaufmann.

  • R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org, ISBN 3-900051-07-0.

  • Ren, J., Cai, B., Hu, C. (2011). Clustering over data streams based on grid density and index tree. Journal of Convergence Information Technology, 6(1).

  • Rodrigues, P.P., Gama, J., Pedroso, J. (2008). Hierarchical clustering of time-series data streams. IEEE Transactions on Knowledge and Data Engineering, 20, 615–627. doi:10.1109/TKDE.2007.190727.

    Article  Google Scholar 

  • Sandri, M. (1996). Numerical calculation of lyapunov exponents. The Mathematical Journal, 6(3), 78–84.

    Google Scholar 

  • Skiena, S.S. (1998). The algorithm design manual. New York: Springer.

    Google Scholar 

  • Takens, F. (1981). Detecting strange attractors in turbulence. Dynamical systems and turbulence, (pp. 366–381). Warwick 1980.

  • Tan, P.N., Steinbach, M., Kumar, V. (2005). Introduction to Data Mining. Boston: Addison-Wesley Longman.

  • Tang, L.A., Zheng, Y., Yuan, J., Han, J., Leung, A., Hung, C.C., Peng, W.C. (2012). On discovery of traveling companions from streaming trajectories. In 2012 IEEE 28th International Conference on Data Engineering (ICDE), (pp. 186-197). IEEE.

  • Vinh, N.X., Epps, J., Bailey, J. (2009). Information theoretic measures for clusterings comparison: is a correction for chance necessary? In ICML ’09 (pp 1073–1080). New York: ACM. doi:10.1145/1553374.1553511.

  • Walpole, R., Myers, R., Myers, S., Ye, K. (1998). Probability and statistics for engineers and scientists. Upper Saddle River: Prentice Hall.

    Google Scholar 

  • Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K. (2009). Density-based clustering of data streams at multiple resolutions. ACM Transactions on Knowledge and Discovery Data, 3(3), 1–28. doi:10.1145/1552303.1552307.

    Article  Google Scholar 

  • Whitney, H. (1936). Differentiable manifolds. Annals of Mathematics, 37(3), 645–680.

    Article  MathSciNet  Google Scholar 

  • Widiputra, H., Pears, R., Kasabov, N. (2011). Multiple time-series prediction through multiple time-series relationships profiling and clustered recurring trends. In Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining (pp. 161–172). Berlin, Heidelberg: Springer-Verlag.

    Chapter  Google Scholar 

  • Yang, Y., & Chen, K. (2011). Temporal data clustering via weighted clustering ensemble with different representations. IEEE Transactions on Knowledge and Data Engineering, 23, 307–320.

    Article  Google Scholar 

  • Zheng, K., Zheng, Y., Yuan, N.J., Shang, S. (2013). On discovery of gathering patterns from trajectories. In IEEE international conference on data engineering, ICDE.

Download references

Acknowledgments

This paper is based upon work supported by FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo)—Brazil under grant #2010/05062-6. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of FAPESP. We also thank the anonymous reviewers who suggested improvements to the presentation of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cássio M. M. Pereira.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pereira, C.M.M., de Mello, R.F. TS-stream: clustering time series on data streams. J Intell Inf Syst 42, 531–566 (2014). https://doi.org/10.1007/s10844-013-0290-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-013-0290-3

Keywords

Navigation