Abstract
The current ability to produce massive amounts of data and the impossibility in storing it motivated the development of data stream mining strategies. Despite the proposal of many techniques, this research area still lacks in approaches to mine data streams composed of multiple time series, which has applications in finance, medicine and science. Most of the current techniques for clustering streaming time series have a serious limitation in their similarity measure, which are based on the Pearson correlation. In this paper, we show the Pearson correlation is not capable of detecting similarities even for classic time series models, such as those by Box and Jenkins. This limitation motivated our proposal to cluster streaming time series based on their generating functions, which is achieved by considering features obtained using descriptive measures, such as Auto Mutual Information, the Hurst Exponent and several others. We present a new tree-based clustering algorithm, entitled TS-Stream, which uses the extracted features to produce partitions in better accordance to the time series generating functions. Experiments with synthetic data sets confirm TS-Stream outperforms ODAC, currently the most popular technique, in terms of clustering quality. Using real financial time series from the NYSE and NASDAQ, we conducted stock trading simulations employing TS-Stream to support the creation of diversified investment portfolios. Results confirmed TS-Stream increased the monetary returns in several orders of magnitude when compared to trading strategies simply based on the Moving Average Convergence Divergence financial indicator.
Similar content being viewed by others
Notes
Determinism is a measure based on Recurrence Quantification Analysis (Marwan et al. 2007).
Defined as \(\sigma ^{2}(X)~=~\sum _{i=1}^{n} p_{i}~\cdot ~(x_{i} - \mu )^{2}\) for a series X = (x 1, x 2, … , x n ), in which p i is the probability of observation x i and μ is the mean of the sequence.
A deterministic, chaotic, nonlinear model.
In this case outliers are simply exceptional gains, but when plotted hinder a better visualization of the distribution.
References
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S. (2003). A framework for clustering evolving data streams. In: VLDB ’2003: Proceedings of the 29th international conference on very large data bases (pp. 81–92). VLDB Endowment.
Aggarwal, C.C., Han, J.,Wang, J., Yu, P.S. (2004). A framework for projected clustering of high dimensional data streams. In VLDB ’04: Proceedings of the 30th international conference on very large data bases (pp. 852–863). VLDB Endowment.
Ahmed, N., Natarajan, T., Rao, K.R. (1974). Discrete cosine transfom. IEEE Transactions on Computers, 23, 90–93.
Appel, G. (2005). Technical analysis: power tools for active investors, 1st edn, FT Press.
Ardia, D., Boudt, K., Carl, P., Mullen, K.M., Peterson, B.G. (2011). Differential Evolution with DEoptim: An application to non-convex portfolio optimization. The Royal Journal, 3(1), 27–34.
Athanassioum, P. (2012). Research handbook on hedge funds, private equity and alternative investments, Edward Elgar Pub.
Bélisle, C. (1992). Convergence theorems for a class of simulated annealing algorithms on rd. Journal of Applied Probability, 885–895.
Beringer, J., & Hüllermeier, E. (2006). Online clustering of parallel data streams. Data Knowledge Engineering, 58, 180–204. doi:10.1016/j.datak.2005.05.009.
Bifet, A., & Kirby, R. (2009). Data stream mining: A practical approach. Technical report, The University of Waikato.
Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. The journal of political economy, 637–654.
Box, G., & Jenkins, G. (1994). Time series analysis: forecasting and control, Prentice Hall PTR.
Cao, F. (2006). Density-based clustering over an evolving data stream with noise. In Proceedings of the 6th SIAM international conference data mining.
Chaovalit, P. (2009). Clustering transient data streams by example and by variable. PhD thesis, University of Maryland.
Chatfield, C. (2003). The analysis of time series: an introduction (Vol. 59). CRC press.
Daubechies, I. (1992). Ten lectures on wavelets. Society for industrial and applied mathematics, Philadelphia.
Díaz, S.P., & Vilar, J.A. (2010). Comparing several parametric and nonparametric approaches to time series clustering: a simulation study. Journal of Classification, 27, 333–362. doi:10.1007/s00357-010-9064-6.
D’haeseleer, P., et al. (2005). How does gene expression clustering work? Nature biotechnology, 23(12), 1499–1502.
Fourier, J. (1888). Théorie analytique de la chaleur (Vol. 1). Gauthier-Villars et fils.
Fu, T.C. (2011). A review on time series data mining. Engineering Applications of Artificial Intelligence, 24(1), 164–181. doi:10.1016/j.engappai.2010.09.007, http://www.sciencedirect.com/science/article/B6V2M-516KF3X-1/2/f93f19227049b30e34b3de788e9e2b7f.
Fujita, A., Sato, J., Demasi, M., Sogayar, M., Ferreira, C., Miyano, S. (2009). Comparing pearson, spearman and hoeffding’s d measure for gene expression association analysis. Journal of Bioinformatics and Computational Biology, 7(4), 663–84.
Gama, J. (2010). Knowledge discovery from data streams, 1st edn. Chapman & Hall/CRC.
Hilbert, D. (1912). Grundzüge einer allgemeinen Theorie der linearen Integralgleichungen. BG Teubner.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 13–30.
Huang, N., Shen, Z., Long, S., Wu, M., Shih, H., Zheng, Q., Yen, N., Tung, C., Liu, H. (1998). The empirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London Series A: Mathematical, Physical and Engineering Sciences, 454(1971), 903.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. doi:10.1007/BF01908075.
Ishii, R.P., Rios, R.A., de Mello, R.F. (2011). Classification of time series generation processes using experimental tools: a survey and proposal of an automatic and systematic approach. International Journal of Computational Science and Engineering, 1, 1–21.
Kantz, H., & Schreiber, T. (1997). Nonlinear time series analysis. New York: Cambridge University Press.
Keogh, E., Lin, J., Truppel, W. (2003). Clustering of time series subsequences is meaningless: implications for previous and future research. In ICDM ’03: Proceedings of the 3rd IEEE international conference on data mining, IEEE computer society (pp. 115–). Washington, DC. http://dl.acm.org/citation.cfm?id=951949.952156.
Kontaki, M., Papadopoulos, A.N., Manolopoulos, Y. (2008). Continuous subspace clustering in streaming time series. Information Systems, 33(2), 240–260.
Kranen, P., Assent, I., Baldauf, C., Seidl, T. (2011). The clustree: indexing micro-clusters for anytime stream mining. Knowledge and Information Systems, 29(2), 249–272. doi:10.1007/s10115-010-0342-8.
Lorenz, E.N. (1963). Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20(2), 130–141.
MacKay, D. (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press.
Marwan, N., Romano, M.C., Thiel, M., Kurths, J. (2007). Recurrence plots for the analysis of complex systems. Physics Reports, 438(5-6), 237–329. doi:10.1016/j.physrep.2006.11.001.
Peng, C., Buldyrev, S., Havlin, S., Simons, M., Stanley, H., Goldberger, A. (1994). On the mosaic organization of DNA sequences. Physical Review E, 49, 1685–1689.
Pompe, B. (1993). Measuring statistical dependences in a time series. Journal of Statistical Physics, 73(3), 587–610.
Quinlan, J. (1986). Induction of decision trees. Machine learning, 1(1), 81–106.
Quinlan, J. (1993). C4. 5: programs for machine learning, Morgan Kaufmann.
R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org, ISBN 3-900051-07-0.
Ren, J., Cai, B., Hu, C. (2011). Clustering over data streams based on grid density and index tree. Journal of Convergence Information Technology, 6(1).
Rodrigues, P.P., Gama, J., Pedroso, J. (2008). Hierarchical clustering of time-series data streams. IEEE Transactions on Knowledge and Data Engineering, 20, 615–627. doi:10.1109/TKDE.2007.190727.
Sandri, M. (1996). Numerical calculation of lyapunov exponents. The Mathematical Journal, 6(3), 78–84.
Skiena, S.S. (1998). The algorithm design manual. New York: Springer.
Takens, F. (1981). Detecting strange attractors in turbulence. Dynamical systems and turbulence, (pp. 366–381). Warwick 1980.
Tan, P.N., Steinbach, M., Kumar, V. (2005). Introduction to Data Mining. Boston: Addison-Wesley Longman.
Tang, L.A., Zheng, Y., Yuan, J., Han, J., Leung, A., Hung, C.C., Peng, W.C. (2012). On discovery of traveling companions from streaming trajectories. In 2012 IEEE 28th International Conference on Data Engineering (ICDE), (pp. 186-197). IEEE.
Vinh, N.X., Epps, J., Bailey, J. (2009). Information theoretic measures for clusterings comparison: is a correction for chance necessary? In ICML ’09 (pp 1073–1080). New York: ACM. doi:10.1145/1553374.1553511.
Walpole, R., Myers, R., Myers, S., Ye, K. (1998). Probability and statistics for engineers and scientists. Upper Saddle River: Prentice Hall.
Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K. (2009). Density-based clustering of data streams at multiple resolutions. ACM Transactions on Knowledge and Discovery Data, 3(3), 1–28. doi:10.1145/1552303.1552307.
Whitney, H. (1936). Differentiable manifolds. Annals of Mathematics, 37(3), 645–680.
Widiputra, H., Pears, R., Kasabov, N. (2011). Multiple time-series prediction through multiple time-series relationships profiling and clustered recurring trends. In Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining (pp. 161–172). Berlin, Heidelberg: Springer-Verlag.
Yang, Y., & Chen, K. (2011). Temporal data clustering via weighted clustering ensemble with different representations. IEEE Transactions on Knowledge and Data Engineering, 23, 307–320.
Zheng, K., Zheng, Y., Yuan, N.J., Shang, S. (2013). On discovery of gathering patterns from trajectories. In IEEE international conference on data engineering, ICDE.
Acknowledgments
This paper is based upon work supported by FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo)—Brazil under grant #2010/05062-6. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of FAPESP. We also thank the anonymous reviewers who suggested improvements to the presentation of this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pereira, C.M.M., de Mello, R.F. TS-stream: clustering time series on data streams. J Intell Inf Syst 42, 531–566 (2014). https://doi.org/10.1007/s10844-013-0290-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-013-0290-3