Abstract
Big data analytics nowadays represent one of the most relevant and promising research activities in the field of Big Data. Tools and solutions designed for such purpose are meant to analyse very large sets ot data to extract relevant/valuable information. In this path, this paper addresses the problem of sequentially analysing big streams of data inspecting for changes. This problem that has been extensively studied for scalar or multivariate datastreams, has been mostly left unattended in the Big Data scenario. More specifically, the aim of this paper is to introduce a change detection test able to detect changes in datastreams characterized by very-large dimensions (up to 1000). The proposed test, based on a change-point method, is non parameteric (in the sense that it does not require any apriori information about the system under inspection or the possible changes) and is designed to detect changes in the mean vector of the datastreams. The effectiveness and the efficiency of the proposed change detection test has been tested on both synthetic and real datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The Matlab Demo Toolbox of the proposed CDT can be found at the following url: http://roveri.faculty.polimi.it/software-and-datasets/.
- 2.
The computing platform is based on a 1,7 GHz Intel Core i5 with 4 GB 1333 MHz DDR3.
References
Agarwal, D.: An empirical bayes approach to detect anomalies in dynamic multidimensional arrays. In: Fifth IEEE International Conference on Data Mining, 8-pp. IEEE (2005)
Alippi, C., Roveri, M.: Just-in-time adaptive classifierspart i: detecting nonstationary changes. IEEE Trans. Neural Netw. 19(7), 1145–1153 (2008)
Basseville, M., Nikiforov, I.V., et al.: Detection of Abrupt Changes: Theory and Application, vol. 104. Prentice Hall, Englewood Cliffs (1993)
Danziger, S.A., Swamidass, S.J., Zeng, J., Dearth, L.R., Lu, Q., Chen, J.H., Cheng, J., Hoang, V.P., Saigo, H., Luo, R., et al.: Functional census of mutation sequence spaces: the example of p53 cancer rescue mutants. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 3(2), 114–125 (2006)
Ferreira, L.N., Zhao, L.: A time series clustering technique based on community detection in networks. Procedia Comput. Sci. 53, 183–190 (2015). INNS Conference on Big Data 2015, San Francisco, CA, USA, 8–10 August 2015
Galeano, P., Peña, D.: Covariance changes detection in multivariate time series. J. Stat. Plann. Infer. 137(1), 194–211 (2007)
Hajj, N., Rizk, Y., Awad, M.: A mapreduce cortical algorithms implementation for unsupervised learning of big data. Procedia Comput. Sci. 53, 327–334 (2015)
Hegedűs, I., Nyers, L., Ormándi, R.: Detecting concept drift in fully distributed environments. In: 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics (SISY), pp. 183–188. IEEE (2012)
Kuncheva, L.I.: Change detection in streaming multivariate data using likelihood detectors. IEEE Trans. Knowl. Data Eng. 25(5), 1175–1180 (2013)
Qiu, P., Hawkins, D.: A rank-based multivariate cusum procedure. Technometrics 43(2), 120–132 (2012)
Sullivan, J.H., Woodall, W.H.: Change-point detection of mean vector or covariance matrix shifts using multivariate individual observations. IIE Trans. 32(6), 537–549 (2000)
Wang, T.Y., Chen, L.H.: Mean shifts detection and classification in multivariate process: a neural-fuzzy approach. J. Intell. Manufact. 13(3), 211–221 (2002)
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Zamba, K., Hawkins, D.M.: A multivariate change-point model for statistical process control. Technometrics 48(4), 539–549 (2006)
Zikopoulos, P., Eaton, C., et al.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Osborne Media, New York (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
As described in Sect. 4, thresholds \(h_{p,\gamma ,\alpha }\)s have been computed through simulations since an analytical derivation of the test statistic in (4) is hard to obtain. Thresholds have been computed as follows: for each value of p and \(\alpha \) and \(\gamma \), we simulated 10,000 experiments in which we randomly generated a p-variate normal distribution with random mean vector and covariance matrix. The threshold \(h_{p,\gamma ,\alpha }\) is set to guarantee that the empirical probability of having a false positive detection by the proposed CDT on a fixed data-sequence whose length is \(2 \gamma p\) is equal to the confidence parameter \(\alpha \). Computed thresholds for different values of p and \(\alpha \) and \(\gamma =1.5\) are shown in Table 3. Further values of \(h_{p,\gamma ,\alpha }\) for different configurations of p and \(\alpha \) and \(\gamma \) can be found at the following url: http://roveri.faculty.polimi.it/software-and-datasets/.
To further clarify the effects of \(h_{p,\gamma ,\alpha }\) in the sequential scenario, in Table 4 we detail the empirically estimated ARL for \(\alpha =0.1\), \(\gamma =1.5\) and p ranging from 100 to 1000.
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Tacconelli, G., Roveri, M. (2017). A CPM-Based Change Detection Test for Big Data. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds) Advances in Big Data. INNS 2016. Advances in Intelligent Systems and Computing, vol 529. Springer, Cham. https://doi.org/10.1007/978-3-319-47898-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-47898-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47897-5
Online ISBN: 978-3-319-47898-2
eBook Packages: EngineeringEngineering (R0)