Skip to main content

A CPM-Based Change Detection Test for Big Data

  • Conference paper
  • First Online:
Advances in Big Data (INNS 2016)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 529))

Included in the following conference series:

  • 2234 Accesses

Abstract

Big data analytics nowadays represent one of the most relevant and promising research activities in the field of Big Data. Tools and solutions designed for such purpose are meant to analyse very large sets ot data to extract relevant/valuable information. In this path, this paper addresses the problem of sequentially analysing big streams of data inspecting for changes. This problem that has been extensively studied for scalar or multivariate datastreams, has been mostly left unattended in the Big Data scenario. More specifically, the aim of this paper is to introduce a change detection test able to detect changes in datastreams characterized by very-large dimensions (up to 1000). The proposed test, based on a change-point method, is non parameteric (in the sense that it does not require any apriori information about the system under inspection or the possible changes) and is designed to detect changes in the mean vector of the datastreams. The effectiveness and the efficiency of the proposed change detection test has been tested on both synthetic and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The Matlab Demo Toolbox of the proposed CDT can be found at the following url: http://roveri.faculty.polimi.it/software-and-datasets/.

  2. 2.

    The computing platform is based on a 1,7 GHz Intel Core i5 with 4 GB 1333 MHz DDR3.

References

  1. Agarwal, D.: An empirical bayes approach to detect anomalies in dynamic multidimensional arrays. In: Fifth IEEE International Conference on Data Mining, 8-pp. IEEE (2005)

    Google Scholar 

  2. Alippi, C., Roveri, M.: Just-in-time adaptive classifierspart i: detecting nonstationary changes. IEEE Trans. Neural Netw. 19(7), 1145–1153 (2008)

    Article  Google Scholar 

  3. Basseville, M., Nikiforov, I.V., et al.: Detection of Abrupt Changes: Theory and Application, vol. 104. Prentice Hall, Englewood Cliffs (1993)

    Google Scholar 

  4. Danziger, S.A., Swamidass, S.J., Zeng, J., Dearth, L.R., Lu, Q., Chen, J.H., Cheng, J., Hoang, V.P., Saigo, H., Luo, R., et al.: Functional census of mutation sequence spaces: the example of p53 cancer rescue mutants. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 3(2), 114–125 (2006)

    Article  Google Scholar 

  5. Ferreira, L.N., Zhao, L.: A time series clustering technique based on community detection in networks. Procedia Comput. Sci. 53, 183–190 (2015). INNS Conference on Big Data 2015, San Francisco, CA, USA, 8–10 August 2015

    Article  Google Scholar 

  6. Galeano, P., Peña, D.: Covariance changes detection in multivariate time series. J. Stat. Plann. Infer. 137(1), 194–211 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  7. Hajj, N., Rizk, Y., Awad, M.: A mapreduce cortical algorithms implementation for unsupervised learning of big data. Procedia Comput. Sci. 53, 327–334 (2015)

    Article  Google Scholar 

  8. Hegedűs, I., Nyers, L., Ormándi, R.: Detecting concept drift in fully distributed environments. In: 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics (SISY), pp. 183–188. IEEE (2012)

    Google Scholar 

  9. Kuncheva, L.I.: Change detection in streaming multivariate data using likelihood detectors. IEEE Trans. Knowl. Data Eng. 25(5), 1175–1180 (2013)

    Article  Google Scholar 

  10. Qiu, P., Hawkins, D.: A rank-based multivariate cusum procedure. Technometrics 43(2), 120–132 (2012)

    Article  MathSciNet  Google Scholar 

  11. Sullivan, J.H., Woodall, W.H.: Change-point detection of mean vector or covariance matrix shifts using multivariate individual observations. IIE Trans. 32(6), 537–549 (2000)

    Google Scholar 

  12. Wang, T.Y., Chen, L.H.: Mean shifts detection and classification in multivariate process: a neural-fuzzy approach. J. Intell. Manufact. 13(3), 211–221 (2002)

    Article  Google Scholar 

  13. Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)

    Article  Google Scholar 

  14. Zamba, K., Hawkins, D.M.: A multivariate change-point model for statistical process control. Technometrics 48(4), 539–549 (2006)

    Article  MathSciNet  Google Scholar 

  15. Zikopoulos, P., Eaton, C., et al.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Osborne Media, New York (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manuel Roveri .

Editor information

Editors and Affiliations

Appendix

Appendix

As described in Sect. 4, thresholds \(h_{p,\gamma ,\alpha }\)s have been computed through simulations since an analytical derivation of the test statistic in (4) is hard to obtain. Thresholds have been computed as follows: for each value of p and \(\alpha \) and \(\gamma \), we simulated 10,000 experiments in which we randomly generated a p-variate normal distribution with random mean vector and covariance matrix. The threshold \(h_{p,\gamma ,\alpha }\) is set to guarantee that the empirical probability of having a false positive detection by the proposed CDT on a fixed data-sequence whose length is \(2 \gamma p\) is equal to the confidence parameter \(\alpha \). Computed thresholds for different values of p and \(\alpha \) and \(\gamma =1.5\) are shown in Table 3. Further values of \(h_{p,\gamma ,\alpha }\) for different configurations of p and \(\alpha \) and \(\gamma \) can be found at the following url: http://roveri.faculty.polimi.it/software-and-datasets/.

Table 3. Thresholds \(h_{p,\gamma ,\alpha }\) for different values of p and \(\alpha \) and \(\gamma =1.5\).

To further clarify the effects of \(h_{p,\gamma ,\alpha }\) in the sequential scenario, in Table 4 we detail the empirically estimated ARL for \(\alpha =0.1\), \(\gamma =1.5\) and p ranging from 100 to 1000.

Table 4. ARL for different values of p, \(\alpha =0.1\) and \(\gamma =1.5\).

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Tacconelli, G., Roveri, M. (2017). A CPM-Based Change Detection Test for Big Data. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds) Advances in Big Data. INNS 2016. Advances in Intelligent Systems and Computing, vol 529. Springer, Cham. https://doi.org/10.1007/978-3-319-47898-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47898-2_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47897-5

  • Online ISBN: 978-3-319-47898-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics