ABSTRACT
This paper deals with detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, we define a statistical test called the density test for deciding if the observed data points are sampled from the underlying distribution that produced the baseline data set. We define a test statistic that is strictly distribution-free under the null hypothesis. Our experimental results show that the density test has substantially more power than the two existing methods for multi-dimensional change detection.
- D. Agarwal, A. McGregor, J. Phillips, S. Venkatasubramanian, and Z. Zhu. Spatial scan statistics: Approximations and performance study. In SIGKDD, 2006. Google ScholarDigital Library
- D. Agarwal, J. M. Phillips, and S. Venkatasubramanian. The hunting of the bump: On maximizing statistical discrepancy. In SODA, 2006. Google ScholarDigital Library
- C. Aggarwal. A framework for change diagnosis of data streams. In SIGMOD, pages 575--586, 2003. Google ScholarDigital Library
- S. Bay and M. Pazzani. Detecting group differences: Mining contrast sets. Data Min. Knowl. Discov., 5(3):213--246, 2001. Google ScholarDigital Library
- J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997.Google Scholar
- M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. Lof: Identifying density-based local outliers. In SIGMOD, pages 93--104, 2000. Google ScholarDigital Library
- D. Comaniciu. An algorithm for data-driven bandwidth selection. IEEE Transactions on PAMI, 25(2): 281--288, 2003. Google ScholarDigital Library
- T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi. An information-theoretic approach to detecting changes in multi-dimensional data streams. In Interface, 2006.Google Scholar
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JRSS Series B), 39(1):1--38, 1977.Google Scholar
- B. Efron and R. J. Tibshirani. An introduction to the Bootstrap, volume 57 of Monographs on Statistics and Applied Probability. Chapman and Hall, 1993.Google Scholar
- E. Knorr, R. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. VLDB J., 8(3--4), 2000. Google ScholarDigital Library
- M. Kulldorff. A spatial scan statistic. Comm. in Statistics: Theory and Methods, 26(6):1481--1496, 1997.Google ScholarCross Ref
- J.-F. Maa, D. Pearl, and R. Bartoszynski. Reducing multidimensional two-sample data to one-dimensional interpoint comparisons. The Annals of Statistics, 24(3): 1069--1074, 1996.Google ScholarCross Ref
- R. Miller. Simultaneous Statistical Inference. McGraw-Hill, New York, 1966.Google Scholar
- D. Neill and A. Moore. Rapid detection of significant spatial clusters. In SIGKDD, 2004. Google ScholarDigital Library
- P. R. Rosenbaum. An exact distribution-free test comparing two multivariate distributions based on adjacency. JRSS Series B), 67(4): 515--530, 2005.Google Scholar
- D. Scott. Multivariate Density Estimation: Theory, Practice and Visualization. Wiley-Interscience, New York, 1992.Google Scholar
- S. Sheather and M. Jones. A reliable databased bandwidth selection method for kernel density estimation. JRSS Series B, (53):683--690, 1991.Google Scholar
- B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 1986.Google ScholarCross Ref
- M. Wand and M. Jones. Kernel Smoothing. Chapman and Hall, 1995.Google ScholarCross Ref
- W.-K. Wong, A. Moore, G. Cooper, and M. Wagner. Bayesian network anomaly pattern detection for disease outbreaks. In ICML, pages 808--815, 2003.Google Scholar
Index Terms
- Statistical change detection for multi-dimensional data
Recommendations
Change Detection in Streaming Multivariate Data Using Likelihood Detectors
Change detection in streaming data relies on a fast estimation of the probability that the data in two consecutive windows come from different distributions. Choosing the criterion is one of the multitude of questions that need to be addressed when ...
Kernel estimation for adjusted p-values in multiple testing
Multiple testing procedures are frequently applied to biomedical and genomic research, for instance, identification of differentially expressed genes in microarray experiments. Resampling methods are commonly used to compute adjusted p-values in ...
Statistical analysis of water-quality data containing multiple detection limits II: S-language software for nonparametric distribution modeling and hypothesis testing
Analysis of low concentrations of trace contaminants in environmental media often results in left-censored data that are below some limit of analytical precision. Interpretation of values becomes complicated when there are multiple detection limits in ...
Comments