Article

Statistical change detection for multi-dimensional data

Authors:
Xiuyao Song

University of Florida

University of Florida
View Profile

,
Mingxi Wu

University of Florida

University of Florida
View Profile

,
Christopher Jermaine

University of Florida

University of Florida
View Profile

,
Sanjay Ranka

University of Florida

University of Florida
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 667–676https://doi.org/10.1145/1281192.1281264

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 667–676

ABSTRACT

This paper deals with detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, we define a statistical test called the density test for deciding if the observed data points are sampled from the underlying distribution that produced the baseline data set. We define a test statistic that is strictly distribution-free under the null hypothesis. Our experimental results show that the density test has substantially more power than the two existing methods for multi-dimensional change detection.

References

D. Agarwal, A. McGregor, J. Phillips, S. Venkatasubramanian, and Z. Zhu. Spatial scan statistics: Approximations and performance study. In SIGKDD, 2006. Google ScholarDigital Library
D. Agarwal, J. M. Phillips, and S. Venkatasubramanian. The hunting of the bump: On maximizing statistical discrepancy. In SODA, 2006. Google ScholarDigital Library
C. Aggarwal. A framework for change diagnosis of data streams. In SIGMOD, pages 575--586, 2003. Google ScholarDigital Library
S. Bay and M. Pazzani. Detecting group differences: Mining contrast sets. Data Min. Knowl. Discov., 5(3):213--246, 2001. Google ScholarDigital Library
J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997.Google Scholar
M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. Lof: Identifying density-based local outliers. In SIGMOD, pages 93--104, 2000. Google ScholarDigital Library
D. Comaniciu. An algorithm for data-driven bandwidth selection. IEEE Transactions on PAMI, 25(2): 281--288, 2003. Google ScholarDigital Library
T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi. An information-theoretic approach to detecting changes in multi-dimensional data streams. In Interface, 2006.Google Scholar
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JRSS Series B), 39(1):1--38, 1977.Google Scholar
B. Efron and R. J. Tibshirani. An introduction to the Bootstrap, volume 57 of Monographs on Statistics and Applied Probability. Chapman and Hall, 1993.Google Scholar
E. Knorr, R. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. VLDB J., 8(3--4), 2000. Google ScholarDigital Library
M. Kulldorff. A spatial scan statistic. Comm. in Statistics: Theory and Methods, 26(6):1481--1496, 1997.Google ScholarCross Ref
J.-F. Maa, D. Pearl, and R. Bartoszynski. Reducing multidimensional two-sample data to one-dimensional interpoint comparisons. The Annals of Statistics, 24(3): 1069--1074, 1996.Google ScholarCross Ref
R. Miller. Simultaneous Statistical Inference. McGraw-Hill, New York, 1966.Google Scholar
D. Neill and A. Moore. Rapid detection of significant spatial clusters. In SIGKDD, 2004. Google ScholarDigital Library
P. R. Rosenbaum. An exact distribution-free test comparing two multivariate distributions based on adjacency. JRSS Series B), 67(4): 515--530, 2005.Google Scholar
D. Scott. Multivariate Density Estimation: Theory, Practice and Visualization. Wiley-Interscience, New York, 1992.Google Scholar
S. Sheather and M. Jones. A reliable databased bandwidth selection method for kernel density estimation. JRSS Series B, (53):683--690, 1991.Google Scholar
B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 1986.Google ScholarCross Ref
M. Wand and M. Jones. Kernel Smoothing. Chapman and Hall, 1995.Google ScholarCross Ref
W.-K. Wong, A. Moore, G. Cooper, and M. Wagner. Bayesian network anomaly pattern detection for disease outbreaks. In ICML, pages 808--815, 2003.Google Scholar

Index Terms

Statistical change detection for multi-dimensional data
1. Information systems
  1. Information systems applications
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Change Detection in Streaming Multivariate Data Using Likelihood Detectors

Change detection in streaming data relies on a fast estimation of the probability that the data in two consecutive windows come from different distributions. Choosing the criterion is one of the multitude of questions that need to be addressed when ...
Read More
Kernel estimation for adjusted p-values in multiple testing

Multiple testing procedures are frequently applied to biomedical and genomic research, for instance, identification of differentially expressed genes in microarray experiments. Resampling methods are commonly used to compute adjusted p-values in ...
Read More
Statistical analysis of water-quality data containing multiple detection limits II: S-language software for nonparametric distribution modeling and hypothesis testing

Analysis of low concentrations of trace contaminants in environmental media often results in left-censored data that are below some limit of analytical precision. Interpretation of values becomes complicated when there are multiple detection limits in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
change detection
density test
kernel density estimation
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 87
  Total Citations
  View Citations
- 1,439
  Total Downloads
- Downloads (Last 12 months)85
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Statistical change detection for multi-dimensional data

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Change Detection in Streaming Multivariate Data Using Likelihood Detectors

Kernel estimation for adjusted p-values in multiple testing

Statistical analysis of water-quality data containing multiple detection limits II: S-language software for nonparametric distribution modeling and hypothesis testing