Detecting anomalies in cross-classified streams: a Bayesian approach

Agarwal, Deepak

doi:10.1007/s10115-006-0036-4

Detecting anomalies in cross-classified streams: a Bayesian approach

Regular Paper
Published: 03 October 2006

Volume 11, pages 29–44, (2007)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Deepak Agarwal¹

368 Accesses
48 Citations
3 Altmetric
Explore all metrics

Abstract

We consider the problem of detecting anomalies in data that arise as multidimensional arrays with each dimension corresponding to the levels of a categorical variable. In typical data mining applications, the number of cells in such arrays are usually large. Our primary focus is detecting anomalies by comparing information at the current time to historical data. Naive approaches advocated in the process control literature do not work well in this scenario due to the multiple testing problem—performing multiple statistical tests on the same data produce excessive number of false positives. We use an empirical Bayes method which works by fitting a two-component Gaussian mixture to deviations at current time. The approach is scalable to problems that involve monitoring massive number of cells and fast enough to be potentially useful in many streaming scenarios. We show the superiority of the method relative to a naive “per component error rate” procedure through simulation. A novel feature of our technique is the ability to suppress deviations that are merely the consequence of sharp changes in the marginal distributions. This research was motivated by the need to extract critical application information and business intelligence from the daily logs that accompany large-scale spoken dialog systems. We illustrate our method on one such system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Density-Based Clustering Based on Hierarchical Density Estimates

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Aki Vehtari, Andrew Gelman & Jonah Gabry

Evaluating time series forecasting models: an empirical study on performance estimation methods

Article 13 October 2020

Vitor Cerqueira, Luis Torgo & Igor Mozetič

References

Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. PODS, Madison, WI
Google Scholar
Box GE (1970) Time series analysis: forecasting and control. Holden-Day, London
Google Scholar
Carlin BP, Louis TA (2000) Bayes and empirical Bayes methods for data analysis, 2nd edn. Chapman and Hall/CRC Press, London
MATH Google Scholar
Duncan DB (1965) A Bayesian approach to multiple comparisons. Technometrics 7:171–222
Article MATH Google Scholar
Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: Proceedings of the 30th VLDB conference, Toronto, Canada, pp 180–191
Douglas S, Agarwal D, Alonso T, Bell R, Rahim M, Swayne DF, Volinsky C (2004) Mining customer care dialogs for “Daily News”. In : Proceedings of the INTERSPEECH-2004, Jeju, Korea
Google Scholar
DuMouchel W (1988) A Bayesian model and graphical elicitation procedure for multiple comparisons. In: Bernardo JM, DeGroot MH, Lindley DV, Smith AFM (eds) Bayesian statistics 3. Oxford University Press, Oxford.
Google Scholar
Genovese C, Wasserman L (2003) Bayesian and frequentist multiple testing. In: Bayesian statistics 7, Proceedings of the 7th Valencia International Meeting, Tenerife, Spain, pp 145–162.
Good P (2000) Permutation tests—a practical guide to resampling methods for testing hypotheses, 2nd edn. Springer, Berlin Heidelberg New York
Google Scholar
Iman RL, Conover W (1987) A measure of top-down correlation. Technometrics 29(3):351–358
Article MATH Google Scholar
Shaffer JP (1999) A semi-Bayesian study of Duncan's Bayesian multiple comparison procedure. J Stat Plan Inference 82:197–213
Article MATH MathSciNet Google Scholar
Gopalan R, Berry DA (1998) Bayesian multiple comparisons using dirichlet process priors. J Am Stat Assoc 93:1130–1139
Article MATH MathSciNet Google Scholar
Scott J, Berger J (2003) An exploration of aspects of Bayesian multiple testing. Technical report, Institute of Statistics and Decision Science
Ganti V, Gehrke JE, Ramakrishnan R (2002) Mining data streams under block evolution. Sigkdd Explorat 3:1–10
MATH Google Scholar
DuMouchel W, Volinsky C, Johnson T, Cortes C, Pregibon D (1999) Squashing flat files flatter. In: Proceedings of the 5th ACM SIGKDD Conference, San Diego, CA, pp 6–15
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc, Ser B 57:289–300
MATH MathSciNet Google Scholar
Yi BK, Sidiropoulos N, Johnson T, Jagadish HV, Faloutsos C, Biliris A (2000) Online data mining for co-evolving time sequences. In: Proceedings of the 16th International Conference on Data Engineering, San Diego, CA, pp 13–22
Zhu Y, Shasha D (2002) Statstream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th VLDB conference, HongKong, China, pp 358–369

Download references

Author information

Authors and Affiliations

Yahoo! Research, 2821 Mission College Blvd, Santa Clara, CA, USA
Deepak Agarwal

Authors

Deepak Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deepak Agarwal.

Additional information

Deepak Agarwal received his Ph.D. in statistics in 2001 from the University of Connecticut, Storrs. He was a research staff member at AT&T Research Labs from 2001 to 2005 and is currently a Senior Research Scientist at Yahoo! Research. His main research interests are in the areas of time series analysis, anomaly detection, social networks, and hierarchical Bayesian models. He received the best application paper award at the Siam Data Mining Conference in 2004 and has served on several program committees and panels. He has published several papers both in statistics and data mining.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Agarwal, D. Detecting anomalies in cross-classified streams: a Bayesian approach. Knowl Inf Syst 11, 29–44 (2007). https://doi.org/10.1007/s10115-006-0036-4

Download citation

Received: 30 November 2005
Revised: 06 January 2006
Accepted: 20 February 2006
Published: 03 October 2006
Issue Date: January 2007
DOI: https://doi.org/10.1007/s10115-006-0036-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Detecting anomalies in cross-classified streams: a Bayesian approach

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Evaluating time series forecasting models: an empirical study on performance estimation methods

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detecting anomalies in cross-classified streams: a Bayesian approach

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Evaluating time series forecasting models: an empirical study on performance estimation methods

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation