Abstract
The data deluge has created a great challenge for data mining applications wherein the rare topics of interest are often buried in the flood of major headlines. We identify and formulate a novel problem: cross-channel anomaly detection from multiple data channels. Cross-channel anomalies are common among the individual channel anomalies and are often portent of significant events. Central to this new problem is a development of theoretical foundation and methodology. Using the spectral approach, we propose a two-stage detection method: anomaly detection at a single-channel level, followed by the detection of cross-channel anomalies from the amalgamation of single-channel anomalies. We also derive the extension of the proposed detection method to an online settings, which automatically adapts to changes in the data over time at low computational complexity using incremental algorithms. Our mathematical analysis shows that our method is likely to reduce the false alarm rate by establishing theoretical results on the reduction of an impurity index. We demonstrate our method in two applications: document understanding with multiple text corpora and detection of repeated anomalies in large-scale video surveillance. The experimental results consistently demonstrate the superior performance of our method compared with related state-of-art methods, including the one-class SVM and principal component pursuit. In addition, our framework can be deployed in a decentralized manner, lending itself for large-scale data stream analysis.
Similar content being viewed by others
References
Adams B, Phung D, Venkatesh S (2009) Social reader: following social networks in the wilds of the blogosphere. In: Proceedings of the first SIGMM workshop on Social media, pp 73–80
Agarwal D (2007) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inf Syst 11(1): 29–44
Allan, J (eds) (2002) Topic detection and tracking: event-based information organization. Kluwer, Boston
Allan J, Papka R, Lavrenko V (1998) On-line new event detection and tracking. In: Proceedings of the 21st ACM SIGIR, pp 37–45
Blei DM, Ng AY, Jordan MY (2003) Latent Dirichlet allocation. J Mach Learn Res 3: 993–1022
Brants T, Chen F, Farahat A (2003) A system for new event detection. In: Proceedings of the 26th ACM SIGIR, pp 330–337
Budhaditya S, Pham DS, Lazarescu M, Venkatesh S (2009) Effective anomaly detection in sensor networks data streams. In: IEEE Proceedings of the ICDM, pp 722–727
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3): 1–58
Chandola V, Mithal V, Kumar V (2008) Comparative evaluation of anomaly detection techniques for sequence data. In: IEEE Proceedings of the ICDM, pp 743–748
Chen K-Y, Luesukprasert L, Chou ST (2007) Hot topic extraction based on timeline analysis and multidimensional sentence modeling. IEEE Trans Knowl Data Eng 19(8): 1016–1025
de Vries T, Chawla S, Houle ME (2011) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 1–28. doi:10.1007/s10115-011-0430-4
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391–407
Dereszynski EW, Dietterich TG (2007) Probabilistic models for anomaly detection in remote sensor data streams. In: 23rd Conference on UAI. Citeseer
Eisenhardt M, Muller W, Henrich A (2003) Classifying documents by distributed p2p clustering. In: Informatik 2003: innovative information technology uses
Fu Q, Lou JG, Wang Y, Li J,(2009) Execution anomaly detection in distributed systems through unstructured log analysis. In: IEEE Proceedings of the ICDM, pp 149–158
Fu Y, Cao L, Guo G, Huang TS (2008) Multiple feature fusion by subspace learning. In: Proceedings of the international conference on content-based image and video retrieval, ACM, pp 127–134
Hammouda K, Kamel M (2006) Collaborative document clustering. In: Proceedings of the SDM, Citeseer, pp 453–463
Hawkes AG (1982) Approximating the normal tail. The Statistican 31(3): 231–236
Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2): 309–336
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd ACM SIGIR, pp 50–57
Huang L, Nguyen XL, Garofalakis M, Jordan MI, Joseph A, Taft N (2007) In-network PCA and anomaly detection. NIPS 19:617
Johnstone IM (2001) On the distribution of the largest eigenvalue in principal component analysis. Ann Stat 29(2): 295–327
Kashef R, Kamel MS (2010) Cooperative clustering. Pattern Recogn 43: 2315–2329
Keogh E, Lin J, Fu A (2005) Hot sax: efficiently finding the most unusual time series subsequence. In: IEEE Proceedings of the ICDM, 8 pp
Kleinberg J (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4): 373–397
Lakhina A, Crovella M, Diot C (2004) Diagnosing network-wide traffic anomalies. ACM SIGCOMM 34(4): 219–230
Li Z, Wang W, Li M, Ma WY (2005) A probabilistic model for retrospective news event detection. In: Proceedings of the 28th ACM SIGIR, pp 106–113
Liu H, Lin Y, Han J (2011) Methods for mining frequent items in data streams: an overview. Knowl Inf Syst 26(1): 1–30
Manevitz LM, Yousef M (2002) One-class svms for document classification. J Mach Learn Res 2: 139–154
Min K, Zhang Z, Wright J, Ma Y (2010) Decomposing background topics from keywords by principal component pursuit. In: Proceedings of the 19th ACM CIKM, pp 269–278
Moerchen F, Brinker K, Neubauer C (2007) Any-time clustering of high frequency news streams. In: DMCS Workshop, 13th ACM SIGKDD
Panov P, Džeroski S (2007) Combining bagging and random subspaces to create better ensembles. In: Proceedings of the 7th international conference on intelligent data analysis. Springer, New York, pp 118–129
Papadimitriou S, Sun J, Faloutsos C (2005) Streaming pattern discovery in multiple time-series. In: Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, pp 697–708
Pham D-S, Saha B, Phung D, Venkatesh S (2011) Detection of cross-channel anomalies from multiple data channels. In: IEEE Proceedings of the ICDM
Srivastava AN, Zane-Ulman B (2005) Discovering recurring anomalies in text reports regarding complex space systems. In: Proceedings of the IEEE Aerospace Conference
Sun B, Mitra P, Giles CL, Yen J, Zha H (2007) Topic segmentation with shared topic detection and alignment of multiple documents. In: Proceedings of the 30th ACM SIGIR, pp 199–206
Sun J, Qu H, Chakrabarti D, Faloutsos C (2005) Neighborhood formation and anomaly detection in bipartite graphs. In: IEEE Proceedings of the ICDM, 8 pp
Vershynin R (2010) Introduction to the non-asymptotic analysis of random matrices, Arxiv preprint arxiv:1011.3027, 2010 (available at http://arxiv.org/abs/1011.3027)
Wang B, Tang J, Fan W, Chen S, Tan C, Yang Z (2012) Query-dependent cross-domain ranking in heterogeneous network. Knowl Inf Syst 1–37. doi:10.1007/s10115-011-0472-7
Wang X, Zhang K, Jin X, Shen D (2009) Mining common topics from multiple asynchronous text streams. In: Proceedings of the 2nd WSDM, pp 192–201
Wang X, Zhai C, Hu X, Sproat R (2007) Mining correlated bursty topic patterns from coordinated text streams. In: Proceedings of the 13th ACM SIGKDD, pp 784–793
Yang Y, Pierce T, Carbonell J (1998) A study of retrospective and on-line event detection. In: Proceedings of the 21st ACM SIGIR, pp 28–36
Yu S, Tranchevent LC, Moor B, Moreau Y (2011) Kernel-based data fusion for machine learning: methods and applications in bioinformatics and text mining, vol 345. Springer, Berlin
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pham, DS., Saha, B., Phung, D.Q. et al. Detection of cross-channel anomalies. Knowl Inf Syst 35, 33–59 (2013). https://doi.org/10.1007/s10115-012-0509-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0509-6