Abstract
Rare-category detection helps discover new rare classes in an unlabeled data set by selecting their candidate data examples for labeling. Most of the existing approaches for rare-category detection require prior information about the data set without which they are otherwise not applicable. The prior-free algorithms try to address this problem without prior information about the data set; though, the compensation is high time complexity, which is not lower than \(O(dN^2)\) where \(N\) is the number of data examples in a data set and \(d\) is the data set dimension. In this paper, we propose CLOVER a prior-free algorithm by introducing a novel rare-category criterion known as local variation degree (LVD), which utilizes the characteristics of rare classes for identifying rare-class data examples from other types of data examples and passes those data examples with maximum LVD values to CLOVER for labeling. A remarkable improvement is that CLOVER’s time complexity is \(O(dN^{2-1/d})\) for \(d > 1\) or \(O(N\log N)\) for \(d = 1\). Extensive experimental results on real data sets demonstrate the effectiveness and efficiency of our method in terms of new rare classes discovery and lower time complexity.







Similar content being viewed by others
Notes
Each class only forms one cluster of data examples [34].
Bins with an over-large bandwidth excessively shortens the density differentials, and results in an over-smoothed histogram estimate; whereas bins with an over-small bandwidth usually divides the local regions where data examples have similar local densities into a lot of small pieces, and results in an under-smoothed histogram estimate [38].
References
Agarwal D (2006) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inf Syst 11(1):29–44
Aggarwal CC, Yu PS (2008) Outlier detection with uncertain data. In: Proceedings of the 2008 SIAM international conference on data mining (SDM ’08), April 24–26, Atlanta, pp 483–493
Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196
Ando S (2007) Clustering needles in a haystack: An information theoretic analysis of minority and outlier detection. In: Proceedings of the 7th IEEE international conference on data mining (ICDM ’07), October 28–31, Omaha, pp 13–22
Bay S, Kumaraswamy K, Anderle M, Kumar R, Steier D (2006) Large scale detection of irregularities in accounting data. In: Proceedings of the 6th IEEE international conference on data mining (ICDM ’06), December 18–22, Hong Kong, pp 75–86
Blum A, Mitchell T (1998) Combining labeled and unlabeded data with co-training. In: Proceedings of the 11th annual conference on learning theory (COLT ’98), July 24–26, Madison, pp 92–100
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM international conference on data mining (SDM ’08), April 24–26 Atlanta, pp 243–254
Branch JW, Giannella C, Szymanski B, Wolff R, Kargupta H (2012) In-network outlier detection in wireless sensor networks. Knowl Inf Syst. doi:10.1007/s10115-011-0474-5
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: Identifying ddensity-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, May 16–18, Dallas, pp 93–104
Calderara S, Heinemann U, Prati A, Cucchiara R, Tishby N (2011) Detecting anomalies in people’s trajectories using spectral graph analysis. Comput Vis Image Underst 115(8):1099–1111
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):1–58
Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on machine learning (ICML ’08), July 5–9, Helsinki, pp 208–215
de Vries T, Chawla S, Houle M (2011) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst. doi:10.1007/s10115-011-0430-4
Dutta H, Giannella C, Borne K, Kargupta H (2007) Distributed top-k outlier detection in astronomy catalogs using the demac system. In: Proceedings of the 2007 SIAM international conference on data mining (SDM ’07), April 26–28, Minneapolis, pp 208–215
Fine S, Mansour Y (2006) Active sampling for multiple output identification. In: Proceedings of the 19th annual conference on learning theory (COLT ’06), June 22–25, Pittsburgh, pp 620–634
Foss A, Zaïane OR (2011) Class separation through variance: a new application of outlier detection. Knowl Inf Syst 29(3):565–596
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
Gao Y, Zheng B, Chen G, Li Q (2009) On efficient mutual nearest neighbor query processing in spatial databases. Data Knowl Eng 68(8):705–727
He J, Carbonell J (2007) Nearest-neighbor-based active learning for rare category detection. In: Advances in neural information processing systems (NIPS ’07), vol 20, December 3–6, Vancouver, pp 633–640
He J, Carbonell J (2009) Prior-free rare category detection. In: Proceedings of the 2009 SIAM international conference on data mining (SDM ’09), April 30–May 2, Sparks, pp 155–163
He J, Liu Y, Lawrence R (2008) Graph-based rare category detection. In: Proceedings of the 8th IEEE international conference on data mining (ICDM ’08), December 15–19, Pisa, pp 833–838
He J, Tong H, Carbonell J (2010) Rare category characterization. In: Proceedings of the 10th IEEE international conference on data mining (ICDM ’10), December 14–17, Sydney, pp 226–235
He Z, Deng S, Xu X, Huang JZ (2006) A fast greedy algorithm for outlier mining. In: Advances in knowledge discovery and data mining (PAKDD ’06), vol LNCS 3918, April 9–12, Singapore, pp 567–576
He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9–10): 1641–1650
Hospedales T, Gong S, Xiang T (2011) Finding rare classes: adapting generative and discriminative models in active learning. In: Advances in knowledge discovery and data mining (PAKDD ’11), vol LNAI 6635, May 24–27, Shenzhen, pp 296–308
Huang H, He Q, He J, Ma L (2011) Radar: rare category detection via computation of boundary degree. In: Advances in knowledge discovery and data mining (PAKDD ’11), vol LNCS 6635, May 24–27, Shenzhen, pp 258–269
Jian P, Kapoor A (2009) Active learning for large multi-class problems. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR ’09), June 20–25, Miami Beach, pp 762–769
Jolliffe I (2002) Principal component analysis, 2nd edn. Springer, Heidelberg
Leung Y, Zhang JS, Xu ZB (2000) Clustering by scale-space filtering. IEEE Trans Pattern Anal Mach Intell 22(12):1396–1410
Linda O, Vollmer T, Manic M (2009) Neural network based intrusion detection system for critical infrastructures. In: Proceedings of the 2009 international joint conference on neural networks (IJCNN ’09), June 14–19, Atlanta, pp 1827–1834
Moore A (1991) A tutorial on kd-trees. University of Cambridge Computer Laboratory Technical, Report
Moshtaghi M, Havens T, Bezdek J, Park L, Leckie C, Rajasegarar S, Keller J, Palaniswami M (2011) Clustering ellipses for anomaly detection. Pattern Recogn 44(1):55–69
Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Mining Knowl Discov 12(2–3):203–228
Pelleg D, Moore A (2004) Active learning for anomaly and rare-category detection. In: Advances in neural information processing systems (NIPS ’04), vol 18, December 13–18, Vancouver, pp 1073–1080
Porter R, Hush D, Harvey N, Theiler J (2010) Toward interactive search in remote sensing imagery. In: Proceedings of SPIE—the international society for optical engineering, vol 7709, April 5 Orlando
Rice JA (2006) Mathematical statistics and data analysis, 3rd edn. Duxbury Press, California
Roweis S (1998) Em algorithm for pca and spca. In: Advances in neural information processing systems (NIPS ’98), November 30–December 5, Denver, pp 626–632
Scott DW (1992) Multivariate density estimation: theory, practice, and visualization. Wiley, New York
Scott DW (2010) Histogram. WIREs Comput Stat 2(1):44–48
Settles B (2010) Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison
Sotiris VA, Tse PW, Pecht MG (2010) Anomaly detection through a bayesian support vector machine. IEEE Trans Reliab 59(2):277–286
Tandon G, Chan P (2007) Weighting versus pruning in rule validation for detecting network and host anomalies. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, August 12–15, San Jose, pp 697–706
Vatturi P, Wong W-K (2009) Category detection using hierarchical mean shift. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, June 28–July 1, Paris, pp 847–856
Wang W, Zhou ZH (2010) A new analysis of co-training. In: Proceedings of the 27th international conference on machine learning (ICML ’10), June 21–24, Haifa, pp 1135–1142
Wu J, Xiong H, Wu P, Chen J (2007) Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, August 12–15, San Jose, pp 814–823
Acknowledgments
This research was partly supported by the Ministry of Education-Intel IT Special Research Foundation under grant No. MOE-INTEL-11-06, in which the work of Kevin Chiew was partly supported by National Natural Science Foundation of China under Grant No. 60970081. The authors would like to thank Dr. Yunjun Gao from Zhejiang University for his advice and input during the preparation of this article.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Huang, H., He, Q., Chiew, K. et al. CLOVER: a faster prior-free approach to rare-category detection. Knowl Inf Syst 35, 713–736 (2013). https://doi.org/10.1007/s10115-012-0530-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0530-9