CLOVER: a faster prior-free approach to rare-category detection

Huang, Hao; He, Qinming; Chiew, Kevin; Qian, Feng; Ma, Lianhang

doi:10.1007/s10115-012-0530-9

CLOVER: a faster prior-free approach to rare-category detection

Regular Paper
Published: 21 August 2012

Volume 35, pages 713–736, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Hao Huang¹,
Qinming He¹,
Kevin Chiew²,
Feng Qian¹ &
…
Lianhang Ma¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Rare-category detection helps discover new rare classes in an unlabeled data set by selecting their candidate data examples for labeling. Most of the existing approaches for rare-category detection require prior information about the data set without which they are otherwise not applicable. The prior-free algorithms try to address this problem without prior information about the data set; though, the compensation is high time complexity, which is not lower than $O(dN^2)$ where $N$ is the number of data examples in a data set and $d$ is the data set dimension. In this paper, we propose CLOVER a prior-free algorithm by introducing a novel rare-category criterion known as local variation degree (LVD), which utilizes the characteristics of rare classes for identifying rare-class data examples from other types of data examples and passes those data examples with maximum LVD values to CLOVER for labeling. A remarkable improvement is that CLOVER’s time complexity is $O(dN^{2-1/d})$ for $d > 1$ or $O(N\log N)$ for $d = 1$. Extensive experimental results on real data sets demonstrate the effectiveness and efficiency of our method in terms of new rare classes discovery and lower time complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Rare Category Detection Forest

Combining Active Semi-supervised Learning and Rare Category Detection

Fast Rare Category Detection Using Nearest Centroid Neighborhood

Notes

Each class only forms one cluster of data examples [34].
Bins with an over-large bandwidth excessively shortens the density differentials, and results in an over-smoothed histogram estimate; whereas bins with an over-small bandwidth usually divides the local regions where data examples have similar local densities into a lot of small pieces, and results in an under-smoothed histogram estimate [38].

References

Agarwal D (2006) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inf Syst 11(1):29–44
Article Google Scholar
Aggarwal CC, Yu PS (2008) Outlier detection with uncertain data. In: Proceedings of the 2008 SIAM international conference on data mining (SDM ’08), April 24–26, Atlanta, pp 483–493
Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196
Article Google Scholar
Ando S (2007) Clustering needles in a haystack: An information theoretic analysis of minority and outlier detection. In: Proceedings of the 7th IEEE international conference on data mining (ICDM ’07), October 28–31, Omaha, pp 13–22
Bay S, Kumaraswamy K, Anderle M, Kumar R, Steier D (2006) Large scale detection of irregularities in accounting data. In: Proceedings of the 6th IEEE international conference on data mining (ICDM ’06), December 18–22, Hong Kong, pp 75–86
Blum A, Mitchell T (1998) Combining labeled and unlabeded data with co-training. In: Proceedings of the 11th annual conference on learning theory (COLT ’98), July 24–26, Madison, pp 92–100
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM international conference on data mining (SDM ’08), April 24–26 Atlanta, pp 243–254
Branch JW, Giannella C, Szymanski B, Wolff R, Kargupta H (2012) In-network outlier detection in wireless sensor networks. Knowl Inf Syst. doi:10.1007/s10115-011-0474-5
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: Identifying ddensity-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, May 16–18, Dallas, pp 93–104
Calderara S, Heinemann U, Prati A, Cucchiara R, Tishby N (2011) Detecting anomalies in people’s trajectories using spectral graph analysis. Comput Vis Image Underst 115(8):1099–1111
Article Google Scholar
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):1–58
Article Google Scholar
Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on machine learning (ICML ’08), July 5–9, Helsinki, pp 208–215
de Vries T, Chawla S, Houle M (2011) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst. doi:10.1007/s10115-011-0430-4
Dutta H, Giannella C, Borne K, Kargupta H (2007) Distributed top-k outlier detection in astronomy catalogs using the demac system. In: Proceedings of the 2007 SIAM international conference on data mining (SDM ’07), April 26–28, Minneapolis, pp 208–215
Fine S, Mansour Y (2006) Active sampling for multiple output identification. In: Proceedings of the 19th annual conference on learning theory (COLT ’06), June 22–25, Pittsburgh, pp 620–634
Foss A, Zaïane OR (2011) Class separation through variance: a new application of outlier detection. Knowl Inf Syst 29(3):565–596
Article Google Scholar
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
Gao Y, Zheng B, Chen G, Li Q (2009) On efficient mutual nearest neighbor query processing in spatial databases. Data Knowl Eng 68(8):705–727
Article Google Scholar
He J, Carbonell J (2007) Nearest-neighbor-based active learning for rare category detection. In: Advances in neural information processing systems (NIPS ’07), vol 20, December 3–6, Vancouver, pp 633–640
He J, Carbonell J (2009) Prior-free rare category detection. In: Proceedings of the 2009 SIAM international conference on data mining (SDM ’09), April 30–May 2, Sparks, pp 155–163
He J, Liu Y, Lawrence R (2008) Graph-based rare category detection. In: Proceedings of the 8th IEEE international conference on data mining (ICDM ’08), December 15–19, Pisa, pp 833–838
He J, Tong H, Carbonell J (2010) Rare category characterization. In: Proceedings of the 10th IEEE international conference on data mining (ICDM ’10), December 14–17, Sydney, pp 226–235
He Z, Deng S, Xu X, Huang JZ (2006) A fast greedy algorithm for outlier mining. In: Advances in knowledge discovery and data mining (PAKDD ’06), vol LNCS 3918, April 9–12, Singapore, pp 567–576
He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9–10): 1641–1650
Google Scholar
Hospedales T, Gong S, Xiang T (2011) Finding rare classes: adapting generative and discriminative models in active learning. In: Advances in knowledge discovery and data mining (PAKDD ’11), vol LNAI 6635, May 24–27, Shenzhen, pp 296–308
Huang H, He Q, He J, Ma L (2011) Radar: rare category detection via computation of boundary degree. In: Advances in knowledge discovery and data mining (PAKDD ’11), vol LNCS 6635, May 24–27, Shenzhen, pp 258–269
Jian P, Kapoor A (2009) Active learning for large multi-class problems. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR ’09), June 20–25, Miami Beach, pp 762–769
Jolliffe I (2002) Principal component analysis, 2nd edn. Springer, Heidelberg
MATH Google Scholar
Leung Y, Zhang JS, Xu ZB (2000) Clustering by scale-space filtering. IEEE Trans Pattern Anal Mach Intell 22(12):1396–1410
Article Google Scholar
Linda O, Vollmer T, Manic M (2009) Neural network based intrusion detection system for critical infrastructures. In: Proceedings of the 2009 international joint conference on neural networks (IJCNN ’09), June 14–19, Atlanta, pp 1827–1834
Moore A (1991) A tutorial on kd-trees. University of Cambridge Computer Laboratory Technical, Report
Moshtaghi M, Havens T, Bezdek J, Park L, Leckie C, Rajasegarar S, Keller J, Palaniswami M (2011) Clustering ellipses for anomaly detection. Pattern Recogn 44(1):55–69
Article MATH Google Scholar
Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Mining Knowl Discov 12(2–3):203–228
Google Scholar
Pelleg D, Moore A (2004) Active learning for anomaly and rare-category detection. In: Advances in neural information processing systems (NIPS ’04), vol 18, December 13–18, Vancouver, pp 1073–1080
Porter R, Hush D, Harvey N, Theiler J (2010) Toward interactive search in remote sensing imagery. In: Proceedings of SPIE—the international society for optical engineering, vol 7709, April 5 Orlando
Rice JA (2006) Mathematical statistics and data analysis, 3rd edn. Duxbury Press, California
Google Scholar
Roweis S (1998) Em algorithm for pca and spca. In: Advances in neural information processing systems (NIPS ’98), November 30–December 5, Denver, pp 626–632
Scott DW (1992) Multivariate density estimation: theory, practice, and visualization. Wiley, New York
Book MATH Google Scholar
Scott DW (2010) Histogram. WIREs Comput Stat 2(1):44–48
Article Google Scholar
Settles B (2010) Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison
Sotiris VA, Tse PW, Pecht MG (2010) Anomaly detection through a bayesian support vector machine. IEEE Trans Reliab 59(2):277–286
Article Google Scholar
Tandon G, Chan P (2007) Weighting versus pruning in rule validation for detecting network and host anomalies. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, August 12–15, San Jose, pp 697–706
Vatturi P, Wong W-K (2009) Category detection using hierarchical mean shift. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, June 28–July 1, Paris, pp 847–856
Wang W, Zhou ZH (2010) A new analysis of co-training. In: Proceedings of the 27th international conference on machine learning (ICML ’10), June 21–24, Haifa, pp 1135–1142
Wu J, Xiong H, Wu P, Chen J (2007) Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, August 12–15, San Jose, pp 814–823

Download references

Acknowledgments

This research was partly supported by the Ministry of Education-Intel IT Special Research Foundation under grant No. MOE-INTEL-11-06, in which the work of Kevin Chiew was partly supported by National Natural Science Foundation of China under Grant No. 60970081. The authors would like to thank Dr. Yunjun Gao from Zhejiang University for his advice and input during the preparation of this article.

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, People’s Republic of China
Hao Huang, Qinming He, Feng Qian & Lianhang Ma
School of Engineering, Tan Tao University, Duc Hoa District, Long An Province, Vietnam
Kevin Chiew

Authors

Hao Huang
View author publications
You can also search for this author inPubMed Google Scholar
Qinming He
View author publications
You can also search for this author inPubMed Google Scholar
Kevin Chiew
View author publications
You can also search for this author inPubMed Google Scholar
Feng Qian
View author publications
You can also search for this author inPubMed Google Scholar
Lianhang Ma
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Kevin Chiew.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, H., He, Q., Chiew, K. et al. CLOVER: a faster prior-free approach to rare-category detection. Knowl Inf Syst 35, 713–736 (2013). https://doi.org/10.1007/s10115-012-0530-9

Download citation

Received: 23 August 2011
Revised: 10 June 2012
Accepted: 28 July 2012
Published: 21 August 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10115-012-0530-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CLOVER: a faster prior-free approach to rare-category detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Rare Category Detection Forest

Combining Active Semi-supervised Learning and Rare Category Detection

Fast Rare Category Detection Using Nearest Centroid Neighborhood

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now