Skip to main content
Log in

CLOVER: a faster prior-free approach to rare-category detection

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Rare-category detection helps discover new rare classes in an unlabeled data set by selecting their candidate data examples for labeling. Most of the existing approaches for rare-category detection require prior information about the data set without which they are otherwise not applicable. The prior-free algorithms try to address this problem without prior information about the data set; though, the compensation is high time complexity, which is not lower than \(O(dN^2)\) where \(N\) is the number of data examples in a data set and \(d\) is the data set dimension. In this paper, we propose CLOVER a prior-free algorithm by introducing a novel rare-category criterion known as local variation degree (LVD), which utilizes the characteristics of rare classes for identifying rare-class data examples from other types of data examples and passes those data examples with maximum LVD values to CLOVER for labeling. A remarkable improvement is that CLOVER’s time complexity is \(O(dN^{2-1/d})\) for \(d > 1\) or \(O(N\log N)\) for \(d = 1\). Extensive experimental results on real data sets demonstrate the effectiveness and efficiency of our method in terms of new rare classes discovery and lower time complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Each class only forms one cluster of data examples [34].

  2. Bins with an over-large bandwidth excessively shortens the density differentials, and results in an over-smoothed histogram estimate; whereas bins with an over-small bandwidth usually divides the local regions where data examples have similar local densities into a lot of small pieces, and results in an under-smoothed histogram estimate [38].

References

  1. Agarwal D (2006) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inf Syst 11(1):29–44

    Article  Google Scholar 

  2. Aggarwal CC, Yu PS (2008) Outlier detection with uncertain data. In: Proceedings of the 2008 SIAM international conference on data mining (SDM ’08), April 24–26, Atlanta, pp 483–493

  3. Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196

    Article  Google Scholar 

  4. Ando S (2007) Clustering needles in a haystack: An information theoretic analysis of minority and outlier detection. In: Proceedings of the 7th IEEE international conference on data mining (ICDM ’07), October 28–31, Omaha, pp 13–22

  5. Bay S, Kumaraswamy K, Anderle M, Kumar R, Steier D (2006) Large scale detection of irregularities in accounting data. In: Proceedings of the 6th IEEE international conference on data mining (ICDM ’06), December 18–22, Hong Kong, pp 75–86

  6. Blum A, Mitchell T (1998) Combining labeled and unlabeded data with co-training. In: Proceedings of the 11th annual conference on learning theory (COLT ’98), July 24–26, Madison, pp 92–100

  7. Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM international conference on data mining (SDM ’08), April 24–26 Atlanta, pp 243–254

  8. Branch JW, Giannella C, Szymanski B, Wolff R, Kargupta H (2012) In-network outlier detection in wireless sensor networks. Knowl Inf Syst. doi:10.1007/s10115-011-0474-5

  9. Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: Identifying ddensity-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, May 16–18, Dallas, pp 93–104

  10. Calderara S, Heinemann U, Prati A, Cucchiara R, Tishby N (2011) Detecting anomalies in people’s trajectories using spectral graph analysis. Comput Vis Image Underst 115(8):1099–1111

    Article  Google Scholar 

  11. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):1–58

    Article  Google Scholar 

  12. Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on machine learning (ICML ’08), July 5–9, Helsinki, pp 208–215

  13. de Vries T, Chawla S, Houle M (2011) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst. doi:10.1007/s10115-011-0430-4

  14. Dutta H, Giannella C, Borne K, Kargupta H (2007) Distributed top-k outlier detection in astronomy catalogs using the demac system. In: Proceedings of the 2007 SIAM international conference on data mining (SDM ’07), April 26–28, Minneapolis, pp 208–215

  15. Fine S, Mansour Y (2006) Active sampling for multiple output identification. In: Proceedings of the 19th annual conference on learning theory (COLT ’06), June 22–25, Pittsburgh, pp 620–634

  16. Foss A, Zaïane OR (2011) Class separation through variance: a new application of outlier detection. Knowl Inf Syst 29(3):565–596

    Article  Google Scholar 

  17. Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml

  18. Gao Y, Zheng B, Chen G, Li Q (2009) On efficient mutual nearest neighbor query processing in spatial databases. Data Knowl Eng 68(8):705–727

    Article  Google Scholar 

  19. He J, Carbonell J (2007) Nearest-neighbor-based active learning for rare category detection. In: Advances in neural information processing systems (NIPS ’07), vol 20, December 3–6, Vancouver, pp 633–640

  20. He J, Carbonell J (2009) Prior-free rare category detection. In: Proceedings of the 2009 SIAM international conference on data mining (SDM ’09), April 30–May 2, Sparks, pp 155–163

  21. He J, Liu Y, Lawrence R (2008) Graph-based rare category detection. In: Proceedings of the 8th IEEE international conference on data mining (ICDM ’08), December 15–19, Pisa, pp 833–838

  22. He J, Tong H, Carbonell J (2010) Rare category characterization. In: Proceedings of the 10th IEEE international conference on data mining (ICDM ’10), December 14–17, Sydney, pp 226–235

  23. He Z, Deng S, Xu X, Huang JZ (2006) A fast greedy algorithm for outlier mining. In: Advances in knowledge discovery and data mining (PAKDD ’06), vol LNCS 3918, April 9–12, Singapore, pp 567–576

  24. He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9–10): 1641–1650

    Google Scholar 

  25. Hospedales T, Gong S, Xiang T (2011) Finding rare classes: adapting generative and discriminative models in active learning. In: Advances in knowledge discovery and data mining (PAKDD ’11), vol LNAI 6635, May 24–27, Shenzhen, pp 296–308

  26. Huang H, He Q, He J, Ma L (2011) Radar: rare category detection via computation of boundary degree. In: Advances in knowledge discovery and data mining (PAKDD ’11), vol LNCS 6635, May 24–27, Shenzhen, pp 258–269

  27. Jian P, Kapoor A (2009) Active learning for large multi-class problems. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR ’09), June 20–25, Miami Beach, pp 762–769

  28. Jolliffe I (2002) Principal component analysis, 2nd edn. Springer, Heidelberg

    MATH  Google Scholar 

  29. Leung Y, Zhang JS, Xu ZB (2000) Clustering by scale-space filtering. IEEE Trans Pattern Anal Mach Intell 22(12):1396–1410

    Article  Google Scholar 

  30. Linda O, Vollmer T, Manic M (2009) Neural network based intrusion detection system for critical infrastructures. In: Proceedings of the 2009 international joint conference on neural networks (IJCNN ’09), June 14–19, Atlanta, pp 1827–1834

  31. Moore A (1991) A tutorial on kd-trees. University of Cambridge Computer Laboratory Technical, Report

  32. Moshtaghi M, Havens T, Bezdek J, Park L, Leckie C, Rajasegarar S, Keller J, Palaniswami M (2011) Clustering ellipses for anomaly detection. Pattern Recogn 44(1):55–69

    Article  MATH  Google Scholar 

  33. Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Mining Knowl Discov 12(2–3):203–228

    Google Scholar 

  34. Pelleg D, Moore A (2004) Active learning for anomaly and rare-category detection. In: Advances in neural information processing systems (NIPS ’04), vol 18, December 13–18, Vancouver, pp 1073–1080

  35. Porter R, Hush D, Harvey N, Theiler J (2010) Toward interactive search in remote sensing imagery. In: Proceedings of SPIE—the international society for optical engineering, vol 7709, April 5 Orlando

  36. Rice JA (2006) Mathematical statistics and data analysis, 3rd edn. Duxbury Press, California

    Google Scholar 

  37. Roweis S (1998) Em algorithm for pca and spca. In: Advances in neural information processing systems (NIPS ’98), November 30–December 5, Denver, pp 626–632

  38. Scott DW (1992) Multivariate density estimation: theory, practice, and visualization. Wiley, New York

    Book  MATH  Google Scholar 

  39. Scott DW (2010) Histogram. WIREs Comput Stat 2(1):44–48

    Article  Google Scholar 

  40. Settles B (2010) Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison

  41. Sotiris VA, Tse PW, Pecht MG (2010) Anomaly detection through a bayesian support vector machine. IEEE Trans Reliab 59(2):277–286

    Article  Google Scholar 

  42. Tandon G, Chan P (2007) Weighting versus pruning in rule validation for detecting network and host anomalies. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, August 12–15, San Jose, pp 697–706

  43. Vatturi P, Wong W-K (2009) Category detection using hierarchical mean shift. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, June 28–July 1, Paris, pp 847–856

  44. Wang W, Zhou ZH (2010) A new analysis of co-training. In: Proceedings of the 27th international conference on machine learning (ICML ’10), June 21–24, Haifa, pp 1135–1142

  45. Wu J, Xiong H, Wu P, Chen J (2007) Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, August 12–15, San Jose, pp 814–823

Download references

Acknowledgments

This research was partly supported by the Ministry of Education-Intel IT Special Research Foundation under grant No. MOE-INTEL-11-06, in which the work of Kevin Chiew was partly supported by National Natural Science Foundation of China under Grant No. 60970081. The authors would like to thank Dr. Yunjun Gao from Zhejiang University for his advice and input during the preparation of this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kevin Chiew.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, H., He, Q., Chiew, K. et al. CLOVER: a faster prior-free approach to rare-category detection. Knowl Inf Syst 35, 713–736 (2013). https://doi.org/10.1007/s10115-012-0530-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0530-9

Keywords

Navigation