Skip to main content
Log in

An effective framework for characterizing rare categories

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Rare categories become more and more abundant and their characterization has received little attention thus far. Fraudulent banking transactions, network intrusions, and rare diseases are examples of rare classes whose detection and characterization are of high value. However, accurate characterization is challenging due to high-skewness and nonseparability from majority classes, e.g., fraudulent transactions masquerade as legitimate ones. This paper proposes the RACH algorithm by exploring the compactness property of the rare categories. This algorithm is semi-supervised in nature since it uses both labeled and unlabeled data. It is based on an optimization framework which encloses the rare examples by a minimum-radius hyperball. The framework is then converted into a convex optimization problem, which is in turn effectively solved in its dual form by the projected subgradient method. RACH can be naturally kernelized. Experimental results validate the effectiveness of RACH.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Chau D H, Pandit S, Faloutsos C. Detecting fraudulent personalities in networks of online auctioneers. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2006, 103–114

  2. EURODIS. Rare diseases: understanding this public health priority. 2005, http://www.eurordis.org/IMG/pdf/princeps_document-EN.pdf

  3. Pelleg D, Moore A W. Active learning for anomaly and rare-category detection. In: Proceedings of 2004 Neural Information Processing Systems. 2004

  4. Fine S, Mansour Y. Active sampling for multiple output identification. In: Proceedings of the 19th Annual Conference on Learning Theory. 2006, 620–634

  5. He J, Carbonell J. Nearest-neighbor-based active learning for rare category detection. In: Proceedings of 2007 Neural Information Processing Systems. 2007

  6. Dasgupta S, Hsu D. Hierarchical sampling for active learning. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 208–215

  7. Vatturi P, Wong WK. Category detection using hierarchical mean shift. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 847–856

  8. Japkowicz N. Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets. Menlo Park: AAAI Press, 2000

    Google Scholar 

  9. Chawla N V, Japkowicz N, Kolcz A. Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets. 2003

  10. Chawla N V, Japkowicz N, Kotcz A. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 1–6

    Article  Google Scholar 

  11. Ling C X, Li C. Data mining for direct marketing: problems and solutions. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. 1998, 73–79

  12. Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321–357

    MATH  Google Scholar 

  13. Cieslak D A, Chawla N V. Start globally, optimize locally, predict globally: improving performance on imbalanced data. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 143–152

  14. Köknar-Tezel S, Latecki L. Improving SVM classification on imbalanced time series data sets with ghost points. Knowledge and Information Systems, 2011, 28(1): 1–23

    Article  Google Scholar 

  15. Chawla N V, Lazarevic A, Hall L O, Bowyer K W. Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119

  16. Sun Y, Kamel M S, Wang Y. Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the 6th IEEE International Conference on Data Mining. 2006, 592–602

  17. Wang B, Japkowicz N. Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 2010, 25(1): 1–20

    Article  Google Scholar 

  18. Wu J, Xiong H, Wu P, Chen J. Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007, 814–823

  19. Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Computing Surveys, 2009, 41(3): 1–58

    Article  Google Scholar 

  20. Barbará D, Wu N, Jajodia S. Detecting novel network intrusions using Bayes estimators. In: Proceedings of the 1st SIAMConference on Data Mining. 2001

  21. Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. 2000, 427–438

  22. de Vries T, Chawla S, Houle M E. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems (in Press)

  23. Bhaduri K, Matthews B L, Giannella C. Algorithms for speeding up distance-based outlier detection. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2011, 859–867

  24. Yu D, Sheikholeslami G, Zhang A. FindOut: finding outliers in very large datasets. Knowledge and Information Systems, 2002, 4(4): 387–412

    Article  Google Scholar 

  25. Gao J, Liang F, Fan W, Wang C, Sun Y, Han J. On community outliers and their efficient detection in information networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 813–822

  26. He Z, Xu X, Deng S. An optimization model for outlier detection in categorical data. The Computing Research Repository, 2005, abs/cs/0503081

  27. Dutta H, Giannella C, Borne K D, Kargupta H. Distributed top-k outlier detection from astronomy catalogs using the DEMAC system. In: Proceedings of the 7th SIAM International Conference on Data Mining. 2007

  28. Aggarwal C C, Yu P S. Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 2001, 37–46

  29. Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 2011, 26(2): 309–336

    Article  Google Scholar 

  30. Chen F, Lu C T, Boedihardjo A P. GLS-SOD: a generalized local statistical approach for spatial outlier detection. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 1069–1078

  31. Papadimitriou S, Kitagawa H, Gibbons P B, Faloutsos C. LOCI: fast outlier detection using the local correlation integral. In: Proceedings of the 19th International Conference on Data Engineering. 2003, 315–327

  32. Görnitz N, Kloft M, Brefeld U. Active and semi-supervised data domain description. In: Proceedings of European Conference onMachine Learning and Knowledge Discovery in Databases, Part I. 2009, 407–422

  33. Schölkopf B, Platt J C, Shawe-Taylor J, Smola A J, Williamson R C. Estimating the support of a high-dimensional distribution. Neural Computation, 2001, 13(7): 1443–1471

    Article  MATH  Google Scholar 

  34. Joachims T. A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning. 2005, 377–384

  35. Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004

    MATH  Google Scholar 

  36. Duchi J, Shalev-Shwartz S, Singer Y, Chandra T. Efficient projections onto the l 1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 272–279

  37. Zhou D, Weston J, Gretton A, Bousquet O, Schölkopf B. Ranking on data manifolds. In: Proceedings of 2003 Neural Information Processing Systems. 2003

  38. Joachims T. Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning. 1999, 200–209

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingrui He.

Additional information

Dr. Jingrui He is currently a research staff member at IBM T.J. Watson Research Center. She received her M.Sc. and Ph.D. from Carnegie Mellon University in 2008 and 2010, respectively, both majored in Machine Learning. Her research interests include developing scalable algorithms for rare category analysis, heterogeneous learning, and semi-supervised learning with an emphasis on applications in social media analysis. She is the recipient of IBM Fellowship between 2008 and 2010. She also won the second place in ICDM 2010 data mining Contest on traffic prediction (both Task 2 and Task 3). She has published over 30 referred articles and served in the organization committee of ICML, KDD, etc.

Dr. Hanghang Tong is currently a research staff member at IBM T.J. Watson Research Center. Before that, he was a Post-doctoral fellow in Carnegie Mellon University. He received his M.Sc. and Ph.D. from Carnegie Mellon University in 2008 and 2010, respectively, both majored in Machine Learning. His research interest is in large scale data mining for graphs and multimedia. He has received several awards, including best research paper in ICDM 2006 and best paper award in SDM 2008. He has published over 40 referred articles and served as a program committee member of SIGKDD, PKDD, and WWW.

Dr. Jaime Carbonell is the Director of the Language Technologies Institute and Allen Newell Professor of Computer Science at Carnegie Mellon University. He received BS degrees in Physics and Mathematics from MIT, and M.Sc. and Ph.D. in Computer Science from Yale University. His current spans multiple text mining, machine translation, and automated summarization (where he invented the MMR search-diversity method), and computational proteomics. He is also an expert in Machine Learning, editing 3 books, and serving as editor-in-chief of the Machine Learning Journal for 4 years. He recently invented Proactive Machine Learning, including underlying decision-theoretic framework. Overall, he has published 300 articles and books. Dr. Carbonell has served on multiple governmental advisory committees such as the Human Genome Committee of the National Institutes of Health, the Oakridge National Laboratories Scientific Advisory Board, the National Institute of Standards and Technology Interactive Systems Scientific Advisory Board, and the German National Artificial Intelligence (DFKI) Scientific Advisory Board.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, J., Tong, H. & Carbonell, J. An effective framework for characterizing rare categories. Front. Comput. Sci. 6, 154–165 (2012). https://doi.org/10.1007/s11704-012-2861-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-012-2861-9

Keywords

Navigation