An effective framework for characterizing rare categories

He, Jingrui; Tong, Hanghang; Carbonell, Jaime

doi:10.1007/s11704-012-2861-9

An effective framework for characterizing rare categories

Research Article
Published: 31 March 2012

Volume 6, pages 154–165, (2012)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Jingrui He¹,
Hanghang Tong¹ &
Jaime Carbonell²

144 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

Rare categories become more and more abundant and their characterization has received little attention thus far. Fraudulent banking transactions, network intrusions, and rare diseases are examples of rare classes whose detection and characterization are of high value. However, accurate characterization is challenging due to high-skewness and nonseparability from majority classes, e.g., fraudulent transactions masquerade as legitimate ones. This paper proposes the RACH algorithm by exploring the compactness property of the rare categories. This algorithm is semi-supervised in nature since it uses both labeled and unlabeled data. It is based on an optimization framework which encloses the rare examples by a minimum-radius hyperball. The framework is then converted into a convex optimization problem, which is in turn effectively solved in its dual form by the projected subgradient method. RACH can be naturally kernelized. Experimental results validate the effectiveness of RACH.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining Active Semi-supervised Learning and Rare Category Detection

One-Class Semi-supervised Learning

Unsupervised label generation for severely imbalanced fraud data

Article Open access 11 March 2025

References

Chau D H, Pandit S, Faloutsos C. Detecting fraudulent personalities in networks of online auctioneers. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2006, 103–114
EURODIS. Rare diseases: understanding this public health priority. 2005, http://www.eurordis.org/IMG/pdf/princeps_document-EN.pdf
Pelleg D, Moore A W. Active learning for anomaly and rare-category detection. In: Proceedings of 2004 Neural Information Processing Systems. 2004
Fine S, Mansour Y. Active sampling for multiple output identification. In: Proceedings of the 19th Annual Conference on Learning Theory. 2006, 620–634
He J, Carbonell J. Nearest-neighbor-based active learning for rare category detection. In: Proceedings of 2007 Neural Information Processing Systems. 2007
Dasgupta S, Hsu D. Hierarchical sampling for active learning. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 208–215
Vatturi P, Wong WK. Category detection using hierarchical mean shift. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 847–856
Japkowicz N. Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets. Menlo Park: AAAI Press, 2000
Google Scholar
Chawla N V, Japkowicz N, Kolcz A. Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets. 2003
Chawla N V, Japkowicz N, Kotcz A. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 1–6
Article Google Scholar
Ling C X, Li C. Data mining for direct marketing: problems and solutions. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. 1998, 73–79
Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321–357
MATH Google Scholar
Cieslak D A, Chawla N V. Start globally, optimize locally, predict globally: improving performance on imbalanced data. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 143–152
Köknar-Tezel S, Latecki L. Improving SVM classification on imbalanced time series data sets with ghost points. Knowledge and Information Systems, 2011, 28(1): 1–23
Article Google Scholar
Chawla N V, Lazarevic A, Hall L O, Bowyer K W. Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119
Sun Y, Kamel M S, Wang Y. Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the 6th IEEE International Conference on Data Mining. 2006, 592–602
Wang B, Japkowicz N. Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 2010, 25(1): 1–20
Article Google Scholar
Wu J, Xiong H, Wu P, Chen J. Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007, 814–823
Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Computing Surveys, 2009, 41(3): 1–58
Article Google Scholar
Barbará D, Wu N, Jajodia S. Detecting novel network intrusions using Bayes estimators. In: Proceedings of the 1st SIAMConference on Data Mining. 2001
Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. 2000, 427–438
de Vries T, Chawla S, Houle M E. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems (in Press)
Bhaduri K, Matthews B L, Giannella C. Algorithms for speeding up distance-based outlier detection. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2011, 859–867
Yu D, Sheikholeslami G, Zhang A. FindOut: finding outliers in very large datasets. Knowledge and Information Systems, 2002, 4(4): 387–412
Article Google Scholar
Gao J, Liang F, Fan W, Wang C, Sun Y, Han J. On community outliers and their efficient detection in information networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 813–822
He Z, Xu X, Deng S. An optimization model for outlier detection in categorical data. The Computing Research Repository, 2005, abs/cs/0503081
Dutta H, Giannella C, Borne K D, Kargupta H. Distributed top-k outlier detection from astronomy catalogs using the DEMAC system. In: Proceedings of the 7th SIAM International Conference on Data Mining. 2007
Aggarwal C C, Yu P S. Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 2001, 37–46
Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 2011, 26(2): 309–336
Article Google Scholar
Chen F, Lu C T, Boedihardjo A P. GLS-SOD: a generalized local statistical approach for spatial outlier detection. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 1069–1078
Papadimitriou S, Kitagawa H, Gibbons P B, Faloutsos C. LOCI: fast outlier detection using the local correlation integral. In: Proceedings of the 19th International Conference on Data Engineering. 2003, 315–327
Görnitz N, Kloft M, Brefeld U. Active and semi-supervised data domain description. In: Proceedings of European Conference onMachine Learning and Knowledge Discovery in Databases, Part I. 2009, 407–422
Schölkopf B, Platt J C, Shawe-Taylor J, Smola A J, Williamson R C. Estimating the support of a high-dimensional distribution. Neural Computation, 2001, 13(7): 1443–1471
Article MATH Google Scholar
Joachims T. A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning. 2005, 377–384
Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004
MATH Google Scholar
Duchi J, Shalev-Shwartz S, Singer Y, Chandra T. Efficient projections onto the l ₁-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 272–279
Zhou D, Weston J, Gretton A, Bousquet O, Schölkopf B. Ranking on data manifolds. In: Proceedings of 2003 Neural Information Processing Systems. 2003
Joachims T. Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning. 1999, 200–209

Download references

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598, USA
Jingrui He & Hanghang Tong
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Jaime Carbonell

Authors

Jingrui He
View author publications
You can also search for this author inPubMed Google Scholar
Hanghang Tong
View author publications
You can also search for this author inPubMed Google Scholar
Jaime Carbonell
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jingrui He.

Additional information

Dr. Jingrui He is currently a research staff member at IBM T.J. Watson Research Center. She received her M.Sc. and Ph.D. from Carnegie Mellon University in 2008 and 2010, respectively, both majored in Machine Learning. Her research interests include developing scalable algorithms for rare category analysis, heterogeneous learning, and semi-supervised learning with an emphasis on applications in social media analysis. She is the recipient of IBM Fellowship between 2008 and 2010. She also won the second place in ICDM 2010 data mining Contest on traffic prediction (both Task 2 and Task 3). She has published over 30 referred articles and served in the organization committee of ICML, KDD, etc.

Dr. Hanghang Tong is currently a research staff member at IBM T.J. Watson Research Center. Before that, he was a Post-doctoral fellow in Carnegie Mellon University. He received his M.Sc. and Ph.D. from Carnegie Mellon University in 2008 and 2010, respectively, both majored in Machine Learning. His research interest is in large scale data mining for graphs and multimedia. He has received several awards, including best research paper in ICDM 2006 and best paper award in SDM 2008. He has published over 40 referred articles and served as a program committee member of SIGKDD, PKDD, and WWW.

Dr. Jaime Carbonell is the Director of the Language Technologies Institute and Allen Newell Professor of Computer Science at Carnegie Mellon University. He received BS degrees in Physics and Mathematics from MIT, and M.Sc. and Ph.D. in Computer Science from Yale University. His current spans multiple text mining, machine translation, and automated summarization (where he invented the MMR search-diversity method), and computational proteomics. He is also an expert in Machine Learning, editing 3 books, and serving as editor-in-chief of the Machine Learning Journal for 4 years. He recently invented Proactive Machine Learning, including underlying decision-theoretic framework. Overall, he has published 300 articles and books. Dr. Carbonell has served on multiple governmental advisory committees such as the Human Genome Committee of the National Institutes of Health, the Oakridge National Laboratories Scientific Advisory Board, the National Institute of Standards and Technology Interactive Systems Scientific Advisory Board, and the German National Artificial Intelligence (DFKI) Scientific Advisory Board.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, J., Tong, H. & Carbonell, J. An effective framework for characterizing rare categories. Front. Comput. Sci. 6, 154–165 (2012). https://doi.org/10.1007/s11704-012-2861-9

Download citation

Received: 31 May 2011
Accepted: 06 November 2011
Published: 31 March 2012
Issue Date: April 2012
DOI: https://doi.org/10.1007/s11704-012-2861-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective framework for characterizing rare categories

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Combining Active Semi-supervised Learning and Rare Category Detection

One-Class Semi-supervised Learning

Unsupervised label generation for severely imbalanced fraud data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now