Abstract
Rare categories become more and more abundant and their characterization has received little attention thus far. Fraudulent banking transactions, network intrusions, and rare diseases are examples of rare classes whose detection and characterization are of high value. However, accurate characterization is challenging due to high-skewness and nonseparability from majority classes, e.g., fraudulent transactions masquerade as legitimate ones. This paper proposes the RACH algorithm by exploring the compactness property of the rare categories. This algorithm is semi-supervised in nature since it uses both labeled and unlabeled data. It is based on an optimization framework which encloses the rare examples by a minimum-radius hyperball. The framework is then converted into a convex optimization problem, which is in turn effectively solved in its dual form by the projected subgradient method. RACH can be naturally kernelized. Experimental results validate the effectiveness of RACH.
Similar content being viewed by others
References
Chau D H, Pandit S, Faloutsos C. Detecting fraudulent personalities in networks of online auctioneers. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2006, 103–114
EURODIS. Rare diseases: understanding this public health priority. 2005, http://www.eurordis.org/IMG/pdf/princeps_document-EN.pdf
Pelleg D, Moore A W. Active learning for anomaly and rare-category detection. In: Proceedings of 2004 Neural Information Processing Systems. 2004
Fine S, Mansour Y. Active sampling for multiple output identification. In: Proceedings of the 19th Annual Conference on Learning Theory. 2006, 620–634
He J, Carbonell J. Nearest-neighbor-based active learning for rare category detection. In: Proceedings of 2007 Neural Information Processing Systems. 2007
Dasgupta S, Hsu D. Hierarchical sampling for active learning. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 208–215
Vatturi P, Wong WK. Category detection using hierarchical mean shift. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 847–856
Japkowicz N. Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets. Menlo Park: AAAI Press, 2000
Chawla N V, Japkowicz N, Kolcz A. Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets. 2003
Chawla N V, Japkowicz N, Kotcz A. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 1–6
Ling C X, Li C. Data mining for direct marketing: problems and solutions. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. 1998, 73–79
Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321–357
Cieslak D A, Chawla N V. Start globally, optimize locally, predict globally: improving performance on imbalanced data. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 143–152
Köknar-Tezel S, Latecki L. Improving SVM classification on imbalanced time series data sets with ghost points. Knowledge and Information Systems, 2011, 28(1): 1–23
Chawla N V, Lazarevic A, Hall L O, Bowyer K W. Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119
Sun Y, Kamel M S, Wang Y. Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the 6th IEEE International Conference on Data Mining. 2006, 592–602
Wang B, Japkowicz N. Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 2010, 25(1): 1–20
Wu J, Xiong H, Wu P, Chen J. Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007, 814–823
Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Computing Surveys, 2009, 41(3): 1–58
Barbará D, Wu N, Jajodia S. Detecting novel network intrusions using Bayes estimators. In: Proceedings of the 1st SIAMConference on Data Mining. 2001
Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. 2000, 427–438
de Vries T, Chawla S, Houle M E. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems (in Press)
Bhaduri K, Matthews B L, Giannella C. Algorithms for speeding up distance-based outlier detection. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2011, 859–867
Yu D, Sheikholeslami G, Zhang A. FindOut: finding outliers in very large datasets. Knowledge and Information Systems, 2002, 4(4): 387–412
Gao J, Liang F, Fan W, Wang C, Sun Y, Han J. On community outliers and their efficient detection in information networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 813–822
He Z, Xu X, Deng S. An optimization model for outlier detection in categorical data. The Computing Research Repository, 2005, abs/cs/0503081
Dutta H, Giannella C, Borne K D, Kargupta H. Distributed top-k outlier detection from astronomy catalogs using the DEMAC system. In: Proceedings of the 7th SIAM International Conference on Data Mining. 2007
Aggarwal C C, Yu P S. Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 2001, 37–46
Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 2011, 26(2): 309–336
Chen F, Lu C T, Boedihardjo A P. GLS-SOD: a generalized local statistical approach for spatial outlier detection. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 1069–1078
Papadimitriou S, Kitagawa H, Gibbons P B, Faloutsos C. LOCI: fast outlier detection using the local correlation integral. In: Proceedings of the 19th International Conference on Data Engineering. 2003, 315–327
Görnitz N, Kloft M, Brefeld U. Active and semi-supervised data domain description. In: Proceedings of European Conference onMachine Learning and Knowledge Discovery in Databases, Part I. 2009, 407–422
Schölkopf B, Platt J C, Shawe-Taylor J, Smola A J, Williamson R C. Estimating the support of a high-dimensional distribution. Neural Computation, 2001, 13(7): 1443–1471
Joachims T. A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning. 2005, 377–384
Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004
Duchi J, Shalev-Shwartz S, Singer Y, Chandra T. Efficient projections onto the l 1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 272–279
Zhou D, Weston J, Gretton A, Bousquet O, Schölkopf B. Ranking on data manifolds. In: Proceedings of 2003 Neural Information Processing Systems. 2003
Joachims T. Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning. 1999, 200–209
Author information
Authors and Affiliations
Corresponding author
Additional information
Dr. Jingrui He is currently a research staff member at IBM T.J. Watson Research Center. She received her M.Sc. and Ph.D. from Carnegie Mellon University in 2008 and 2010, respectively, both majored in Machine Learning. Her research interests include developing scalable algorithms for rare category analysis, heterogeneous learning, and semi-supervised learning with an emphasis on applications in social media analysis. She is the recipient of IBM Fellowship between 2008 and 2010. She also won the second place in ICDM 2010 data mining Contest on traffic prediction (both Task 2 and Task 3). She has published over 30 referred articles and served in the organization committee of ICML, KDD, etc.
Dr. Hanghang Tong is currently a research staff member at IBM T.J. Watson Research Center. Before that, he was a Post-doctoral fellow in Carnegie Mellon University. He received his M.Sc. and Ph.D. from Carnegie Mellon University in 2008 and 2010, respectively, both majored in Machine Learning. His research interest is in large scale data mining for graphs and multimedia. He has received several awards, including best research paper in ICDM 2006 and best paper award in SDM 2008. He has published over 40 referred articles and served as a program committee member of SIGKDD, PKDD, and WWW.
Dr. Jaime Carbonell is the Director of the Language Technologies Institute and Allen Newell Professor of Computer Science at Carnegie Mellon University. He received BS degrees in Physics and Mathematics from MIT, and M.Sc. and Ph.D. in Computer Science from Yale University. His current spans multiple text mining, machine translation, and automated summarization (where he invented the MMR search-diversity method), and computational proteomics. He is also an expert in Machine Learning, editing 3 books, and serving as editor-in-chief of the Machine Learning Journal for 4 years. He recently invented Proactive Machine Learning, including underlying decision-theoretic framework. Overall, he has published 300 articles and books. Dr. Carbonell has served on multiple governmental advisory committees such as the Human Genome Committee of the National Institutes of Health, the Oakridge National Laboratories Scientific Advisory Board, the National Institute of Standards and Technology Interactive Systems Scientific Advisory Board, and the German National Artificial Intelligence (DFKI) Scientific Advisory Board.
Rights and permissions
About this article
Cite this article
He, J., Tong, H. & Carbonell, J. An effective framework for characterizing rare categories. Front. Comput. Sci. 6, 154–165 (2012). https://doi.org/10.1007/s11704-012-2861-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-012-2861-9