skip to main content
10.1145/1321440.1321461acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Learning on the border: active learning in imbalanced data classification

Published: 06 November 2007 Publication History

Abstract

This paper is concerned with the class imbalance problem which has been known to hinder the learning performance of classification algorithms. The problem occurs when there are significantly less number of observations of the target concept. Various real-world classification tasks, such as medical diagnosis, text categorization and fraud detection suffer from this phenomenon. The standard machine learning algorithms yield better prediction performance with balanced datasets. In this paper, we demonstrate that active learning is capable of solving the class imbalance problem by providing the learner more balanced classes. We also propose an efficient way of selecting informative instances from a smaller pool of samples for active learning which does not necessitate a search through the entire dataset. The proposed method yields an efficient querying system and allows active learning to be applied to very large datasets. Our experimental results show that with an early stopping criteria, active learning achieves a fast solution with competitive prediction performance in imbalanced data classification.

References

[1]
N. Abe. Invited talk: Sampling approaches to learning from imbalanced datasets: Active learning, cost sensitive learning and beyond. Proc. of ICML Workshop: Learning from Imbalanced Data Sets, 2003.
[2]
R. Akbani, S. Kwek, and N. Japkowicz. Applying support vector machines to imbalanced datasets. Proc. of European Conference on Machine Learning, pages 39--50, 2004.
[3]
A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research (JMLR), 6:1579--1619, 2005.
[4]
P. K. Chan and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 1998.
[5]
N. V. Chawla, K. W. Bowyer., L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR), 16:321--357, 2002.
[6]
P. Domingos. Metacost: A general method for making classifiers cost-sensitive. In Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 1999.
[7]
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proc. of Int. Conference on Information and Knowledge Management (CIKM), 1998.
[8]
J. W. Grzymala-Busse, Z. Zheng, L. K. Goodwin, and W. J. Grzymala-Busse. An approach to imbalanced datasets based on changing rule strength. In Proc. of In Learning from Imbalanced Datasets, AAAI Workshop, 2000.
[9]
J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large scale datasets. In Proc. of European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2006.
[10]
N. Japkowicz. A novelty detection approach to classification. In Proc. of the Int. Joint Conference on Artificial Intelligence (IJCAI), pages 518--523, 1995.
[11]
N. Japkowicz. The class imbalance problem: Significance and strategies. In Proc. of 2000 Int. Conference on Artificial Intelligence (IC-AI'2000), volume 1, pages 111--117, 2000.
[12]
N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 2002.
[13]
M. Kubat, R. C. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2-3):195--215, 1998.
[14]
M. Kubat and S. Matwin. Addressing the curse of imbalanced training datasets: One sided selection. Proc. of Int. Conference on Machine Learning (ICML), 30(2-3), 1997.
[15]
C. X. Ling and C. Li. Data mining for direct marketing: Problems and solutions. In Knowledge Discovery and Data Mining, pages 73--79, 1998.
[16]
X.-Y. Liu, J. Wu, and Z.-H. Zhou. Exploratory under-sampling for class-imbalance learning. In Proc. of the International Conference on Data Mining (ICDM), 2006.
[17]
M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and C. Brunk. Reducing misclassification costs. In Proc. of 11th Int. Conference on Machine Learning (ICML), 1994.
[18]
F. Provost. Machine learning from imbalanced datasets 101. In Proc. of AAAI Workshop on Imbalanced Data Sets, 2000.
[19]
P. Radivojac, N. V. Chawla, A. K. Dunker, and Z. Obradovic. Classification and knowledge discovery in protein databases. Journal of Biomedical Informatics, 37(4):224--239, 2004.
[20]
B. Raskutti and A. Kowalczyk. Extreme re-balancing for svms: a case study. SIGKDD Explorations Newsletter, 6(1):60--69, 2004.
[21]
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. of the 17th Int. Conference on Machine Learning (ICML), pages 839--846, 2000.
[22]
A. J. Smola and B. Schölkopf. Sparse greedy matrix approximation for machine learning. In Proc. of 17th Int. Conference on Machine Learning (ICML).
[23]
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research (JMLR), 2:45--66, 2002.
[24]
V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.

Cited By

View all
  • (2025)Tsundoku: A Python toolkit for social network analysisSoftwareX10.1016/j.softx.2024.10200829(102008)Online publication date: Feb-2025
  • (2025)Statistic Deviation Mode Balancer (SDMB): A novel sampling algorithm for imbalanced dataNeurocomputing10.1016/j.neucom.2025.129484(129484)Online publication date: Jan-2025
  • (2024)Evaluation of resampling techniques for deep learning based identification of promising genotypes in sugarcane varietal trialsIndian Journal of Genetics and Plant Breeding (The)10.31742/ISGPB.84.1.884:01(92-98)Online publication date: 10-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
November 2007
1048 pages
ISBN:9781595938039
DOI:10.1145/1321440
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. active learning
  2. imbalanced data
  3. support vector machines

Qualifiers

  • Research-article

Conference

CIKM07

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)203
  • Downloads (Last 6 weeks)24
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Tsundoku: A Python toolkit for social network analysisSoftwareX10.1016/j.softx.2024.10200829(102008)Online publication date: Feb-2025
  • (2025)Statistic Deviation Mode Balancer (SDMB): A novel sampling algorithm for imbalanced dataNeurocomputing10.1016/j.neucom.2025.129484(129484)Online publication date: Jan-2025
  • (2024)Evaluation of resampling techniques for deep learning based identification of promising genotypes in sugarcane varietal trialsIndian Journal of Genetics and Plant Breeding (The)10.31742/ISGPB.84.1.884:01(92-98)Online publication date: 10-Apr-2024
  • (2024)Active Learning for Data Quality Control: A SurveyJournal of Data and Information Quality10.1145/366336916:2(1-45)Online publication date: 11-May-2024
  • (2024)A Survey on Unbalanced Classification: How Can Evolutionary Computation Help?IEEE Transactions on Evolutionary Computation10.1109/TEVC.2023.325723028:2(353-373)Online publication date: Apr-2024
  • (2024)BaSAL: Size-Balanced Warm Start Active Learning for LiDAR Semantic Segmentation2024 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA57147.2024.10611122(18258-18264)Online publication date: 13-May-2024
  • (2024)GMO-AC: Gaussian-based minority oversampling with adaptive outlier filtering and class overlap weightingIEEE Access10.1109/ACCESS.2024.3518573(1-1)Online publication date: 2024
  • (2024)Data-centric challenges with the application and adoption of artificial intelligence for drug discoveryExpert Opinion on Drug Discovery10.1080/17460441.2024.240363919:11(1297-1307)Online publication date: 24-Sep-2024
  • (2024)Cost aware LSTM model for predicting hard disk drive failures based on extremely imbalanced S.M.A.R.T. sensors dataEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107339127:PBOnline publication date: 1-Jan-2024
  • (2024)Label-free model evaluation and weighted uncertainty sample selection for domain adaptive instance segmentationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107204127(107204)Online publication date: Jan-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media