research-article

Learning on the border: active learning in imbalanced data classification

Authors:

Lee GilesAuthors Info & Claims

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Pages 127 - 136

https://doi.org/10.1145/1321440.1321461

Published: 06 November 2007 Publication History

Abstract

This paper is concerned with the class imbalance problem which has been known to hinder the learning performance of classification algorithms. The problem occurs when there are significantly less number of observations of the target concept. Various real-world classification tasks, such as medical diagnosis, text categorization and fraud detection suffer from this phenomenon. The standard machine learning algorithms yield better prediction performance with balanced datasets. In this paper, we demonstrate that active learning is capable of solving the class imbalance problem by providing the learner more balanced classes. We also propose an efficient way of selecting informative instances from a smaller pool of samples for active learning which does not necessitate a search through the entire dataset. The proposed method yields an efficient querying system and allows active learning to be applied to very large datasets. Our experimental results show that with an early stopping criteria, active learning achieves a fast solution with competitive prediction performance in imbalanced data classification.

References

[1]

N. Abe. Invited talk: Sampling approaches to learning from imbalanced datasets: Active learning, cost sensitive learning and beyond. Proc. of ICML Workshop: Learning from Imbalanced Data Sets, 2003.

[2]

R. Akbani, S. Kwek, and N. Japkowicz. Applying support vector machines to imbalanced datasets. Proc. of European Conference on Machine Learning, pages 39--50, 2004.

Digital Library

[3]

A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research (JMLR), 6:1579--1619, 2005.

Digital Library

[4]

P. K. Chan and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 1998.

[5]

N. V. Chawla, K. W. Bowyer., L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR), 16:321--357, 2002.

Digital Library

[6]

P. Domingos. Metacost: A general method for making classifiers cost-sensitive. In Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 1999.

Digital Library

[7]

S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proc. of Int. Conference on Information and Knowledge Management (CIKM), 1998.

Digital Library

[8]

J. W. Grzymala-Busse, Z. Zheng, L. K. Goodwin, and W. J. Grzymala-Busse. An approach to imbalanced datasets based on changing rule strength. In Proc. of In Learning from Imbalanced Datasets, AAAI Workshop, 2000.

[9]

J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large scale datasets. In Proc. of European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2006.

Digital Library

[10]

N. Japkowicz. A novelty detection approach to classification. In Proc. of the Int. Joint Conference on Artificial Intelligence (IJCAI), pages 518--523, 1995.

Digital Library

[11]

N. Japkowicz. The class imbalance problem: Significance and strategies. In Proc. of 2000 Int. Conference on Artificial Intelligence (IC-AI'2000), volume 1, pages 111--117, 2000.

[12]

N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 2002.

Digital Library

[13]

M. Kubat, R. C. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2-3):195--215, 1998.

Digital Library

[14]

M. Kubat and S. Matwin. Addressing the curse of imbalanced training datasets: One sided selection. Proc. of Int. Conference on Machine Learning (ICML), 30(2-3), 1997.

[15]

C. X. Ling and C. Li. Data mining for direct marketing: Problems and solutions. In Knowledge Discovery and Data Mining, pages 73--79, 1998.

[16]

X.-Y. Liu, J. Wu, and Z.-H. Zhou. Exploratory under-sampling for class-imbalance learning. In Proc. of the International Conference on Data Mining (ICDM), 2006.

Digital Library

[17]

M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and C. Brunk. Reducing misclassification costs. In Proc. of 11th Int. Conference on Machine Learning (ICML), 1994.

Digital Library

[18]

F. Provost. Machine learning from imbalanced datasets 101. In Proc. of AAAI Workshop on Imbalanced Data Sets, 2000.

[19]

P. Radivojac, N. V. Chawla, A. K. Dunker, and Z. Obradovic. Classification and knowledge discovery in protein databases. Journal of Biomedical Informatics, 37(4):224--239, 2004.

Digital Library

[20]

B. Raskutti and A. Kowalczyk. Extreme re-balancing for svms: a case study. SIGKDD Explorations Newsletter, 6(1):60--69, 2004.

Digital Library

[21]

G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. of the 17th Int. Conference on Machine Learning (ICML), pages 839--846, 2000.

Digital Library

[22]

A. J. Smola and B. Schölkopf. Sparse greedy matrix approximation for machine learning. In Proc. of 17th Int. Conference on Machine Learning (ICML).

Digital Library

[23]

S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research (JMLR), 2:45--66, 2002.

Digital Library

[24]

V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.

Digital Library

Cited By

Graells-Garrido EGarcía NCarvallo A(2025)Tsundoku: A Python toolkit for social network analysisSoftwareX10.1016/j.softx.2024.10200829(102008)Online publication date: Feb-2025
https://doi.org/10.1016/j.softx.2024.102008
Alimoradi MSadeghi RDaliri AZabihimayvan M(2025)Statistic Deviation Mode Balancer (SDMB): A novel sampling algorithm for imbalanced dataNeurocomputing10.1016/j.neucom.2025.129484(129484)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2025.129484
Hasan SBaitha AGangwar LKumar S(2024)Evaluation of resampling techniques for deep learning based identification of promising genotypes in sugarcane varietal trialsIndian Journal of Genetics and Plant Breeding (The)10.31742/ISGPB.84.1.884:01(92-98)Online publication date: 10-Apr-2024
https://doi.org/10.31742/ISGPB.84.1.8
Show More Cited By

Index Terms

Learning on the border: active learning in imbalanced data classification
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Logical and relational learning
        Inductive logic learning
2. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Active learning for class imbalance problem
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

The class imbalance problem has been known to hinder the learning performance of classification algorithms. Various real-world classification tasks such as text categorization suffer from this phenomenon. We demonstrate that active learning is capable ...
Boosted SVM with active learning strategy for imbalanced data

In this work, we introduce a novel training method for constructing boosted Support Vector Machines (SVMs) directly from imbalanced data. The proposed solution incorporates the mechanisms of active learning strategy to eliminate redundant instances and ...
Active learning support vector machines with low-rank transformation

Active learning has proven to be quite effective in a vast array of machine learning tasks. Despite the lower labeling cost of active learning, it has been shown that active learning still can not reach state-of-the-art performance on several ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

November 2007

1048 pages

ISBN:9781595938039

DOI:10.1145/1321440

Co-chair:
Alberto H. F. Laender,
Conference Chairs:
André O. Falcão
Universidade de Lisboa, Portugal
,
Øystein Haug Olsen,
General Chair:
Mário J. Silva
(Universidade de Lisboa, Portugal)
,
Program Chairs:
Ricardo Baeza-Yates,
Deborah L. McGuinness,
Bjorn Olstad

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM07

Sponsor:

CIKM07: Conference on Information and Knowledge Management

November 6 - 10, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

224
Total Citations
View Citations
3,327
Total Downloads

Downloads (Last 12 months)203
Downloads (Last 6 weeks)24

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Graells-Garrido EGarcía NCarvallo A(2025)Tsundoku: A Python toolkit for social network analysisSoftwareX10.1016/j.softx.2024.10200829(102008)Online publication date: Feb-2025
https://doi.org/10.1016/j.softx.2024.102008
Alimoradi MSadeghi RDaliri AZabihimayvan M(2025)Statistic Deviation Mode Balancer (SDMB): A novel sampling algorithm for imbalanced dataNeurocomputing10.1016/j.neucom.2025.129484(129484)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2025.129484
Hasan SBaitha AGangwar LKumar S(2024)Evaluation of resampling techniques for deep learning based identification of promising genotypes in sugarcane varietal trialsIndian Journal of Genetics and Plant Breeding (The)10.31742/ISGPB.84.1.884:01(92-98)Online publication date: 10-Apr-2024
https://doi.org/10.31742/ISGPB.84.1.8
Li NQi YLi CZhao Z(2024)Active Learning for Data Quality Control: A SurveyJournal of Data and Information Quality10.1145/366336916:2(1-45)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3663369
Pei WXue BZhang MShang LYao XZhang Q(2024)A Survey on Unbalanced Classification: How Can Evolutionary Computation Help?IEEE Transactions on Evolutionary Computation10.1109/TEVC.2023.325723028:2(353-373)Online publication date: Apr-2024
https://doi.org/10.1109/TEVC.2023.3257230
Wei JLin YCaesar H(2024)BaSAL: Size-Balanced Warm Start Active Learning for LiDAR Semantic Segmentation2024 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA57147.2024.10611122(18258-18264)Online publication date: 13-May-2024
https://doi.org/10.1109/ICRA57147.2024.10611122
Yang SCha K(2024)GMO-AC: Gaussian-based minority oversampling with adaptive outlier filtering and class overlap weightingIEEE Access10.1109/ACCESS.2024.3518573(1-1)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3518573
Ghislat GHernandez-Hernandez SPiyawajanusorn CBallester P(2024)Data-centric challenges with the application and adoption of artificial intelligence for drug discoveryExpert Opinion on Drug Discovery10.1080/17460441.2024.240363919:11(1297-1307)Online publication date: 24-Sep-2024
https://doi.org/10.1080/17460441.2024.2403639
Ahmed JGreen R(2024)Cost aware LSTM model for predicting hard disk drive failures based on extremely imbalanced S.M.A.R.T. sensors dataEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107339127:PBOnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.engappai.2023.107339
Guan LYuan X(2024)Label-free model evaluation and weighted uncertainty sample selection for domain adaptive instance segmentationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107204127(107204)Online publication date: Jan-2024
https://doi.org/10.1016/j.engappai.2023.107204
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents