skip to main content
10.1145/1233341.1233378acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
Article

Classifying imbalanced data using a bagging ensemble variation (BEV)

Published: 23 March 2007 Publication History

Abstract

In many applications, data collected are highly skewed where data of one class clearly dominates data from the other classes. Most existing classification systems that perform well on balanced data give very poor performance on imbalanced data, especially for the minority class data. Existing work on improving the quality of classification on imbalanced data include over-sampling, under-sampling, and methods that make modifications to the existing classification systems. This paper discusses the BEV system for classifying imbalanced data. The system is developed based on the ideas from the "Bagging" classification ensemble. The motivation behind the scheme is to maximally use the minority class data without creating synthetic data or making changes to the existing classification systems. Experimental results using real world imbalanced data show the effectiveness of the system.

References

[1]
Palmer, C. R. and Faloutsos, C., Density biased sampling: an improved method for data mining and clustering, CMU-CS-99-113, 1999.
[2]
Weiss, G. M. and Provost, F., The effects of classification distribution on classifier learning: an empirical study, Technical Report ML-TR-44, Department of Computer Science, Rutgers University, Aug. 2001.
[3]
Stone, D. H. and Moyar G. J., Wheel shelling and spalling - an interpretive review, Rail Transportation 1989, ASME, pp.19--31, 1989.
[4]
Marais, J. J, Wheel failures on heavy haul freight wheels due to subsurface effects, Proc 12th International Wheelset Congress, Qingdao, China, pp. 306--314, 1998.
[5]
Mutton, P. J., Epp, C. J. and Dudek, J., Rolling contact fatigue in railway wheels under high axle loads, Wear. Vol. 144, pp. 139--152, 1991.
[6]
Ekberg, A., E. Kabo, and H. Andersson., An engineering model for prediction of rolling contact fatigue of railway wheels, Fatigue & Fracture of Engineering Materials and Structures. Vol. 25, pp. 899--909, 2002.
[7]
Thorsten, J. Text Categorization with support vector machines: learning with many relevant features, Proceedings of the 10th European Conference on Machine Learning, 1998.
[8]
Burges, C. J. C. A tutorial on support vector machines for pattern recognition, Knowledge discovery and Data Mining, 2, pp. 1--43, 1998.
[9]
Weston, J. and Watkins, C. Multi-class support vector machines. Royal Holloway, University of London. Technical Report, CSD-TR-98-04.
[10]
Joachims, T. Making large scale SVM learning practical. Advances in Kernel Methods - Support Vector Learning, ed. Scholkopf, B, Burges, C. and Smola, A. MIT Press, Cambridge, USA, 1998.
[11]
Ding, C. H. Q. and Dubchak, I., Multi-class protein fold recognition using support vector machines and neural networks.
[12]
Quinlan, J. R., C4.5: Programs for machine learning, Morgan Kaufmann Publishers, Inc. 1993.
[13]
Caruana, R. and Niculescu-Mizil, A. An empirical comparision of supervised learning algorithms. Proceedings of the International Conference on Machine Learning, 2006.
[14]
Mitchell, T. M., Machine Learning. McGraw-Hill Companies, Inc., 1997.
[15]
Freund, Y. and Schapire, R. E., Experiments with a new boosting algorithm, Proceedings of the Thirteenth International Conference on Machine Learning, pp. 148--156, 1996.
[16]
Breiman, L., Bagging predictors, Machine Learning, 24, pp. 123--140, 1996.
[17]
Friedman, J. H. and Popescu, B. E., Predictive learning via rule ensembles. Technical Report, Stanford University, 2005.
[18]
Cesa-Bianchi, N. Gentile, C. and Zaniboni, L. Hierarchical classification: combining Bayes with SVM. Proceedings of the International Conference on Machine Learning, 2006.
[19]
Kubat, M. and Matwin, S., Addressing the curse of imbalanced training sets: one sided selection. Proceedings of the 14th International Conference on machine Learning. 1997.
[20]
Japkowicz, N., The class imbalance problem: significance and strategies. Proceedings of the 2000 International Conference on Artificial intelligence: Special Track on Inductive Learning, Las Vegas, Nevada.
[21]
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P., SMOTE: synthetic minority oversampling technique, Journal of Artificial Intelligence Research. Vol (16), pp. 341--378, 2002.
[22]
Akbani, R., Kwek, S., and Japkowicz, N. Applying support vector machines to imbalanced datasets, Proceedings of the 15th European Conference on Machine Learning, pp. 39--50, Pisa, Italy.
[23]
Phua, C., Alahakoon, D., and Lee, V., Minority report in fraud detection: classification of skewed data, SigKdd Explorations. Vol 6(1), pp. 50--59.
[24]
Cover, T. and Hart P., "Nearest neighbor pattern classification". IEEE Transactions on Information Theory, Vol (13), pp 21--27, 1967.
[25]
Dasarathy, B. V., "Nearest-neighbor classification techniques". IEEE Computer Society Press, Los Alomitos, CA.
[26]
Globig, C. and Wess, S., Symbolic learning and nearest-neighbor classification. 1994.
[27]
Anand, R., Mehrotra, K. G., Nohan, C. K., and Ranka, S., "An improved algorithm for neural network classification of imbalanced training set". IEEE Transactions on Neural Networks, Vol. 4, No. 6, pp 962--969, November 1993.
[28]
McCallum, A. and Nigam, K., "A comparison of event models for naïve bayes text classification". Proceedings of the AAAI-98 workshop on Learning for text categorization, 1998.
[29]
Domingos, P. and Pazzani, M., "Beyond independence: conditions for the optimality of the simple Bayesian classifier. Proceedings of the Thirteenth International Conference on Machine Learning (ICML), 1996.

Cited By

View all
  • (2024) BENN : Balanced Ensemble Neural Network for Handling Class Imbalance in Big Data Expert Systems10.1111/exsy.13754Online publication date: 13-Oct-2024
  • (2024)An ensemble model for addressing class imbalance and class overlap in software defect predictionInternational Journal of System Assurance Engineering and Management10.1007/s13198-024-02538-x15:12(5584-5603)Online publication date: 9-Nov-2024
  • (2024)Handling class overlap and imbalance using overlap driven under-sampling with balanced random forest in software defect predictionInnovations in Systems and Software Engineering10.1007/s11334-024-00571-4Online publication date: 18-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ACMSE '07: Proceedings of the 45th annual ACM Southeast Conference
March 2007
574 pages
ISBN:9781595936295
DOI:10.1145/1233341
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 March 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classification
  2. decision tree
  3. imbalanced data
  4. machine learning
  5. support vector machine

Qualifiers

  • Article

Conference

ACM SE07
Sponsor:
ACM SE07: ACM Southeast Regional Conference
March 23 - 24, 2007
North Carolina, Winston-Salem

Acceptance Rates

ACMSE '07 Paper Acceptance Rate 81 of 137 submissions, 59%;
Overall Acceptance Rate 502 of 1,023 submissions, 49%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)3
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024) BENN : Balanced Ensemble Neural Network for Handling Class Imbalance in Big Data Expert Systems10.1111/exsy.13754Online publication date: 13-Oct-2024
  • (2024)An ensemble model for addressing class imbalance and class overlap in software defect predictionInternational Journal of System Assurance Engineering and Management10.1007/s13198-024-02538-x15:12(5584-5603)Online publication date: 9-Nov-2024
  • (2024)Handling class overlap and imbalance using overlap driven under-sampling with balanced random forest in software defect predictionInnovations in Systems and Software Engineering10.1007/s11334-024-00571-4Online publication date: 18-Jun-2024
  • (2024)The Effect of Imbalanced Data on Machine Learning AlgorithmsInventive Communication and Computational Technologies10.1007/978-981-97-7710-5_69(887-897)Online publication date: 15-Dec-2024
  • (2023)Majority re-sampling via sub-class clustering for imbalanced datasetsJournal of Experimental & Theoretical Artificial Intelligence10.1080/0952813X.2023.216571536:8(1581-1596)Online publication date: 10-Jan-2023
  • (2023)Improving the accuracy of k-nearest neighbor (k-NN) using Synthetic Minority Oversampling Technique (SMOTE) and Gain Ratio (GR) for imbalanced class data2ND INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION SCIENTIFIC DEVELOPMENT (ICAISD) 2021: Innovating Scientific Learning for Deep Communication10.1063/5.0128413(030012)Online publication date: 2023
  • (2023)Class attention to regions of lesion for imbalanced medical image recognitionNeurocomputing10.1016/j.neucom.2023.126577555:COnline publication date: 28-Oct-2023
  • (2022)A Survey of Different Approaches for the Class Imbalance Problem in Software Defect PredictionInternational Journal of Software Science and Computational Intelligence10.4018/IJSSCI.30126814:1(1-26)Online publication date: 3-Jun-2022
  • (2021)Image Analysis of Cells Acute Lymphoblastic Leukemia Using Ensemble Learning of Deep Bagging2021 IEEE 7th International Conference on Computing, Engineering and Design (ICCED)10.1109/ICCED53389.2021.9664867(1-4)Online publication date: 5-Aug-2021
  • (2021)Review of Classification Methods on Unbalanced Data SetsIEEE Access10.1109/ACCESS.2021.30742439(64606-64628)Online publication date: 2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media