skip to main content
10.1145/1743384.1743408acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

AdaOUBoost: adaptive over-sampling and under-sampling to boost the concept learning in large scale imbalanced data sets

Published: 29 March 2010 Publication History

Abstract

Automatic concept learning from large scale imbalanced data sets is a key issue in video semantic analysis and retrieval, which means the number of negative examples is far more than that of positive examples for each concept in the training data. The existing methods adopt generally under-sampling for the majority negative examples or over-sampling for the minority positive examples to balance the class distribution on training data. The main drawbacks of these methods are: (1) As a key factor that affects greatly the performance, in most existing methods, the degree of re-sampling needs to be pre-fixed, which is not generally the optimal choice; (2) Many useful negative samples may be discarded in under-sampling. In addition, some works only focus on the improvement of the computational speed, rather than the accuracy. To address the above issues, we propose a new approach and algorithm named AdaOUBoost (Adaptive Over-sampling and Under-sampling Boost). The novelty of AdaOUBoost mainly lies in: adaptively over-sample the minority positive examples and under-sample the majority negative examples to form different sub-classifiers. And combine these sub-classifiers according to their accuracy to create a strong classifier, which aims to use fully the whole training data and improve the performance of the class-imbalance learning classifier. In AdaOUBoost, first, our clustering-based under-sampling method is employed to divide the majority negative examples into some disjoint subsets. Then, for each subset of negative examples, we utilize the borderline-SMOTE (synthetic minority over-sampling technique) algorithm to over-sample the positive examples with different size, train each sub-classifier using each of them, and get the classifier by fusing these sub-classifiers with different weights. Finally, we combine these classifiers in each subset of negative examples to create a strong classifier. We compare the performance between AdaOUBoost and the state-of-the-art methods on TRECVID 2008 benchmark with all 20 concepts, and the results show the AdaOUBoost can achieve the superior performance in large scale imbalanced data sets.

References

[1]
X. Hong, S. Chen and C. J. Harris. A Kernel-Based Two-Class Classifier for Imbalanced Data Sets. IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 28--41, 2007.
[2]
Y. Lin, Y. Lee, and G. Wahba. Support Vector Machines for Classification in Nonstandard Situations. Machine Learning, vol. 46, no.1-3, pp.191--202, 2002.
[3]
K. Veropoulos, N. Cristianini, and C. Campbell. Controlling the Sensitivity of Support Vector Machines. In Proceedings of SVM workshop at IJCAI, 1999.
[4]
G. Wu and E. Y. Chang. KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution. IEEE Transactions on Knowledge and Data Engineering, vol. 17, no.6, pp.786--795, 2005.
[5]
J. Yuan, J. Li, and B. Zhang. Learning concepts from large scale imbalanced data sets using support cluster machines. ACM Multimedia Conference(MM), pp. 441--450, 2006.
[6]
G. Wu, E. Y. Chang, Adaptive Feature-Space Conformal Transformation for Imbalanced-Data Learning. In Proceedings of ICML, pp. 816--823, 2003.
[7]
G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, vol.6, no.1, pp. 20--29, 2004.
[8]
N. V. Chawla, L. O. Hall, K. W. Bowyer and W. P. Kegelmeyer. Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research (JAIR), pp. 321--357, 2002.
[9]
H. Han, W. Wang, and B. Mao. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. International Conference on Intelligent Computing(ICIC), pp.878--887, 2005.
[10]
N. Japkowicz. Learning from imbalanced datasets: a comparison of various strategies. AAAI Workshop on Learning from Imbalanced Data Sets, pp. 10--15, 2000.
[11]
S. Ertekin, J. Huang, and C. L. Giles. Active learning for class imbalance problem, SIGIR, pp. 823--824, 2007.
[12]
S. Ertekin, J. Huang, L. Bottou, and C. L. Giles. Learning on the border: active learning in imbalanced data classification. In Proceedings of Conference on Information and Knowledge Management(CIKM), pp. 127--136, 2007.
[13]
A. F. Smeaton, P. Over and W. Kraai. High-Level Feature Detection from Video in TRECVid: a 5-Year Retrospective of Achievements, Multimedia Content Analysis, Theory and Applications, 151--174, 2009, Springer Verlag.
[14]
C.G.M. Snoek, K.E.A. van de Sande, O. de Rooij, et al. The MediaMill TRECVID 2008 Semantic Video Search Engine. TRECVID, 2008.
[15]
S. Chang, J. He, Y. Jiang, et al. Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search. TRECVID, 2008.
[16]
C. Ngo, Y. Jiang, X. Wei, et al. Beyond Semantic Search: What You Observe May Not Be What You Think. TRECVID, 2008.
[17]
T. Ojala, M. Pietikainen and T. Maenpaa. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 24, no. 7, pp. 971--987, 2002.
[18]
D. Le, X. Wu, S. Satoh, et al. National Institute of Informatics, Japan at TRECVID 2008. TRECVID, 2008.
[19]
X. Xue, W. Zhang, Y. Guo, et al. Fudan University at TRECVID 2008. TRECVID, 2008.
[20]
A. F. Smeaton, P. Over and W. Kraai. Evaluation campaigns and TRECVid. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, 321--330, October 26-27, 2006.
[21]
E. Yilmaz, and J. A. Aslam. Estimating Average Precision with Incomplete and Imperfect Judgments. In Proceedings of ACM International Conference on Information and Knowledge Management (CIKM), pp. 102--111, 2006.
[22]
A. Ulges, C. Schulze, M. Koch, et al. Learning TRECVID'08 High-Level Features from YouTube. TRECVID, 2008.
[23]
R. Mörzinger and G. Thallinger. TRECVid 2008 High Level Feature Extraction Experiments at JOANNEUM RESEARCH. TRECVID, 2008.
[24]
H. K. Ekenel, H. Gao and R. Stiefelhagen. Universität Karlsruhe (TH) at TRECVID 2008. TRECVID, 2008.
[25]
A. Natsev, J. R. Smith, J. Teaić, et al. IBM Research TRECVID-2008 Video Retrieval System. TRECVID, 2008.
[26]
Yuxin Peng, Zhiguo Yang, Jian Yi, Lei Cao, Hao Li, and Jia Yao. Peking University at TRECVID 2008: High Level Feature Extraction. TRECVID, 2008.

Cited By

View all
  • (2023)Prediction of Oncology Drug Targets Based on Ensemble Learning and Sample Weight Updating2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM58861.2023.10385773(3602-3609)Online publication date: 5-Dec-2023
  • (2022)Random Walk-steered Majority Undersampling2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53654.2022.9945451(530-537)Online publication date: 9-Oct-2022
  • (2019)A critique of imbalanced data learning approaches for big data analyticsInternational Journal of Business Intelligence and Data Mining10.1504/ijbidm.2019.09996114:4(419-457)Online publication date: 11-Apr-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MIR '10: Proceedings of the international conference on Multimedia information retrieval
March 2010
600 pages
ISBN:9781605588155
DOI:10.1145/1743384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adaouboost
  2. class-imbalance learning
  3. concept annotation
  4. large scale

Qualifiers

  • Research-article

Conference

MIR '10
Sponsor:
MIR '10: International Conference on Multimedia Information Retrieval
March 29 - 31, 2010
Pennsylvania, Philadelphia, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Prediction of Oncology Drug Targets Based on Ensemble Learning and Sample Weight Updating2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM58861.2023.10385773(3602-3609)Online publication date: 5-Dec-2023
  • (2022)Random Walk-steered Majority Undersampling2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53654.2022.9945451(530-537)Online publication date: 9-Oct-2022
  • (2019)A critique of imbalanced data learning approaches for big data analyticsInternational Journal of Business Intelligence and Data Mining10.1504/ijbidm.2019.09996114:4(419-457)Online publication date: 11-Apr-2019
  • (2019)Hybrid of Intelligent Minority Oversampling and PSO-Based Intelligent Majority Undersampling for Learning from Imbalanced DatasetsIntelligent Systems Design and Applications10.1007/978-3-030-16660-1_74(760-769)Online publication date: 14-Apr-2019
  • (2017)MACE prediction of acute coronary syndrome via boosted resampling classification using electronic medical recordsJournal of Biomedical Informatics10.1016/j.jbi.2017.01.00166:C(161-170)Online publication date: 1-Feb-2017
  • (2016)Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasetsNeurocomputing10.1016/j.neucom.2014.05.096172(198-206)Online publication date: Jan-2016
  • (2014)RWO-Sampling: A random walk over-sampling approach to imbalanced data classificationInformation Fusion10.1016/j.inffus.2013.12.00320(99-116)Online publication date: Nov-2014
  • (2014)Hybrid negative example selection using visual and conceptual featuresMultimedia Tools and Applications10.1007/s11042-011-0886-y71:3(967-989)Online publication date: 1-Aug-2014
  • (2011)Clustering-based binary-class classification for imbalanced data sets2011 IEEE International Conference on Information Reuse & Integration10.1109/IRI.2011.6009578(384-389)Online publication date: Aug-2011
  • (2011)A normal distribution-based over-sampling approach to imbalanced data classificationProceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I10.1007/978-3-642-25853-4_7(83-96)Online publication date: 17-Dec-2011
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media