Skip to main content

Boosting with Data Generation: Improving the Classification of Hard to Learn Examples

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3029))

Abstract

An ensemble of classifiers consists of a set of individually trained classifiers whose predictions are combined to classify new instances. In particular, boosting is an ensemble method where the performance of weak classifiers is improved by focusing on “hard examples” which are difficult to classify. Recent studies have indicated that boosting algorithm is applicable to a broad spectrum of problems with great success. However, boosting algorithms frequently suffer from over-emphasizing the hard examples, leading to poor training and test set accuracies. Also, the knowledge acquired from such hard examples may be insufficient to improve the overall accuracy of the ensemble. This paper describes a new algorithm to solve the above-mentioned problems through data generation. In the DataBoost method, hard examples are identified during each of the iterations of the boosting algorithm. Subsequently, the hard examples are used to generate synthetic training data. These synthetic examples are added to the original training set and are used for further training. The paper shows the results of this approach against ten data sets, using both decision trees and neural networks as base classifiers. The experiments show promising results, in terms of the overall accuracy obtained.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   74.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Breiman, L.: Stacked regressions. Machine Learning 24(1), 49–64 (1996)

    MATH  MathSciNet  Google Scholar 

  2. Wolpert, D.: Stacked Generatlization. Neural Networks 5, 241–259 (1992)

    Article  Google Scholar 

  3. Breiman, L.: Bagging Predictors. Machine learning 24(2), 123–140 (1996)

    MATH  MathSciNet  Google Scholar 

  4. Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: The Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, pp. 148–156 (1996)

    Google Scholar 

  5. Opitz, D., Maclin, R.: Popular Ensemble Methods: An Empirical Study. Journal of Artificial Intelligence Research 11, 169–198 (1999)

    MATH  Google Scholar 

  6. Shapiro, R., Freud, Y., Bartlett, P., Lee, W.: Boosting the margin: A new explanation of the effectiveness of the voting methods. In: Proc. Of 14th Intern. Conf. on Machine Learning, pp. 322–330. Morgan Kaufmann, San Francisco (1996)

    Google Scholar 

  7. Ridgeway, G.: The State of Boosting. Computing Science and Statistics 31, 172–181 (1999)

    Google Scholar 

  8. Quinlan, J.R.: Bagging, Boosting, and C4.5. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR, pp. 725–730 (1996)

    Google Scholar 

  9. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, California (1994)

    Google Scholar 

  10. Breiman, L.: Bias, variance, and arcing classifiers. Tech. Rep. 460, UC-Berkeley, Berkeley, CA (1996)

    Google Scholar 

  11. Viktor, H.L., Skrypnik, I.: Improving the Competency of Ensembles of Classifiers through Data Generation. In: ICANNGA 2001, April 21-25, pp. 59–62. Czech Republic, Prague (2001)

    Google Scholar 

  12. Viktor, H.L.: The CILT multi-agent learning system. South African Computer Journal (SACJ) 24, 171–181 (1999)

    Google Scholar 

  13. Thrun, S.B., et al.: The Monk’s problems: A Performance Comparison of Different Learning Algorithms. Technical Report CMU-CS-91-17. Computer Science Department, Carnegie Mellon University, Pittsburgh: USA (1991)

    Google Scholar 

  14. Witten, I., Frank, E.: Data Mining: Practical Machine Learning tools and Techniques with Java Implementations. ch. 8. Morgan Kaufmann Publishers, San Francisco (2000)

    Google Scholar 

  15. Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html

  16. Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence 14(5), 771–780 (1999)

    Google Scholar 

  17. Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139–157 (2000)

    Article  Google Scholar 

  18. Schapire, R.E.: A brief Introduction to Boosting. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (1999)

    Google Scholar 

  19. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  20. Schwenk, H., Bengio, Y.: AdaBoosting Neural Networks: Application to On-line Character Recognition. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 969–972. Springer, Heidelberg (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guo, H., Viktor, H.L. (2004). Boosting with Data Generation: Improving the Classification of Hard to Learn Examples. In: Orchard, B., Yang, C., Ali, M. (eds) Innovations in Applied Artificial Intelligence. IEA/AIE 2004. Lecture Notes in Computer Science(), vol 3029. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24677-0_111

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24677-0_111

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22007-7

  • Online ISBN: 978-3-540-24677-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics