Boosting with Data Generation: Improving the Classification of Hard to Learn Examples

Guo, Hongyu; Viktor, Herna L

doi:10.1007/978-3-540-24677-0_111

Boosting with Data Generation: Improving the Classification of Hard to Learn Examples

Hongyu Guo¹⁹ &
Herna L Viktor¹⁹

Conference paper

1736 Accesses
14 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3029))

Abstract

An ensemble of classifiers consists of a set of individually trained classifiers whose predictions are combined to classify new instances. In particular, boosting is an ensemble method where the performance of weak classifiers is improved by focusing on “hard examples” which are difficult to classify. Recent studies have indicated that boosting algorithm is applicable to a broad spectrum of problems with great success. However, boosting algorithms frequently suffer from over-emphasizing the hard examples, leading to poor training and test set accuracies. Also, the knowledge acquired from such hard examples may be insufficient to improve the overall accuracy of the ensemble. This paper describes a new algorithm to solve the above-mentioned problems through data generation. In the DataBoost method, hard examples are identified during each of the iterations of the boosting algorithm. Subsequently, the hard examples are used to generate synthetic training data. These synthetic examples are added to the original training set and are used for further training. The paper shows the results of this approach against ten data sets, using both decision trees and neural networks as base classifiers. The experiments show promising results, in terms of the overall accuracy obtained.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Breiman, L.: Stacked regressions. Machine Learning 24(1), 49–64 (1996)
MATH MathSciNet Google Scholar
Wolpert, D.: Stacked Generatlization. Neural Networks 5, 241–259 (1992)
Article Google Scholar
Breiman, L.: Bagging Predictors. Machine learning 24(2), 123–140 (1996)
MATH MathSciNet Google Scholar
Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: The Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, pp. 148–156 (1996)
Google Scholar
Opitz, D., Maclin, R.: Popular Ensemble Methods: An Empirical Study. Journal of Artificial Intelligence Research 11, 169–198 (1999)
MATH Google Scholar
Shapiro, R., Freud, Y., Bartlett, P., Lee, W.: Boosting the margin: A new explanation of the effectiveness of the voting methods. In: Proc. Of 14th Intern. Conf. on Machine Learning, pp. 322–330. Morgan Kaufmann, San Francisco (1996)
Google Scholar
Ridgeway, G.: The State of Boosting. Computing Science and Statistics 31, 172–181 (1999)
Google Scholar
Quinlan, J.R.: Bagging, Boosting, and C4.5. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR, pp. 725–730 (1996)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, California (1994)
Google Scholar
Breiman, L.: Bias, variance, and arcing classifiers. Tech. Rep. 460, UC-Berkeley, Berkeley, CA (1996)
Google Scholar
Viktor, H.L., Skrypnik, I.: Improving the Competency of Ensembles of Classifiers through Data Generation. In: ICANNGA 2001, April 21-25, pp. 59–62. Czech Republic, Prague (2001)
Google Scholar
Viktor, H.L.: The CILT multi-agent learning system. South African Computer Journal (SACJ) 24, 171–181 (1999)
Google Scholar
Thrun, S.B., et al.: The Monk’s problems: A Performance Comparison of Different Learning Algorithms. Technical Report CMU-CS-91-17. Computer Science Department, Carnegie Mellon University, Pittsburgh: USA (1991)
Google Scholar
Witten, I., Frank, E.: Data Mining: Practical Machine Learning tools and Techniques with Java Implementations. ch. 8. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar
Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence 14(5), 771–780 (1999)
Google Scholar
Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139–157 (2000)
Article Google Scholar
Schapire, R.E.: A brief Introduction to Boosting. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (1999)
Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)
Article MATH MathSciNet Google Scholar
Schwenk, H., Bengio, Y.: AdaBoosting Neural Networks: Application to On-line Character Recognition. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 969–972. Springer, Heidelberg (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology and Engineering, University of Ottawa, 800 King Edward St, Ottawa, Ontario, Canada, K1N 6N5
Hongyu Guo & Herna L Viktor

Authors

Hongyu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Herna L Viktor
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Information Technology, National Research Council of Canada, 1200 Montreal Read, M-50, K1A 0R6, Ottawa, Ontario, Canada
Bob Orchard
Institute for Information Technology, National Research Council, Canada
Chunsheng Yang
Department of Computer Science, Texas State University-San Marcos, Nueces 247, 601 University Drive, TX 78666-4616, San Marcos, USA
Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, H., Viktor, H.L. (2004). Boosting with Data Generation: Improving the Classification of Hard to Learn Examples. In: Orchard, B., Yang, C., Ali, M. (eds) Innovations in Applied Artificial Intelligence. IEA/AIE 2004. Lecture Notes in Computer Science(), vol 3029. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24677-0_111

Download citation

DOI: https://doi.org/10.1007/978-3-540-24677-0_111
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22007-7
Online ISBN: 978-3-540-24677-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics