Abstract
Ensembles are often capable of greater prediction accuracy than any of their individual members. As a consequence of the diversity between individual base-learners, an ensemble will not suffer from overfitting. On the other hand, in many cases we are dealing with imbalanced data and a classifier which was built using all data has tendency to ignore minority class. As a solution to the problem, we propose to consider a large number of relatively small and balanced subsets where representatives from the both patterns are to be selected randomly. Using different pre-processing technique combined with available background knowledge, which may have subjective treatment, we can generate many secondary databases for training. The relevance of those databases maybe tested with five folds cross-validation (CV5). Further, we can use CV5-results to optimise blending structure. Note that it is appropriate to use different software for CV5 evaluation and for the computation of the final solution. Our model was tested online during an International Carvana data mining Contest on the Kaggle platform. This Contest was highly popular and attracted 582 actively participating teams, where our team was awarded 2nd prize.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Koren, Y.: The BellKor Solution to the Netflix Grand Prize, Wikipedia, 10 pages (2009)
Wang, W.: Some fundamental issues in ensemble methods. In: World Congress on Computational Intelligence, Hong Kong, pp. 2244–2251. IEEE (2008)
Nikulin, V.: Classification of imbalanced data with random sets and mean-variance filtering. International Journal of Data Warehousing and Mining 4(2), 63–78 (2008)
Nikulin, V., McLachlan, G.: Classification of imbalanced data with balanced random sets. Journal of Machine Learning Research, Workshop and Conference Proceedings 7, 89–100 (2009)
Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119–139 (1997)
Heckerman, J.: Sample selection bias as a specification error. Econometrica 47(1), 153–161 (1979)
Jelizarow, M., Guillemot, V., Tenenhaus, A., Strimmer, K., Boulesteix, A.-L.: Over-optimism in bioinformatics: an illustration. Bioinformatics 26(16), 1990–1998 (2010)
Carpenter, J.: the best analyst win. Science 331, 698–699 (2011)
Cudeck, R., Browne, M.: Cross-validation of covariance structures. Multivariate Behavioral Research 18(2), 147–167 (1983)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nikulin, V. (2012). On the Homogeneous Ensembling with Balanced Random Sets and Boosting. In: Yao, J., et al. Rough Sets and Current Trends in Computing. RSCTC 2012. Lecture Notes in Computer Science(), vol 7413. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32115-3_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-32115-3_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32114-6
Online ISBN: 978-3-642-32115-3
eBook Packages: Computer ScienceComputer Science (R0)