Abstract
Large amount of available information does not necessarily imply that induction algorithms must use all this information. Samples often provide the same accuracy with less computational cost. We propose several effective techniques based on the idea of progressive sampling when progressively larger samples are used for training as long as model accuracy improves. Our sampling procedures combine all the models constructed on previously considered data samples. In addition to random sampling, controllable sampling based on the boosting algorithm is proposed, where the models are combined using a weighted voting. To improve model accuracy, an effective pruning technique for inaccurate models is also employed. Finally, a novel sampling procedure for spatial data domains is proposed, where the data examples are drawn not only according to the performance of previous models, but also according to the spatial correlation of data. Experiments performed on several data sets showed that the proposed sampling procedures outperformed standard progressive sampling in both the achieved accuracy and the level of data reduction.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Oates, T., Jansen, D.: Large Datasets Lead to Overly Complex Models: An Explanation and a Solution, Proc. Fourth International Conference On Knowledge Discovery and Data Mining, (1998), 294–298
Grossman R, Turinsky A. A Framework for Finding Distributed Data Mining Strategies That Are Intermediate Between Centralized Strategies and In-Place Strategies, KDD Workshop on Distributed Data Mining, (2000)
Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling, Proc. Fifth Int’l Conf. On Knowledge Discovery and Data Mining, (1999), 23–32
Freund, Y., and Schapire, R. E.: Experiments with a New Boosting Algorithm, in Proc. of the 13th International Conference on Machine Learning, (1996) 325–332
Quinlan, J. R.: Learning Efficient Classification Procedures and their Application to Chess and Games, In Michalski, R., Carbonell, J., Mitchell, T. (eds.): Machine Learning. An Artificial Intelligence Approach, (1983), 463–482
Fürnkranz, J.: Integrative windowing, J. Artificial Intelligence & Research 8, (1998), 129–164
Harris-Jones, C., Haines, T.: Sample Size and Misclassification: Is More Always Better?, Proc. Second International Conference on the Practical Application of Knowledge Discovery and Data Mining, (1998)
Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Reading, MA, (1989)
Cressie, N.A.C., Statistics for Spatial Data, John Wiley & Sons, Inc., New York, 1993.
Riedmiller, M., Braun, H.: A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm, Proc. of the IEEE International Conference on Neural Networks, (1993), 586–591
Hagan, M., Menhaj, M.B.: Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks 5, (1994) 989–993
Pokrajac D, Fiez T, Obradovic Z.: A Spatial Data Simulator for Agriculture Knowledge Discovery Applications, in review
Murphy, P.M., Aha, D.W., UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine, CA, (1999)
Blackard, J., Comparison of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types, Ph.D. dissertation, Colorado State University, Fort Collins, (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lazarevic, A., Obradovic, Z. (2001). Data Reduction Using Multiple Models Integration. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_25
Download citation
DOI: https://doi.org/10.1007/3-540-44794-6_25
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive