Data Reduction Using Multiple Models Integration

Lazarevic, Aleksandar; Obradovic, Zoran

doi:10.1007/3-540-44794-6_25

Aleksandar Lazarevic³ &
Zoran Obradovic³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2168))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2678 Accesses

Abstract

Large amount of available information does not necessarily imply that induction algorithms must use all this information. Samples often provide the same accuracy with less computational cost. We propose several effective techniques based on the idea of progressive sampling when progressively larger samples are used for training as long as model accuracy improves. Our sampling procedures combine all the models constructed on previously considered data samples. In addition to random sampling, controllable sampling based on the boosting algorithm is proposed, where the models are combined using a weighted voting. To improve model accuracy, an effective pruning technique for inaccurate models is also employed. Finally, a novel sampling procedure for spatial data domains is proposed, where the data examples are drawn not only according to the performance of previous models, but also according to the spatial correlation of data. Experiments performed on several data sets showed that the proposed sampling procedures outperformed standard progressive sampling in both the achieved accuracy and the level of data reduction.

Download to read the full chapter text

Chapter PDF

A Constructive Method for Data Reduction and Imbalanced Sampling

Uncertainty-Based Sample Optimization Strategies for Large Forest Samples Set

A Heuristic Sampling Method for Maintaining the Probability Distribution

Article 30 July 2021

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Oates, T., Jansen, D.: Large Datasets Lead to Overly Complex Models: An Explanation and a Solution, Proc. Fourth International Conference On Knowledge Discovery and Data Mining, (1998), 294–298
Google Scholar
Grossman R, Turinsky A. A Framework for Finding Distributed Data Mining Strategies That Are Intermediate Between Centralized Strategies and In-Place Strategies, KDD Workshop on Distributed Data Mining, (2000)
Google Scholar
Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling, Proc. Fifth Int’l Conf. On Knowledge Discovery and Data Mining, (1999), 23–32
Google Scholar
Freund, Y., and Schapire, R. E.: Experiments with a New Boosting Algorithm, in Proc. of the 13th International Conference on Machine Learning, (1996) 325–332
Google Scholar
Quinlan, J. R.: Learning Efficient Classification Procedures and their Application to Chess and Games, In Michalski, R., Carbonell, J., Mitchell, T. (eds.): Machine Learning. An Artificial Intelligence Approach, (1983), 463–482
Google Scholar
Fürnkranz, J.: Integrative windowing, J. Artificial Intelligence & Research 8, (1998), 129–164
MATH Google Scholar
Harris-Jones, C., Haines, T.: Sample Size and Misclassification: Is More Always Better?, Proc. Second International Conference on the Practical Application of Knowledge Discovery and Data Mining, (1998)
Google Scholar
Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Reading, MA, (1989)
MATH Google Scholar
Cressie, N.A.C., Statistics for Spatial Data, John Wiley & Sons, Inc., New York, 1993.
Google Scholar
Riedmiller, M., Braun, H.: A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm, Proc. of the IEEE International Conference on Neural Networks, (1993), 586–591
Google Scholar
Hagan, M., Menhaj, M.B.: Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks 5, (1994) 989–993
Article Google Scholar
Pokrajac D, Fiez T, Obradovic Z.: A Spatial Data Simulator for Agriculture Knowledge Discovery Applications, in review
Google Scholar
Murphy, P.M., Aha, D.W., UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine, CA, (1999)
Google Scholar
Blackard, J., Comparison of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types, Ph.D. dissertation, Colorado State University, Fort Collins, (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Information Science and Technology, Temple University, Room 303, Wachman Hall (038-24), 1805 N. Broad Street, Philadelphia, PA, 19122, USA
Aleksandar Lazarevic & Zoran Obradovic

Authors

Aleksandar Lazarevic
View author publications
You can also search for this author in PubMed Google Scholar
Zoran Obradovic
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Albert-Ludwigs University Freiburg, Georges Köhler-Allee, Geb. 079, 79110, Freiburg, Germany
Luc De Raedt
Inst.of Information and Computing Sciences Dept. of Mathematics and Computer Science, University of Utrecht, Padualaan 14, de Uithof, 3508, TB Utrecht, The Netherlands
Arno Siebes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lazarevic, A., Obradovic, Z. (2001). Data Reduction Using Multiple Models Integration. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_25

Download citation

DOI: https://doi.org/10.1007/3-540-44794-6_25
Published: 28 August 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics