Abstract
An important issue in data mining and knowledge discovery is the issue of data scalability. We propose an approach to this problem by applying active learning as a method for data selection. In particular, we propose and evaluate a selective sampling method that belongs to the general category of ‘uncertainty sampling,’ by adopting and extending the ‘query by bagging’ method, proposed earlier by the authors as a query learning method. We empirically evaluate the effectiveness of the proposed method by comparing its performance against Breiman’s Ivotes, a representative sampling method for scaling up inductive algorithms. Our results show that the performance of the proposed method compares favorably against that of Ivotes, both in terms of the predictive accuracy achieved using a fixed amount of computation time, and the final accuracy achieved. This is found to be especially the case when the data size approaches a million, a typical data size encountered in real world data mining applications. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels.
Supported in part by a Grant-in-Aid for Scientific Research on Priority Areas “Discovery Science” from the Ministry of Education, Science, Sports and Culture of Japan. This work was carried out while this author was with NEC and Tokyo Institute of Technology.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
N. Abe and H. Mamitsuka. Query Learning Strategies Using Boosting and Bagging Proceedings of Fifteenth International Conference on Machine Learning, 1–9, 1998.
R. Agrawal and T. Imielinski and A. Swami. Database Mining: A Performance Perspective IEEE Transactions on Knowledge and Data Engineering, 5(6):914–925, 1993.
L. Breiman. Bagging Predictors Machine Learning 24:123–140, 1996.
L. Breiman. Pasting Small Votes for Classification in Large Databases and on-line Machine Learning 36:85–103, 1999.
J. Catlett. Megainduction: Atest flight Proceedings of Eighth International Workshop on Machine Learning, 596–599, 1991.
Y. Freund and R. Schapire. Adecision-theoretic generalization of on-line learning and an application to boosting Journal of Computer and System Sciences 55(1), 119–139, 1997.
J. Furnkranz Integrative windowing Journal of Artificial Intelligence Research 8:129–164, 1998.
J. Gehrke and V. Ganti and R. Ramakrishnan and W-Y. Loh BOAT — Optimistic Decision Tree Construction Proceedings of the ACM SIGMOD International Conference on Management of Data, 169–180, 1999.
D. Michie and D. Spiegelhalter and C. Taylor (Editors). Machine Learning, Neural and Statistical Classification, Ellis Horwood, London, 1994.
F. Provost and V. Kolluri. A Survey of Methods for Scaling Up Inductive Algorithms Knowledge Discovery and Data Mining 3(2):131–169, 1999.
J. R. Quinlan. Learning efficient classification procedures and their applications to chess endgames Machine Learning: An artificial intelligence approach, R. S. Michalski and J. G. Carbonell and T. M. Mitchell (Editors), San Francisco, Morgan Kaufmann, 1983.
J. R. Quinlan C4.5: Programs for Machine Learning, San Francisco, Morgan Kaufmann, 1993.
R. Rastogi and K. Shim Public: ADecision Tree Classifier that integrates building and pruning Proceedings of 24th International Conference on Very Large Data Bases, New York, Morgan Kaufmann, 404–415, 1998.
H. S. Seung and M. Opper and H. Sompolinsky. Query by committee Proceedings of 5th Annual Workshop on Computational Learning Theory, 287–294, New York, ACM Press, 1992.
S. M. Weiss and N. Indurkhya Predictive Data Mining, Morgan Kaufmann, San Francisco, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Mamitsuka, H., Abe, N. (2002). Efficient Data Mining by Active Learning. In: Arikawa, S., Shinohara, A. (eds) Progress in Discovery Science. Lecture Notes in Computer Science(), vol 2281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45884-0_17
Download citation
DOI: https://doi.org/10.1007/3-540-45884-0_17
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43338-5
Online ISBN: 978-3-540-45884-5
eBook Packages: Springer Book Archive