Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Domingo, Carlos; Gavaldà, Ricard; Watanabe, Osamu

doi:10.1007/3-540-46846-3_16

Carlos Domingo³,
Ricard Gavaldà⁴ &
Osamu Watanabe³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1721))

Included in the following conference series:

International Conference on Discovery Science

427 Accesses
23 Citations

Abstract

Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for many data mining applications approximate answers are acceptable. However, as argued by several researchers, random sampling is difficult to use due to the difficulty of determining an appropriate sample size. In this paper, we take a sequential sampling approach for solving this difficulty, and propose an adaptive sampling algorithm that solves a general problem covering many problems arising in applications of discovery science. The algorithm obtains examples sequentially in an on-line fashion, and it determines from the obtained examples whether it has already seen a large enough number of examples. Thus, sample size is not fixed a priori; instead, it adaptively depends on the situation. Due to this adaptiveness, if we are not in a worst case situation as fortunately happens in many practical applications, then we can solve the problem with a number of examples much smaller than the required in the worst case. For illustrating the generality of our approach, we also describe how different instantiations of it can be applied to scale up knowledge discovery problems that appear in several areas.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Carlos Domingo, Ricard Gavaldà and Osamu Watanabe. Practical Algorithms for On-line Selection. In Proceedings of the First International Conference on Discovery Science, DS’98. Lecture Notes in Artificial Intelligence 1532:150–161, 1998.
Google Scholar
Carlos Domingo, Ricard Gavaldà and Osamu Watanabe. On-line Sampling Methods for Discovering Association Rules. Tech Rep. C-126, Dept. of Math and Computing Science, Tokyo Institute of Technology, 1999.
Google Scholar
Carlos Domingo, Ricard Gavaldà and Osamu Watanabe. Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms. Tech Rep. C-131, Dept. of Math and Computing Science, Tokyo Institute of Technology. (www.is.titech.ac.jp/research/research-report/C/index.html), 1999.
Google Scholar
Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. JCSS, 55(1):119–139, 1997.
MATH MathSciNet Google Scholar
George H. John and Pat Langley. Static Versus Dynamic Sampling for Data Mining. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.
Google Scholar
Michael Kearns and Yishay Mansour. On the boosting ability of top-down decision tree learning algorithms. In Proc. of 28th Annual ACM Symposium on the Theory of Computing, pp. 459–468, 1996.
Google Scholar
M.J. Kearns and U.V. Vazirani. An Introduction to Computational Learning Theory. Cambridge University Press, 1994.
Google Scholar
Jyrki Kivinen and Heikki Mannila. The power of sampling in knowledge discovery. In Proceedings of the ACM SIGACT-SIGMOD-SIGACT Symposium on Principles of Database Theory, pp.77–85, 1994.
Google Scholar
Richard J. Lipton and Jeffrey F. Naughton. Query Size Estimation by Adaptive Sampling. Journal of Computer and System Science, 51:18–25, 1995.
Article MATH MathSciNet Google Scholar
Richard J. Lipton, Jeffrey F. Naughton, Donovan A. Schneider and S. Seshadri. Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116:195–226, 1993.
Article MATH MathSciNet Google Scholar
Oded Maron and Andrew W. Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. Advances in Neural Information Processing Systems, 6:59–66, 1994.
Google Scholar
Andrew W. Moore and M.S. Lee. Efficient algorithms for minimizing cross validation error. In Proc. of the 11th Int. Conference on Machine Learning, pp. 190–198, 1994.
Google Scholar
Ron Musick, Jason Catlett and Stuart Russell. Decision Theoretic Subsampling for Induction on Large Databases. In Proceedings of the 10th International Conference on Machine Learning, pp.212–219, 1993.
Google Scholar
Hannu Toivonen. Sampling large databases for association rules. In Proceedings of the 22nd International Conference on Very Large Databases, pages 134–145, 1996.
Google Scholar
Abraham Wald. Sequential Analysis. Wiley Mathematical, Statistics Series, 1947.
Google Scholar
Min Wang, Bala Iyer and Jeffrey Scott Vitter. Scalable Mining for Classification Rules in Relational Databases. In Proceedings of IDEAS’98, pp. 58–67, 1998.
Google Scholar
Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, pp.78–87, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Math. and Comp. Science, Tokyo Institute of Technology, Tokyo, Japan
Carlos Domingo & Osamu Watanabe
Dept. of LSI, Universitat Politècnica de Catalunya, Barcelona, Spain
Ricard Gavaldà

Authors

Carlos Domingo
View author publications
You can also search for this author in PubMed Google Scholar
Ricard Gavaldà
View author publications
You can also search for this author in PubMed Google Scholar
Osamu Watanabe
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, Fukuoka, 812-8581, Japan
Setsuo Arikawa
Graduate School of Media and Governance, Keio University, 5322 Endoh, Fujisawa-shi, Kanagawa, 252-8520, Japan
Koichi Furukawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Domingo, C., Gavaldà, R., Watanabe, O. (1999). Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms. In: Arikawa, S., Furukawa, K. (eds) Discovery Science. DS 1999. Lecture Notes in Computer Science(), vol 1721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46846-3_16

Download citation

DOI: https://doi.org/10.1007/3-540-46846-3_16
Published: 22 October 1999
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66713-1
Online ISBN: 978-3-540-46846-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics