Abstract
Many data mining tasks can be seen as an instance of the problem of finding the most interesting (according to some utility function) patterns in a large database. In recent years, significant progress has been achieved in scaling algorithms for this task to very large databases through the use of sequential sampling techniques. However, except for sampling-based greedy algorithms which cannot give absolute quality guarantees, the scalability of existing approaches to this problem is only with respect to the data, not with respect to the size of the pattern space: it is universally assumed that the entire hypothesis space fits in main memory. In this paper, we describe how this class of algorithms can be extended to hypothesis spaces that do not fit in memory while maintaining the algorithms’ precise ∈-δ quality guarantees. We present a constant memory algorithm for this task and prove that it possesses the required properties. In an empirical comparison, we compare variable memory and constant memory sampling.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, 1996.
H. Dodge and H. Romig. A method of sampling inspection. The Bell System Technical Journal, 8:613–631, 1929.
C. Domingo, R. Gavelda, and O. Watanabe. Adaptive sampling methods for scaling up knowledge discovery algorithms. Technical Report TR-C131, Dept. de LSI, Politecnica de Catalunya, 1999.
Y. Freund. Self-bounding learning algorithms. In Proceedings of the International Workshop on Computational Learning Theory (COLT-98), 1998.
Russell Greiner. PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 83(1–2), July 1996.
D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78–150, 1992.
D. Haussler, M. Kearns, S. Seung, and N. Tishby. Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25, 1996.
G. Hulten and P. Domingos. Mining high-speed data streams. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.
W. Klösgen. Problems in knowledge discovery in databases and their treatment in the statistics interpreter explora. Journal of Intelligent Systems, 7:649–673, 1992.
W. Klösgen. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, pages 249–271. AAAI, 1996.
J. Langford and D. McAllester. Computable shell decomposition bounds. In Proceedings of the International Conference on Computational Learning Theory, 2000.
O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classification and function approximating. In Advances in Neural Information Processing Systems, pages 59–66, 1994.
G. Piatetski-Shapiro. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases, pages 229–248, 1991.
T. Scheffer and S. Wrobel. Incremental maximization of non-instance-averaging utility functions with applications to knowledge discovery problems. In Proceedings of the International Conference on Machine Learning, 2001.
T. Scheffer and S. Wrobel. Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, In Print.
H. Toivonen. Sampling large databases for association rules. In Proc. VLDB Conference, 1996.
A. Wald. Sequential Analysis. Wiley, 1947.
Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Proc. First European Symposion on Principles of Data Mining and Knowledge Discovery (PKDD-97), pages 78–87, Berlin, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Scheffer, T., Wrobel, S. (2002). A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_33
Download citation
DOI: https://doi.org/10.1007/3-540-45681-3_33
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive