A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases

Scheffer, Tobias; Wrobel, Stefan

doi:10.1007/3-540-45681-3_33

Tobias Scheffer⁴ &
Stefan Wrobel^5,6

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2431))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

1889 Accesses

Abstract

Many data mining tasks can be seen as an instance of the problem of finding the most interesting (according to some utility function) patterns in a large database. In recent years, significant progress has been achieved in scaling algorithms for this task to very large databases through the use of sequential sampling techniques. However, except for sampling-based greedy algorithms which cannot give absolute quality guarantees, the scalability of existing approaches to this problem is only with respect to the data, not with respect to the size of the pattern space: it is universally assumed that the entire hypothesis space fits in main memory. In this paper, we describe how this class of algorithms can be extended to hypothesis spaces that do not fit in memory while maintaining the algorithms’ precise ∈-δ quality guarantees. We present a constant memory algorithm for this task and prove that it possesses the required properties. In an empirical comparison, we compare variable memory and constant memory sampling.

Download to read the full chapter text

Chapter PDF

ProSecCo: progressive sequence mining with convergence guarantees

Article 20 August 2019

SPEck: mining statistically-significant sequential patterns efficiently with exact sampling

Article 18 June 2022

Flexible constrained sampling with guarantees for pattern mining

Article 24 March 2017

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, 1996.
Google Scholar
H. Dodge and H. Romig. A method of sampling inspection. The Bell System Technical Journal, 8:613–631, 1929.
Google Scholar
C. Domingo, R. Gavelda, and O. Watanabe. Adaptive sampling methods for scaling up knowledge discovery algorithms. Technical Report TR-C131, Dept. de LSI, Politecnica de Catalunya, 1999.
Google Scholar
Y. Freund. Self-bounding learning algorithms. In Proceedings of the International Workshop on Computational Learning Theory (COLT-98), 1998.
Google Scholar
Russell Greiner. PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 83(1–2), July 1996.
Google Scholar
D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78–150, 1992.
Article MATH MathSciNet Google Scholar
D. Haussler, M. Kearns, S. Seung, and N. Tishby. Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25, 1996.
Google Scholar
G. Hulten and P. Domingos. Mining high-speed data streams. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.
Google Scholar
W. Klösgen. Problems in knowledge discovery in databases and their treatment in the statistics interpreter explora. Journal of Intelligent Systems, 7:649–673, 1992.
Article MATH Google Scholar
W. Klösgen. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, pages 249–271. AAAI, 1996.
Google Scholar
J. Langford and D. McAllester. Computable shell decomposition bounds. In Proceedings of the International Conference on Computational Learning Theory, 2000.
Google Scholar
O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classification and function approximating. In Advances in Neural Information Processing Systems, pages 59–66, 1994.
Google Scholar
G. Piatetski-Shapiro. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases, pages 229–248, 1991.
Google Scholar
T. Scheffer and S. Wrobel. Incremental maximization of non-instance-averaging utility functions with applications to knowledge discovery problems. In Proceedings of the International Conference on Machine Learning, 2001.
Google Scholar
T. Scheffer and S. Wrobel. Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, In Print.
Google Scholar
H. Toivonen. Sampling large databases for association rules. In Proc. VLDB Conference, 1996.
Google Scholar
A. Wald. Sequential Analysis. Wiley, 1947.
Google Scholar
Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Proc. First European Symposion on Principles of Data Mining and Knowledge Discovery (PKDD-97), pages 78–87, Berlin, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

FIN/IWS, University of Magdeburg, P.O. Box 4120, 39016, Magdeburg, Germany
Tobias Scheffer
FhG AiS, Schloß Birlinghoven, 53754, Sankt Augustin, Germany
Stefan Wrobel
Informatik III, University of Bonn, Römerstr. 164, 53117, Bonn, Germany
Stefan Wrobel

Authors

Tobias Scheffer
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Wrobel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Helsinki, P.O. Box 26, 00014, Helsinki, Finland
Tapio Elomaa , Heikki Mannila & Hannu Toivonen , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scheffer, T., Wrobel, S. (2002). A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_33

Download citation

DOI: https://doi.org/10.1007/3-540-45681-3_33
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases

Abstract

Chapter PDF

Similar content being viewed by others

ProSecCo: progressive sequence mining with convergence guarantees

SPEck: mining statistically-significant sequential patterns efficiently with exact sampling

Flexible constrained sampling with guarantees for pattern mining

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases

Abstract

Chapter PDF

Similar content being viewed by others

ProSecCo: progressive sequence mining with convergence guarantees

SPEck: mining statistically-significant sequential patterns efficiently with exact sampling

Flexible constrained sampling with guarantees for pattern mining

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation