Abstract
As discovering association rules in a very large database is time consuming, researchers have developed many algorithms to improve the efficiency. Sampling can significantly reduce the cost of mining, since the mining algorithms need to deal with only a small dataset compared to the original database. Especially, if data comes as a stream flowing at a faster rate than can be processed, sampling seems to be the only choice. How to sample the data and how big the sample size should be for a given error bound and confidence level are key issues for particular data mining tasks. In this paper, we derive the sufficient sample size based on central limit theorem for sampling large datasets with replacement. This approach requires smaller sample size than that based on the Chernoff bounds and is effective for association rules mining. The effectiveness of the method has been evaluated on both dense and sparse datasets.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Conference on Management of Data (1993)
Heikki Mannila, H.T., Verkamo, I.: Efficient Algorithms for Discovering Association Rules. In: AAAI Workshop on Knowledge Discovery in Databases, pp. 181–192 (1994)
Toivonen, H.: Sampling large databases for association rules. In: 22th International Conference on Very Large Databases, pp. 134–145 (1996)
Zaki, M.J., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: 7th International Workshop on Research Issues in Data Engineering High Performance Database Management for Large-Scale Applications, pp. 42–50 (1997)
Chen, B., Haas, P., Scheuermann, P.: A New Two Phase Sampling Based Algorithm for Discovering Association Rules. In: SIGKDD (2002)
Zhang, C., Zhang, S., Webb, G.I.: Identifying Approximate Itemsets of Interest in Large Databases. Applied Intelligence 18(1), 91–104
Parthasarathy, S.: Efficient Progressive Sampling for Association Rules. In: IEEE International Conference on Data Mining (2002)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and Issues in Data Stream Systems. In: 21st ACM Symposium on Principles of Database Systems (2002)
Borgelt, C.: Efficient Implementations of Apriori and Eclat. In: Workshop of Frequent Item Set Mining Implementations (2003)
Gopalan, R.P., Sucahyo, Y.G.: Fast Frequent Itemset Mining using Compressed Data Representation. In: IASTED International Conference on Databases and Applications (2003)
Thomson, S.K.: Sampling. John Wiley & Sons Inc, Chichester (1992)
Mendenhall, W., Sincich, T.: Statistics for Engineering and Sciences. Dellen Publishing Company, San Francisco (1992)
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: The 20th VLDB Conference (1994)
Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases, Irvine, CA, University of California, Department of Information and Computer Science (1998)
Frequent Itemset Mining Dataset Repository, http://fimi.cs.helsinki.fi/data/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, Y., Gopalan, R.P. (2004). Effective Sampling for Mining Association Rules. In: Webb, G.I., Yu, X. (eds) AI 2004: Advances in Artificial Intelligence. AI 2004. Lecture Notes in Computer Science(), vol 3339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30549-1_35
Download citation
DOI: https://doi.org/10.1007/978-3-540-30549-1_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24059-4
Online ISBN: 978-3-540-30549-1
eBook Packages: Computer ScienceComputer Science (R0)