Skip to main content
Log in

Multi-Scaling Sampling: An Adaptive Sampling Method for Discovering Approximate Association Rules

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

One of the obstacles of the efficient association rule mining is the explosive expansion of data sets since it is costly or impossible to scan large databases, esp., for multiple times. A popular solution to improve the speed and scalability of the association rule mining is to do the algorithm on a random sample instead of the entire database. But how to effectively define and efficiently estimate the degree of error with respect to the outcome of the algorithm, and how to determine the sample size needed are entangling researches until now. In this paper, an effective and efficient algorithm is given based on the PAC (Probably Approximate Correct) learning theory to measure and estimate sample error. Then, a new adaptive, on-line, fast sampling strategy — multi-scaling sampling — is presented inspired by MRA (Multi-Resolution Analysis) and Shannon sampling theorem, for quickly obtaining acceptably approximate association rules at appropriate sample size. Both theoretical analysis and empirical study have showed that the sampling strategy can achieve a very good speed-accuracy trade-off.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Evfimievski A, Srikant R, Agrawal R, Gehrke J. Privacy preserving mining of association rules. In Proc. the 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, EDmonton, Alberta, Canada, July 2002, pp.217–228.

  2. Agrawal R, Mannila H, Srikant R et al. Fast Discovery of the Association Rules. Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996, pp.307–328.

  3. Li Q, Wang H et al. Efficient mining of association rules by reducing the number of passes over the database. Journal of Computer Science and Technology, 2001, 16(2): 182–188.

    Google Scholar 

  4. Zaki M J. Parallel and distributed association mining: A survey. IEEE Concurrency, 1999, 7(4): 14–25.

    Article  Google Scholar 

  5. Agrawal R, Shafer J C. Parallel mining of association rules. IEEE Trans. Knowledge and Data Engineering, 1996, 8(6): 962–969.

    Article  Google Scholar 

  6. SAS Institute Inc. Data mining and case for sampling: Solving business problems using SAS enterprise miner software. {SAS Institute White Paper}, 1998.

  7. Chen B, Haas P, Scheuermann P. A new two-phase sampling based algorithms for discovery association rules. In Proc. the 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, EDmonton, Alberta, Canada, July 2002, pp.462–468.

  8. Parthasarathy S. Efficient progressive sampling for association rules. In Proc. the IEEE Int. Conf. Data Mining (ICDM’02), Maebashi City, Japan, Dec., 2002, pp.354–361.

  9. Toivonen H. Sampling large databases for association rules. In Proc. the 22nd Int. Conf. Very Large Data Bases, Mumbai, Bombay, India, Sept. 1996, pp.134–145.

  10. Zaki M J, Parthasarathy S, Li W et al. Evaluation of sampling for data mining of association rules. In Proc. the 7th Workshop on Research Issues in Data Engineer, Birmingham, UK, April 1997, pp.42–50.

  11. Watanabe O. Simple sampling techniques for discovery science. IEICE Trans. Information and Systems, 2000, E83-D(1): 19–26.

    Google Scholar 

  12. Zhang C, Zhang S, Webb G I. Identifying approximate itemsets of interest in large databases. Applied Intelligence, 2003, 18: 91–104.

    Article  Google Scholar 

  13. Valiant L G. A theory of the learnable. Communications of the ACM, 1984, 27: 1134–1142.

    Article  Google Scholar 

  14. John G H, Langley P. Static versus dynamic sampling for data mining. In Proc. the 2nd Int. Conf. Knowledge Discovery and Data Mining, KDD-96, Portland, OR, Aug. 1996, pp.367–370.

  15. Suzuki E. Sampling theories for rule discovery based on generality and accuracy, the worst case and a distribution-based case. Communication of Institute of Information and Computing Machinery, May, 2002, 5(2): 83–88.

    Google Scholar 

  16. Zaki M J, Hsiao C J. CHARM: An efficient algorithm for closed association rule mining. {Technical Report 99-10}, Computer Science Dept., Rensselaer Polytechnic Institute, Oct., 1999.

  17. Burdick D, Calimlim M, Gehrke J. MAFIA: A maximal frequent itemset algorithm for transactional databases. In Proc. the 17th Int. Conf. Data Engineering, Heidelberg, Germany, April, 2001, pp.443–452.

  18. Agrawal R, Srikant R. Fast algorithms for mining association rules. In Proc. the 20th Int. Conf. Very Large Data Bases, Santiago, Chile, 1994, pp.487–499.

  19. Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In Proc. the ACM SIGMOD Int. Conf. the Management of Data, Dallas, TX, May 2000, pp.1–12.

  20. Pei J, Han J, Mao R. CLOSET: An efficient algorithm for mining frequent closed itemsets. In Proc. the ACM-SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Dallas, TX, May, 2000, pp.21–30.

  21. Provost F, Jensen D, Oates T. Efficient progressive sampling. In Proc. the 5th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Diego, CA, USA, Aug. 1999, pp.23–32.

  22. Vitter J S. An efficient algorithm for sequential random sampling. ACM Trans. Mathematical Software, 1987, 13(1): 58–67.

    Article  Google Scholar 

  23. http://fuzzy.cs.uni-magdeburg.de/~borgelt/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cai-Yan Jia.

Additional information

Regular Paper The work is partially supported by CAS Project of Brain and Mind Science, Pre-973 Project 2001CCA03000, the National High Technology 863 Program of China under Grant No.2001AA113130, the National Basic Research 973 Program of China under Grant No.2001CB312004, Innovation Foundation of IOM, AMSS and ICT Projects, the National Natural Science Foundation of China under Grant Nos.69733020, 60375021, Natural Science Foundation of Hunan Province under Grant No.03JJY3096.

Cai-Yan Jia is engaging in the postdoctoral study at Department of Computer Science and Engineering in Fudan University. She received the Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences, in July 2004 and the M.S. degree from Department of Mathematics of Xiangtan University, P.R. China in July 2001. Her recent research interest includes data mining, machine learning, computational intelligence and bioinformatics. She has published several papers in conferences and journals.

Xie-Ping Gao received the B.S. and M.S. degrees from Xiangtan University, P.R. China, in 1985 and 1988, respectively, and the Ph.D. degree from Hunan University, P.R. China in 2003. Since July 1999, he has been a professor with Mathematical Department and Information Engineering College, Xiangtan University. From December 2002 to December 2003, he joined the School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore, where he was a visiting professor. His current research interests are in the areas of wavelets analysis, neural networks, evolution computation, data mining, and image compression. He has co-authored more than 60 journal papers, conference papers, book chapters.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jia, CY., Gao, XP. Multi-Scaling Sampling: An Adaptive Sampling Method for Discovering Approximate Association Rules. J Comput Sci Technol 20, 309–318 (2005). https://doi.org/10.1007/s11390-005-0309-5

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-005-0309-5

Keywords

Navigation