Skip to main content
Log in

Effect of Data Distribution in Parallel Mining of Associations

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Association rule mining is an important new problem in data mining. It has crucial applications in decision support and marketing strategy. We proposed an efficient parallel algorithm for mining association rules on a distributed share-nothing parallel system. Its efficiency is attributed to the incorporation of two powerful candidate set pruning techniques. The two techniques, distributed and global prunings, are sensitive to two data distribution characteristics: data skewness and workload balance. The prunings are very effective when both the skewness and balance are high. We have implemented FPM on an IBM SP2 parallel system. The performance studies show that FPM outperforms CD consistently, which is a parallel version of the representative Apriori algorithm (Agrawal and Srikant, 1994). Also, the results have validated our observation on the effectiveness of the two pruning techniques with respect to the data distribution characteristics. Furthermore, it shows that FPM has nice scalability and parallelism, which can be tuned for different business applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data. pp. 207–216.

  • Agrawal, R. and Shafer, J.C. 1996. Parallel mining of association rules: Design, implementation and experience. Special Issue in Data Mining, IEEE Trans. on Knowledge and Data Engineering, IEEE Computer Society, 8(6):962–969.

    Google Scholar 

  • Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. Proc. 1994 Int. Conf. Very Large Data Bases. Santiago, Chile, pp. 487–499.

  • Brin, S., Motwani, R., Ullman, J., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. Proc. of 1997 ACM-SIGMOD Int. Conf. On Management of Data. Tucson, Arizona, pp. 255–264.

  • Cheung, D.W., Han, J., Ng, V.T., Fu, A.W., and Fu. Y. 1996. A fast distributed algorithm for mining association rules. Proc. of 4th Int. Conf. on Parallel and Distributed Information Systems. Miami Beach, FL, pp. 31–43.

  • Cheung, D.W., Han, J., Ng, V.T., and Wong, C.Y. 1996. Maintenance of discovered association rules in large databases: An incremental updating technique. Proc. 1996 IEEE Int. Conf. on Data Engineering. New Orleans, Louisiana.

  • Cover T.M. and Thomas, T.A. 1991. Elements of Information Theory. John Wiley & Sons.

  • Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. 1995. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press.

  • Han J. and Fu, Y. 1995. Discovery of multiple-level association rules from large databases. Proc. 1995 Int. Conf. Very Large Data Bases. Zurich, Switzerland, pp. 420–431.

  • Han, E., Karypis G., and Kumar, V. 1997. Scalable parallel data mining for association rules. Proc. of 1997 ACM-SIGMOD Int. Conf. On Management of Data.

  • Int'l Business Machines. 1995. Scalable POWERparallel Systems, GA23-2475-02 edition.

  • MacQueen, J.B. 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, pp. 281–297.

  • Message Passing Interface Forum. 1994. MPI: A Message-Passing Interface Standard.

  • Ng, R., Lakshmanan, L., Han J., and Pang, A. 1998. Exploratory mining and pruning optimizations of constrainted association rules. Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data. Seattle, WH.

  • Park, J.S., Chen, M.S., and Yu, P.S. 1995a. An effective hash-based algorithm for mining association rules. Proc. 1995 ACM-SIGMOD Int. Conf. Management of Data. San Jose, CA, pp. 175–186.

  • Park, J.S., Chen, M.S., and Yu, P.S. 1995b. Efficient parallel data mining for association rules. Proc. 1995 Int. Conf. on Information and Knowledge Management. Baltimore, MD.

  • Savasere, A., Omiecinski, E., and Navathe, S. 1995. An efficient algorithm for mining association rules in large databases. Proc. 1995 Int. Conf. Very Large Data Bases. Zurich, Switzerland, pp. 432–444.

  • Shintani, T. and Kitsuregawa, M. 1996. Hash based parallel algorithms for mining association rules. Proc. of 4th Int. Conf. on Parallel and Distributed Information Systems.

  • Silberschatz, A., Stonebraker, M., and Ullman, J. 1995. Database research: achievements and opportunities into the 21st century. Report of an NSF Workshop on the Future of Database Systems Research.

  • Srikant R. and Agrawal, R. 1995. Mining generalized association rules. Proc. 1995 Int. Conf. Very Large Data Bases. Zurich, Switzerland, pp. 407–419.

  • Srikant R. and Agrawal, R. 1996a. Mining sequential patterns: Generalizations and performance improvements. Proc. of the 5th Int. Conf. on Extending Database Technology. Avignon, France.

  • Srikant R. and Agrawal, R. 1996b. Mining quantitative association rules in large relational tables. Proc. 1996 ACM-SIGMOD Int. Conf. on Management of Data. Montreal, Canada.

  • Zaki, M.J., Ogihara, M., Parthasarathy, S., and Li, W. 1996. Parallel data mining for association rules on shared-memory multi-processors. Supercomputing'96, Pittsburg, PA, Nov. 17–22.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheung, D.W., Xiao, Y. Effect of Data Distribution in Parallel Mining of Associations. Data Mining and Knowledge Discovery 3, 291–314 (1999). https://doi.org/10.1023/A:1009836926181

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1009836926181

Navigation