Skip to main content
Log in

Parallel mining of association rules from text databases

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this paper, we propose a new algorithm named Parallel Multipass with Inverted Hashing and Pruning (PMIHP) for mining association rules between words in text databases. The characteristics of text databases are quite different from those of retail transaction databases, and existing mining algorithms cannot handle text databases efficiently because of the large number of itemsets (i.e., sets of words) that need to be counted. The new PMIHP algorithm is a parallel version of our Multipass with Inverted Hashing and Pruning (MIHP) algorithm (Holt, Chung in: Proc of the 14th IEEE int’l conf on tools with artificial intelligence, 2002, pp 49–56), which was shown to be quite efficient than other existing algorithms in the context of mining text databases. The PMIHP algorithm reduces the overhead of communication between miners running on different processors because they are mining local databases asynchronously and prune the global candidates by using the Inverted Hashing and Pruning technique. Compared with the well-known Count Distribution algorithm (Agrawal, Shafer in: (1996) IEEE Trans Knowl Data Eng 8(6):962–969), PMIHP demonstrates superior performance characteristics for mining association rules in large text databases, and when the minimum support level is low, its speedup is superlinear as the number of processors increases. These experiments were performed on a cluster of Linux workstations using a collection of Wall Street Journal articles.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agarwal RC, Aggarwal CC, Prasad VVV (2000) Depth first generation of long patterns. In: Proc of the 6th ACM SIGKDD int’l conf on knowledge discovery and data mining, 2000, pp 108–118

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proc of the 20th VLDB conf, 1994, pp 487–499

  3. Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969

    Article  Google Scholar 

  4. Agarwal R, Aggarwal C, Prasad V (2001) A tree projection algorithm for generation of frequent item sets. J Parallel Distrib Comput 61(3):350–371

    Article  MATH  Google Scholar 

  5. Bayardo RJ (1998) Efficient mining long patterns from databases. In: Proc of ACM SIGMOD int’l conf on management of data, 1998, pp 85–93

  6. Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proc of ACM SIGMOD int’l conf on management of data, 1997, pp 255–264

  7. Burdick D, Calimlim M, Gehrke J (2001) MAFIA: a maximal frequent itemset algorithm for transaction databases. In: Proc of int’l conf on data engineering, 2001, pp 443–452

  8. Chen MS, Han J, Yu PS (1996) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 6(8):866–883

    Article  Google Scholar 

  9. Cheung DW, Ng VT, Fu AW, Fu Y (1996) Efficient mining of association rules in distributed databases. IEEE Trans Knowl Data Eng 8(6):911–922

    Article  Google Scholar 

  10. Cheung DW, Lee SD, Xiao Y (2002) Effect of data skewness and workload balance in parallel data mining. IEEE Trans Knowl Data Eng 14(3):498–514

    Article  Google Scholar 

  11. Chung SM, Yang J (1996) A parallel distributive join algorithm for cube-connected multiprocessors. IEEE Trans Parallel Distrib Syst 7(2):127–137

    Article  Google Scholar 

  12. Chung SM, Luo C (2004) Distributed mining of maximal frequent itemsets from databases on a cluster of workstations. In: Proc of the 4th IEEE/ACM int’l symp on cluster computing and the grid—CCGrid 2004, 2004

  13. Feldman R, Hirsh H (1998) Finding associations in collections of text. In: Michalski R, Bratko I, Kubat M (eds), Machine learning and data mining: methods and applications. Wiley, pp 223–240

  14. Feldman R, Dagen I, Hirsh H (1998) Mining text using keyword distributions. J Intell Inf Syst 10(3):281–300

    Article  Google Scholar 

  15. Fox C (1992) Lexical analysis and stoplists. In: Frakes W, Baeza-Yates R (eds), Information retrieval: data structures and algorithms. Prentice Hall, pp 102–130

  16. Gordon M, Dumais S (1998) Using latent semantic indexing for literature based discovery. J Am Soc Info Sci 49(8):674–685

    Article  Google Scholar 

  17. Gouda K, Zaki MJ (2001) Efficiently mining maximal frequent itemsets. In: Proc of the 1st IEEE int’l conf on data mining, 2001, pp 163–170

  18. Han EH, Karypis G, Kumar V (2000) Scalable parallel data mining for association rules. IEEE Trans Knowl Data Eng 12(3):337–352

    Article  Google Scholar 

  19. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proc of ACM SIGMOD int’l conf on management of data, 2000, pp 1–12

  20. Holt JD, Chung SM (2001) Multipass algorithms for mining association rules in text databases. Knowl Inf Syst 3(2):168–183

    Article  MATH  Google Scholar 

  21. Holt JD, Chung SM (2002) Mining association rules using inverted hashing and pruning. Inf Proces Lett 83(4):211–220

    Article  MATH  MathSciNet  Google Scholar 

  22. Holt JD, Chung SM (2002) Mining association rules in text databases using multipass with inverted hashing and pruning. In: Proc of the 14th IEEE int’l conf on tools with artificial intelligence, 2002, pp 49–56

  23. National Institute of Standards and Technology (NIST) (1997) Text Research Collection

  24. Orlando S, Palmerini P, Perego R (2001) Enhancing the apriori algorithm for frequent set counting. In: Proc of int’l conf on data warehousing and knowledge discovery, 2001, pp 71–82

  25. Park JS, Chen MS, Yu PS (1995) Efficient parallel data mining for association rules. In: Proc of ACM int’l conf on information and knowledge management, 1995, pp 31–36

  26. Park JS, Chen MS, Yu PS (1997) Using a hash-based method with transaction trimming for mining association rules. IEEE Trans Knowl Data Eng 9(5):813–825

    Article  Google Scholar 

  27. Salton G (1988) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley

  28. Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large databases. In: Proc of the 21st VLDB conf, 1995, pp 432–444

  29. Shintani T, Kitsuregawa M (1996) Hash based parallel algorithms for mining association rules. In: Proc of the 4th int’l conf on parallel and distributed information systems, 1996, pp 19–30

  30. Toivonen H (1996) Sampling large databases for association rules. In: Proc of the 22nd VLDB conf, 1996, pp 134–145

  31. Zaiane OR, El-Hajj M, Lu P (2001) Fast parallel association rule mining without candidacy generation. In: Proc of IEEE conf on data mining, 2001, pp 665–668

  32. Zaiane OR, Antoine ML (2002) Classifying text documents by associating terms with text categories. In: Proc of the 13th australian database conf, 2002

  33. Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) Parallel algorithms for fast discovery of association rules. Data Min Knowl Discov 1(4):343–373

    Article  Google Scholar 

  34. Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soon M. Chung.

Additional information

This research was supported in part by Ohio Board of Regents, LexisNexis, and AFRL/Wright Brothers Institute (WBI).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Holt, J.D., Chung, S.M. Parallel mining of association rules from text databases. J Supercomput 39, 273–299 (2007). https://doi.org/10.1007/s11227-006-0008-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-006-0008-1

Keywords

Navigation