Abstract
In this paper, we propose a new algorithm named Parallel Multipass with Inverted Hashing and Pruning (PMIHP) for mining association rules between words in text databases. The characteristics of text databases are quite different from those of retail transaction databases, and existing mining algorithms cannot handle text databases efficiently because of the large number of itemsets (i.e., sets of words) that need to be counted. The new PMIHP algorithm is a parallel version of our Multipass with Inverted Hashing and Pruning (MIHP) algorithm (Holt, Chung in: Proc of the 14th IEEE int’l conf on tools with artificial intelligence, 2002, pp 49–56), which was shown to be quite efficient than other existing algorithms in the context of mining text databases. The PMIHP algorithm reduces the overhead of communication between miners running on different processors because they are mining local databases asynchronously and prune the global candidates by using the Inverted Hashing and Pruning technique. Compared with the well-known Count Distribution algorithm (Agrawal, Shafer in: (1996) IEEE Trans Knowl Data Eng 8(6):962–969), PMIHP demonstrates superior performance characteristics for mining association rules in large text databases, and when the minimum support level is low, its speedup is superlinear as the number of processors increases. These experiments were performed on a cluster of Linux workstations using a collection of Wall Street Journal articles.
Similar content being viewed by others
References
Agarwal RC, Aggarwal CC, Prasad VVV (2000) Depth first generation of long patterns. In: Proc of the 6th ACM SIGKDD int’l conf on knowledge discovery and data mining, 2000, pp 108–118
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proc of the 20th VLDB conf, 1994, pp 487–499
Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969
Agarwal R, Aggarwal C, Prasad V (2001) A tree projection algorithm for generation of frequent item sets. J Parallel Distrib Comput 61(3):350–371
Bayardo RJ (1998) Efficient mining long patterns from databases. In: Proc of ACM SIGMOD int’l conf on management of data, 1998, pp 85–93
Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proc of ACM SIGMOD int’l conf on management of data, 1997, pp 255–264
Burdick D, Calimlim M, Gehrke J (2001) MAFIA: a maximal frequent itemset algorithm for transaction databases. In: Proc of int’l conf on data engineering, 2001, pp 443–452
Chen MS, Han J, Yu PS (1996) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 6(8):866–883
Cheung DW, Ng VT, Fu AW, Fu Y (1996) Efficient mining of association rules in distributed databases. IEEE Trans Knowl Data Eng 8(6):911–922
Cheung DW, Lee SD, Xiao Y (2002) Effect of data skewness and workload balance in parallel data mining. IEEE Trans Knowl Data Eng 14(3):498–514
Chung SM, Yang J (1996) A parallel distributive join algorithm for cube-connected multiprocessors. IEEE Trans Parallel Distrib Syst 7(2):127–137
Chung SM, Luo C (2004) Distributed mining of maximal frequent itemsets from databases on a cluster of workstations. In: Proc of the 4th IEEE/ACM int’l symp on cluster computing and the grid—CCGrid 2004, 2004
Feldman R, Hirsh H (1998) Finding associations in collections of text. In: Michalski R, Bratko I, Kubat M (eds), Machine learning and data mining: methods and applications. Wiley, pp 223–240
Feldman R, Dagen I, Hirsh H (1998) Mining text using keyword distributions. J Intell Inf Syst 10(3):281–300
Fox C (1992) Lexical analysis and stoplists. In: Frakes W, Baeza-Yates R (eds), Information retrieval: data structures and algorithms. Prentice Hall, pp 102–130
Gordon M, Dumais S (1998) Using latent semantic indexing for literature based discovery. J Am Soc Info Sci 49(8):674–685
Gouda K, Zaki MJ (2001) Efficiently mining maximal frequent itemsets. In: Proc of the 1st IEEE int’l conf on data mining, 2001, pp 163–170
Han EH, Karypis G, Kumar V (2000) Scalable parallel data mining for association rules. IEEE Trans Knowl Data Eng 12(3):337–352
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proc of ACM SIGMOD int’l conf on management of data, 2000, pp 1–12
Holt JD, Chung SM (2001) Multipass algorithms for mining association rules in text databases. Knowl Inf Syst 3(2):168–183
Holt JD, Chung SM (2002) Mining association rules using inverted hashing and pruning. Inf Proces Lett 83(4):211–220
Holt JD, Chung SM (2002) Mining association rules in text databases using multipass with inverted hashing and pruning. In: Proc of the 14th IEEE int’l conf on tools with artificial intelligence, 2002, pp 49–56
National Institute of Standards and Technology (NIST) (1997) Text Research Collection
Orlando S, Palmerini P, Perego R (2001) Enhancing the apriori algorithm for frequent set counting. In: Proc of int’l conf on data warehousing and knowledge discovery, 2001, pp 71–82
Park JS, Chen MS, Yu PS (1995) Efficient parallel data mining for association rules. In: Proc of ACM int’l conf on information and knowledge management, 1995, pp 31–36
Park JS, Chen MS, Yu PS (1997) Using a hash-based method with transaction trimming for mining association rules. IEEE Trans Knowl Data Eng 9(5):813–825
Salton G (1988) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley
Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large databases. In: Proc of the 21st VLDB conf, 1995, pp 432–444
Shintani T, Kitsuregawa M (1996) Hash based parallel algorithms for mining association rules. In: Proc of the 4th int’l conf on parallel and distributed information systems, 1996, pp 19–30
Toivonen H (1996) Sampling large databases for association rules. In: Proc of the 22nd VLDB conf, 1996, pp 134–145
Zaiane OR, El-Hajj M, Lu P (2001) Fast parallel association rule mining without candidacy generation. In: Proc of IEEE conf on data mining, 2001, pp 665–668
Zaiane OR, Antoine ML (2002) Classifying text documents by associating terms with text categories. In: Proc of the 13th australian database conf, 2002
Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) Parallel algorithms for fast discovery of association rules. Data Min Knowl Discov 1(4):343–373
Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported in part by Ohio Board of Regents, LexisNexis, and AFRL/Wright Brothers Institute (WBI).
Rights and permissions
About this article
Cite this article
Holt, J.D., Chung, S.M. Parallel mining of association rules from text databases. J Supercomput 39, 273–299 (2007). https://doi.org/10.1007/s11227-006-0008-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-006-0008-1