Parallel mining of association rules from text databases

Holt, John D.; Chung, Soon M.

doi:10.1007/s11227-006-0008-1

Parallel mining of association rules from text databases

Published: 02 March 2007

Volume 39, pages 273–299, (2007)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

John D. Holt¹ &
Soon M. Chung¹

155 Accesses
9 Citations
Explore all metrics

Abstract

In this paper, we propose a new algorithm named Parallel Multipass with Inverted Hashing and Pruning (PMIHP) for mining association rules between words in text databases. The characteristics of text databases are quite different from those of retail transaction databases, and existing mining algorithms cannot handle text databases efficiently because of the large number of itemsets (i.e., sets of words) that need to be counted. The new PMIHP algorithm is a parallel version of our Multipass with Inverted Hashing and Pruning (MIHP) algorithm (Holt, Chung in: Proc of the 14th IEEE int’l conf on tools with artificial intelligence, 2002, pp 49–56), which was shown to be quite efficient than other existing algorithms in the context of mining text databases. The PMIHP algorithm reduces the overhead of communication between miners running on different processors because they are mining local databases asynchronously and prune the global candidates by using the Inverted Hashing and Pruning technique. Compared with the well-known Count Distribution algorithm (Agrawal, Shafer in: (1996) IEEE Trans Knowl Data Eng 8(6):962–969), PMIHP demonstrates superior performance characteristics for mining association rules in large text databases, and when the minimum support level is low, its speedup is superlinear as the number of processors increases. These experiments were performed on a cluster of Linux workstations using a collection of Wall Street Journal articles.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A highly scalable parallel algorithm for maximally informative k-itemset mining

Article 22 March 2016

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Article 24 March 2017

Parallel High Utility Itemset Mining

References

Agarwal RC, Aggarwal CC, Prasad VVV (2000) Depth first generation of long patterns. In: Proc of the 6th ACM SIGKDD int’l conf on knowledge discovery and data mining, 2000, pp 108–118
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proc of the 20th VLDB conf, 1994, pp 487–499
Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969
Article Google Scholar
Agarwal R, Aggarwal C, Prasad V (2001) A tree projection algorithm for generation of frequent item sets. J Parallel Distrib Comput 61(3):350–371
Article MATH Google Scholar
Bayardo RJ (1998) Efficient mining long patterns from databases. In: Proc of ACM SIGMOD int’l conf on management of data, 1998, pp 85–93
Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proc of ACM SIGMOD int’l conf on management of data, 1997, pp 255–264
Burdick D, Calimlim M, Gehrke J (2001) MAFIA: a maximal frequent itemset algorithm for transaction databases. In: Proc of int’l conf on data engineering, 2001, pp 443–452
Chen MS, Han J, Yu PS (1996) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 6(8):866–883
Article Google Scholar
Cheung DW, Ng VT, Fu AW, Fu Y (1996) Efficient mining of association rules in distributed databases. IEEE Trans Knowl Data Eng 8(6):911–922
Article Google Scholar
Cheung DW, Lee SD, Xiao Y (2002) Effect of data skewness and workload balance in parallel data mining. IEEE Trans Knowl Data Eng 14(3):498–514
Article Google Scholar
Chung SM, Yang J (1996) A parallel distributive join algorithm for cube-connected multiprocessors. IEEE Trans Parallel Distrib Syst 7(2):127–137
Article Google Scholar
Chung SM, Luo C (2004) Distributed mining of maximal frequent itemsets from databases on a cluster of workstations. In: Proc of the 4th IEEE/ACM int’l symp on cluster computing and the grid—CCGrid 2004, 2004
Feldman R, Hirsh H (1998) Finding associations in collections of text. In: Michalski R, Bratko I, Kubat M (eds), Machine learning and data mining: methods and applications. Wiley, pp 223–240
Feldman R, Dagen I, Hirsh H (1998) Mining text using keyword distributions. J Intell Inf Syst 10(3):281–300
Article Google Scholar
Fox C (1992) Lexical analysis and stoplists. In: Frakes W, Baeza-Yates R (eds), Information retrieval: data structures and algorithms. Prentice Hall, pp 102–130
Gordon M, Dumais S (1998) Using latent semantic indexing for literature based discovery. J Am Soc Info Sci 49(8):674–685
Article Google Scholar
Gouda K, Zaki MJ (2001) Efficiently mining maximal frequent itemsets. In: Proc of the 1st IEEE int’l conf on data mining, 2001, pp 163–170
Han EH, Karypis G, Kumar V (2000) Scalable parallel data mining for association rules. IEEE Trans Knowl Data Eng 12(3):337–352
Article Google Scholar
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proc of ACM SIGMOD int’l conf on management of data, 2000, pp 1–12
Holt JD, Chung SM (2001) Multipass algorithms for mining association rules in text databases. Knowl Inf Syst 3(2):168–183
Article MATH Google Scholar
Holt JD, Chung SM (2002) Mining association rules using inverted hashing and pruning. Inf Proces Lett 83(4):211–220
Article MATH MathSciNet Google Scholar
Holt JD, Chung SM (2002) Mining association rules in text databases using multipass with inverted hashing and pruning. In: Proc of the 14th IEEE int’l conf on tools with artificial intelligence, 2002, pp 49–56
National Institute of Standards and Technology (NIST) (1997) Text Research Collection
Orlando S, Palmerini P, Perego R (2001) Enhancing the apriori algorithm for frequent set counting. In: Proc of int’l conf on data warehousing and knowledge discovery, 2001, pp 71–82
Park JS, Chen MS, Yu PS (1995) Efficient parallel data mining for association rules. In: Proc of ACM int’l conf on information and knowledge management, 1995, pp 31–36
Park JS, Chen MS, Yu PS (1997) Using a hash-based method with transaction trimming for mining association rules. IEEE Trans Knowl Data Eng 9(5):813–825
Article Google Scholar
Salton G (1988) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley
Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large databases. In: Proc of the 21st VLDB conf, 1995, pp 432–444
Shintani T, Kitsuregawa M (1996) Hash based parallel algorithms for mining association rules. In: Proc of the 4th int’l conf on parallel and distributed information systems, 1996, pp 19–30
Toivonen H (1996) Sampling large databases for association rules. In: Proc of the 22nd VLDB conf, 1996, pp 134–145
Zaiane OR, El-Hajj M, Lu P (2001) Fast parallel association rule mining without candidacy generation. In: Proc of IEEE conf on data mining, 2001, pp 665–668
Zaiane OR, Antoine ML (2002) Classifying text documents by associating terms with text categories. In: Proc of the 13th australian database conf, 2002
Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) Parallel algorithms for fast discovery of association rules. Data Min Knowl Discov 1(4):343–373
Article Google Scholar
Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Wright State University, Dayton, Ohio 45435, USA
John D. Holt & Soon M. Chung

Authors

John D. Holt
View author publications
You can also search for this author in PubMed Google Scholar
Soon M. Chung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soon M. Chung.

Additional information

This research was supported in part by Ohio Board of Regents, LexisNexis, and AFRL/Wright Brothers Institute (WBI).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Holt, J.D., Chung, S.M. Parallel mining of association rules from text databases. J Supercomput 39, 273–299 (2007). https://doi.org/10.1007/s11227-006-0008-1

Download citation

Published: 02 March 2007
Issue Date: March 2007
DOI: https://doi.org/10.1007/s11227-006-0008-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel mining of association rules from text databases

Abstract

Access this article

Similar content being viewed by others

A highly scalable parallel algorithm for maximally informative k-itemset mining

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Parallel High Utility Itemset Mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel mining of association rules from text databases

Abstract

Access this article

Similar content being viewed by others

A highly scalable parallel algorithm for maximally informative k-itemset mining

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Parallel High Utility Itemset Mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation