Cosine interesting pattern discovery
Introduction
Since the introduction of the association rules and frequent patterns in the early nineties of the last century [1], association analysis has become one of the core problems in data mining and database communities. Given a large set of items (objects) and observation data about co-occurring items, association analysis is concerned with the identification of strongly related subsets of items [27], [3]. It plays an important role in many application domains such as market-basket analysis [2], financial data analysis [19], telecommunication network alarm analysis [22], climate studies [29], public health [10], and bioinformatics [20], [33]. For example, in market-basket study, the association analysis can be used for sales promotion, shelf management, and inventory management. Also, in medical informatics, association analysis is used to find combination of patient symptoms and complaints associated with certain diseases.
Although numerous scalable methods have been developed for mining frequent patterns in association analysis, the classic support-confidence framework employed by most of the algorithms has two obvious defects: (1) The confidence as the measure of rule interestingness may not disclose truly interesting relationships [14], [13]; (2) The support-based pruning strategy is not effective for data sets with skewed support distributions [35]. The first defect is usually illustrated by the well-known “coffee-tea” example [31]. And the second defect can be understood from the dilemma of setting the minimum support threshold. That is, if the threshold is low, we may extract too many spurious patterns involving items with substantially different support levels [35], such as {earrings, milk} in which earrings occurs rarely but milk tends to be “flush” in many transactions. In contrast, if the threshold is high, we may miss many rare but interesting patterns in low levels of support [15], such as {earrings, gold ring, bracelet} that contains expensive items.
To cope with these problems, many interestingness measures have been proposed for mining truly interesting patterns [12]. Among them, the cosine similarity gains particular interests. Indeed, cosine similarity has unique merits for association analysis in that: (1) It is one of the few interestingness measures that hold symmetry, null-invariance [30] and cross-support properties [35] simultaneously. Many well-known measures such as confidence and lift do not satisfy this; (2) It has been widely used as a popular proximity measure in text mining [37], information retrieval [4], and bio-informatics [23] to avoid the “curse of dimensionality” [5] — a problem we will also face in association analysis; (3) It is very simple and has physical meaning, i.e., it represents the cosine value of the angle of two vectors. Based on the considerations above, in this paper, we select the cosine similarity as the interestingness measure to find the interesting patterns. Along this line, two problems must be solved as follows:
- •
How to define the cosine similarity for a multi-itemset?
- •
Although the cosine similarity has no anti-monotone property, can we find some alternative properties that function similarly?
The answer to the first question is actually available now. In [32], Wu et al. used the notion of the “generalized mean” to extend some interestingness measures including the cosine similarity to the multi-itemset case. We adopt this generalization in our paper. The choke-point therefore is the second question, which indeed motivates our study in this paper.
The main contributions of this paper lie in the three aspects as follows. First, we define the concepts of the “conditional anti-monotone property” and the “Support-Ascending Set Enumeration Tree” (SA-SET). We prove that if the itemset traversal sequence is defined by the SA-SET, any measure that possesses the conditional anti-monotone property can be used with support to mine interesting patterns — the cosine similarity is just an example. Second, we propose some upper bounds for the cosine similarity and identify their anti-monotone property. This is of great use in further pruning candidate itemsets, especially when the transaction databases contain too many items. Also, we prove that the cosine similarity holds the cross-support property, which can help eliminate spurious associations among items with substantially different support levels. Third, we develop an Apriori-like algorithm called CosMiner to mine the cosine interesting patterns from large-scale multi-item databases. The proposed itemset pruning strategy and the use of TRIE structure guarantee the completeness of CosMiner. Finally, we conduct extensive experiments on various real-world data sets. Results show that compared with the classic Apriori algorithm, CosMiner enjoys a much higher efficiency in mining interesting patterns with the help of the conditional anti-monotone property and the cosine upper bound pruning. Also, CosMiner shows its merit in finding non-trivial interesting patterns even at extremely low levels of support.
The remainder of this paper is organized as follows. In Section 2, we extend the cosine similarity from the 2-itemset case to the multi-itemset case. Various properties of the cosine similarity are also discussed. In Section 3, we define the conditional anti-monotone property of the cosine similarity. The pruning effect of the cosine upper bounds is also discussed. The CosMiner algorithm is proposed in Section 4. Section 5 shows the experimental results. We finally present the related work and conclude our work in Sections 6 Related work, 7 Conclusion, respectively.
Section snippets
Cosine similarity: from item pairs to multi-itemsets
Cosine similarity is traditionally a measure of proximity between two high-dimensional vectors. In essence, it is the cosine value of the angle between two vectors, e.g., and , as follows:where “·” indicates the dot-product of the vectors, and “∥ · ∥” indicates the length of the vector. For vectors with non-negative elements, the cosine similarity value always varies in the range of [0, 1], where 1 indicates a perfect match of the two vectors, whereas 0 indicates a
The conditional anti-monotone property
In this section, we explore how to perform the interesting pattern mining with the cosine similarity. It is well recognized that to incorporate the interestingness measures into the pattern mining process, the measures are deemed to have the anti-monotone property, as formulated below. Definition 5 The anti-monotone property Let be a universal itemset. A measure f is anti-monotone if Remark A measure possessing the anti-monotone property can serve as an IN-evaluation measure in the mining of interesting patterns, since
The CosMiner algorithm
In this section, we present the CosMiner algorithm which utilizes the cosine similarity to find interesting patterns in large-scale multi-item transaction databases. In essence, the CosMiner algorithm is an Apriori-like algorithm which employs the candidate generate-and-test approach and the breadth-first search strategy.
Experimental results
In this section, we show the performance of CosMiner on real-world data sets. Specifically, we illustrate: (1) The pruning effect of the conditional anti-monotone property of the cosine similarity; (2) The pruning effect of the cosine upper bound; (3) The interestingness of the patterns found by CosMiner.
Related work
Since the early introduction by Agrawal et al. [1], association analysis has been widely used in many application domains [13], [31]. But one problem is that the traditional support-confidence framework tends to generate too many rules — many of them are indeed uninteresting to us.
To cope with this problem, many interestingness measures have been proposed or borrowed to mine the truly interesting patterns. Piatetski-Shapiro proposed the statistical independence of rules as an interestingness
Conclusion
In this paper, we studied the problem of mining interesting patterns from large-scale multi-item transaction databases with the cosine similarity thresholds. Specifically, we proved that the cosine similarity possessed the conditional anti-monotone property and therefore could be used with support in mining interesting itemsets. Also, we proposed the cosine upper bounds that could be used in further pruning candidate itemsets. These two pruning strategies were implemented in an Apriori-like
Acknowledgments
This research was partially supported by the National Natural Science Foundation of China (NSFC) (Nos. 70901002, 71171007, 71031001, 70890080, 90924020), the Ph.D. Programs Foundation of Ministry of Education of China (No. 20091102120014), and the Fundamental Research Funds for the Central Universities.
References (37)
- et al.
An algorithm to mine general association rules from tabular data
Information Sciences
(2009) - et al.
Mining maximal hyperclique pattern: a hybrid search strategy
Information Sciences
(2007) - et al.
An approach to discovering multi-temporal patterns and its application to financial databases
Information Sciences
(2010) - et al.
Novel alarm correlation analysis system based on association rules mining in telecommunication networks
Information Sciences
(2010) - et al.
New algorithms for efficient mining of association rules
Information Sciences
(1999) - R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings...
Market Models: A Guide to Financial Data Analysis
(2001)- R.J. Bayardo, Y. Ma, R. Srikant, Scaling up all pairs similarity search, in: Proceedings of the 16th International...
- et al.
Dynamic Programming
(1957) - J. Blanchard, F. Guillet, R. Gras, H. Briand, Using information-theoretic measures to assess association rule...