Elsevier

Information Sciences

Volume 184, Issue 1, 1 February 2012, Pages 176-195
Information Sciences

Cosine interesting pattern discovery

https://doi.org/10.1016/j.ins.2011.09.006Get rights and content

Abstract

Recent years have witnessed an increasing interest in computing cosine similarity between high-dimensional documents, transactions, and gene sequences, etc. Most previous studies limited their scope to the pairs of items, which cannot be adapted to the multi-itemset cases. Therefore, from a frequent pattern mining perspective, there exists still a critical need for discovering interesting patterns whose cosine similarity values are above some given thresholds. However, the knottiest point of this problem is, the cosine similarity has no anti-monotone property. To meet this challenge, we propose the notions of conditional anti-monotone property and Support-Ascending Set Enumeration Tree (SA-SET). We prove that the cosine similarity has the conditional anti-monotone property and therefore can be used for the interesting pattern mining if the itemset traversal sequence is defined by the SA-SET. We also identify the anti-monotone property of an upper bound of the cosine similarity, which can be used in further pruning the candidate itemsets. An Apriori-like algorithm called CosMiner is then put forward to mine the cosine interesting patterns from large-scale multi-item databases. Experimental results show that CosMiner can efficiently identify interesting patterns using the conditional anti-monotone property of the cosine similarity and the anti-monotone property of its upper bound, even at extremely low levels of support.

Introduction

Since the introduction of the association rules and frequent patterns in the early nineties of the last century [1], association analysis has become one of the core problems in data mining and database communities. Given a large set of items (objects) and observation data about co-occurring items, association analysis is concerned with the identification of strongly related subsets of items [27], [3]. It plays an important role in many application domains such as market-basket analysis [2], financial data analysis [19], telecommunication network alarm analysis [22], climate studies [29], public health [10], and bioinformatics [20], [33]. For example, in market-basket study, the association analysis can be used for sales promotion, shelf management, and inventory management. Also, in medical informatics, association analysis is used to find combination of patient symptoms and complaints associated with certain diseases.

Although numerous scalable methods have been developed for mining frequent patterns in association analysis, the classic support-confidence framework employed by most of the algorithms has two obvious defects: (1) The confidence as the measure of rule interestingness may not disclose truly interesting relationships [14], [13]; (2) The support-based pruning strategy is not effective for data sets with skewed support distributions [35]. The first defect is usually illustrated by the well-known “coffee-tea” example [31]. And the second defect can be understood from the dilemma of setting the minimum support threshold. That is, if the threshold is low, we may extract too many spurious patterns involving items with substantially different support levels [35], such as {earrings, milk} in which earrings occurs rarely but milk tends to be “flush” in many transactions. In contrast, if the threshold is high, we may miss many rare but interesting patterns in low levels of support [15], such as {earrings, gold ring, bracelet} that contains expensive items.

To cope with these problems, many interestingness measures have been proposed for mining truly interesting patterns [12]. Among them, the cosine similarity gains particular interests. Indeed, cosine similarity has unique merits for association analysis in that: (1) It is one of the few interestingness measures that hold symmetry, null-invariance [30] and cross-support properties [35] simultaneously. Many well-known measures such as confidence and lift do not satisfy this; (2) It has been widely used as a popular proximity measure in text mining [37], information retrieval [4], and bio-informatics [23] to avoid the “curse of dimensionality” [5] — a problem we will also face in association analysis; (3) It is very simple and has physical meaning, i.e., it represents the cosine value of the angle of two vectors. Based on the considerations above, in this paper, we select the cosine similarity as the interestingness measure to find the interesting patterns. Along this line, two problems must be solved as follows:

  • How to define the cosine similarity for a multi-itemset?

  • Although the cosine similarity has no anti-monotone property, can we find some alternative properties that function similarly?

The answer to the first question is actually available now. In [32], Wu et al. used the notion of the “generalized mean” to extend some interestingness measures including the cosine similarity to the multi-itemset case. We adopt this generalization in our paper. The choke-point therefore is the second question, which indeed motivates our study in this paper.

The main contributions of this paper lie in the three aspects as follows. First, we define the concepts of the “conditional anti-monotone property” and the “Support-Ascending Set Enumeration Tree” (SA-SET). We prove that if the itemset traversal sequence is defined by the SA-SET, any measure that possesses the conditional anti-monotone property can be used with support to mine interesting patterns — the cosine similarity is just an example. Second, we propose some upper bounds for the cosine similarity and identify their anti-monotone property. This is of great use in further pruning candidate itemsets, especially when the transaction databases contain too many items. Also, we prove that the cosine similarity holds the cross-support property, which can help eliminate spurious associations among items with substantially different support levels. Third, we develop an Apriori-like algorithm called CosMiner to mine the cosine interesting patterns from large-scale multi-item databases. The proposed itemset pruning strategy and the use of TRIE structure guarantee the completeness of CosMiner. Finally, we conduct extensive experiments on various real-world data sets. Results show that compared with the classic Apriori algorithm, CosMiner enjoys a much higher efficiency in mining interesting patterns with the help of the conditional anti-monotone property and the cosine upper bound pruning. Also, CosMiner shows its merit in finding non-trivial interesting patterns even at extremely low levels of support.

The remainder of this paper is organized as follows. In Section 2, we extend the cosine similarity from the 2-itemset case to the multi-itemset case. Various properties of the cosine similarity are also discussed. In Section 3, we define the conditional anti-monotone property of the cosine similarity. The pruning effect of the cosine upper bounds is also discussed. The CosMiner algorithm is proposed in Section 4. Section 5 shows the experimental results. We finally present the related work and conclude our work in Sections 6 Related work, 7 Conclusion, respectively.

Section snippets

Cosine similarity: from item pairs to multi-itemsets

Cosine similarity is traditionally a measure of proximity between two high-dimensional vectors. In essence, it is the cosine value of the angle between two vectors, e.g., X and Y, as follows:cos(X,Y)=X·YXY,where “·” indicates the dot-product of the vectors, and “∥ · ∥” indicates the length of the vector. For vectors with non-negative elements, the cosine similarity value always varies in the range of [0, 1], where 1 indicates a perfect match of the two vectors, whereas 0 indicates a

The conditional anti-monotone property

In this section, we explore how to perform the interesting pattern mining with the cosine similarity. It is well recognized that to incorporate the interestingness measures into the pattern mining process, the measures are deemed to have the anti-monotone property, as formulated below.

Definition 5 The anti-monotone property

Let I be a universal itemset. A measure f is anti-monotone ifX,YI:XYf(X)f(Y).

Remark

A measure possessing the anti-monotone property can serve as an IN-evaluation measure in the mining of interesting patterns, since

The CosMiner algorithm

In this section, we present the CosMiner algorithm which utilizes the cosine similarity to find interesting patterns in large-scale multi-item transaction databases. In essence, the CosMiner algorithm is an Apriori-like algorithm which employs the candidate generate-and-test approach and the breadth-first search strategy.

Experimental results

In this section, we show the performance of CosMiner on real-world data sets. Specifically, we illustrate: (1) The pruning effect of the conditional anti-monotone property of the cosine similarity; (2) The pruning effect of the cosine upper bound; (3) The interestingness of the patterns found by CosMiner.

Related work

Since the early introduction by Agrawal et al. [1], association analysis has been widely used in many application domains [13], [31]. But one problem is that the traditional support-confidence framework tends to generate too many rules — many of them are indeed uninteresting to us.

To cope with this problem, many interestingness measures have been proposed or borrowed to mine the truly interesting patterns. Piatetski-Shapiro proposed the statistical independence of rules as an interestingness

Conclusion

In this paper, we studied the problem of mining interesting patterns from large-scale multi-item transaction databases with the cosine similarity thresholds. Specifically, we proved that the cosine similarity possessed the conditional anti-monotone property and therefore could be used with support in mining interesting itemsets. Also, we proposed the cosine upper bounds that could be used in further pruning candidate itemsets. These two pruning strategies were implemented in an Apriori-like

Acknowledgments

This research was partially supported by the National Natural Science Foundation of China (NSFC) (Nos. 70901002, 71171007, 71031001, 70890080, 90924020), the Ph.D. Programs Foundation of Ministry of Education of China (No. 20091102120014), and the Fundamental Research Funds for the Central Universities.

References (37)

  • C. Borgelt, Recursion pruning for the apriori algorithm, in: Workshop of Frequent Item Set Mining Implementations FIMI...
  • R. Briandais, File searching using variable length keys, in: IRE-AIEE-ACM ’59 western joint computer conference, 1959,...
  • S. Brin, R. Motwani, C. Silverstein, Beyond market basket: generalizing association rules to correlations, in:...
  • P. Cohen, J. Cohen, S.G. West, L.S. Aiken, Applied Multiple Regression/Correlation Analysis for the Behavioral Science...
  • E. Fredkin

    Trie memory

    Communication of the ACM

    (1960)
  • L. Geng et al.

    Interestingness measures for data mining: a survey

    ACM Computing Surveys

    (2006)
  • J. Han et al.

    Frequent pattern mining: current status and future directions

    Data Mining and Knowledge Discovery

    (2007)
  • J. Han et al.

    Data Mining: Concepts and Techniques

    (2005)
  • Cited by (0)

    View full text