Scaling up cosine interesting pattern discovery: A depth-first method
Introduction
Since the introduction of frequent patterns in the early nineties of the last century [1], association analysis has become one of the core problems in the data mining and database communities [2]. Given a large set of items (objects) and the observation data about co-occurring items, association analysis is concerned with the identification of strongly correlated subsets of items [27]. It plays an important role in many application domains such as market-basket analysis [3], [8], recommender systems [22], telecommunication network alarm analysis [26], image processing [25], climate studies [32], public health [10], and bio-informatics [38].
It was not until recently, researchers found that the classic support–confidence framework of association analysis has some intrinsic defects. For instance, the well-known “coffee–tea” example reveals that the confidence as an interestingness measure for association rules may not disclose truly interesting relationships [13], [14]. Also, this framework often generates too many frequent patterns and even more association rules, from which to find useful patterns is itself a great challenge. Meanwhile, the support-based pruning strategy is not effective for data sets with skewed support distributions [40]; that is, if the minimum support threshold is relatively low, we may extract too many spurious patterns involving items with substantially different support levels, such as {earrings, milk} in which the support of earrings is much lower than that of milk. On the contrary, if the threshold is high, we may miss many interesting but relatively infrequent patterns [16], such as {earrings, gold ring, bracelet} that contains rare but truly valuable items.
To overcome these problems, many interestingness measures have been proposed to fine truly interesting patterns [12]. Patterns with a measure value above some given threshold are called interesting patterns with respect to that measure. Among the proposed measures, the cosine similarity gains particular interests. Indeed, cosine similarity has various merits for association analysis. First, it is one of the few interestingness measures that hold symmetry, null-invariance and anti-cross-support-pattern properties simultaneously (details will be given in Section 2.2). Many well-known measures, such as confidence and lift, cannot achieve this. Second, cosine similarity has been widely used as proximity measure in text mining [41], information retrieval [17], image processing [11], and bio-informatics [20], to avoid the “curse of dimensionality” [4] – a problem we also face in association analysis. Finally, it is very simple and has physical meaning, i.e., it measures the angle of two vectors in the feature space. As a result, in this paper, we select the cosine similarity as the interestingness measure to define interesting patterns.
Mining cosine interesting patterns is by no means a trivial task. First of all, the traditional definition of cosine similarity is only for two vectors, which must be extended to the case of multi-itemsets. More importantly, the cosine similarity may not hold the anti-monotone property – the key for the success of frequent pattern mining using the support measure – and therefore may not be able to reduce the search space of interesting patterns. One may argue that a POST-evaluation scheme can already work, by first generating frequent patterns, and then identifying interesting patterns from frequent patterns using a cosine threshold. While being simple, the post-evaluation scheme has the same problem as the traditional frequent pattern mining algorithms; that is, rare but interesting patterns will be lost if a higher support threshold is set, but too many patterns will be generated for a low threshold instead.
In light of these, in this paper, we attempt to design an efficient algorithm to mine cosine interesting patterns from large-scale databases. To this end, we first expand the definition of cosine similarity to the scope of multi-itemsets, and highlight some valuable properties of the cosine similarity, such as the null-invariance property and the anti-cross-support-pattern property. We then point out that the cosine similarity does not hold the anti-monotone property, and propose a novel conditional anti-monotone property (CAMP) as an alternative. We argue that a depth-first itemset traversal strategy is more appropriate than a breath-first strategy when mining cosine interesting pattern using CAMP, based on which an algorithm named is proposed. is an FP-growth-like algorithm that uses the cosine as well as support thresholds to prune uninteresting patterns, although the mining of single-path tree is different from the FP-growth algorithm [15]. Extensive experiments on various real-world data sets demonstrate the superiority of to the Apriori-like method () and the post-evaluation method (POST), in terms of efficiency. We also verify the abilities of in suppressing cross-support patterns and mining the rare but interesting patterns. Finally, we apply as a noise-removal tool to a real-world landmark recognition case, based on which the performance of image clustering is improved substantially. This, in turn, demonstrates the value of cosine interesting patterns mined by .
The remainder of this paper is organized as follows. In Section 2, we define and explore the cosine similarity from a pattern mining perspective. In Section 3, we propose the conditional anti-monotone property and suggest the combined use of a depth-first traversal strategy. The algorithm is proposed in Section 4. Section 5 shows the experimental results, followed by a case study on landmark recognition in Section 6. We finally present the related work and conclude our work in Section 7 and Section 8, respectively.
Section snippets
Cosine similarity: preliminaries and problem definition
In this section, we present the definition of cosine similarity for multi-itemsets from an interestingness measure perspective. Some important properties of cosine similarity will be also discussed here. Finally, we formulate the problem to be studied in this paper.
Conditional anti-monotone property for interesting pattern mining
In this section, we present the Conditional Anti-Monotone Property (CAMP), an alternative to the anti-monotone property, for cosine interesting pattern discovery. The itemset traversal schemes coupled with CAMP will be also discussed here.
: the algorithmic details
is generally an FP-growth-like algorithm for cosine interesting pattern discovery. It can be mainly divided into two phases: (1) The generation of an FP-tree and (2) the mining of cosine interesting pattern from the FP-tree.
The first phase is almost the same as the FP-growth algorithm, except that the items in the transactions must be sorted in a support-descending order before building the FP-tree. This is crucial for the implementation of the DF-SAT strategy, since the patterns in
Experimental results
In this section, we provide experimental results on six real-world data sets. Two other mining algorithms, including the Apriori-like method (denoted as ) and the post-evaluation method (denoted as POST), are adopted for the comparative study. Note that employs a breath-first traversal strategy for interesting pattern mining, as described in Section 3.2. POST is more straightforward; that is, it first mines frequent patterns using the FP-growth algorithm, and then computes
An application
In this section, we employ , as a massive-noise removal tool, to enhance the cluster analysis for high-dimensional landmark recognition.
Related work
Since the early introduction by Agrawal et al. [1], association analysis has been widely used in many application domains [13], [35]. But one problem is that the traditional support–confidence framework tends to generate too many rules – many of them are indeed uninteresting to us.
To cope with this problem, many interestingness measures have been proposed or borrowed to mine the truly interesting patterns. Piatetski–Shapiro proposed the statistical independence of rules as an interestingness
Conclusions
This paper studied the problem of mining cosine interesting patterns from large-scale databases. The conditional anti-monotone property as well as the depth-first traversal strategy were proposed to elicit a novel FP-growth-like algorithm: . Extensive experiments with comparative studies demonstrated the strengths of . As an application example, was successfully applied to a landmark recognition task where huge volume of image noise was presented.
Acknowledgments
This research was partially supported by the National Natural Science Foundation of China (NSFC) under Grants 71072172, 71372188 and 61103229, National Center for International Joint Research on E-Business Information Processing under Grant 2013B01035, National Key Technologies R&D Program of China under Grant 2013BAH16F00, the National Soft Science Research Program under Grant 2013GXS4B081, Industry Projects in Jiangsu S&T Pillar Program under Grant BE2012185, and the Natural Science
References (41)
- et al.
A framework for mining interesting high utility patterns with a strong frequency affinity
Inform. Sci.
(2011) - et al.
Mining frequent patterns in a varying-size sliding window of online transactional data streams
Inform. Sci.
(2012) - et al.
Image categorization: graph edit distance + edge direction histogram
Patt. Recog.
(2008) - et al.
Modeling term proximity for probabilistic information retrieval models
Inform. Sci.
(2011) - et al.
Mining maximal hyperclique pattern: a hybrid search strategy
Inform. Sci.
(2007) - et al.
A novel approach to hybrid recommendation systems based on association rules mining for content recommendation in asynchronous discussion groups
Inform. Sci.
(2013) - et al.
A probabilistic model for image representation via multiple patterns
Patt. Recog.
(2012) - et al.
Novel alarm correlation analysis system based on association rules mining in telecommunication networks
Inform. Sci.
(2010) - et al.
High utility pattern mining using the maximal itemset property and lexicographic tree structures
Inform. Sci.
(2012) - et al.
Cosine interesting pattern discovery
Inform. Sci.
(2012)
Market Models: A Guide to Financial Data Analysis
Dynamic Programming
A Course in Universal Algebra
Mining frequent itemsets without support threshold: with and without item constraints
IEEE Trans. Knowl. Data Eng.
Applied Multiple Regression/Correlation Analysis for the Behavioral Science
Interestingness measures for data mining: a survey
ACM Comput. Surv.
Frequent pattern mining: current status and future directions
Data Min. Knowl. Discov.
Cited by (33)
On evaluating the collaborative research areas: A case study
2022, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :In the end, to extract actual topic patterns from candidates, TP-tree was designed. Zhang et al. (2018a) proposed a pattern-based method to detect topics from a Twitter-like platform in China, which employs an FP-growth-like algorithm (Cao et al., 2014) to extract patterns and further summarizes them into topics by hierarchical clustering. Graph-based methods are another type of feature-pivot methods.
Link communities detection: an embedding method on the line hypergraph
2019, NeurocomputingCitation Excerpt :We prove that if suitable thresholds are set, several types of weak-ties can be excluded from link groups. Also, by virtue of our previous work on cosine pattern mining [25,26], we can utilize an efficient algorithm for mining link groups satisfying the threshold constraints. Secondly, we model the groups of links as a hypergraph for the reason that each group usually contains more than two links.
Best of both worlds: Mitigating imbalance of crowd worker strategic choices without a budget
2019, Knowledge-Based SystemsCitation Excerpt :Crowdsourcing. As a popular application paradigm, crowdsourcing has received extensive attentions in recent years [5,19–21]. In the scenario of data collection, some researchers devote to the new challenges such as data quality [22] or information distillation [23].