Elsevier

Information Sciences

Volume 266, 10 May 2014, Pages 31-46
Information Sciences

Scaling up cosine interesting pattern discovery: A depth-first method

https://doi.org/10.1016/j.ins.2013.12.062Get rights and content

Abstract

This paper presents an efficient algorithm called CosMinert for interesting pattern discovery. The widely used cosine similarity, found to possess the null-invariance property and the anti-cross-support-pattern property, is adopted as the interestingness measure in CosMinert. CosMinert is generally an FP-growth-like depth-first traversal algorithm that rests on an important property of the cosine similarity: the conditional anti-monotone property (CAMP). The combined use of CAMP and the depth-first support-ascending traversal strategy enables the pre-pruning of uninteresting patterns during the mining process of CosMinert. Extensive experiments demonstrate the high efficiency of CosMinert in interesting pattern discovery, in comparison to the breath-first strategy and the post-evaluation strategy. In particular, CosMinert shows its capability in suppressing the generation of cross-support patterns and discovering rare but truly interesting patterns. Finally, an interesting case of landmark recognition is presented to illustrate the value of cosine interesting patterns found by CosMinert in real-world applications.

Introduction

Since the introduction of frequent patterns in the early nineties of the last century [1], association analysis has become one of the core problems in the data mining and database communities [2]. Given a large set of items (objects) and the observation data about co-occurring items, association analysis is concerned with the identification of strongly correlated subsets of items [27]. It plays an important role in many application domains such as market-basket analysis [3], [8], recommender systems [22], telecommunication network alarm analysis [26], image processing [25], climate studies [32], public health [10], and bio-informatics [38].

It was not until recently, researchers found that the classic support–confidence framework of association analysis has some intrinsic defects. For instance, the well-known “coffee–tea” example reveals that the confidence as an interestingness measure for association rules may not disclose truly interesting relationships [13], [14]. Also, this framework often generates too many frequent patterns and even more association rules, from which to find useful patterns is itself a great challenge. Meanwhile, the support-based pruning strategy is not effective for data sets with skewed support distributions [40]; that is, if the minimum support threshold is relatively low, we may extract too many spurious patterns involving items with substantially different support levels, such as {earrings, milk} in which the support of earrings is much lower than that of milk. On the contrary, if the threshold is high, we may miss many interesting but relatively infrequent patterns [16], such as {earrings, gold ring, bracelet} that contains rare but truly valuable items.

To overcome these problems, many interestingness measures have been proposed to fine truly interesting patterns [12]. Patterns with a measure value above some given threshold are called interesting patterns with respect to that measure. Among the proposed measures, the cosine similarity gains particular interests. Indeed, cosine similarity has various merits for association analysis. First, it is one of the few interestingness measures that hold symmetry, null-invariance and anti-cross-support-pattern properties simultaneously (details will be given in Section 2.2). Many well-known measures, such as confidence and lift, cannot achieve this. Second, cosine similarity has been widely used as proximity measure in text mining [41], information retrieval [17], image processing [11], and bio-informatics [20], to avoid the “curse of dimensionality” [4] – a problem we also face in association analysis. Finally, it is very simple and has physical meaning, i.e., it measures the angle of two vectors in the feature space. As a result, in this paper, we select the cosine similarity as the interestingness measure to define interesting patterns.

Mining cosine interesting patterns is by no means a trivial task. First of all, the traditional definition of cosine similarity is only for two vectors, which must be extended to the case of multi-itemsets. More importantly, the cosine similarity may not hold the anti-monotone property – the key for the success of frequent pattern mining using the support measure – and therefore may not be able to reduce the search space of interesting patterns. One may argue that a POST-evaluation scheme can already work, by first generating frequent patterns, and then identifying interesting patterns from frequent patterns using a cosine threshold. While being simple, the post-evaluation scheme has the same problem as the traditional frequent pattern mining algorithms; that is, rare but interesting patterns will be lost if a higher support threshold is set, but too many patterns will be generated for a low threshold instead.

In light of these, in this paper, we attempt to design an efficient algorithm to mine cosine interesting patterns from large-scale databases. To this end, we first expand the definition of cosine similarity to the scope of multi-itemsets, and highlight some valuable properties of the cosine similarity, such as the null-invariance property and the anti-cross-support-pattern property. We then point out that the cosine similarity does not hold the anti-monotone property, and propose a novel conditional anti-monotone property (CAMP) as an alternative. We argue that a depth-first itemset traversal strategy is more appropriate than a breath-first strategy when mining cosine interesting pattern using CAMP, based on which an algorithm named CosMinert is proposed. CosMinert is an FP-growth-like algorithm that uses the cosine as well as support thresholds to prune uninteresting patterns, although the mining of single-path tree is different from the FP-growth algorithm [15]. Extensive experiments on various real-world data sets demonstrate the superiority of CosMinert to the Apriori-like method (CosMinera) and the post-evaluation method (POST), in terms of efficiency. We also verify the abilities of CosMinert in suppressing cross-support patterns and mining the rare but interesting patterns. Finally, we apply CosMinert as a noise-removal tool to a real-world landmark recognition case, based on which the performance of image clustering is improved substantially. This, in turn, demonstrates the value of cosine interesting patterns mined by CosMinert.

The remainder of this paper is organized as follows. In Section 2, we define and explore the cosine similarity from a pattern mining perspective. In Section 3, we propose the conditional anti-monotone property and suggest the combined use of a depth-first traversal strategy. The CosMinert algorithm is proposed in Section 4. Section 5 shows the experimental results, followed by a case study on landmark recognition in Section 6. We finally present the related work and conclude our work in Section 7 and Section 8, respectively.

Section snippets

Cosine similarity: preliminaries and problem definition

In this section, we present the definition of cosine similarity for multi-itemsets from an interestingness measure perspective. Some important properties of cosine similarity will be also discussed here. Finally, we formulate the problem to be studied in this paper.

Conditional anti-monotone property for interesting pattern mining

In this section, we present the Conditional Anti-Monotone Property (CAMP), an alternative to the anti-monotone property, for cosine interesting pattern discovery. The itemset traversal schemes coupled with CAMP will be also discussed here.

CosMinert: the algorithmic details

CosMinert is generally an FP-growth-like algorithm for cosine interesting pattern discovery. It can be mainly divided into two phases: (1) The generation of an FP-tree and (2) the mining of cosine interesting pattern from the FP-tree.

The first phase is almost the same as the FP-growth algorithm, except that the items in the transactions must be sorted in a support-descending order before building the FP-tree. This is crucial for the implementation of the DF-SAT strategy, since the patterns in

Experimental results

In this section, we provide experimental results on six real-world data sets. Two other mining algorithms, including the Apriori-like method (denoted as CosMinera) and the post-evaluation method (denoted as POST), are adopted for the comparative study. Note that CosMinera employs a breath-first traversal strategy for interesting pattern mining, as described in Section 3.2. POST is more straightforward; that is, it first mines frequent patterns using the FP-growth algorithm, and then computes

An application

In this section, we employ CosMinert, as a massive-noise removal tool, to enhance the cluster analysis for high-dimensional landmark recognition.

Related work

Since the early introduction by Agrawal et al. [1], association analysis has been widely used in many application domains [13], [35]. But one problem is that the traditional support–confidence framework tends to generate too many rules – many of them are indeed uninteresting to us.

To cope with this problem, many interestingness measures have been proposed or borrowed to mine the truly interesting patterns. Piatetski–Shapiro proposed the statistical independence of rules as an interestingness

Conclusions

This paper studied the problem of mining cosine interesting patterns from large-scale databases. The conditional anti-monotone property as well as the depth-first traversal strategy were proposed to elicit a novel FP-growth-like algorithm: CosMinert. Extensive experiments with comparative studies demonstrated the strengths of CosMinert. As an application example, CosMinert was successfully applied to a landmark recognition task where huge volume of image noise was presented.

Acknowledgments

This research was partially supported by the National Natural Science Foundation of China (NSFC) under Grants 71072172, 71372188 and 61103229, National Center for International Joint Research on E-Business Information Processing under Grant 2013B01035, National Key Technologies R&D Program of China under Grant 2013BAH16F00, the National Soft Science Research Program under Grant 2013GXS4B081, Industry Projects in Jiangsu S&T Pillar Program under Grant BE2012185, and the Natural Science

References (41)

  • Rakesh Agrawal, Tomasz Imielinski, Arun Swami, Mining association rules between sets of items in large databases, in:...
  • Carol Alexander

    Market Models: A Guide to Financial Data Analysis

    (2001)
  • Richard Bellman et al.

    Dynamic Programming

    (1957)
  • Julien Blanchard, Fabrice Guillet, Regis Gras, Henri Briand, Using information-theoretic measures to assess association...
  • Sergey Brin, Rajeev Motwani, Craig Silverstein, Beyond market basket: generalizing association rules to correlations,...
  • Stanley Burris et al.

    A Course in Universal Algebra

    (1981)
  • Yin-Ling Cheung et al.

    Mining frequent itemsets without support threshold: with and without item constraints

    IEEE Trans. Knowl. Data Eng.

    (2004)
  • Jacob Cohen et al.

    Applied Multiple Regression/Correlation Analysis for the Behavioral Science

    (2002)
  • Liqiang Geng et al.

    Interestingness measures for data mining: a survey

    ACM Comput. Surv.

    (2006)
  • Jiawei Han et al.

    Frequent pattern mining: current status and future directions

    Data Min. Knowl. Discov.

    (2007)
  • Cited by (33)

    • On evaluating the collaborative research areas: A case study

      2022, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      In the end, to extract actual topic patterns from candidates, TP-tree was designed. Zhang et al. (2018a) proposed a pattern-based method to detect topics from a Twitter-like platform in China, which employs an FP-growth-like algorithm (Cao et al., 2014) to extract patterns and further summarizes them into topics by hierarchical clustering. Graph-based methods are another type of feature-pivot methods.

    • Link communities detection: an embedding method on the line hypergraph

      2019, Neurocomputing
      Citation Excerpt :

      We prove that if suitable thresholds are set, several types of weak-ties can be excluded from link groups. Also, by virtue of our previous work on cosine pattern mining [25,26], we can utilize an efficient algorithm for mining link groups satisfying the threshold constraints. Secondly, we model the groups of links as a hypergraph for the reason that each group usually contains more than two links.

    • Best of both worlds: Mitigating imbalance of crowd worker strategic choices without a budget

      2019, Knowledge-Based Systems
      Citation Excerpt :

      Crowdsourcing. As a popular application paradigm, crowdsourcing has received extensive attentions in recent years [5,19–21]. In the scenario of data collection, some researchers devote to the new challenges such as data quality [22] or information distillation [23].

    View all citing articles on Scopus
    View full text