Scaling up cosine interesting pattern discovery: A depth-first method

doi:10.1016/j.ins.2013.12.062

Information Sciences

Volume 266, 10 May 2014, Pages 31-46

https://doi.org/10.1016/j.ins.2013.12.062 Get rights and content

Abstract

This paper presents an efficient algorithm called ${CosMiner}_{t}$ for interesting pattern discovery. The widely used cosine similarity, found to possess the null-invariance property and the anti-cross-support-pattern property, is adopted as the interestingness measure in ${CosMiner}_{t}$ . ${CosMiner}_{t}$ is generally an FP-growth-like depth-first traversal algorithm that rests on an important property of the cosine similarity: the conditional anti-monotone property (CAMP). The combined use of CAMP and the depth-first support-ascending traversal strategy enables the pre-pruning of uninteresting patterns during the mining process of ${CosMiner}_{t}$ . Extensive experiments demonstrate the high efficiency of ${CosMiner}_{t}$ in interesting pattern discovery, in comparison to the breath-first strategy and the post-evaluation strategy. In particular, ${CosMiner}_{t}$ shows its capability in suppressing the generation of cross-support patterns and discovering rare but truly interesting patterns. Finally, an interesting case of landmark recognition is presented to illustrate the value of cosine interesting patterns found by ${CosMiner}_{t}$ in real-world applications.

Introduction

Since the introduction of frequent patterns in the early nineties of the last century [1], association analysis has become one of the core problems in the data mining and database communities [2]. Given a large set of items (objects) and the observation data about co-occurring items, association analysis is concerned with the identification of strongly correlated subsets of items [27]. It plays an important role in many application domains such as market-basket analysis [3], [8], recommender systems [22], telecommunication network alarm analysis [26], image processing [25], climate studies [32], public health [10], and bio-informatics [38].

It was not until recently, researchers found that the classic support–confidence framework of association analysis has some intrinsic defects. For instance, the well-known “coffee–tea” example reveals that the confidence as an interestingness measure for association rules may not disclose truly interesting relationships [13], [14]. Also, this framework often generates too many frequent patterns and even more association rules, from which to find useful patterns is itself a great challenge. Meanwhile, the support-based pruning strategy is not effective for data sets with skewed support distributions [40]; that is, if the minimum support threshold is relatively low, we may extract too many spurious patterns involving items with substantially different support levels, such as {earrings, milk} in which the support of earrings is much lower than that of milk. On the contrary, if the threshold is high, we may miss many interesting but relatively infrequent patterns [16], such as {earrings, gold ring, bracelet} that contains rare but truly valuable items.

To overcome these problems, many interestingness measures have been proposed to fine truly interesting patterns [12]. Patterns with a measure value above some given threshold are called interesting patterns with respect to that measure. Among the proposed measures, the cosine similarity gains particular interests. Indeed, cosine similarity has various merits for association analysis. First, it is one of the few interestingness measures that hold symmetry, null-invariance and anti-cross-support-pattern properties simultaneously (details will be given in Section 2.2). Many well-known measures, such as confidence and lift, cannot achieve this. Second, cosine similarity has been widely used as proximity measure in text mining [41], information retrieval [17], image processing [11], and bio-informatics [20], to avoid the “curse of dimensionality” [4] – a problem we also face in association analysis. Finally, it is very simple and has physical meaning, i.e., it measures the angle of two vectors in the feature space. As a result, in this paper, we select the cosine similarity as the interestingness measure to define interesting patterns.

Mining cosine interesting patterns is by no means a trivial task. First of all, the traditional definition of cosine similarity is only for two vectors, which must be extended to the case of multi-itemsets. More importantly, the cosine similarity may not hold the anti-monotone property – the key for the success of frequent pattern mining using the support measure – and therefore may not be able to reduce the search space of interesting patterns. One may argue that a POST-evaluation scheme can already work, by first generating frequent patterns, and then identifying interesting patterns from frequent patterns using a cosine threshold. While being simple, the post-evaluation scheme has the same problem as the traditional frequent pattern mining algorithms; that is, rare but interesting patterns will be lost if a higher support threshold is set, but too many patterns will be generated for a low threshold instead.

In light of these, in this paper, we attempt to design an efficient algorithm to mine cosine interesting patterns from large-scale databases. To this end, we first expand the definition of cosine similarity to the scope of multi-itemsets, and highlight some valuable properties of the cosine similarity, such as the null-invariance property and the anti-cross-support-pattern property. We then point out that the cosine similarity does not hold the anti-monotone property, and propose a novel conditional anti-monotone property (CAMP) as an alternative. We argue that a depth-first itemset traversal strategy is more appropriate than a breath-first strategy when mining cosine interesting pattern using CAMP, based on which an algorithm named ${CosMiner}_{t}$ is proposed. ${CosMiner}_{t}$ is an FP-growth-like algorithm that uses the cosine as well as support thresholds to prune uninteresting patterns, although the mining of single-path tree is different from the FP-growth algorithm [15]. Extensive experiments on various real-world data sets demonstrate the superiority of ${CosMiner}_{t}$ to the Apriori-like method ( ${CosMiner}_{a}$ ) and the post-evaluation method (POST), in terms of efficiency. We also verify the abilities of ${CosMiner}_{t}$ in suppressing cross-support patterns and mining the rare but interesting patterns. Finally, we apply ${CosMiner}_{t}$ as a noise-removal tool to a real-world landmark recognition case, based on which the performance of image clustering is improved substantially. This, in turn, demonstrates the value of cosine interesting patterns mined by ${CosMiner}_{t}$ .

The remainder of this paper is organized as follows. In Section 2, we define and explore the cosine similarity from a pattern mining perspective. In Section 3, we propose the conditional anti-monotone property and suggest the combined use of a depth-first traversal strategy. The ${CosMiner}_{t}$ algorithm is proposed in Section 4. Section 5 shows the experimental results, followed by a case study on landmark recognition in Section 6. We finally present the related work and conclude our work in Section 7 and Section 8, respectively.

Section snippets

Cosine similarity: preliminaries and problem definition

In this section, we present the definition of cosine similarity for multi-itemsets from an interestingness measure perspective. Some important properties of cosine similarity will be also discussed here. Finally, we formulate the problem to be studied in this paper.

Conditional anti-monotone property for interesting pattern mining

In this section, we present the Conditional Anti-Monotone Property (CAMP), an alternative to the anti-monotone property, for cosine interesting pattern discovery. The itemset traversal schemes coupled with CAMP will be also discussed here.

${CosMiner}_{t}$ : the algorithmic details

${CosMiner}_{t}$ is generally an FP-growth-like algorithm for cosine interesting pattern discovery. It can be mainly divided into two phases: (1) The generation of an FP-tree and (2) the mining of cosine interesting pattern from the FP-tree.

The first phase is almost the same as the FP-growth algorithm, except that the items in the transactions must be sorted in a support-descending order before building the FP-tree. This is crucial for the implementation of the DF-SAT strategy, since the patterns in

Experimental results

In this section, we provide experimental results on six real-world data sets. Two other mining algorithms, including the Apriori-like method (denoted as ${CosMiner}_{a}$ ) and the post-evaluation method (denoted as POST), are adopted for the comparative study. Note that ${CosMiner}_{a}$ employs a breath-first traversal strategy for interesting pattern mining, as described in Section 3.2. POST is more straightforward; that is, it first mines frequent patterns using the FP-growth algorithm, and then computes

An application

In this section, we employ ${CosMiner}_{t}$ , as a massive-noise removal tool, to enhance the cluster analysis for high-dimensional landmark recognition.

Related work

Since the early introduction by Agrawal et al. [1], association analysis has been widely used in many application domains [13], [35]. But one problem is that the traditional support–confidence framework tends to generate too many rules – many of them are indeed uninteresting to us.

To cope with this problem, many interestingness measures have been proposed or borrowed to mine the truly interesting patterns. Piatetski–Shapiro proposed the statistical independence of rules as an interestingness

Conclusions

This paper studied the problem of mining cosine interesting patterns from large-scale databases. The conditional anti-monotone property as well as the depth-first traversal strategy were proposed to elicit a novel FP-growth-like algorithm: ${CosMiner}_{t}$ . Extensive experiments with comparative studies demonstrated the strengths of ${CosMiner}_{t}$ . As an application example, ${CosMiner}_{t}$ was successfully applied to a landmark recognition task where huge volume of image noise was presented.

Acknowledgments

This research was partially supported by the National Natural Science Foundation of China (NSFC) under Grants 71072172, 71372188 and 61103229, National Center for International Joint Research on E-Business Information Processing under Grant 2013B01035, National Key Technologies R&D Program of China under Grant 2013BAH16F00, the National Soft Science Research Program under Grant 2013GXS4B081, Industry Projects in Jiangsu S&T Pillar Program under Grant BE2012185, and the Natural Science

References (41)

Chowdhury Farhan Ahmed et al.
A framework for mining interesting high utility patterns with a strong frequency affinity
Inform. Sci.
(2011)
Hui Chen et al.
Mining frequent patterns in a varying-size sliding window of online transactional data streams
Inform. Sci.
(2012)
Xinbo Gao et al.
Image categorization: graph edit distance + edge direction histogram
Patt. Recog.
(2008)
Ben He et al.
Modeling term proximity for probabilistic information retrieval models
Inform. Sci.
(2011)
Yaochun Huang et al.
Mining maximal hyperclique pattern: a hybrid search strategy
Inform. Sci.
(2007)
Ahmad A. Kardan et al.
A novel approach to hybrid recommendation systems based on association rules mining for content recommendation in asynchronous discussion groups
Inform. Sci.
(2013)
Jing Li et al.
A probabilistic model for image representation via multiple patterns
Patt. Recog.
(2012)
Tongyan Li et al.
Novel alarm correlation analysis system based on association rules mining in telecommunication networks
Inform. Sci.
(2010)
Ming-Yen Lin et al.
High utility pattern mining using the maximal itemset property and lexicographic tree structures
Inform. Sci.
(2012)
Junjie Wu et al.
Cosine interesting pattern discovery
Inform. Sci.
(2012)

Rakesh Agrawal, Tomasz Imielinski, Arun Swami, Mining association rules between sets of items in large databases, in:...

Carol Alexander

Market Models: A Guide to Financial Data Analysis

(2001)

Richard Bellman et al.

Dynamic Programming

(1957)

Julien Blanchard, Fabrice Guillet, Regis Gras, Henri Briand, Using information-theoretic measures to assess association...

Sergey Brin, Rajeev Motwani, Craig Silverstein, Beyond market basket: generalizing association rules to correlations,...

Stanley Burris et al.

A Course in Universal Algebra

(1981)

Yin-Ling Cheung et al.

Mining frequent itemsets without support threshold: with and without item constraints

IEEE Trans. Knowl. Data Eng.

(2004)

Jacob Cohen et al.

Applied Multiple Regression/Correlation Analysis for the Behavioral Science

(2002)

Liqiang Geng et al.

Interestingness measures for data mining: a survey

ACM Comput. Surv.

(2006)

Jiawei Han et al.

Frequent pattern mining: current status and future directions

Data Min. Knowl. Discov.

(2007)

Cited by (33)

Collusive spam detection from Chinese community question answering sites: A collective classification framework
2024, Information Sciences
With Community Question Answering (CQA) sites evolving into quite popular knowledge-sharing platforms on the Internet, they have also become ideal places for various spammers to spread fake or promotional information. Recently, with the rapid development of crowdsourcing systems, numerous malicious users have launched organized spam campaigns, conducting many spam accounts to carry out collusive spamming activities on CQA sites. In these campaigns, the spammers do not act independently but post deceptive questions and answers (Q&As) collaboratively, which makes the Q&As closely related to each other, but the spam clues of them are even less visible. Therefore, most existing spam detection works may fail to detect these carefully organized and posted collusive CQA spam. In this paper, taking Baidu Zhidao, a popular CQA platform in Chinese, as the study object, we propose a Collective Classification framework for community Question Answering spam detection (CCQA), which collectively identifies the collusive CQA spam using Q&A features and the correlations among Q&As. First, we define the Deceptive Pattern of Q&As, based on which the real Q&A groups are extracted. Then, we extract several highly discriminative Q&A features from both individual and group levels, and propose several types of correlations, which correlate the Q&As that are more likely to have the same labels. After uniformly modeling the Q&As, features, and correlations in the Attributed Heterogeneous Information Network (AHIN), a semi-supervised collective classification algorithm is proposed to detect the collusive Q&A spam. Experimental results on a real-life dataset demonstrate that CCQA can accurately detect the collusive CQA spam, and outperform a number of competitive baselines.
On evaluating the collaborative research areas: A case study
2022, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
In the end, to extract actual topic patterns from candidates, TP-tree was designed. Zhang et al. (2018a) proposed a pattern-based method to detect topics from a Twitter-like platform in China, which employs an FP-growth-like algorithm (Cao et al., 2014) to extract patterns and further summarizes them into topics by hierarchical clustering. Graph-based methods are another type of feature-pivot methods.
The growth of social networks is ever-increasing. Many available scientific publications evidence the interest of researchers in this area. Within a time span of eight years from 2011 to 2018, approximately 2600, 230, 150, and 110 scientific articles were published from the USA, Iran, Saudi Arabia, and Turkey, respectively around this area of research. To comprehensively survey all the sub-fields and interests within this research area, the present paper proposes a novel density-based method for finding topic descriptors from academic articles. By employing a robust to noise fuzzy clustering algorithm, the terms are clustered, and by utilizing a modified Parzen window, k topic descriptors from each cluster are extracted. Besides, an optimization problem has been designed to detect the similarity between word pairs. By conducting the experiments, the research priorities for four countries within this time span have been found. Moreover, the closeness of the research in developing countries to the developed country have been measured. The experimental results show that for four years, the research topics in Turkey were close to the research topics in the USA on average, and the research topics in Saudi Arabia were close to the USA topics during the past two years. Additionally, the experimental comparison of the proposed method with two clustering baselines indicates the superiority of the proposed method in terms of precision, recall, and accuracy.
Fusion estimation for multi-rate linear repetitive processes under weighted try-once-discard protocol
2020, Information Fusion
In this paper, the fusion estimation problem is studied for a class of discrete time-varying multi-rate linear repetitive processes (LRPs) under weighted try-once-discard protocol. The LRPs are measured by multiple sensors that are allowed to have different sampling periods, and the state updating period of the LRPs is also allowed to be different from the sampling periods of the asynchronous sensors. To facilitate the estimator design, the lifting technique is applied to transform the multi-rate LRPs to single-rate ones. Moreover, due to limited communication capability, the weighted try-once-discard protocol is adopted to schedule the asynchronous sensors. A set of local estimators is designed such that the upper bounds on the local estimation error covariances are guaranteed, and such upper bounds are then minimized by appropriately designing the estimator gains. Furthermore, the estimates from the local estimators are fused by recurring to the sequential covariance intersection fusion method. Finally, a simulation example is given to demonstrate the effectiveness of the proposed fusion estimation scheme.
Link communities detection: an embedding method on the line hypergraph
2019, Neurocomputing
Citation Excerpt :
We prove that if suitable thresholds are set, several types of weak-ties can be excluded from link groups. Also, by virtue of our previous work on cosine pattern mining [25,26], we can utilize an efficient algorithm for mining link groups satisfying the threshold constraints. Secondly, we model the groups of links as a hypergraph for the reason that each group usually contains more than two links.
Recent advances have verified ground-truth communities perceive several characteristics. That is, communities are overlapped and densely connected. Not only that, the organization of communities, in a general sense, is hierarchical. To capture all of these characteristics, we propose a framework based on link embedding method. Firstly, we define close-knit link groups which preserve the hierarchical structures and carefully transform the problem of mining close-knit link groups as mining cosine patterns which can be implemented efficiently. Secondly, we construct the weighted line hypergraph and embed each link into a low dimension vector. Finally, we simply employ K-means algorithm to obtain the link communities. Overlapping structures are naturally obtained by interpreting the link communities as nodes communities. Experimental results on three real-world networks demonstrate the proposed approach is able to identify much higher-quality overlapping communities in terms of four external measures, compared with six classical overlapping community detection methods.
Best of both worlds: Mitigating imbalance of crowd worker strategic choices without a budget
2019, Knowledge-Based Systems
Citation Excerpt :
Crowdsourcing. As a popular application paradigm, crowdsourcing has received extensive attentions in recent years [5,19–21]. In the scenario of data collection, some researchers devote to the new challenges such as data quality [22] or information distillation [23].
Crowdsourcing has become a popular paradigm for requesters to hire ubiquitous crowd workers. The worker’s selfish instinct of choosing the most profitable task can cause the imbalance of task completion: some tasks achieve a number of redundant worker choices, while others may receive no worker response. Although budget-based incentives can mitigate the imbalance of crowd workers’ strategic choices, the extra budget makes them less attractive. To mitigate task completion imbalance without a budget, a price mediation mechanism is proposed. This mechanism works by allowing the crowdsourcing platforms to implicitly adjust task prices, thereby eliciting workers to balance their choices. The price adjustment should be carefully designed to satisfy (1) task completion integrity and (2) no extra budget, while it maximizes social welfare. We prove that this optimization problem is NP-hard to solve. By designing bound function and pruning strategies, we propose an optimal branch-and-bound algorithm for small-scale instances. To further improve the scalability for large-scale instances, a heuristic method based on price transfers is proposed. Experimental results on a real dataset show that compared with benchmarks, our approaches are effective for maximizing social welfare and are beneficial to both requesters and workers.
State estimation under non-Gaussian Lévy and time-correlated additive sensor noises: A modified Tobit Kalman filtering approach
2019, Signal Processing
The Tobit Kalman filter (TKF) is a powerful tool in solving the state estimation problem for linear systems with censored measurements. This paper is concerned with the Tobit Kalman filtering problem for discrete time-varying systems subject to non-Gaussian Lévy and time-correlated additive measurement noises. By referencing to the measurement differencing method, the time-correlation of the measurement noises is transformed into the cross-correlation between the equivalent measurement noise and the process noise. Then, by resorting to the Lévy-Ito theorem, the non-Gaussian Lévy measurement noises are transformed into equivalent Gaussian noises with unknown covariances. Based on the transformed Gaussian measurement noises, a modified recursive TKF is designed where the unknown noise covariances are carefully calculated. Simulation results are provided to illustrate the effectiveness of the proposed filter.

View all citing articles on Scopus

View full text

Scaling up cosine interesting pattern discovery: A depth-first method

Abstract

Introduction

Section snippets

Cosine similarity: preliminaries and problem definition

Conditional anti-monotone property for interesting pattern mining

CosMinert: the algorithmic details

Experimental results

An application

Related work

Conclusions

Acknowledgments

Inform. Sci.

Inform. Sci.

Patt. Recog.

Inform. Sci.

Inform. Sci.

Inform. Sci.

Patt. Recog.

Inform. Sci.

Inform. Sci.

Inform. Sci.

Market Models: A Guide to Financial Data Analysis

Dynamic Programming

A Course in Universal Algebra

Mining frequent itemsets without support threshold: with and without item constraints

IEEE Trans. Knowl. Data Eng.

Applied Multiple Regression/Correlation Analysis for the Behavioral Science

Interestingness measures for data mining: a survey

ACM Comput. Surv.

Frequent pattern mining: current status and future directions

Data Min. Knowl. Discov.

${CosMiner}_{t}$ : the algorithmic details