Elsevier

Knowledge-Based Systems

Volume 53, November 2013, Pages 40-50
Knowledge-Based Systems

Finding multiple global linear correlations in sparse and noisy data sets

https://doi.org/10.1016/j.knosys.2013.08.015Get rights and content

Abstract

Finding linear correlations is an important research problem with numerous real-world applications. In real-world data sets, linear correlation may not exist in the entire data set. Some linear correlations are only visible in certain data subsets. On one hand, a lot of local correlation clustering algorithms assume that the data points of a linear correlation are locally dense. These methods may miss some global correlations when data points are sparsely distributed. On the other hand, existing global correlation clustering methods may fail when the data set contains a large amount of non-correlated points or the actual correlations are coarse. This paper proposes a simple and fast algorithm DCSearch for finding multiple global linear correlations in a data set. This algorithm is able to find the coarse and global linear correlation in noisy and sparse data sets. By using the classical divide and conquer strategy, it first divides the data set into subsets to reduce the search space, and then recursively searches and prunes the candidate correlations from the subsets. Empirical studies show that DCSearch can efficiently reduce the number of candidate correlations during each iteration. Experimental results on both synthetic and real data sets demonstrate that DCSearch is effective and efficient in finding global linear correlations in sparse and noisy data sets.

Introduction

Linear correlations reveal the linear dependencies among several features in a data set. Finding theses correlations is an interesting research problem, since it has many real-world applications. Extensive research efforts have been conducted on mining the linear correlation pattern over data set [34], [8], [35], [27]. Most of them are focusing on the entire data set with full feature space. However, in real-world applications, the linear correlations only appear in some subspaces of the entire data set.

A linear correlation in an Euclidean space is equivalent to a linear subspace. Given a data space Rd, a subspace S can be formulated by the parametric linear equation below:x=x0+t1v1+t2v2++tkvkwhere S is spanned by k orthogonal d-dimensional vectors v1, … , vk. Parameters t1, … , tk are scalars, 1  k < d, and x0 is a point in S. In this paper, “linear correlation”, “linear subspace” and “correlation pattern” are the same and used interchangeably.

Fig. 1 shows an example data set D with two different subspaces S1 and S2 in R3. The data set contains three subsets D = D1  D2  D3. D1 and D2 lie in S1 and S2 respectively. D3 is the set of non-correlated points. S1 and S2 are defined below:S1:x=(-13,4,-9)T+t1(0.9,0.2,-0.4)T+t2(0.2,0.5,0.8)TS2:x=(-4,18,33)T+t1(0.2,0.3,0.9)T+t2(-0.9,0.4,0.1)T

In this paper, our goal is to find the global linear correlations, such as S1 and S2 in Fig. 1. A straightforward approach is to enumerate all possible subspaces of the data set. If the data set D is of size N in Rd, every k + 1 points can construct a k-dimensional subspace, k < d. The total number of subspaces is k=1dNk. An alternative approach is to enumerate all data subsets. There are 2N different data subsets.

A lot of previous studies on subspace clustering, projected clustering and correlation clustering are related to this problem [8], [7], [9], [33], [22], [18], [10], [14], [21], [13], [3], [5], [36], [37]. Some works on projected clustering and subspace clustering only find axis-parallel subspaces, not arbitrarily oriented subspaces [9], [22], [18], [10]. Other works have an assumption that the data points of a correlation should be close to others [34], [2], [1], [35], [16]. However, in Fig. 1, S1 has both dense areas and sparse areas. Local linear correlation clustering algorithms are able to identify the correlations in dense areas, but are difficult to capture the correlations in sparse areas. Without the “locality assumption”, finding the global correlations is regarded as the “chicken-and-egg” problem [29]. We do not know which subspace has a data subset of certain size, or which data subset can fit a subspace. Therefore, this problem is more challenging.

CASH is a recent algorithm which can identify correlations in sparse data sets [2], [1]. It is based on Hough transform. It first converts a data space into a parametric space, then searches dense areas in the parametric space. The dense area indicates lots of subspaces associated with that parameters. Hough transform is a traditional method to detect lines in images. However, there is a well-known limitation of Hough transform that it cannot deal with higher dimensional space, even 4 or 5 dimensions [25]. The CASH algorithm applies a binary search strategy to identify the dense grids in the parametric space without exploring all grids in hough space[2]. There are two types of noise in the data set. One is the type of non-correlated data points, which are those points far away from the correlations. The other type is the coarseness of correlations, which means the blurring of a correlation. The coarseness indicates the correlated data points may not be exactly located in that correlation hyperplane. In other words, there may be a small distance ( > 0) between the correlated data points and the correlation hyperplane. CASH is robust to non-correlated data points. However, the Hough transform in the CASH algorithm does not apply any smoothing accumulator for grid voting. A lot of previous studies on Hough transform in image processing area have pointed out that Hough transform without a smoothing accumulator would be very sensitive to coarseness of actual correlations [11], [26].

Generalized Principal Component Analysis (GPCA) [30], [17] [29], [28] formulates the subspace clustering problem as a polynomial factorization problem. Since each subspace represents a linear equation biTx=0, GPCA tries to find the coefficients bi. It constructs an integrated equation pn(x)=i=1i=n(biTx)=0, where n is the number of actual subspaces. In the integrated equation, bi are considered as variables. Once we compute the roots of the equation, we could obtain all subspaces biTx=0. However, as shown in Fig. 1, not all points lie in S1 or S2. There are non-correlated data points (or noisy points) in the data sets. Every k non-correlated points can construct a noisy k-dimensional subspace. So n is close to |Dnon|3, where Dnon is the set of non-correlated points in D. The number of monomials Rn(d) of the integrated equation increases exponentially with n, Rn(d)=n+d-1n [30]. Generally, GPCA is developed to find the principal linear subspaces on the data set, but the true correlation may not be “principal” in a noisy data set since most data points are not in any correlation hyperplane. Therefore, GPCA may not be able to find the true correlations in noisy data sets.

LMCLUS is an approximation algorithm for global linear correlation clustering [16]. It samples a number of points in the data set to find the subspace. LMCLUS algorithm randomly picks up m points in the data set, and shows that the probability of the m points has at least k points in a subspace is pr = 1  [1  (1/n)k]m, where n is the number of actual correlation subspaces. As shown in Fig. 1, due to the large amount of non-correlated data points, the actual number of subspaces may be very large. The reason is that every k non-correlated points can construct a k-dimensional noisy subspace. So n is at least |Dnon|k, where Dnon is the set of non-correlated points. Therefore, the number of sampling points m should be very large to maintain the high probability pr.

In practice, people only care about those correlations having at least a certain amount of data points. If the correlation only covers very few data points, this correlation may be caused by a coincidence. This criterion is similar to the minimum support for an association rule. In this paper, we apply the divide and conquer strategy to find those linear correlations having at least a certain amount of data points. Given a data set D consisting of N data points in Rd-1, let D = D1  D2, ∣D1 = D2∣. If subspace S in D contains at least m points, then S must have at least ⌊m/2⌋ in D1 or D2. Thus, we first search those subspaces having at least ⌊m/2⌋ in D1 and D2. Then we combine the two results, and check each subspace whether it has at least m points in D or not. We prune the unqualified candidates for D. The pruning process is executed at each level of the recursion. Thus, the divide and conquer strategy can reduce the search space.

If a data subset fits a (k-1)-dimensional subspace, it must fit a k-dimensional subspace as well [29], k  1. Therefore, in this paper, we only focus on finding data subsets fitting (k  1)-dimensional subspaces in the k-dimensional data space. Those data subsets fitting lower dimensional subspace can be discovered by recursively applying this method.

We develop a simple and fast algorithm DCSearch to solve the problem of finding multiple global linear correlations in the data set. The algorithm DCSearch is able to find the coarse and global linear correlations in sparse and noisy data set. Comparing to the local linear correlation, the global linear correlation does not need the “locality assumption”. We show that each global linear correlation can be captured by the algorithm in noisy data sets. Empirical studies show that DCSearch can efficiently reduce the number of candidate correlations at each iteration. Experiments on both synthetic and real data sets show that DCSearch is more effective and efficient in finding multiple global linear correlations in sparse and noisy data sets compared to existing methods.

Section snippets

Related work

In classification literature, discriminant analysis approaches are well known to be able to learn discriminative feature correlations and transformations [15], [19]. However, our work focuses on finding linear correlations in an unsupervised way. Subspace clustering and projected clustering have attracted a lot of previous research works. The goal of subspace clustering is to divide the data set into several subsets, and each subset belongs to a subspace. The studies of subspace clustering

Definition of global linear correlation

The main goal of this paper is to find the linear correlations which may be interesting to the users. Intuitively, the interestingness of a correlation can be determined by how many data points described by the correlation. A correlation can draw people’s attention only if a lot of data points captured that correlation. Otherwise, this correlation has no meaning to people. Recently, [1] [2] proposed the concept of global linear correlation. In this paper, we propose the definition of the global

Divide and conquer strategy

In the discussion of the related work, we mentioned a lot of existing methods work well in small data sets for finding global correlations. But when the data size increases, the search space of those methods grows exponentially with the dimensionality. Therefore, we apply the divide and conquer strategy to handle large data sets. In general, a divide and conquer strategy is composed of two procedures: Recursion and Base case. Recursion is a procedure to extract the problem solution from the

DCSearch algorithm

In this section, we present our algorithm proposed in this paper. Theorem 1 provides a strategy to search global linear correlations from the subsets of the entire data set.

Algorithm 1

DCSearch (D, , δ, θ, dmin)

Parameter: D: the data set; , δ, θ: thresholds;
1: ifD ⩽< dmin then
2:  corrs← SampleCorrSpace (D)
3: else
4:  Arbitrarily bisect D into D1 and D2
5:  corrs1← DCSearch (D1, , δ, θ/2, dmin)
6:  corrs2← DCSearch (D2, , δ, θ/2, dmin)
7:  corrs  corrs1  corrs2
8:  Remove redundant correlations in corrs with θ.
9:  for each S  

Experiments

In this section, we test our algorithm on synthetic data sets and real data sets. For synthetic data sets, since we have the ground truth of the true correlations, we precisely compare absolute errors between discovered correlations and real correlations for each algorithm. For the real data sets, we do not have the ground truth, so we can only compare the discovered correlations from all the algorithms. Experiments are performed on a 2.83 GHz PC with 1G memory running Windows XP system.

Conclusions

In this paper, we investigate the problem of finding multiple global linear correlations in sparse and noisy data sets. This paper proposes a simple and fast algorithm DCSearch for finding multiple global linear correlations in a data set. This algorithm is able to find the coarse and global linear correlation in noisy and sparse data sets. By using the classical divide and conquer strategy, it first divides the data set into subsets to reduce the search space, and then recursively searches and

Acknowledgements

The work of S. Zhu is supported by the National Natural Science Foundation of China under Grants 61070151 and 61373147, and the work of L. Tang and T. Li is supported in part by National Science Foundation under Grants HRD-0833093, CNS-1126619 and IIS-1213026, US Department of Homeland Security under Grant Award No. 2010-ST-062-000039, and Army Research Office under Grant No. W911NF-10-1-0366 and W911NF-12-1-0431.

References (37)

  • D.H. Ballard

    Generalizing the hough transform to detect arbitrary shapes

    Pattern Recogn.

    (1981)
  • S. Shapiro

    Properties of transforms for the detection of curves in noisy pictures

    Comput. Graph. Image Process

    (1978)
  • K.-G. Woo et al.

    Findit: a fast and intelligent subspace clustering algorithm using dimension voting

    Inform. Software Technol.

    (2004)
  • E. Achtert et al.

    Global correlation clustering based on the hough transform

    Stat. Anal. Data Min.

    (2008)
  • E. Achtert, C. Bohm, J. David, P. Krpger, A. Zimek, Robust clustering in arbitrarily oriented subspaces, in:...
  • E. Achtert, C. Bohm, H.-P. Kriegel, P. Kroger, A. Zimek, Deriving quantitative models for correlation clusters, in:...
  • E. Achtert, C. Bohm, H.-P. Kriegel, P. Kroger, A. Zimek, On exploring complex relationships of correlation clusters,...
  • E. Achtert, C. Bohm, H.-P. Kriegel, P. Kroger, A. Zimek, Robust, complete, and efficient correlation clustering, in:...
  • E. Achtert, C. Bohm, P. Kroger, A. Zimek, Mining hierarchies of correlation clusters, in: Proceedings of the...
  • C.C. Aggarwal, C.M. Procopiuc, J.L. Wolf, P.S. Yu, J.S. Park, Fast algorithms for projected clustering, in: Proceedings...
  • C.C. Aggarwal, P.S. Yu, Finding generalized projected clusters in high dimensional spaces, in: Proceedings of ACM...
  • R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional data for data...
  • I. Assent, R. Krieger, E. Muller, T. Seidl, Dusc: Dimensionality unbiased subspace clustering, in: Proceedings of ICDM...
  • A. Bjoerck et al.

    Numerical methods for computing angles between linear subspaces

    Math. Comput.

    (1973)
  • C. Bohm, K. Kailing, P. Kroger, A. Zimek. Computing clusters of correlation connected objects, in: Proceedings of ACM...
  • C.H. Cheng, A.W.-C. Fu, Y. Zhang, Entropy-based subspace clustering for mining numerical data, in: Proceedings of ACM...
  • K. Fukunaga

    Introduction to Statistical Pattern Recognition

    (1990)
  • R. Harpaz et al.

    Linear manifold correlation clustering

    Int. J. Inform. Technol. Intell. Comput.

    (2007)
  • Cited by (0)

    View full text