Projected memory clustering☆
Introduction
Clustering, one of fundamental tools of data analysis, aims to partition data into homogeneous groups; in particular, it focuses on extracting non-trivial or hidden patterns from a set of objects. Although various clustering algorithms have been proposed for grouping low-dimensional data [19], [25], [29], [38], it is difficult to use them in high dimensional space. Nevertheless, high dimensional data is of great importance in pattern recognition, natural language processing, computational biology, etc. [11], [26].
Subspace clustering is a class of algorithms, which work well for high dimensional problems and focus on detecting groups described by arbitrary affine subspaces [20], [30], [37]. An arbitrary choice of affine subspaces negatively influences its computational complexity, which partially limits its practical applicability to big data. Projected clustering uses affine subspaces with axis parallel to the elements of the coordinate basis, see Fig. 1, which reduces the computational cost of the algorithm [17], [39]. In other words, projected clustering algorithms define a projected cluster as a pair (X; Y), where X is a subset of data points, and Y is a subset of their attributes, so that the points in X are “close” when projected on the attributes in Y, but they are “not close” when projected on the remaining attributes, see Fig. 1. In consequence, every cluster is described by most informative attributes, which is partially related to co-clustering [8], [24], [33].
In this paper we propose an efficient projected clustering algorithm, PMC (projected memory clustering), which can process high dimensional data with more than 106 attributes. It is an adaptation of a recent state-of-the-art subspace clustering algorithm SuMC [28] to the projected case. The optimization of PMC objective function requires the calculation of coordinate-wise variances (instead of clusters eigenvalues in SuMC), which is linear with respect to both data dimension and number of samples, see Theorem 3.1. Theoretical details of PMC with an optimization algorithm are given in Section 3.
Experiments performed on synthetic and real datasets show that PMC recovers original data structure (measured by Adjusted Rand Index) better than related projected clustering methods, see Section 4. Moreover, it is competitive or even better than state-of-the-art subspace clustering methods in the case of high dimensional data, which is of great practical importance in big data applications, see Fig. 2. To briefly illustrate its effect, we present the results of PMC applied to MNIST data. Fig. 2 shows coordinates used to describe each axis-parallel cluster (we use ten clusters). For a better visualization, we also present the mean values of each cluster. Confusion matrix, presented in Table 1, demonstrates that PMC was able to define correct patterns for most clusters (except digits 4,5,7). One can also discover nonlinear structures by applying PMC on a data set transformed by nonlinear basis functions, such as RBF (radial basis functions) [23], see Fig. 3.
We summarize the main contributions of our paper:
- 1.
We modify SuMC objective function to define a clustering model to discover parallel-axis affine subspaces. This allows to interpret each cluster by the most informative features analogically to co-clustering.
- 2.
We propose an extremely efficient algorithm for its optimization, which can easily process data with more than 106 attributes.
- 3.
The experiments performed on artificial data shows its suitability for detecting axis-parallel clusters and confirm low computational complexity.
- 4.
Experimental study demonstrates that PMC gives better results than state-of-the-art projected and subspace clustering methods on very high dimensional data, which is crucial in practical use cases and big data applications.
Section snippets
Related works
Subspace clustering received considerable attention in recent years due to the growing number of high dimensional practical problems [20], [27], [30], [37]. Prior work includes iterative methods [3], which alternate between assigning the data points to the identified subspaces and updating the subspaces. Algebraic approaches aim to describe clusters using polynomials whose gradients at a point are orthogonal to the subspace containing that point [31]. Variations of spectral clustering focus on
Projected clustering model
Our method can be understood as a modification of SuMC [28], a recent subspace clustering method. Instead of looking for an arbitrary subspace for each cluster, we restrict our attention to subspaces which are parallel to the main axis of the canonical basis and therefore, we obtain axis-parallel projection clustering method. More precisely, we use an affine subspace defined as:where mean(Xi) is the mean of Xi, m ≤ N and (ej)j is the canonical base of .
Our basic
Experiments
In this section we present the evaluation of our method implemented in C2. All experiments were run on Ubuntu 16.04 (64-bit) workstation with a 3.3 GHz Quad-Core Intel Xeon Processor and 32 GB RAM.
We compare our method with leading state-of-the-art projected clustering approaches: PROCLUS, P3C [17], PreDeCon [2] (implemented in Java, Elki3
Conclusion
In this paper, we have presented a new PMC algorithm for finding axis-parallel clusters. Making use of compression theory, we obtained a method, which automatically detects the optimal dimensions of clusters. Moreover, axis-parallel character of clusters gives the most informative coordinates in clusters. Extensive experiments performed on various types of data showed that PMC detects clustering structures better than related projected clustering methods in reasonable amount of time.
Acknowledgements
The work of P. Spurek was supported by the (National Centre of Science, Poland) Grant No. UMO-2015/19/D/ST6/01472. The work of J. Tabor, Ł. Struski and M. Śmieja was supported by the National Centre of Science (Poland) Grant No. UMO-2017/25/B/ST6/01271.
References (39)
- et al.
Three learning phases for radial-basis-function networks
Neural Netw.
(2001) - et al.
Lossy compression approach to subspace clustering
Inf. Sci. (Ny)
(2018) - et al.
Cross-entropy clustering
Pattern Recognit.
(2014) - et al.
Fast algorithms for projected clustering
ACM SIGMoD Record
(1999) - et al.
Density connected clustering with local subspace preferences
Data Mining, 2004. ICDM’04. Fourth IEEE International Conference on
(2004) - et al.
K-plane clustering
J. Global Optim.
(2000) - et al.
Locally adaptive metrics for clustering high dimensional data
Data Min. Knowl. Discov.
(2007) - et al.
Sparse subspace clustering: algorithm, theory, and applications
IEEE Trans. Pattern Anal. Mach. Intell.
(2013) - et al.
A density-based algorithm for discovering clusters in large spatial databases with noise
Kdd
(1996) The Minimum Lescription Length Principle
(2007)
Metacluster-based projective clustering ensembles
Mach. Learn.
Comparing partitions
J. Classif.
Principal Component Analysis
Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering
ACM Trans. Knowl. Discov. Data (TKDD)
Learning multiple layers of features from tiny images
Technical Report
Gradient-based learning applied to document recognition
Proc. IEEE
Robust and efficient subspace segmentation via least squares regression
European Conference on Computer Vision
New routes from minimal approximation error to principal components
Neural Process. Lett.
Cited by (0)
- ☆
Conflicts of Interest Statement: The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakersbureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.