On evolutionary subspace clustering with symbiosis

Vahdat, Ali; Heywood, Malcolm I.

doi:10.1007/s12065-013-0103-1

On evolutionary subspace clustering with symbiosis

Research Paper
Published: 12 January 2014

Volume 6, pages 229–256, (2014)
Cite this article

Evolutionary Intelligence Aims and scope Submit manuscript

Ali Vahdat¹ &
Malcolm I. Heywood¹

322 Accesses
3 Citations
Explore all metrics

Abstract

Subspace clustering identifies the attribute support for each cluster as well as identifying the location and number of clusters. In the most general case, attributes associated with each cluster could be unique. A multi-objective evolutionary method is proposed to identify the unique attribute support of each cluster while detecting its data instances. The proposed algorithm, symbiotic evolutionary subspace clustering (S-ESC) borrows from ‘symbiosis’ in the sense that each clustering solution is defined in terms of a host (single member of the host population) and a number of coevolved cluster centroids (or symbionts in an independent symbiont population). Symbionts define clusters and therefore attribute subspaces, whereas hosts define sets of clusters to constitute a non-degenerate solution. The symbiotic representation of S-ESC is the key to making it scalable to high-dimensional datasets, while an integrated subsampling process makes it scalable to tasks with a large number of data items. Benchmarking is performed against a test suite of 59 subspace clustering tasks with four well known comparator algorithms from both the full-dimensional and subspace clustering literature: EM, MINECLUS, PROCLUS, STATPC. Performance of the S-ESC algorithm was found to be robust across a wide cross-section of properties with a common parameterization utilized throughout. This was not the case for the comparator algorithms. Specifically, performance could be sensitive to the particular data distribution or parameter sweeps might be necessary to provide comparable performance. An additional evaluation is performed against a non-symbiotic GA, with S-ESC still returning superior clustering solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved Multi-objective Evolutionary Subspace Clustering

Efficient Monte Carlo clustering in subspaces

Article 14 February 2017

Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical Data

Notes

Queller identifies reproductive fission as the second mechanism by which major transitions appear.
Similar conclusions are drawn in the other recent survey by Sim et al. [36] albeit resulting in further refinements in the categorization.
PCA and Hough transform based methods appear to currently be the dominant approaches for the case of arbitrarily orientated subspaces [18, 36].
The earlier survey of Patrikainen and Meila recognized axis-aligned and non-axis-aligned categories [29].
k-medoid and k-means algorithms have several similarities. The principle difference however, is that the k-medoid algorithm defines centroids in terms of a sample of data instances. Conversely, the k-means algorithm attempts to define clusters in terms of co-ordinates representing the centroids directly.
http://dme.rwth-aachen.de/OpenSubspace/.
For a recent review of evolutionary computation as applied to full dimensional clustering see [14].
The earlier evolutionary approach to subspace clustering attempted to first build cluster centroids and then describe clustering solutions through two independent cycles of evolution. This is difficult to do because the ‘performance’ of cluster centroids is dependent on the clustering solutions (a group of cluster centroids), where such groups are undefined at the point of cluster centroid evolution.
For example both the X-means [30] or EM [22] clustering algorithms would be appropriate choices.
http://web.cs.dal.ca/~mheywood/Code/S-ESC.
Outliers are data instances for which all attribute data values represent noise.
Outlier points are labeled as an extra cluster in incremental datasets and are therefore straightforward to explicitly remove.
http://web.cs.dal.ca/~mheywood/Code/S-ESC.
16 core Intel Xeon 2.67Ghz server, 48GB RAM, Linux CentOS 5.5.
Six parameter values {10⁻¹⁵, 10⁻¹², 10⁻⁹, 10⁻⁶, 10⁻³, 10⁻¹} were considered for each of the three STATPC parameters, or a total of 6 × 6 × 6 = 216 parameter settings.
This is to be expected given that EM is the only full-space clustering algorithm, however, it does not preclude better results following additional parameter optimization.
Statistical significance holds for 6 of 7 datasets in Figs. 10a, 5 of 7 in Fig. 10b.
For example time complexity of connectivity (without subsampling) can be alleviated to O(M N log N) based on [39], where M is the number nearest neighbours needed for each instance.
Recall that when k exceeds the actual number of clusters, MINECLUS need not use all the k clusters specified a priori.

References

Aggarwal CC, Wolf JL, Yu Philip S, Procopiuc Cecilia, Park Jong Soo (1999) Fast algorithms for projected clustering. In ACM SIGMOD International conference on management of data, pp 61–72. ACM
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1988) Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec 27:94–105
Article Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In ACM International conference on very large data bases, pp 487–499
Assent I, Krieger R, Steffens A, Seidl T (2006) A novel biology inspired model for evolutionary subspace clustering. In Proceedings. Annual symposium on nature inspired smart information systems (NiSIS)
Bacquet C, Zincir-Heywood AN, Heywood MI (2011) Genetic optimization and hierarchical clustering applied to encrypted traffic identification. In IEEE symposium on computational intelligence in cyber security, pp 194–201
Boudjeloud-Assala L, Blansché A (2012) Iterative evolutionary subspace clustering. In International Conference on neural information processing (ICONIP), pp 424–431. Springer
Calcott B, Sterelny K, Szathmáry E (2001) The major transitions in evolution revisited. The Vienna series in theoretical biology. MIT Press, Cambridge
Google Scholar
Cho H, Dhillon IS, Guan Y, Sra S (2004) Minimum sum-squared residue co-clustering of gene expression data. In SIAM International conference on data mining
Deb K, Pratap A, Agarwal S, Meyarivan TAMT (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evolut Comput 6(2):182–197
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Series B (Methodological) 39(1):1–38
MATH MathSciNet Google Scholar
Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the 21st international conference on Machine learning, pp 36– ACM
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor 11(1):10–18
Article Google Scholar
Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evolut Comput 11(1):56–76
Article Google Scholar
Hruschka ER, Campello BRJG, Freitas AA, De Carvalho APLF (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst, Man, Cybern: Part C 39(2):133–155
Article Google Scholar
Jensen MT (2003) Reducing the run-time complexity of multiobjective EAs: The NSGA-II and other algorithms. IEEE Trans Evolut Comput 7(5):503–515
Article Google Scholar
Kluger Y, Basri R, Chang JT, Gerstein M (2003) Spectral bi-clustering of microarray data: co-clustering genes and conditions. Genome Res 13:703–716
Article Google Scholar
Kriegel H-P, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58
Article Google Scholar
Kriegel H-P, Kröger P, Zimek A (2012) Subspace clustering. WIREs Data Mining Knowl Discov 2:351–364
Article Google Scholar
Liebovitch L, Toth T (1989) A fast algorithm to determine fractal dimensions by box counting. Phys Lett 141A(8)
Lu Y, Wang S, Li S, Zhou C (2011) Particle swarm optimizer for variable weighting in clustering high-dimensional data. Machine Learn 82(1):43–70
Article MathSciNet Google Scholar
Margulis L, Fester R (1991) Symbiosis as a source of evolutionary innovation. MIT Press, Cambridge
Google Scholar
McLachlan G, Krishnan T (1997) The EM algorithm and extensions. Wiley-Interscience,
MATH Google Scholar
Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In ACM International conference on knowledge discovery and data mining, pp 533–541. ACM
Moise G, Zimek A, Kröger P, Kriegel H-P, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inform Syst 21:299–326
Article Google Scholar
Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. Int Conf Very Large Data Bases 2:1270–1281
Google Scholar
Nourashrafeddin S, Arnold D, Milios E (2012) An evolutionary subspace clustering algorithm for high-dimensional data. In Proceedings of the ACM genetic and evolutionary computation conference companion, pp 1497–1498
Okasha S (2005) Multilevel selection and the major transitions in evolution. Philos Sci 72:1013–1025
Article Google Scholar
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6:90–105
Article Google Scholar
Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Trans Knowl Data Eng 18:902–916
Article Google Scholar
Pelleg D, Moore AW et al (2000) X-means: extending k-means with efficient estimation of the number of clusters. In International conference on machine learning, pp 727–734
Procopiuc CM, Jones M, Agarwal PK, Murali TM (2002) A monte carlo algorithm for fast projective clustering. In ACM International conference on management of data, SIGMOD ’02, pages 418–427, New York, NY, USA, 2002. ACM
Queller DC (2000) Relatedness and the fracternal major transitions. Philos Trans R Soc Lond B 355:1647–1655
Article Google Scholar
Rachmawati L, Srinivasan D (2009) Multiobjective evolutionary algorithm with controllable focus on the knees of the pareto front. IEEE Trans Evolut Comput 13(4):810–824
Google Scholar
Sarafis IA, Trinder PW, Zalzala AMS (2003) Towards effective subspace clustering with an evolutionary algorithm. In IEEE Congress on Evolutionary Computation, pp 797–806
Schütze O, Laumanns M, Coello CAC (2008) Approximating the knee of an MOP with stochastic search algorithms. In Parallel problem solving from nature, volume 5199 of LNCS, pp 795–804
Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Data Mining Knowl Discov 26:332–397
Article MathSciNet Google Scholar
Vahdat A, Heywood MI, Zincir-Heywood AN (2010) bottom–up evolutionary subspace clustering. In IEEE Congress on Evolutionary Computation, pp 1371–1378
Vahdat A, Heywood MI, Zincir-Heywood AN (2012) Symbiotic evolutionary subspace clustering. In IEEE Congress on Evolutionary Computation, pp 2724–2731
Vaidya PM (1989) Ano (n logn) algorithm for the all-nearest-neighbors problem. Discret Comput Geometr 4(1):101–115
Article MATH Google Scholar
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, 2 edn
Wu SX, Banzhaf W (2011) A hierarchical cooperative evolutionary algorithm. In ACM Genetic and Evolutionary Computation Conference, pp 233–240
Yiu ML, Mamoulis N (2003) Frequent-pattern based iterative projected clustering. IEEE International Conference on Data Mining, page 689
Zhu L, Cao L, Yang J (2012) Multiobjective evolutionary algorithm-based soft subspace clustering. In Evolutionary Computation (CEC), 2012 IEEE Congress on, pp 2732–2739

Download references

Acknowledgment

The authors gratefully acknowledge support from the NSERC Discovery grant, NSERC RTI and CFI New Opportunities programs (Canada).

Author information

Authors and Affiliations

Faculty of Computer Science, Dalhousie University, Halifax, NS, B3H 4R2, Canada
Ali Vahdat & Malcolm I. Heywood

Authors

Ali Vahdat
View author publications
You can also search for this author in PubMed Google Scholar
Malcolm I. Heywood
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Vahdat.

Appendix

1.1 Flat evolutionary subspace clustering

Flat evolutionary subspace clustering (F-ESC) is a simplified version of S-ESC in which the two-level (hierarchical) symbiotic representation is replaced with a single-level flat representation, thereby eliminating the symbiotic relationship. Apart from the symbiotic process and replacing the single level mutation operator with a crossover operator everything is shared. The grid generation, multi-objective evolutionary optimization using compactness and connectivity objectives, atomic mutation operators to remove and add attributes and modify the 1-d centroid within an attribute, subsampling and knee detection are all used in F-ESC as per S-ESC. This section characterizes the main differences between the two variants of the ESC: representation and crossover operator.

1.1.1 Representation

The two-level representation of S-ESC is simplified (and condensed) into a flat single-level representation as shown in Fig. 17. Here, each individual encodes all k cluster centroids necessary for a partitioning, therefore the chromosome is composed of k cluster centroids laid out into a single string of integer pairs. Similar to a symbiont representation in S-ESC, F-ESC chromosome has two connected integer strings; one for indexing attributes, and one for indexing 1-d cluster centroids, within each attribute. Both integer strings start with the number of cluster centroids the individual is encoding (k). Then k cluster centroids are encoded similar to a S-ESC symbiont representation, with a minor change. The first integer of each cluster centroid is the number of attributes the cluster centroid supports in both attribute and 1-d centroid strings. In other words L ₁, L ₂ and L _k in Fig. 17 represent the attribute count for cluster centroid 1, 2 and k respectively. a _1,1 and a _{1,L_1} are the first and last attributes for cluster centroid 1 whereas c _1,1 and c _{1,L_1} are the first and last 1-d centroids for cluster centroid 1.

The same limits are set for F-ESC with respect to minimum and maximum number of clusters in a dataset and attributes per cluster. The range [2, 20] is selected for both constraints, which means that a single individual’s length can vary between 4 and 400 in length.

1.1.2 Crossover

The second main difference between S-ESC and F-ESC—due to the ‘flat’ representation—is the use of a crossover operator replacing the single-level mutation operator utilized in S-ESC. The single-level mutation operator in S-ESC is responsible for removing / adding / swapping symbionts (CC) from / to / between hosts (CS). The crossover operator in F-ESC essentially performs the same modification between two flat individuals. The variation operator is a 2-point crossover operator which swaps one or more cluster centroids from parent a with one or more cluster centroids from parent b. There is also a repair mechanism to make sure the offspring meets the cluster limit constraints.

1.1.3 Similarities

The remaining components and sub-components of F-ESC are the same as S-ESC. Grid generation is the pre-processing component with its output being the genetic material used by evolutionary process. F-ESC employs the same EMO algorithm utilizing both compactness and connectivity objectives. The selection operator is a tournament process between four individuals where the best individual is selected as parent a and the runner up as parent b; thus elitist. The same atomic mutation operators are implemented to remove and add attributes to a randomly selected cluster centroid within an individual with a third mutation operator modifying the 1-d centroid of a randomly selected attribute within a cluster centroid.

To account for robustness against dataset cardinality, the same subsampling process is used in F-ESC. M points are randomly selected anew at each generation and individuals’ objectives are evaluated against this set instead of the whole dataset. This set (called active set) is refreshed at the end of each generation. Once the evolutionary process provided a pool of solutions the knee detection procedure as suggested by [35] identifies the knee solution as the champion solution.

1.2 Statistical tests

Tests were performed to support or reject the hypothesis that the performance of the S-ESC solutions are drawn from the same distribution as that of the comparator methods. The following tables return the p value with a confidence level of 99 %. Hence values smaller than 0.01 imply that the distributions are independent with a confidence of 99 %. It does not say wether S-ESC is outperforming the comparator method or vice versa, only the fact whether they are statistically different or not, with a confidence level of 99 %. Results in bold indicate that it is not possible to reject the hypothesis (i.e. neither S-ESC nor the comparator in the test is being outperformed by the other method), whereas in most cases the hypothesis is rejected (i.e. one of the methods in the test is being outperformed by the other method).

Some care is necessary in the case of distributions about the extreme values. Thus, the * symbol is used to denote the use of a single tailed test rather than a double-tailed test. Similarly in the case of the outlier datasets a Normal distribution could not be assumed, thus the Krushal-Wallis non-parametric hypothesis test was used in place of the student t-test. The ‘NaN’ values imply that the comparator algorithm failed to provide any results for the task within the given time.

Note that the numbers in parentheses in Table 3 define the specific parameter value for the dataset to be tested. For the case of the GE, GD, UE and UD experiments, it is the average dimensionality of the dataset. For the D, N and k experiments, it is the dimensionality, cardinality and cluster count of the dataset, respectively. For the Extent experiment, it is the spread of values for relevant attributes. For the Overlap experiment, it is the overlap between the relevant attributes of the different clusters, and for the ClusterSize experiment, it is the average instance count of the clusters.

Out of 54 × 4 = 216 tests between S-ESC and comparator methods on the incremental datasets of Table 1, there were 9 cases (approximately 4 %) in which the comparator algorithms (MINECLUS, STATPC and EM) do not return results (NaN’s in Table 3) and 32 cases (approximately 15 %) in which there is no statistically significant difference between S-ESC and the comparator results (the bold cases in Table 3). There are 134 cases (approximately 62 %) in which S-ESC outperforms the comparator method in a statistically significant way, and only 41 cases (approximately 19 %) in which S-ESC is outperformed by a comparator algorithm.

Table 3 The t-test p values for F-measure significance of the incremental benchmarks of Moise et al. (Sect. 5; Table 1). Values in bold indicate that there is no statistically significant difference between S-ESC and the comparator method

Full size table

Out of the 5 × 4 = 20 tests on the large-scale datasets of Table 2 there are 5 cases in which comparator methods (MINECLUS and STATPC) fail to produce a result (NaN’s in Table 4) and 5 cases in which the results are not significantly different (the bold cases in Table 4). In 8 cases S-ESC outperforms the comparator methods and in only 2 cases S-ESC is outperformed by comparators methods. The * symbol is used to denote the use of a single tailed test rather than a double-tailed test.

Table 4 The t-test p values for F-measure significance in the large-scale benchmark (Sect. 5; Table 2). Values in bold indicate that there is no statistically significant difference between S-ESC and the comparator method

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vahdat, A., Heywood, M.I. On evolutionary subspace clustering with symbiosis. Evol. Intel. 6, 229–256 (2014). https://doi.org/10.1007/s12065-013-0103-1

Download citation

Received: 09 August 2013
Revised: 11 December 2013
Accepted: 13 December 2013
Published: 12 January 2014
Issue Date: March 2014
DOI: https://doi.org/10.1007/s12065-013-0103-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On evolutionary subspace clustering with symbiosis

Abstract

Access this article

Similar content being viewed by others

Improved Multi-objective Evolutionary Subspace Clustering

Efficient Monte Carlo clustering in subspaces

Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical Data

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Flat evolutionary subspace clustering

1.1.1 Representation

1.1.2 Crossover

1.1.3 Similarities

1.2 Statistical tests

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On evolutionary subspace clustering with symbiosis

Abstract

Access this article

Similar content being viewed by others

Improved Multi-objective Evolutionary Subspace Clustering

Efficient Monte Carlo clustering in subspaces

Subspace Clustering Technique Using Multi-objective Functions for Multi-class Categorical Data

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Flat evolutionary subspace clustering

1.1.1 Representation

1.1.2 Crossover

1.1.3 Similarities

1.2 Statistical tests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation