Skip to main content
Log in

On evolutionary subspace clustering with symbiosis

  • Research Paper
  • Published:
Evolutionary Intelligence Aims and scope Submit manuscript

Abstract

Subspace clustering identifies the attribute support for each cluster as well as identifying the location and number of clusters. In the most general case, attributes associated with each cluster could be unique. A multi-objective evolutionary method is proposed to identify the unique attribute support of each cluster while detecting its data instances. The proposed algorithm, symbiotic evolutionary subspace clustering (S-ESC) borrows from ‘symbiosis’ in the sense that each clustering solution is defined in terms of a host (single member of the host population) and a number of coevolved cluster centroids (or symbionts in an independent symbiont population). Symbionts define clusters and therefore attribute subspaces, whereas hosts define sets of clusters to constitute a non-degenerate solution. The symbiotic representation of S-ESC is the key to making it scalable to high-dimensional datasets, while an integrated subsampling process makes it scalable to tasks with a large number of data items. Benchmarking is performed against a test suite of 59 subspace clustering tasks with four well known comparator algorithms from both the full-dimensional and subspace clustering literature: EM, MINECLUS, PROCLUS, STATPC. Performance of the S-ESC algorithm was found to be robust across a wide cross-section of properties with a common parameterization utilized throughout. This was not the case for the comparator algorithms. Specifically, performance could be sensitive to the particular data distribution or parameter sweeps might be necessary to provide comparable performance. An additional evaluation is performed against a non-symbiotic GA, with S-ESC still returning superior clustering solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. Queller identifies reproductive fission as the second mechanism by which major transitions appear.

  2. Similar conclusions are drawn in the other recent survey by Sim et al. [36] albeit resulting in further refinements in the categorization.

  3. PCA and Hough transform based methods appear to currently be the dominant approaches for the case of arbitrarily orientated subspaces [18, 36].

  4. The earlier survey of Patrikainen and Meila recognized axis-aligned and non-axis-aligned categories [29].

  5. k-medoid and k-means algorithms have several similarities. The principle difference however, is that the k-medoid algorithm defines centroids in terms of a sample of data instances. Conversely, the k-means algorithm attempts to define clusters in terms of co-ordinates representing the centroids directly.

  6. http://dme.rwth-aachen.de/OpenSubspace/.

  7. For a recent review of evolutionary computation as applied to full dimensional clustering see [14].

  8. The earlier evolutionary approach to subspace clustering attempted to first build cluster centroids and then describe clustering solutions through two independent cycles of evolution. This is difficult to do because the ‘performance’ of cluster centroids is dependent on the clustering solutions (a group of cluster centroids), where such groups are undefined at the point of cluster centroid evolution.

  9. For example both the X-means [30] or EM [22] clustering algorithms would be appropriate choices.

  10. http://web.cs.dal.ca/~mheywood/Code/S-ESC.

  11. Outliers are data instances for which all attribute data values represent noise.

  12. Outlier points are labeled as an extra cluster in incremental datasets and are therefore straightforward to explicitly remove.

  13. http://web.cs.dal.ca/~mheywood/Code/S-ESC.

  14. 16 core Intel Xeon 2.67Ghz server, 48GB RAM, Linux CentOS 5.5.

  15. Six parameter values {10−15, 10−12, 10−9, 10−6, 10−3, 10−1} were considered for each of the three STATPC parameters, or a total of 6 × 6 × 6 = 216 parameter settings.

  16. This is to be expected given that EM is the only full-space clustering algorithm, however, it does not preclude better results following additional parameter optimization.

  17. Statistical significance holds for 6 of 7 datasets in Figs. 10a, 5 of 7 in Fig. 10b.

  18. For example time complexity of connectivity (without subsampling) can be alleviated to O(M N log N) based on [39], where M is the number nearest neighbours needed for each instance.

  19. Recall that when k exceeds the actual number of clusters, MINECLUS need not use all the k clusters specified a priori.

References

  1. Aggarwal CC, Wolf JL, Yu Philip S, Procopiuc Cecilia, Park Jong Soo (1999) Fast algorithms for projected clustering. In ACM SIGMOD International conference on management of data, pp 61–72. ACM

  2. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1988) Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec 27:94–105

    Article  Google Scholar 

  3. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In ACM International conference on very large data bases, pp 487–499

  4. Assent I, Krieger R, Steffens A, Seidl T (2006) A novel biology inspired model for evolutionary subspace clustering. In Proceedings. Annual symposium on nature inspired smart information systems (NiSIS)

  5. Bacquet C, Zincir-Heywood AN, Heywood MI (2011) Genetic optimization and hierarchical clustering applied to encrypted traffic identification. In IEEE symposium on computational intelligence in cyber security, pp 194–201

  6. Boudjeloud-Assala L, Blansché A (2012) Iterative evolutionary subspace clustering. In International Conference on neural information processing (ICONIP), pp 424–431. Springer

  7. Calcott B, Sterelny K, Szathmáry E (2001) The major transitions in evolution revisited. The Vienna series in theoretical biology. MIT Press, Cambridge

    Google Scholar 

  8. Cho H, Dhillon IS, Guan Y, Sra S (2004) Minimum sum-squared residue co-clustering of gene expression data. In SIAM International conference on data mining

  9. Deb K, Pratap A, Agarwal S, Meyarivan TAMT (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evolut Comput 6(2):182–197

    Article  Google Scholar 

  10. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Series B (Methodological) 39(1):1–38

    MATH  MathSciNet  Google Scholar 

  11. Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the 21st international conference on Machine learning, pp 36– ACM

  12. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor 11(1):10–18

    Article  Google Scholar 

  13. Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evolut Comput 11(1):56–76

    Article  Google Scholar 

  14. Hruschka ER, Campello BRJG, Freitas AA, De Carvalho APLF (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst, Man, Cybern: Part C 39(2):133–155

    Article  Google Scholar 

  15. Jensen MT (2003) Reducing the run-time complexity of multiobjective EAs: The NSGA-II and other algorithms. IEEE Trans Evolut Comput 7(5):503–515

    Article  Google Scholar 

  16. Kluger Y, Basri R, Chang JT, Gerstein M (2003) Spectral bi-clustering of microarray data: co-clustering genes and conditions. Genome Res 13:703–716

    Article  Google Scholar 

  17. Kriegel H-P, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58

    Article  Google Scholar 

  18. Kriegel H-P, Kröger P, Zimek A (2012) Subspace clustering. WIREs Data Mining Knowl Discov 2:351–364

    Article  Google Scholar 

  19. Liebovitch L, Toth T (1989) A fast algorithm to determine fractal dimensions by box counting. Phys Lett 141A(8)

  20. Lu Y, Wang S, Li S, Zhou C (2011) Particle swarm optimizer for variable weighting in clustering high-dimensional data. Machine Learn 82(1):43–70

    Article  MathSciNet  Google Scholar 

  21. Margulis L, Fester R (1991) Symbiosis as a source of evolutionary innovation. MIT Press, Cambridge

    Google Scholar 

  22. McLachlan G, Krishnan T (1997) The EM algorithm and extensions. Wiley-Interscience,

    MATH  Google Scholar 

  23. Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In ACM International conference on knowledge discovery and data mining, pp 533–541. ACM

  24. Moise G, Zimek A, Kröger P, Kriegel H-P, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inform Syst 21:299–326

    Article  Google Scholar 

  25. Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. Int Conf Very Large Data Bases 2:1270–1281

    Google Scholar 

  26. Nourashrafeddin S, Arnold D, Milios E (2012) An evolutionary subspace clustering algorithm for high-dimensional data. In Proceedings of the ACM genetic and evolutionary computation conference companion, pp 1497–1498

  27. Okasha S (2005) Multilevel selection and the major transitions in evolution. Philos Sci 72:1013–1025

    Article  Google Scholar 

  28. Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6:90–105

    Article  Google Scholar 

  29. Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Trans Knowl Data Eng 18:902–916

    Article  Google Scholar 

  30. Pelleg D, Moore AW et al (2000) X-means: extending k-means with efficient estimation of the number of clusters. In International conference on machine learning, pp 727–734

  31. Procopiuc CM, Jones M, Agarwal PK, Murali TM (2002) A monte carlo algorithm for fast projective clustering. In ACM International conference on management of data, SIGMOD ’02, pages 418–427, New York, NY, USA, 2002. ACM

  32. Queller DC (2000) Relatedness and the fracternal major transitions. Philos Trans R Soc Lond B 355:1647–1655

    Article  Google Scholar 

  33. Rachmawati L, Srinivasan D (2009) Multiobjective evolutionary algorithm with controllable focus on the knees of the pareto front. IEEE Trans Evolut Comput 13(4):810–824

    Google Scholar 

  34. Sarafis IA, Trinder PW, Zalzala AMS (2003) Towards effective subspace clustering with an evolutionary algorithm. In IEEE Congress on Evolutionary Computation, pp 797–806

  35. Schütze O, Laumanns M, Coello CAC (2008) Approximating the knee of an MOP with stochastic search algorithms. In Parallel problem solving from nature, volume 5199 of LNCS, pp 795–804

  36. Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Data Mining Knowl Discov 26:332–397

    Article  MathSciNet  Google Scholar 

  37. Vahdat A, Heywood MI, Zincir-Heywood AN (2010) bottom–up evolutionary subspace clustering. In IEEE Congress on Evolutionary Computation, pp 1371–1378

  38. Vahdat A, Heywood MI, Zincir-Heywood AN (2012) Symbiotic evolutionary subspace clustering. In IEEE Congress on Evolutionary Computation, pp 2724–2731

  39. Vaidya PM (1989) Ano (n logn) algorithm for the all-nearest-neighbors problem. Discret Comput Geometr 4(1):101–115

    Article  MATH  Google Scholar 

  40. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, 2 edn

  41. Wu SX, Banzhaf W (2011) A hierarchical cooperative evolutionary algorithm. In ACM Genetic and Evolutionary Computation Conference, pp 233–240

  42. Yiu ML, Mamoulis N (2003) Frequent-pattern based iterative projected clustering. IEEE International Conference on Data Mining, page 689

  43. Zhu L, Cao L, Yang J (2012) Multiobjective evolutionary algorithm-based soft subspace clustering. In Evolutionary Computation (CEC), 2012 IEEE Congress on, pp 2732–2739

Download references

Acknowledgment

The authors gratefully acknowledge support from the NSERC Discovery grant, NSERC RTI and CFI New Opportunities programs (Canada).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Vahdat.

Appendix

Appendix

1.1 Flat evolutionary subspace clustering

Flat evolutionary subspace clustering (F-ESC) is a simplified version of S-ESC in which the two-level (hierarchical) symbiotic representation is replaced with a single-level flat representation, thereby eliminating the symbiotic relationship. Apart from the symbiotic process and replacing the single level mutation operator with a crossover operator everything is shared. The grid generation, multi-objective evolutionary optimization using compactness and connectivity objectives, atomic mutation operators to remove and add attributes and modify the 1-d centroid within an attribute, subsampling and knee detection are all used in F-ESC as per S-ESC. This section characterizes the main differences between the two variants of the ESC: representation and crossover operator.

1.1.1 Representation

The two-level representation of S-ESC is simplified (and condensed) into a flat single-level representation as shown in Fig. 17. Here, each individual encodes all k cluster centroids necessary for a partitioning, therefore the chromosome is composed of k cluster centroids laid out into a single string of integer pairs. Similar to a symbiont representation in S-ESC, F-ESC chromosome has two connected integer strings; one for indexing attributes, and one for indexing 1-d cluster centroids, within each attribute. Both integer strings start with the number of cluster centroids the individual is encoding (k). Then k cluster centroids are encoded similar to a S-ESC symbiont representation, with a minor change. The first integer of each cluster centroid is the number of attributes the cluster centroid supports in both attribute and 1-d centroid strings. In other words L 1, L 2 and L k in Fig. 17 represent the attribute count for cluster centroid 1, 2 and k respectively. a 1,1 and a 1,L_1 are the first and last attributes for cluster centroid 1 whereas c 1,1 and c 1,L_1 are the first and last 1-d centroids for cluster centroid 1.

The same limits are set for F-ESC with respect to minimum and maximum number of clusters in a dataset and attributes per cluster. The range [2, 20] is selected for both constraints, which means that a single individual’s length can vary between 4 and 400 in length.

1.1.2 Crossover

The second main difference between S-ESC and F-ESC—due to the ‘flat’ representation—is the use of a crossover operator replacing the single-level mutation operator utilized in S-ESC. The single-level mutation operator in S-ESC is responsible for removing / adding / swapping symbionts (CC) from / to / between hosts (CS). The crossover operator in F-ESC essentially performs the same modification between two flat individuals. The variation operator is a 2-point crossover operator which swaps one or more cluster centroids from parent a with one or more cluster centroids from parent b. There is also a repair mechanism to make sure the offspring meets the cluster limit constraints.

1.1.3 Similarities

The remaining components and sub-components of F-ESC are the same as S-ESC. Grid generation is the pre-processing component with its output being the genetic material used by evolutionary process. F-ESC employs the same EMO algorithm utilizing both compactness and connectivity objectives. The selection operator is a tournament process between four individuals where the best individual is selected as parent a and the runner up as parent b; thus elitist. The same atomic mutation operators are implemented to remove and add attributes to a randomly selected cluster centroid within an individual with a third mutation operator modifying the 1-d centroid of a randomly selected attribute within a cluster centroid.

To account for robustness against dataset cardinality, the same subsampling process is used in F-ESC. M points are randomly selected anew at each generation and individuals’ objectives are evaluated against this set instead of the whole dataset. This set (called active set) is refreshed at the end of each generation. Once the evolutionary process provided a pool of solutions the knee detection procedure as suggested by [35] identifies the knee solution as the champion solution.

1.2 Statistical tests

Tests were performed to support or reject the hypothesis that the performance of the S-ESC solutions are drawn from the same distribution as that of the comparator methods. The following tables return the p value with a confidence level of 99 %. Hence values smaller than 0.01 imply that the distributions are independent with a confidence of 99 %. It does not say wether S-ESC is outperforming the comparator method or vice versa, only the fact whether they are statistically different or not, with a confidence level of 99 %. Results in bold indicate that it is not possible to reject the hypothesis (i.e. neither S-ESC nor the comparator in the test is being outperformed by the other method), whereas in most cases the hypothesis is rejected (i.e. one of the methods in the test is being outperformed by the other method).

Some care is necessary in the case of distributions about the extreme values. Thus, the * symbol is used to denote the use of a single tailed test rather than a double-tailed test. Similarly in the case of the outlier datasets a Normal distribution could not be assumed, thus the Krushal-Wallis non-parametric hypothesis test was used in place of the student t-test. The ‘NaN’ values imply that the comparator algorithm failed to provide any results for the task within the given time.

Note that the numbers in parentheses in Table 3 define the specific parameter value for the dataset to be tested. For the case of the GEGDUE and UD experiments, it is the average dimensionality of the dataset. For the DN and k experiments, it is the dimensionality, cardinality and cluster count of the dataset, respectively. For the Extent experiment, it is the spread of values for relevant attributes. For the Overlap experiment, it is the overlap between the relevant attributes of the different clusters, and for the ClusterSize experiment, it is the average instance count of the clusters.

Out of 54 × 4 = 216 tests between S-ESC and comparator methods on the incremental datasets of Table 1, there were 9 cases (approximately 4 %) in which the comparator algorithms (MINECLUS, STATPC and EM) do not return results (NaN’s in Table 3) and 32 cases (approximately 15 %) in which there is no statistically significant difference between S-ESC and the comparator results (the bold cases in Table 3). There are 134 cases (approximately 62 %) in which S-ESC outperforms the comparator method in a statistically significant way, and only 41 cases (approximately 19 %) in which S-ESC is outperformed by a comparator algorithm.

Table 3 The t-test p values for F-measure significance of the incremental benchmarks of Moise et al. (Sect. 5; Table 1). Values in bold indicate that there is no statistically significant difference between S-ESC and the comparator method

Out of the 5 × 4 = 20 tests on the large-scale datasets of Table 2 there are 5 cases in which comparator methods (MINECLUS and STATPC) fail to produce a result (NaN’s in Table 4) and 5 cases in which the results are not significantly different (the bold cases in Table 4). In 8 cases S-ESC outperforms the comparator methods and in only 2 cases S-ESC is outperformed by comparators methods. The * symbol is used to denote the use of a single tailed test rather than a double-tailed test.

Table 4 The t-test p values for F-measure significance in the large-scale benchmark (Sect. 5; Table 2). Values in bold indicate that there is no statistically significant difference between S-ESC and the comparator method

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vahdat, A., Heywood, M.I. On evolutionary subspace clustering with symbiosis. Evol. Intel. 6, 229–256 (2014). https://doi.org/10.1007/s12065-013-0103-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12065-013-0103-1

Keywords

Navigation