Utilizing Structure-Rich Features to Improve Clustering

Schelling, Benjamin; Bauer, Lena Greta Marie; Behzadi, Sahar; Plant, Claudia

doi:10.1007/978-3-030-67658-2_6

Benjamin Schelling^12,13,14,
Lena Greta Marie Bauer¹⁵,
Sahar Behzadi¹⁴ &
…
Claudia Plant^14,15

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12457))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1663 Accesses
5 Citations

Abstract

For successful clustering, an algorithm needs to find the boundaries between clusters. While this is comparatively easy if the clusters are compact and non-overlapping and thus the boundaries clearly defined, features where the clusters blend into each other hinder clustering methods to correctly estimate these boundaries. Therefore, we aim to extract features showing clear cluster boundaries and thus enhance the cluster structure in the data. Our novel technique creates a condensed version of the data set containing the structure important for clustering, but without the noise-information. We demonstrate that this transformation of the data set is much easier to cluster for k-means, but also various other algorithms. Furthermore, we introduce a deterministic initialisation strategy for k-means based on these structure-rich features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We follow the argument given in [16] in regard to the explicit form of the derivative.

References

Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding, SODA (2007)
Google Scholar
Celebi, M., Kingravi, H., Vela, P.: A comparative study of efficient initialisation methods for the K-Means clustering algorithm. Expert Syst. Appl. 40(1), 200–210 (2013)
Article Google Scholar
Chronis, P., Athanasiou, S., Skiadopoulos, S.: Automatic clustering by detecting significant density dips in multiple dimensions. In: ICDM (2019)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum-Likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. 39(1), 1–22 (1977)
MathSciNet MATH Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD (1996)
Google Scholar
Goebl, S., He, X., Plant, C., Böhm, C.: Finding the optimal subspace for clustering. In: ICDM (2014)
Google Scholar
Guo, X., Gao, L., Liu, X., Yin, J.: Improved deep embedded clustering with local structure preservation. In: IJCAI (2017)
Google Scholar
Hartigan, J.A., Hartigan, P.M.: The dip test of unimodality. Ann. Stat. 131, 70–84 (1985)
Article MathSciNet Google Scholar
Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. In: TKDE (2007)
Google Scholar
Kalogeratos, A., Likas, A.: Dip-means: an incremental clustering method for estimating the number of clusters. In: NIPS (2012)
Google Scholar
Krause, A., Liebscher, V.: Multimodal projection pursuit using the dip statistic, Preprint-Reihe Mathematik (2005)
Google Scholar
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. In: TKDD (2009)
Google Scholar
Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. (2008)
Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Berkeley Symposium on Math. Stat. and Prob. (1967)
Google Scholar
McInnes, L., Healy, J., Melville, J.: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 (2018)
Maurus, S., Plant, C.: Skinny-dip: clustering in a sea of noise. In: KDD (2016)
Google Scholar
Mautz, D., Ye, W., Plant, C., Böhm, C.: Towards an optimal subspace for k-means. In: KDD (2017)
Google Scholar
Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: NIPS (2002)
Google Scholar
Schelling, B., Plant, C.: DipTransformation: enhancing the structure of a dataset and thereby improving clustering. In: ICDM (2018)
Google Scholar
Schelling, B., Plant, C.: Dataset-transformation: improving clustering by enhancing the structure with DipScaling and DipTransformation. In: KAIS (2019)
Google Scholar
Sibson, R.: SLINK: an optimally efficient algorithm for the single-link cluster method. Comput. J. 16(1), 30–34 (1973)
Article MathSciNet Google Scholar
Siffer, A., Fouque, P.A., Termier, A., Largouet, C.: Are your data gathered? In: KDD (2018)
Google Scholar
Vinh, N.X., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. JMLR 11, 2837–2854 (2011)
MathSciNet MATH Google Scholar
Wu, H., Gu, X.: Max-Pooling dropout for regularization of convolutional neural networks. In: ICONIP (2015)
Google Scholar
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: ICML (2016)
Google Scholar
Yang, B., Fu, X., Sidiropoulos, N.: Learning from hidden traits: joint factor analysis and latent clustering. IEEE Trans. Signal Process. (2017)
Google Scholar
Yang, B., Fu, X., Sidiropoulos, N., Hong, M.: Towards K-means-friendly spaces: simultaneous deep learning and clustering. In: ICML (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

MCML, Munich, Germany
Benjamin Schelling
Ludwig-Maximilians-Universität München, Munich, Germany
Benjamin Schelling
Faculty of Computer Science, University of Vienna, Vienna, Austria
Benjamin Schelling, Sahar Behzadi & Claudia Plant
ds:UniVie, Vienna, Austria
Lena Greta Marie Bauer & Claudia Plant

Authors

Benjamin Schelling
View author publications
You can also search for this author in PubMed Google Scholar
Lena Greta Marie Bauer
View author publications
You can also search for this author in PubMed Google Scholar
Sahar Behzadi
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Plant
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benjamin Schelling .

Editor information

Editors and Affiliations

Albert-Ludwigs-Universität, Freiburg, Germany
Frank Hutter
TU Darmstadt, Darmstadt, Germany
Kristian Kersting
Ghent University, Ghent, Belgium
Jefrey Lijffijt
Saarland University, Saarbrücken, Germany
Isabel Valera

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schelling, B., Bauer, L.G.M., Behzadi, S., Plant, C. (2021). Utilizing Structure-Rich Features to Improve Clustering. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12457. Springer, Cham. https://doi.org/10.1007/978-3-030-67658-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-67658-2_6
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67657-5
Online ISBN: 978-3-030-67658-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)