Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components

de Carvalho, Francisco de A. T.; Irpino, Antonio; Verde, Rosanna; Balzanella, Antonio

doi:10.1007/s00357-022-09411-1

Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components

Published: 18 March 2022

Volume 39, pages 343–375, (2022)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Francisco de A. T. de Carvalho ORCID: orcid.org/0000-0003-1128-745X¹,
Antonio Irpino²,
Rosanna Verde² &
…
Antonio Balzanella²

447 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

This paper deals with a batch self organizing map algorithm for data described by distributional-valued variables (DBSOM). Such variables are characterized to take as values probability or frequency distributions on numeric support. According to the nature of the data, the loss function is based on the L₂ Wasserstein distance, that is one of the most used metrics to compare distributions in the context of distributional data analysis. Besides, to consider the different contributions of the variables, four adaptive versions of the DBSOM algorithm are proposed. Relevance weights are automatically learned, one for each distributional-valued variable, in an additional step of the algorithm. Since the L₂ Wasserstein metric allows a decomposition of the distance into two components, one related to the means and one related to the size and shape of the distributions, relevance weights are automatically learned for each of the two components to emphasize the importance of the different characteristics, related to the moments of the distributions, on the distance value. The proposed algorithms are corroborated by applications on real distributional-valued data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering-Based Adaptive Self-Organizing Map

Aggregating Self-Organizing Maps with Topology Preservation

Clustering, Noise Reduction and Visualization Using Features Extracted from the Self-Organizing Map

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

We remark that g_mj is a distributional data because the corresponding quantile function Q_mj is a weighted average quantile function. For further details see Irpino and Verde (2015)
It is available at https://github.com/Airpino/ADBSOM
See the ?? for the proof.
https://archive.ics.uci.edu/ml/datasets/daily+and+sports+activities
https://cran.r-project.org/package=HistDAWass

References

Altun, K, Barshan, B, & Tunçel, O (2010). Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognition, 43(10), 3605–3620.
Article Google Scholar
Badran, F, Yacoub, M, & Thiria, S (2005). Self-organizing maps and unsupervised classification. In G Dreyfus (Ed.) Neural Networks: methodology and applications (pp. 379–442). Singapore: Springer.
Bao, C, Peng, H, He, D, & Wang, J (2018). Adaptive fuzzy c-means clustering algorithm for interval data type based on interval-dividing technique. Pattern Analysis and Applications, 21, 803–812.
Article MathSciNet Google Scholar
Barshan, B, & Yuksek, M C (2014). Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. The Computer Journal, 57(11), 1649–1667.
Article Google Scholar
Barshan, B, & Yurtman, A (2016). Investigating inter-subject and inter-activity variations in activity recognition using wearable motion sensors. The Computer Journal, 59(9), 1345–1362.
Article Google Scholar
Bock, H H (2002). Clustering algorithms and Kohonen maps for symbolic data. J Jpn Soc Comp Statist, 15, 1–13.
Article Google Scholar
Bock, H H, & Diday, E. (2000). Analysis of symbolic data exploratory methods for extracting statistical information from complex data. Berlin: Springer.
MATH Google Scholar
Cabanes, G, Bennani, Y, Destenay, R, & Hardy, A (2013). A new topological clustering algorithm for interval data. Pattern Recognition, 46, 3030–3039.
Article Google Scholar
Cabanes, G, Bennani, Y, Verde, R, & Irpino, A. (2021). On the use of Wasserstein metric in topological clustering of distributional data. https://arxiv.org/abs/2109.04301.
Campello, R J G B, & Hruschka, E R (2006). A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets and Systems, 157(21), 2858–2875.
Article MathSciNet Google Scholar
de Carvalho, F A T, & De Souza, R M C R (2010). Unsupervised pattern recognition models for mixed feature–type symbolic data. Pattern Recognition Letters, 31, 430–443.
Article Google Scholar
de Carvalho, F A T, & Lechevallier, Y (2009). Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognition, 42(7), 1223–1236.
Article Google Scholar
de Carvalho, FAT, Irpino, A, & Verde, R. (2015). Fuzzy clustering of distribution-valued data using an adaptive L₂ Wasserstein distance. In: 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (pp. 1–8). https://doi.org/10.1109/FUZZ-IEEE.2015.7337847.
de Carvalho, F A T, Bertrand, P, & Simões, E C (2016). Batch SOM algorithms for interval-valued data with automatic weighting of the variables. Neurocomputing, 182, 66–81.
Article Google Scholar
Diday, E, & Govaert, G (1977). Classification automatique avec distances adaptatives. RAIRO Informatique Computer Science, 11(4), 329–349.
MathSciNet MATH Google Scholar
Diday, E, & Simon, J C (1976). Clustering analysis. In K Fu (Ed.) Digital pattern classification (pp. 47–94). Berlin: Springer.
D’Urso, P, & Giovanni, L D (2011). Midpoint radius self-organizing maps for interval-valued data with telecommunications application. Applied Soft Computing, 11, 3877–3886.
Article Google Scholar
Friedman, J H, & Meulman, J J (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Serie B, 66, 815–849.
Article MathSciNet Google Scholar
Gibbs, A L, & Su, F E (2002). On choosing and bounding probability metrics. International Statistical Review, 70(3), 419–435.
Article Google Scholar
Hajjar, C, & Hamdan, H (2011a). Self-organizing map based on city-block distance for interval-valued data. In Complex Systems Design and Management - CSDM, (Vol. 2011 pp. 281–292).
Hajjar, C, & Hamdan, H (2011b). Self-organizing map based on Hausdorff distance for interval-valued data. In IEEE International Conference on Systems, Man, and Cybernetics - SMC, (Vol. 2011 pp. 1747–1752).
Hajjar, C, & Hamdan, H (2011c). Self-organizing map based on l2 distance for interval-valued data. In 6th IEEE International Symposium on Applied Computational Intelligence and Informatics - SACI, (Vol. 2011 pp. 317–322).
Hajjar, C, & Hamdan, H (2013). Interval data clustering using self-organizing maps based on adaptive mahalanobis distances. Neural Networks, 46, 124–132.
Article Google Scholar
Huang, J Z, Ng, M K, Rong, H, & Li, Z (2005). Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell, 27(5), 657–668.
Article Google Scholar
Hubert, L, & Arabie, P (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Article Google Scholar
Hulse, D, Gregory, S, & Baker, J. (2002). Willamette River Basin: Trajectories of environmental and ecological change. Oregon State University Press. http://www.fsl.orst.edu/pnwerc/wrb/Atlas_web_compressed/PDFtoc.html.
Irpino, A. (2018). HistDAWass: Histogram data analysis using Wasserstein distance. R package version 1.0.1.
Irpino, A, & Romano, E. (2007). Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation. Revue des Nouvelles Technologies de l’Information RNTI-E-9, 99–110.
Irpino, A, & Verde, R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In B Vea (Ed.) Data Science and Classification (pp. 185–192). Berlin: Springer.
Irpino, A, & Verde, R (2015). Basic statistics for distributional symbolic variables: A new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.
Article MathSciNet Google Scholar
Irpino, A, Verde, R, & de Carvalho, F A T (2014). Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Systems with Applications, 41(7), 3351–3366.
Article Google Scholar
Irpino, A, Verde, R, & de Carvalho, F A T (2017). Fuzzy clustering of distributional data with automatic weighting of variable components. Information Sciences, 406-407, 248–268.
Article MathSciNet Google Scholar
Kim, J, & Billard, L (2011). A polythetic clustering process and cluster validity indexes for histogram-valued objects. Computational Statistics and Data Analysis, 55(7), 2250–2262.
Article MathSciNet Google Scholar
Kim, J, & Billard, L (2013). Dissimilarity measures for histogram-valued observations. Communications in Statistics - Theory and Methods, 42(2), 283–303.
Article MathSciNet Google Scholar
Kiviluoto, K (1996). Topology preservation in self-organizing maps. In IEEE International Conference on Neural Networks, 1996. https://doi.org/10.1109/ICNN.1996.548907, (Vol. 1 pp. 294–299).
Kohonen, T. (1995). Self-organizing maps. New York: Springer.
Book Google Scholar
Kohonen, T (2013). Essentials of the self-organizing map. Neural Networks, 37(1), 52–65.
Article Google Scholar
Kohonen, T. (2014). MATLAB Implementations and Applications of the Self-Organizing Map. Helsinki: Unigrafia Oy.
Google Scholar
Korenjak-Černe, S, & Batagelj, V. (2002). Symbolic data analysis approach to clustering large datasets, (pp. 319–327). Berlin: Springer.
Google Scholar
Manning, C, Raghavan, P, & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.
Book Google Scholar
Meila, M (2007). Comparing clusterings – an information based distance. Journal of Multivariate Analysis, 98(5), 873–895.
Article MathSciNet Google Scholar
Milligan, G W, & Cooper, M C (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5(2), 181–204. https://doi.org/10.1007/BF01897163.
Article MathSciNet Google Scholar
Modha, D S, & Spangler, W S (2003). Feature weighting in k-means clustering. Machine Learning, 52(3), 217–237.
Article Google Scholar
Mount, N J, & Weaver, D (2011). Self-organizing maps and boundary effects: Quantifying the benefits of torus wrapping for mapping som trajectories. Pattern Analysis and Applications, 14(2), 139–148.
Article MathSciNet Google Scholar
Rousseeuw, P J (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.
Article Google Scholar
Rüshendorff, L. (2001). Wasserstein metric. In Encyclopedia of Mathematics. Springer.
Terada, T, & Yadohisa, H (2010). Non-hierarchical clustering for distribution-valued data. In Y Lechevallier G Saporta (Eds.) Proceedings of COMPSTAT, (Vol. 2010 pp. 1653–1660). Berlin: Springer.
Verde, R, & Irpino, A (2008a). Comparing histogram data using a Mahalanobis-Wasserstein distance. In P Brito (Ed.) Proceedings of COMPSTAT 2008, Compstat, (Vol. 2008 pp. 77–89). Heidelberg: Springer.
Verde, R, & Irpino, A (2008b). Dynamic clustering of histogram data: Using the right metric. In B Pea (Ed.) Selected contributions in data analysis and classification (pp. 123–134). Berlin: Springer.
Verde, R, & Irpino, A. (2018). Multiple factor analysis of distributional data. Statistica Applicata: Italin Journal of applied statistics. To appear.
MATH Google Scholar
Verde, R, Irpino, A, & Lechevallier, Y (2006). Dynamic clustering of histograms using Wasserstein metric. In A Rizzi M Vichi (Eds.) Proceedings of COMPSTAT 2006, Compstat, (Vol. 2006 pp. 869–876). Heidelberg: Physica Verlag.
Vesanto, J, Himberg, J, Alhoniemi, E, & Parhankangas, J. (1999). Self-organizing map in matlab: the som toolbox. In Inproceedings of the Matlab DSP Conference (pp. 35–40).
Vrac, M, Billard, L, Diday, E, & Chedin, A (2012). Copula analysis of mixture models. Computational Statistics, 27, 427–457.
Article MathSciNet Google Scholar
Zhang, L, Bing, Z, & Zhang, L (2015). A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data. Pattern Analysis and Applications, 18, 377–384.
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge the anonymous referees for their careful revision.

Funding

The first author would like to acknowledge the partial funding of the Conselho Nacional de Desenvolvimento Cientifico e Tecnologico - CNPq (311164/2020-0) and of the University of Campania L. Vanvitelli (1049/2018).

Author information

Authors and Affiliations

Centro de Informatica, Universidade Federal de Pernambuco, Av. Jornalista Anibal Fernandes s/n - Cidade Universitaria, CEP 50740-560, Recife, PE, Brazil
Francisco de A. T. de Carvalho
Department of Mathematics and Physics, University of Campania L. Vanvitelli, Viale Lincoln 5, 81100, Caserta, Italy
Antonio Irpino, Rosanna Verde & Antonio Balzanella

Authors

Francisco de A. T. de Carvalho
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Irpino
View author publications
You can also search for this author in PubMed Google Scholar
Rosanna Verde
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Balzanella
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco de A. T. de Carvalho.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: : Fast Computation of Silhouette Coefficient for Euclidean Spaces

Let us consider a table Y containing N numerical data vectors y_i = [y_i1,…,y_iP] of P components. Without loss of generality, the N vectors are clustered in two groups, namely, A and B, having, respectively, size N_A and N_B such that N_A + N_B = N. Let $\bar {y}_{Aj}= n_{A}^{-1}\sum \limits _{\mathbf {y_{i}}\in A}y_{ij}$ and $\bar {y}_{Bj}= n_{B}^{-1}\sum \limits _{\mathbf {y_{i}}\in B}y_{ij}$, j = 1,…,P, be the two cluster means, and $\text {SSE}_{Aj}=\sum \limits _{\mathbf {y_{i}}\in A}{y_{ij}^{2}}- n_{A} (\bar {y}_{Aj})^{2}$ and $\text {SSE}_{Bj}=\sum \limits _{\mathbf {y_{i}}\in B}{y_{ij}^{2}}- n_{B} (\bar {y}_{Bj})^{2}$ the two sum of squares deviations from the respective cluster means.

1.1 The Average Euclidean Distance Between a Point to all the Other Points of a Set Where It Is Contained

Let consider that y_i ∈ A, the average distance of y_i with respect all the other vectors in A is computed as follows:

$$(n_{A}-1)^{-1}\sum\limits_{\mathbf{y}_{k}\in A}\sum\limits_{j=1}^{P}{(y_{ij}-y_{kj})^{2}}.$$

It is easy to show, for a single variable j, that:

$$\sum\limits_{\mathbf{y}_{k}\in A}{(y_{ij}-y_{kj})^{2}}=n_{A}(y_{ij})^{2}+\sum\limits_{\mathbf{y}_{k}\in A}{(y_{kj})^{2}}-2 y_{ij}\sum\limits_{\mathbf{y}_{k}\in A}{y_{kj}}=$$

$$ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~=n_{A}(y_{ij})^{2}+\left[\text{SSE}_{Aj}+n_{A}(\bar{y}_{Aj})^{2}\right]-2n_{A}y_{ij}\bar{y}_{Aj}= $$

$$ ~~~~=n_{A}\left( y_{ij}-\bar{y}_{Aj}\right)^{2}+\text{SSE}_{Aj}. $$

Then, the average distance is:

$$(n_{A}-1)^{-1}\sum\limits_{\mathbf{y}_{k}\in A} {(y_{ij}-y_{kj})^{2}}=\frac{n_{A}\left( y_{ij}-\bar{y}_{Aj}\right)^{2}}{n_{A}-1}+\frac{\text{SSE}_{Aj}}{n_{A}-1}$$

1.2 The Average Euclidean Distance of a Point from all the Other Points of a Set Where It Is Not Contained

Let consider that y_i ∈ A and we want to compute the average distance of y_i with respect to all the other vectors in B. The average squared Euclidean distance between y_i and all the other vectors in B, for each variable, is given by:

$$(n_{B})^{-1}\sum\limits_{\mathbf{y}_{k}\in B}\sum\limits_{j=1}^{P}{(y_{ij}-y_{kj})^{2}},$$

for a single variable j, it is easy to show that:

$$\sum\limits_{\mathbf{y}_{k}\in B}{(y_{ij}-y_{kj})^{2}}=n_{B}(y_{ij})^{2}+\sum\limits_{\mathbf{y}_{k}\in B}{(y_{kj})^{2}}-2 y_{ij}\sum\limits_{\mathbf{y}_{k}\in B}{y_{kj}}=$$

$$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~=n_{B}(y_{ij})^{2}+\left[\text{SSE}_{Bj}+n_{B}(\bar{y}_{Bj})^{2}\right]-2n_{B}y_{ij}\bar{y}_{Bj}=$$

$$~~~~=n_{B}\left( y_{ij}-\bar{y}_{Bj}\right)^{2}+\text{SSE}_{Bj}.$$

Then, the average distance is that:

$$(n_{B})^{-1}\sum\limits_{\mathbf{y}_{k}\in B} {(y_{ij}-y_{kj})^{2}}=\left( y_{ij}-\bar{y}_{Bj}\right)^{2}+\frac{\text{SSE}}{Bjn_{B}}.$$

1.3 The Silhouette Coefficient

As stated above, the silhouette coefficient for a single y_i is given by:

$$ s(i)=\frac{b(i)-a(i)}{max[a(i),b(i)]} ,$$

where, in the case of two groups A and B, if i ∈ A:

$$a(i)=(n_{A}-1)^{-1}\sum\limits_{\mathbf{y}_{k}\in A}\sum\limits_{j=1}^{P}{(y_{ij}-y_{kj})^{2}}$$

and

$$b(i)=(n_{B})^{-1}\sum\limits_{\mathbf{y}_{k}\in B}\sum\limits_{j=1}^{P}{(y_{ij}-y_{kj})^{2}}.$$

If we consider the original formulation, for computing N silhouette indexes, the computational complexity is in the order of $\mathcal {O}(N^{2})$, if N is sufficiently larger than P. If we use the formulas suggested here, once computed the averages and the SSE, the computational cost is approximately of $\mathcal {O}(N)$. In fact, it is possible to compute a(i) and b(i) as follows:

$$a(i)=\sum\limits_{j=1}^{p}\left[\frac{n_{A}\left( y_{ij}-\bar{y}_{Aj}\right)^{2}}{n_{A}-1}+\frac{\text{SSE}_{Aj}}{n_{A}-1}\right]$$

and

$$b(i)=\sum\limits_{j=1}^{p}\left[ \left( y_{ij}-\bar{y}_{Bj}\right)^{2}+\frac{\text{SSE}_{Bj}}{n_{B}}\right].$$

Considering that the squared L2 Wasserstein distance is a Euclidean distance between quantile functions, that the SSE computed for distribution functions is defined as a sum of squared distance, and that the average distributions are Fréchet means with respect to the squared L₂ Wasserstein distance. The same simplification can be applied for computing the silhouette coefficient when data are distributions, and they are compared using the squared L₂ Wasserstein distance.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Carvalho, F.d.A.T., Irpino, A., Verde, R. et al. Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components. J Classif 39, 343–375 (2022). https://doi.org/10.1007/s00357-022-09411-1

Download citation

Accepted: 22 February 2022
Published: 18 March 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s00357-022-09411-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components

Abstract

Access this article

Similar content being viewed by others

Clustering-Based Adaptive Self-Organizing Map

Aggregating Self-Organizing Maps with Topology Preservation

Clustering, Noise Reduction and Visualization Using Features Extracted from the Self-Organizing Map

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher’s Note

Appendix: : Fast Computation of Silhouette Coefficient for Euclidean Spaces

1.1 The Average Euclidean Distance Between a Point to all the Other Points of a Set Where It Is Contained

1.2 The Average Euclidean Distance of a Point from all the Other Points of a Set Where It Is Not Contained

1.3 The Silhouette Coefficient

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components

Abstract

Access this article

Similar content being viewed by others

Clustering-Based Adaptive Self-Organizing Map

Aggregating Self-Organizing Maps with Topology Preservation

Clustering, Noise Reduction and Visualization Using Features Extracted from the Self-Organizing Map

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher’s Note

Appendix: : Fast Computation of Silhouette Coefficient for Euclidean Spaces

Appendix: : Fast Computation of Silhouette Coefficient for Euclidean Spaces

1.1 The Average Euclidean Distance Between a Point to all the Other Points of a Set Where It Is Contained

1.2 The Average Euclidean Distance of a Point from all the Other Points of a Set Where It Is Not Contained

1.3 The Silhouette Coefficient

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation