Skip to main content
Log in

Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

This paper deals with a batch self organizing map algorithm for data described by distributional-valued variables (DBSOM). Such variables are characterized to take as values probability or frequency distributions on numeric support. According to the nature of the data, the loss function is based on the L2 Wasserstein distance, that is one of the most used metrics to compare distributions in the context of distributional data analysis. Besides, to consider the different contributions of the variables, four adaptive versions of the DBSOM algorithm are proposed. Relevance weights are automatically learned, one for each distributional-valued variable, in an additional step of the algorithm. Since the L2 Wasserstein metric allows a decomposition of the distance into two components, one related to the means and one related to the size and shape of the distributions, relevance weights are automatically learned for each of the two components to emphasize the importance of the different characteristics, related to the moments of the distributions, on the distance value. The proposed algorithms are corroborated by applications on real distributional-valued data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. We remark that gmj is a distributional data because the corresponding quantile function Qmj is a weighted average quantile function. For further details see Irpino and Verde (2015)

  2. It is available at https://github.com/Airpino/ADBSOM

  3. See the ?? for the proof.

  4. https://archive.ics.uci.edu/ml/datasets/daily+and+sports+activities

  5. https://cran.r-project.org/package=HistDAWass

References

  • Altun, K, Barshan, B, & Tunçel, O (2010). Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognition, 43(10), 3605–3620.

    Article  Google Scholar 

  • Badran, F, Yacoub, M, & Thiria, S (2005). Self-organizing maps and unsupervised classification. In G Dreyfus (Ed.) Neural Networks: methodology and applications (pp. 379–442). Singapore: Springer.

  • Bao, C, Peng, H, He, D, & Wang, J (2018). Adaptive fuzzy c-means clustering algorithm for interval data type based on interval-dividing technique. Pattern Analysis and Applications, 21, 803–812.

    Article  MathSciNet  Google Scholar 

  • Barshan, B, & Yuksek, M C (2014). Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. The Computer Journal, 57(11), 1649–1667.

    Article  Google Scholar 

  • Barshan, B, & Yurtman, A (2016). Investigating inter-subject and inter-activity variations in activity recognition using wearable motion sensors. The Computer Journal, 59(9), 1345–1362.

    Article  Google Scholar 

  • Bock, H H (2002). Clustering algorithms and Kohonen maps for symbolic data. J Jpn Soc Comp Statist, 15, 1–13.

    Article  Google Scholar 

  • Bock, H H, & Diday, E. (2000). Analysis of symbolic data exploratory methods for extracting statistical information from complex data. Berlin: Springer.

    MATH  Google Scholar 

  • Cabanes, G, Bennani, Y, Destenay, R, & Hardy, A (2013). A new topological clustering algorithm for interval data. Pattern Recognition, 46, 3030–3039.

    Article  Google Scholar 

  • Cabanes, G, Bennani, Y, Verde, R, & Irpino, A. (2021). On the use of Wasserstein metric in topological clustering of distributional data. https://arxiv.org/abs/2109.04301.

  • Campello, R J G B, & Hruschka, E R (2006). A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets and Systems, 157(21), 2858–2875.

    Article  MathSciNet  Google Scholar 

  • de Carvalho, F A T, & De Souza, R M C R (2010). Unsupervised pattern recognition models for mixed feature–type symbolic data. Pattern Recognition Letters, 31, 430–443.

    Article  Google Scholar 

  • de Carvalho, F A T, & Lechevallier, Y (2009). Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognition, 42(7), 1223–1236.

    Article  Google Scholar 

  • de Carvalho, FAT, Irpino, A, & Verde, R. (2015). Fuzzy clustering of distribution-valued data using an adaptive L2 Wasserstein distance. In: 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (pp. 1–8). https://doi.org/10.1109/FUZZ-IEEE.2015.7337847.

  • de Carvalho, F A T, Bertrand, P, & Simões, E C (2016). Batch SOM algorithms for interval-valued data with automatic weighting of the variables. Neurocomputing, 182, 66–81.

    Article  Google Scholar 

  • Diday, E, & Govaert, G (1977). Classification automatique avec distances adaptatives. RAIRO Informatique Computer Science, 11(4), 329–349.

    MathSciNet  MATH  Google Scholar 

  • Diday, E, & Simon, J C (1976). Clustering analysis. In K Fu (Ed.) Digital pattern classification (pp. 47–94). Berlin: Springer.

  • D’Urso, P, & Giovanni, L D (2011). Midpoint radius self-organizing maps for interval-valued data with telecommunications application. Applied Soft Computing, 11, 3877–3886.

    Article  Google Scholar 

  • Friedman, J H, & Meulman, J J (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Serie B, 66, 815–849.

    Article  MathSciNet  Google Scholar 

  • Gibbs, A L, & Su, F E (2002). On choosing and bounding probability metrics. International Statistical Review, 70(3), 419–435.

    Article  Google Scholar 

  • Hajjar, C, & Hamdan, H (2011a). Self-organizing map based on city-block distance for interval-valued data. In Complex Systems Design and Management - CSDM, (Vol. 2011 pp. 281–292).

  • Hajjar, C, & Hamdan, H (2011b). Self-organizing map based on Hausdorff distance for interval-valued data. In IEEE International Conference on Systems, Man, and Cybernetics - SMC, (Vol. 2011 pp. 1747–1752).

  • Hajjar, C, & Hamdan, H (2011c). Self-organizing map based on l2 distance for interval-valued data. In 6th IEEE International Symposium on Applied Computational Intelligence and Informatics - SACI, (Vol. 2011 pp. 317–322).

  • Hajjar, C, & Hamdan, H (2013). Interval data clustering using self-organizing maps based on adaptive mahalanobis distances. Neural Networks, 46, 124–132.

    Article  Google Scholar 

  • Huang, J Z, Ng, M K, Rong, H, & Li, Z (2005). Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell, 27(5), 657–668.

    Article  Google Scholar 

  • Hubert, L, & Arabie, P (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.

    Article  Google Scholar 

  • Hulse, D, Gregory, S, & Baker, J. (2002). Willamette River Basin: Trajectories of environmental and ecological change. Oregon State University Press. http://www.fsl.orst.edu/pnwerc/wrb/Atlas_web_compressed/PDFtoc.html.

  • Irpino, A. (2018). HistDAWass: Histogram data analysis using Wasserstein distance. R package version 1.0.1.

  • Irpino, A, & Romano, E. (2007). Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation. Revue des Nouvelles Technologies de l’Information RNTI-E-9, 99–110.

  • Irpino, A, & Verde, R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In B Vea (Ed.) Data Science and Classification (pp. 185–192). Berlin: Springer.

  • Irpino, A, & Verde, R (2015). Basic statistics for distributional symbolic variables: A new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.

    Article  MathSciNet  Google Scholar 

  • Irpino, A, Verde, R, & de Carvalho, F A T (2014). Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Systems with Applications, 41(7), 3351–3366.

    Article  Google Scholar 

  • Irpino, A, Verde, R, & de Carvalho, F A T (2017). Fuzzy clustering of distributional data with automatic weighting of variable components. Information Sciences, 406-407, 248–268.

    Article  MathSciNet  Google Scholar 

  • Kim, J, & Billard, L (2011). A polythetic clustering process and cluster validity indexes for histogram-valued objects. Computational Statistics and Data Analysis, 55(7), 2250–2262.

    Article  MathSciNet  Google Scholar 

  • Kim, J, & Billard, L (2013). Dissimilarity measures for histogram-valued observations. Communications in Statistics - Theory and Methods, 42(2), 283–303.

    Article  MathSciNet  Google Scholar 

  • Kiviluoto, K (1996). Topology preservation in self-organizing maps. In IEEE International Conference on Neural Networks, 1996. https://doi.org/10.1109/ICNN.1996.548907, (Vol. 1 pp. 294–299).

  • Kohonen, T. (1995). Self-organizing maps. New York: Springer.

    Book  Google Scholar 

  • Kohonen, T (2013). Essentials of the self-organizing map. Neural Networks, 37(1), 52–65.

    Article  Google Scholar 

  • Kohonen, T. (2014). MATLAB Implementations and Applications of the Self-Organizing Map. Helsinki: Unigrafia Oy.

    Google Scholar 

  • Korenjak-Černe, S, & Batagelj, V. (2002). Symbolic data analysis approach to clustering large datasets, (pp. 319–327). Berlin: Springer.

    Google Scholar 

  • Manning, C, Raghavan, P, & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.

    Book  Google Scholar 

  • Meila, M (2007). Comparing clusterings an information based distance. Journal of Multivariate Analysis, 98(5), 873–895.

    Article  MathSciNet  Google Scholar 

  • Milligan, G W, & Cooper, M C (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5(2), 181–204. https://doi.org/10.1007/BF01897163.

    Article  MathSciNet  Google Scholar 

  • Modha, D S, & Spangler, W S (2003). Feature weighting in k-means clustering. Machine Learning, 52(3), 217–237.

    Article  Google Scholar 

  • Mount, N J, & Weaver, D (2011). Self-organizing maps and boundary effects: Quantifying the benefits of torus wrapping for mapping som trajectories. Pattern Analysis and Applications, 14(2), 139–148.

    Article  MathSciNet  Google Scholar 

  • Rousseeuw, P J (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.

    Article  Google Scholar 

  • Rüshendorff, L. (2001). Wasserstein metric. In Encyclopedia of Mathematics. Springer.

  • Terada, T, & Yadohisa, H (2010). Non-hierarchical clustering for distribution-valued data. In Y Lechevallier G Saporta (Eds.) Proceedings of COMPSTAT, (Vol. 2010 pp. 1653–1660). Berlin: Springer.

  • Verde, R, & Irpino, A (2008a). Comparing histogram data using a Mahalanobis-Wasserstein distance. In P Brito (Ed.) Proceedings of COMPSTAT 2008, Compstat, (Vol. 2008 pp. 77–89). Heidelberg: Springer.

  • Verde, R, & Irpino, A (2008b). Dynamic clustering of histogram data: Using the right metric. In B Pea (Ed.) Selected contributions in data analysis and classification (pp. 123–134). Berlin: Springer.

  • Verde, R, & Irpino, A. (2018). Multiple factor analysis of distributional data. Statistica Applicata: Italin Journal of applied statistics. To appear.

    MATH  Google Scholar 

  • Verde, R, Irpino, A, & Lechevallier, Y (2006). Dynamic clustering of histograms using Wasserstein metric. In A Rizzi M Vichi (Eds.) Proceedings of COMPSTAT 2006, Compstat, (Vol. 2006 pp. 869–876). Heidelberg: Physica Verlag.

  • Vesanto, J, Himberg, J, Alhoniemi, E, & Parhankangas, J. (1999). Self-organizing map in matlab: the som toolbox. In Inproceedings of the Matlab DSP Conference (pp. 35–40).

  • Vrac, M, Billard, L, Diday, E, & Chedin, A (2012). Copula analysis of mixture models. Computational Statistics, 27, 427–457.

    Article  MathSciNet  Google Scholar 

  • Zhang, L, Bing, Z, & Zhang, L (2015). A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data. Pattern Analysis and Applications, 18, 377–384.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to acknowledge the anonymous referees for their careful revision.

Funding

The first author would like to acknowledge the partial funding of the Conselho Nacional de Desenvolvimento Cientifico e Tecnologico - CNPq (311164/2020-0) and of the University of Campania L. Vanvitelli (1049/2018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francisco de A. T. de Carvalho.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: : Fast Computation of Silhouette Coefficient for Euclidean Spaces

Appendix: : Fast Computation of Silhouette Coefficient for Euclidean Spaces

Let us consider a table Y containing N numerical data vectors yi = [yi1,…,yiP] of P components. Without loss of generality, the N vectors are clustered in two groups, namely, A and B, having, respectively, size NA and NB such that NA + NB = N. Let \(\bar {y}_{Aj}= n_{A}^{-1}\sum \limits _{\mathbf {y_{i}}\in A}y_{ij}\) and \(\bar {y}_{Bj}= n_{B}^{-1}\sum \limits _{\mathbf {y_{i}}\in B}y_{ij}\), j = 1,…,P, be the two cluster means, and \(\text {SSE}_{Aj}=\sum \limits _{\mathbf {y_{i}}\in A}{y_{ij}^{2}}- n_{A} (\bar {y}_{Aj})^{2}\) and \(\text {SSE}_{Bj}=\sum \limits _{\mathbf {y_{i}}\in B}{y_{ij}^{2}}- n_{B} (\bar {y}_{Bj})^{2}\) the two sum of squares deviations from the respective cluster means.

1.1 The Average Euclidean Distance Between a Point to all the Other Points of a Set Where It Is Contained

Let consider that yiA, the average distance of yi with respect all the other vectors in A is computed as follows:

$$(n_{A}-1)^{-1}\sum\limits_{\mathbf{y}_{k}\in A}\sum\limits_{j=1}^{P}{(y_{ij}-y_{kj})^{2}}.$$

It is easy to show, for a single variable j, that:

$$\sum\limits_{\mathbf{y}_{k}\in A}{(y_{ij}-y_{kj})^{2}}=n_{A}(y_{ij})^{2}+\sum\limits_{\mathbf{y}_{k}\in A}{(y_{kj})^{2}}-2 y_{ij}\sum\limits_{\mathbf{y}_{k}\in A}{y_{kj}}=$$
$$ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~=n_{A}(y_{ij})^{2}+\left[\text{SSE}_{Aj}+n_{A}(\bar{y}_{Aj})^{2}\right]-2n_{A}y_{ij}\bar{y}_{Aj}= $$
$$ ~~~~=n_{A}\left( y_{ij}-\bar{y}_{Aj}\right)^{2}+\text{SSE}_{Aj}. $$

Then, the average distance is:

$$(n_{A}-1)^{-1}\sum\limits_{\mathbf{y}_{k}\in A} {(y_{ij}-y_{kj})^{2}}=\frac{n_{A}\left( y_{ij}-\bar{y}_{Aj}\right)^{2}}{n_{A}-1}+\frac{\text{SSE}_{Aj}}{n_{A}-1}$$

1.2 The Average Euclidean Distance of a Point from all the Other Points of a Set Where It Is Not Contained

Let consider that yiA and we want to compute the average distance of yi with respect to all the other vectors in B. The average squared Euclidean distance between yi and all the other vectors in B, for each variable, is given by:

$$(n_{B})^{-1}\sum\limits_{\mathbf{y}_{k}\in B}\sum\limits_{j=1}^{P}{(y_{ij}-y_{kj})^{2}},$$

for a single variable j, it is easy to show that:

$$\sum\limits_{\mathbf{y}_{k}\in B}{(y_{ij}-y_{kj})^{2}}=n_{B}(y_{ij})^{2}+\sum\limits_{\mathbf{y}_{k}\in B}{(y_{kj})^{2}}-2 y_{ij}\sum\limits_{\mathbf{y}_{k}\in B}{y_{kj}}=$$
$$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~=n_{B}(y_{ij})^{2}+\left[\text{SSE}_{Bj}+n_{B}(\bar{y}_{Bj})^{2}\right]-2n_{B}y_{ij}\bar{y}_{Bj}=$$
$$~~~~=n_{B}\left( y_{ij}-\bar{y}_{Bj}\right)^{2}+\text{SSE}_{Bj}.$$

Then, the average distance is that:

$$(n_{B})^{-1}\sum\limits_{\mathbf{y}_{k}\in B} {(y_{ij}-y_{kj})^{2}}=\left( y_{ij}-\bar{y}_{Bj}\right)^{2}+\frac{\text{SSE}}{Bjn_{B}}.$$

1.3 The Silhouette Coefficient

As stated above, the silhouette coefficient for a single yi is given by:

$$ s(i)=\frac{b(i)-a(i)}{max[a(i),b(i)]} ,$$

where, in the case of two groups A and B, if iA:

$$a(i)=(n_{A}-1)^{-1}\sum\limits_{\mathbf{y}_{k}\in A}\sum\limits_{j=1}^{P}{(y_{ij}-y_{kj})^{2}}$$

and

$$b(i)=(n_{B})^{-1}\sum\limits_{\mathbf{y}_{k}\in B}\sum\limits_{j=1}^{P}{(y_{ij}-y_{kj})^{2}}.$$

If we consider the original formulation, for computing N silhouette indexes, the computational complexity is in the order of \(\mathcal {O}(N^{2})\), if N is sufficiently larger than P. If we use the formulas suggested here, once computed the averages and the SSE, the computational cost is approximately of \(\mathcal {O}(N)\). In fact, it is possible to compute a(i) and b(i) as follows:

$$a(i)=\sum\limits_{j=1}^{p}\left[\frac{n_{A}\left( y_{ij}-\bar{y}_{Aj}\right)^{2}}{n_{A}-1}+\frac{\text{SSE}_{Aj}}{n_{A}-1}\right]$$

and

$$b(i)=\sum\limits_{j=1}^{p}\left[ \left( y_{ij}-\bar{y}_{Bj}\right)^{2}+\frac{\text{SSE}_{Bj}}{n_{B}}\right].$$

Considering that the squared L2 Wasserstein distance is a Euclidean distance between quantile functions, that the SSE computed for distribution functions is defined as a sum of squared distance, and that the average distributions are Fréchet means with respect to the squared L2 Wasserstein distance. The same simplification can be applied for computing the silhouette coefficient when data are distributions, and they are compared using the squared L2 Wasserstein distance.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

de Carvalho, F.d.A.T., Irpino, A., Verde, R. et al. Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components. J Classif 39, 343–375 (2022). https://doi.org/10.1007/s00357-022-09411-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-022-09411-1

Keywords

Navigation