Dynamic clustering of interval data based on hybrid $$L_q$$ distance

de Souza, Leandro Carlos; de Souza, Renata Maria Cardoso Rodrigues; do Amaral, Getúlio José Amorim

doi:10.1007/s10115-019-01367-w

Dynamic clustering of interval data based on hybrid $L_q$ distance

Regular Paper
Published: 17 May 2019

Volume 62, pages 687–718, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Leandro Carlos de Souza¹,
Renata Maria Cardoso Rodrigues de Souza² &
Getúlio José Amorim do Amaral³

382 Accesses
6 Citations
Explore all metrics

Abstract

Dynamic clustering defines partitions within data and prototypes to each partition. Distance metrics are responsible for checking the closeness between instances and prototypes. Considering the literature about interval data, distances depend on interval bounds and the information inside the intervals is ignored. This paper proposes new distances, which explore the information inside of intervals. It also presents a mapping of intervals to points, which preserves their spatial location and internal variation. We formulate a new hybrid distance for interval data based on the well-known $L_q$ distance for point data. This new distance allows for a weighted formulation of the hybridism. Hence, we propose a Hybrid $L_q$ distance, a Weighted Hybrid $L_q$ distance, as well as the adaptive version of the Hybrid $L_q$ distance for interval data. Experiments with synthetic and real interval data sets illustrate the usefulness of the hybrid approach to improve dynamic clustering for interval data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A New Representation of Interval Symbolic Data and Its Application in Dynamic Clustering

Article 23 February 2016

Wenhua Li, Junpeng Guo, … Minglu Wang

Distance based Incremental Clustering for Mining Clusters of Arbitrary Shapes

A neighborhood-based three-stage hierarchical clustering algorithm

Article 29 July 2021

Yan Wang, Yan Ma & Hui Huang

References

Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chichester
Book Google Scholar
Billard L, Le-Rademacher J (2012) Principal component analysis for interval data. Wiley Interdiscip Rev Comput Stat 4(6):535–540
Article Google Scholar
Burden RL, Faires JD (2011) Numerical analysis. Cengage Learning, Brooks/Cole
MATH Google Scholar
Chavent M, Lechevallier Y (2002) Dynamical clustering of interval data: optimization of an adequacy criterion based on Hausdorff distance. In: Classification, clustering, and data analysis, pp 53–60
Chavent M (2004) An Hausdorff distance between hyper-rectangles for clustering interval data. In: Banks D et al (eds) Classification, clustering an data mining application, proceedings of the IFCS04. Springer, Berlin, pp 333–340
Chapter Google Scholar
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The MIT Press, Cambridge
MATH Google Scholar
De Carvalho FAT, Brito P, Bock H-H (2006b) Dynamic clustering for interval data based on L2 distance. Comput Stat 21:231–250
Article Google Scholar
De Carvalho FAT, Souza RMCR, Chavent M, Lechevallier Y (2006a) Adaptive Hausdorff distances and dynamic clustering of symbolic interval data. Pattern Recognit Lett 27:167–179
Article Google Scholar
De Carvalho FAT, Lechevallier Y (2009a) Dynamic clustering of interval-valued data based on adaptive quadratic distances. Trans Syst Man Cyber Part A 39:1295–1306
Article Google Scholar
De Carvalho FAT, Lechevallier Y (2009b) Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognit 42:1223–1236
Article Google Scholar
De Carvalho FAT, Souza RMCR (2010) Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recognit Lett 31(5):430–443
Article Google Scholar
Diday E, Simon JC (1976) Clustering analysis. In: Fu KS (ed) Digit Pattern Classif. Springer, Berlin, pp 47–94
Google Scholar
Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, Chichester
MATH Google Scholar
Diday E (2016) Thinking by classes in data science: the symbolic data analysis paradigm. Wiley Interdiscip Rev Comput Stat 8(5):172–205
Article MathSciNet Google Scholar
Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4(2):229–246
Article MathSciNet Google Scholar
Fränti P, Kivijärvi J (2000) Randomised local search algorithm for the clustering problem. Pattern Anal Appl 3:358–369
Article MathSciNet Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 24:2367–2376
Google Scholar
Lichaman M (2013) newblock UCI machine learning repository
Lima Neto EA, De Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54:333–347
Article MathSciNet Google Scholar
Lima Neto EA, De Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52:1500–1515
Article MathSciNet Google Scholar
Martinez WL, Martinez AR (2007) Computational statistics handbook with MATLAB. Chapman & Hall CRC, New York
MATH Google Scholar
Renche AC, Christensen WF (2012) Methods of multivariate analysis, 3rd edn. Wiley, New York
Book Google Scholar
Silva Filho TM, Souza RMCR (2015) A swarm-trained k-nearest prototypes adaptive classifier with automatic feature selection for interval data. Neural Netw 80:19–33
Article Google Scholar
Silva APD, Brito P (2006) Linear discriminant analysis for interval data. Comput Stat 21(2):289–308
Article MathSciNet Google Scholar
Silva APD, Brito P (2015) Discriminant analysis of interval data: An assessment of parametric and distance-based approaches. J Classif 32(3):516–541
Article MathSciNet Google Scholar
Souza LC (2016) Agrupamento e regressão linear de dados simblicos intervalares baseados em novas representações. PhD Thesis, Universidade Federal de Pernambuco, PE, Brazil, https://repositorio.ufpe.br/handle/123456789/17640
Souza RMCR, De Carvalho FAT (2004) Clustering of interval data based on city–block distances. Pattern Recognit Lett 25:353–365
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank CNPq and CAPES (Brazilian Agencies) for their financial support.

Author information

Authors and Affiliations

Departamento de Computação, DC/UFERSA, Av. Francisco Mota, 572 - Costa e Silva, Mossoró, RN, Brazil
Leandro Carlos de Souza
Centro de Informática, Universidade Federal de Pernambuco, Av. Jornalista Aníbal Fernandes, s/n, Cidade Universitária, Recife, PE, 50740-560, Brazil
Renata Maria Cardoso Rodrigues de Souza
Departamento de Estatística, Centro de Ciências Exatas, Universidade Federal de Pernambuco, Av. Jorn. Aníbal Fernandes, s/n - Cidade Universitária, Recife, PE, 50740-540, Brazil
Getúlio José Amorim do Amaral

Authors

Leandro Carlos de Souza
View author publications
You can also search for this author in PubMed Google Scholar
Renata Maria Cardoso Rodrigues de Souza
View author publications
You can also search for this author in PubMed Google Scholar
Getúlio José Amorim do Amaral
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Renata Maria Cardoso Rodrigues de Souza.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Proposition 1

Fixing cluster k and dimension j, the hybrid weights of the $WHL_q$ distance are obtained using Lagrange multipliers under the restrictions: $w_{k,1}^j + w_{k,2}^j = 1$; $w_{k,1}^j \ge 0 $; $w_{k,2}^j \ge 0$; and $t > 1$. Let

$$\begin{aligned} \xi _{k,1}^j = \sum _{\gamma _n \in C_k} | \underline{\gamma }_n^j -\underline{g}_k^j |^q \quad \text{ and } \quad \,\, \xi _{k,2}^j =\sum _{\gamma _n \in C_k} | \breve{\gamma }_n^j - \breve{g}_k^j |^q. \end{aligned}$$

The Hybrid weight values are computed by:

$$\begin{aligned} w_{k,1}^j = \left\{ 1 + \left( \frac{\xi _{k,1}^j}{\xi _{k,2}^j} \right) ^{\frac{1}{t-1}} \right\} ^{-1} \text{ and } \,\,\, w_{k,2}^j = \left\{ 1 + \left( \frac{\xi _{k,2}^j}{\xi _{k,1}^j} \right) ^{\frac{1}{t-1}} \right\} ^{-1}. \end{aligned}$$

Proof

The partitional dynamic clustering criterion for the $WHL_q$ distance is given by

$$\begin{aligned} J_{d_{WHLq}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \left\{ (w_{k,1}^j)^t \, |\underline{\gamma }_n^j -\underline{g}_k^j|^q + (w_{k,2}^j)^t \, |\breve{\gamma }_n^j-\breve{g}_k^j|^q \right\} , \end{aligned}$$

(28)

under the restrictions: $w_{k,1}^j + w_{k,2}^j = 1$, $w_{k,1}^j \ge 0$, $w_{k,2}^j \ge 0$ and $t > 1$. The solution can be found using Lagrange multipliers. Let $J_{d_{WHLq}}(\varLambda _1^1,\ldots , \varLambda _K^p)$ be the version of Eq. (28) with the Lagrange multipliers ($\varLambda _k^j$) associated restrictions. Thus, it becomes

$$\begin{aligned} J_{d_{WHLq}}(\varLambda _1^1,\ldots , \varLambda _K^p)= & {} \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \left\{ (w_{k,1}^j)^t \, |\underline{\gamma }_n^j-\underline{g}_k^j|^q + (w_{k,2}^j)^t \, |\breve{\gamma }_n^j-\breve{g}_k^j|^q \right\} \nonumber \\&- \sum _{k=1}^K \sum _{j=1}^p \left\{ \varLambda _k^j \, (w_{k,1}^j + w_{k,2}^j - 1) \right\} . \end{aligned}$$

(29)

Weights can be found when the partial derivatives of $J_{d_{WHLq}}$ are equal to 0. Fixing cluster k and dimension j, deriving $J_{d_{WHLq}}$ according to the first weight component ($w_{k,1}^j$), we get

$$\begin{aligned}&\frac{\partial J_{d_{WHLq}}(\varLambda _1^1,\ldots , \varLambda _K^p)}{\partial w_{k,1}^j} = \sum _{\gamma _n \in C_k} \left\{ t \, (w_{k,1}^j)^{t-1} \, |\underline{\gamma }_n^j -\underline{g}_k^j|^q \right\} - \varLambda _k^j = 0 \end{aligned}$$

(30)

$$\begin{aligned}&t \, (w_{k,1}^j)^{t-1} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j-\underline{g}_k^j|^q - \varLambda _k^j = 0. \end{aligned}$$

(31)

Defining

$$\begin{aligned} \xi _{k,1}^j = \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j-\underline{g}_k^j|^q, \end{aligned}$$

(32)

and isolating the $w_{k,1}^j$ term, we get

$$\begin{aligned} t \, (w_{k,1}^j)^{t-1} \xi _{k,1}^j - \varLambda _k^j = 0 \Longrightarrow w_{k,1}^j = \left( \frac{\varLambda _k^j}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}}. \end{aligned}$$

(33)

Now, deriving $J_{d_{WHLq}}$ according to the second weight component ($w_{k,2}^j$), we obtain

$$\begin{aligned}&\displaystyle \frac{\partial J_{d_{WHLq}}(\varLambda _1^1,\ldots , \varLambda _K^p)}{\partial w_{k,2}^j} = \sum _{\gamma _n \in C_k} \left\{ t \, (w_{k,2}^j)^{t-1} \, |\breve{\gamma }_n^j -\breve{g}_k^j|^q \right\} - \varLambda _k^j = 0 \end{aligned}$$

(34)

$$\begin{aligned}&\displaystyle t \, (w_{k,2}^j)^{t-1} \sum _{\gamma _n \in C_k} |\breve{\gamma }_n^j-\breve{g}_k^j|^q - \varLambda _k^j = 0. \end{aligned}$$

(35)

Defining

$$\begin{aligned} \xi _{k,2}^j = \sum _{\gamma _n \in C_k} |\breve{\gamma }_n^j-\breve{g}_k^j|^q, \end{aligned}$$

(36)

we get $w_{k,2}^j$, as follows:

$$\begin{aligned} t \, (w_{k,2}^j)^{t-1} \xi _{k,2}^j - \varLambda _k^j = 0 \Longrightarrow w_{k,2}^j = \left( \frac{\varLambda _k^j}{t \, \xi _{k,2}^j } \right) ^{\frac{1}{t-1}}. \end{aligned}$$

(37)

With the expressions above, used to find weights, we compute the Lagrange multiplier ($\varLambda _k^j$) using the restriction $w_{k,1}^j + w_{k,2}^j =1$. Then,

$$\begin{aligned}&\displaystyle w_{k,1}^j + w_{k,2}^j = 1 \end{aligned}$$

(38)

$$\begin{aligned}&\displaystyle \left( \frac{\varLambda _k^j}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}} + \left( \frac{\varLambda _k^j}{t \, \xi _{k,2}^j } \right) ^{\frac{1}{t-1}} = 1 \end{aligned}$$

(39)

$$\begin{aligned}&\displaystyle \varLambda _k^j = \left\{ \left( \frac{1}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2}^j } \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} \end{aligned}$$

(40)

Replacing Eq. (40)) in Eq. (33)), we get

$$\begin{aligned} w_{k,1}^j = \left( \frac{\left\{ \left( \frac{1}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2}^j } \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} }{(t \, \xi _{k,1}^j) }\right) ^{\frac{1}{t-1}} = \left\{ 1+ \left( \frac{\xi _{k,1}^j}{\xi _{k,2}^j} \right) ^{\frac{1}{t-1}}\right\} ^{-1}. \end{aligned}$$

(41)

Now, replacing Eq. (40)) in Eq. (37)), we get

$$\begin{aligned} w_{k,2}^j = \left( \frac{\left\{ \left( \frac{1}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2}^j } \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} }{(t \, \xi _{k,2}^j) }\right) ^{\frac{1}{t-1}} = \left\{ 1+ \left( \frac{\xi _{k,2}^j}{\xi _{k,1}^j} \right) ^{\frac{1}{t-1}}\right\} ^{-1}. \end{aligned}$$

(42)

So, the hybrid weights can be computed using Eqs. (41) and (42). $\square $

B Proof of Proposition 2

Fixing cluster k, the hybrid weights of the $WHL_{\infty }$ distance are obtained using Lagrange multipliers under the following restrictions: $w_{k,1} + w_{k,2} = 1$; $w_{k,1} \ge 0$; $w_{k,2} \ge 0$; and $ t > 1$. Let

$$\begin{aligned} \xi _{k,1} = \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ | \underline{\gamma }_n^j - \underline{g}_k^j |\right\} \, \text{ and } \, \xi _{k,2} = \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ | \breve{\gamma }_n^j - \breve{g}_k^j |\right\} . \end{aligned}$$

Then,

$$\begin{aligned} w_{k,1} = \left\{ 1 + \left( \frac{\xi _{k,1}}{\xi _{k,2}} \right) ^{\frac{1}{t-1}} \right\} ^{-1} \, \text{ and } \, \, w_{k,2} = \left\{ 1 + \left( \frac{\xi _{k,2}}{\xi _{k,1}} \right) ^{\frac{1}{t-1}} \right\} ^{-1}. \end{aligned}$$

Proof

The partitional dynamic clustering criterion for the $WHL_{\infty }$ distance is given by

$$\begin{aligned} J_{d_{WHLq}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \left\{ (w_{k,1})^t \max _{j=1}^p \left\{ |\underline{\gamma }_n^j-\underline{g}_k^j|\right\} +(w_{k,2})^t\max _{j=1}^p \left\{ |\breve{\gamma }_n^j -\breve{g}_k^j| \right\} \right\} \end{aligned}$$

(43)

under the restrictions: $w_{k,1} + w_{k,2} = 1$; $w_{k,1} \ge 0$; $w_{k,2} \ge 0$; and $t > 1$. This solution can be computed using Lagrange multipliers. Eq. (43) is rewritten to incorporate Lagrange multipliers ($\varLambda _k$) and its respective restrictions. It then becomes

$$\begin{aligned} J_{d_{WHLq}}(\varLambda _1, \ldots , \varLambda _K)= & {} \sum _{k=1}^K \sum _{\gamma _n \in C_k} \left( (w_{k,1})^t \max _{j=1}^p \left\{ |\underline{\gamma }_n^j-\underline{g}_k^j| \right\} \right. \nonumber \\&+\left. (w_{k,2})^t \max _{j=1}^p \left\{ |\breve{\gamma }_n^j-\breve{g}_k^j| \right\} \right) -\sum _{k=1}^K \varLambda _k (w_{k,1} + w_{k,2} - 1)\qquad \end{aligned}$$

(44)

Weights can be found when partial derivatives of $J_{d_{WHLq}}$, with respect to the weights, are equal to 0. For fixed cluster k and dimension j, deriving $J_{d_{WHLq}}$ according to the first weight component ($w_{k,1}$), we get

$$\begin{aligned}&\displaystyle \frac{\partial J_{d_{WHLq}}(\varLambda _1, \ldots , \varLambda _K)}{\partial w_{k,1}} = \sum _{\gamma _n \in C_k} t \, (w_{k,1})^{t-1} \, \max _{j=1}^p \left\{ |\underline{\gamma }_n^j-\underline{g}_k^j| \right\} - \varLambda _k = 0 \end{aligned}$$

(45)

$$\begin{aligned}&\displaystyle t \, (w_{k,1})^{t-1} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j-\underline{g}_k^j| \right\} - \varLambda _k = 0. \end{aligned}$$

(46)

Defining

$$\begin{aligned} \xi _{k,1} = \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j-\underline{g}_k^j| \right\} , \end{aligned}$$

(47)

we get

$$\begin{aligned} t \, (w_{k,1})^{t-1} \xi _{k,1} - \varLambda _k = 0 \Longrightarrow w_{k,1} = \left( \frac{\varLambda _k}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}}. \end{aligned}$$

(48)

Now, deriving $J_{d_{WHLq}}$ according to the second weight component ($w_{k,2}$), we get

$$\begin{aligned}&\displaystyle \frac{\partial J_{d_{WHLq}}(\varLambda _1, \ldots , \varLambda _K)}{\partial w_{k,2}} = \sum _{\gamma _n \in C_k} t \, (w_{k,2})^{t-1} \max _{j=1}^p \left\{ |\breve{\gamma }_n^j-\breve{g}_k^j| \right\} - \varLambda _k = 0 \end{aligned}$$

(49)

$$\begin{aligned}&\displaystyle t \, (w_{k,2})^{t-1} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_n^j-\breve{g}_k^j| \right\} - \varLambda _k = 0. \end{aligned}$$

(50)

Defining

$$\begin{aligned} \xi _{k,2} = \sum _{\gamma _n \in C_k}\max _{j=1}^p \left\{ |\breve{\gamma }_n^j-\breve{g}_k^j| \right\} , \end{aligned}$$

(51)

it becomes

$$\begin{aligned} t \, (w_{k,2})^{t-1} \xi _{k,2} - \varLambda _k = 0 \Longrightarrow w_{k,2} = \left( \frac{\varLambda _k}{t \, \xi _{k,2}} \right) ^{\frac{1}{t-1}}. \end{aligned}$$

(52)

The Lagrange multiplier ($\varLambda _k$) is computed based on restriction $w_{k,1}+ w_{k,2} =1$. Then,

$$\begin{aligned}&\displaystyle w_{k,1} + w_{k,2} = 1 \end{aligned}$$

(53)

$$\begin{aligned}&\displaystyle \left( \frac{\varLambda _k}{t \, \xi _{k,1} } \right) ^{\frac{1}{t-1}} + \left( \frac{\varLambda _k}{t \, \xi _{k,2} } \right) ^{\frac{1}{t-1}} = 1 \end{aligned}$$

(54)

$$\begin{aligned}&\displaystyle \varLambda _k = \left\{ \left( \frac{1}{t \, \xi _{k,1} } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2}} \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} \end{aligned}$$

(55)

Replacing Eq. (55) in Eq. (48), we get

$$\begin{aligned} w_{k,1} = \left( \frac{\left\{ \left( \frac{1}{t \, \xi _{k,1} } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2}} \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} }{(t \, \xi _{k,1}) }\right) ^{\frac{1}{t-1}} = \left\{ 1+\left( \frac{\xi _{k,1}}{\xi _{k,2}} \right) ^{\frac{1}{t-1}}\right\} ^{-1}. \end{aligned}$$

(56)

Now, replacing Eq. (55) in Eq. (52), we get

$$\begin{aligned} w_{k,2} = \left( \frac{\left\{ \left( \frac{1}{t \, \xi _{k,1} } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2} } \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} }{(t \, \xi _{k,2}) } \right) ^{\frac{1}{t-1}} =\left\{ 1+ \left( \frac{\xi _{k,2}}{\xi _{k,1}} \right) ^{\frac{1}{t-1}}\right\} ^{-1}. \end{aligned}$$

(57)

So, the hybrid weights are computed by Eqs. (56) and (57). $\square $

C Proof of Proposition 3

Fixing cluster k and dimension j, the prototype for the $HL_1$ and $HL_{\infty }$ distances has an analytic solution, given by Eq. (58),

$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_{n}^j \right\} \, \text{ and } \, \bar{g}_k^j = \underline{g}_k^j + \underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_{n}^j \right\} . \end{aligned}$$

(58)

Proof

The criterion to be minimized for the $HL_1$ distance is

$$\begin{aligned} J_{d_{HL_1}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j | + |\breve{\gamma }_k^j - \breve{g}_k^j | \right\} . \end{aligned}$$

(59)

Fixing cluster k and dimension j, it is possible to reduce the optimization complexity to

$$\begin{aligned} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j -\underline{g}_k^j | + \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j -\breve{g}_k^j | . \end{aligned}$$

(60)

The problem is resumed to optimize the two sums

$$\begin{aligned} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j - \underline{g}_k^j | \, \text{ and } \, \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j -\breve{g}_k^j | . \end{aligned}$$

(61)

Each sum is minimized by the median of the respective set [27]. Then,

$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \text{ and } \, \breve{g}_k^j =\underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j\right\} . \end{aligned}$$

(62)

The criterion to be minimized for the $HL_{\infty }$ distance is

$$\begin{aligned} J_{d_{HL_{\infty }}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \left( \max _{j=1}^p \{ |\underline{\gamma }_n^j - \underline{g}_k^j | \} +\max _{j=1}^p \{ |\breve{\gamma }_k^j - \breve{g}_k^j | \} \right) . \end{aligned}$$

(63)

Fixing the cluster k, it is possible to reduce the optimization complexity to

$$\begin{aligned} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j |\right\} + \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_k^j - \breve{g}_k^j | \right\} . \end{aligned}$$

(64)

The problem is reduced to optimizing the two sums independently,

$$\begin{aligned} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j | \right\} \, \text{ and } \, \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_k^j - \breve{g}_k^j|\right\} . \end{aligned}$$

(65)

The $\max $ function can be rewritten as a limit of the $HL_q$ distance when $q \rightarrow \infty $, so

$$\begin{aligned} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j | \right\} = \sum _{\gamma _n \in C_k} \lim _{q \rightarrow \infty } \left\{ \sum _{j=1}^p |\underline{\gamma }_n^j -\underline{g}_k^j |^q \right\} ^{\frac{1}{q}} \end{aligned}$$

(66)

and

$$\begin{aligned} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_k^j -\breve{g}_k^j | \right\} = \sum _{\gamma _n \in C_k} \lim _{q \rightarrow \infty } \left\{ \sum _{j=1}^p |\breve{\gamma }_k^j -\breve{g}_k^j |^q \right\} ^{\frac{1}{q}}. \end{aligned}$$

(67)

As the terms of the sums are positive, their minimization entails the minimization of all sums. Fixing dimension j, the problem is reduced to

$$\begin{aligned} \sum _{\gamma _n \in C_k} \lim _{q \rightarrow \infty } \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j |^q\right\} ^{\frac{1}{q}} = \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j - \underline{g}_k^j | \end{aligned}$$

(68)

and

$$\begin{aligned} \sum _{\gamma _n \in C_k} \lim _{q \rightarrow \infty } \left\{ |\breve{\gamma }_k^j - \breve{g}_k^j |^q\right\} ^{\frac{1}{q}} =\sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j - \breve{g}_k^j |. \end{aligned}$$

(69)

This follows the $HL_1$ optimization problem, whose solution is the medians of lower bounds and ranges. Then,

$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \text{ and } \, \breve{g}_k^j =\underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j\right\} . \end{aligned}$$

(70)

Using the inverse mapping [see Eq. (15)], we compute the upper bounds as

$$\begin{aligned} b_{g_k}^j = \underline{g}_k^j + \breve{g}_k^j = \underline{g}_k^j + \underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j\right\} . \end{aligned}$$

(71)

$\square $

D Proof of Proposition 4

Fixing cluster k and dimension j, the prototype for the $HL_2$ distance has an analytic solution, which is the mean of the interval bounds. It is computed by Eq. (72),

$$\begin{aligned} \underline{g}_k^j = \frac{1}{|C_k|} \sum _{j=1}^p \underline{\gamma }_n^j\, \text{ and } \, b_{g_k}^j = \underline{g}_k^j + \frac{1}{|C_k|} \sum _{j=1}^p \breve{\gamma }_{n}^j, \end{aligned}$$

(72)

where $|C_k|$ is the number of instances allocated in the cluster $C_k$.

Proof

The criterion to be minimized for the $HL_2$ distance is

$$\begin{aligned} J_{d_{HL_2}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \left\{ (\underline{\gamma }_n^j - \underline{g}_k^j)^2| + (\breve{\gamma }_k^j - \breve{g}_k^j)^2 \right\} . \end{aligned}$$

(73)

Fixing cluster k and dimension j, it is possible to reduce the optimization complexity to

$$\begin{aligned} J_k^j=\sum _{\gamma _n \in C_k} (\underline{\gamma }_n^j -\underline{g}_k^j)^2+ \sum _{\gamma _n \in C_k} (\breve{\gamma }_k^j -\breve{g}_k^j)^2 . \end{aligned}$$

(74)

The solution is found using minimum squares. Partial derivatives of $J_k^j$ with respect to $\underline{g}_k^j$ and $\breve{g}_k^j$ should be null. So,

$$\begin{aligned} \frac{\partial J_k^j}{\partial \underline{g}_k^j} = \sum _{\gamma _n \in C_k} 2 \cdot (\underline{\gamma }_n^j - \underline{g}_k^j) = 0 \Longrightarrow \underline{g}_k^j = \frac{1}{|C_k|} \sum _{\gamma _n \in C_k} \underline{\gamma }_n^j \end{aligned}$$

(75)

and

$$\begin{aligned} \frac{\partial J_k^j}{\partial \underline{g}_k^j} = \sum _{\gamma _n \in C_k} 2 \cdot (\breve{\gamma }_k^j - \breve{g}_k^j) = 0 \Longrightarrow \breve{g}_k^j = \frac{1}{|C_k|} \sum _{\gamma _n \in C_k} \breve{\gamma }_k^j \end{aligned}$$

(76)

where $|C_k|$ is the number of instances allocated in cluster k. The $HL_2$ prototypes are computed by:

$$\begin{aligned} \underline{g}_k^j = \frac{1}{|C_k|} \sum _ {\gamma _n \in C_k } \underline{\gamma }_n^j \, \text{ and } \, \breve{g}_k^j = \frac{1}{|C_k|} \sum _ {\gamma _n \in C_k } \breve{\gamma }_n^j. \end{aligned}$$

(77)

Using the inverse mapping (see Eq. (15)), we compute the upper bounds as

$$\begin{aligned} b_{g_k}^j = \underline{g}_k^j + \breve{g}_k^j = \underline{g}_k^j + \frac{1}{|C_k|} \sum _ {\gamma _n \in C_k } \breve{\gamma }_{n}^j. \end{aligned}$$

(78)

$\square $

E Proof of Proposition 5

Fixing cluster k and dimension j, the prototype for the $HL_q$ distance (when $q > 1$) can be found using the Newton–Raphson numeric method. Let the sets $L_k^j =\left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} $ and $R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} $. Algorithm 2 shows how to compute the prototype components $\underline{g}_k^j$ and $\breve{g}_k^j$, respectively.

Proof

Let an ascending ordered set $X = \left\{ x_1, x_2, \ldots , x_N\right\} $, i.e., $x_i \le x_{i+1}$, and the function $f:\mathfrak {R}\leftarrow \mathfrak {R}$,

$$\begin{aligned} f(v) = \sum _{i=1}^N |x_i- v|^q \end{aligned}$$

(79)

with $q > 1$. We are interested on the value which minimizes f(v). This function can be rewritten as follows:

$$\begin{aligned} f(v) = \sum _{i=1}^N (x_i- v)^q \cdot \left\{ sgn(x_i-v) \right\} ^q, \end{aligned}$$

(80)

where $sgn(\cdot )$ is a constant, defined as

$$\begin{aligned} sgn(x)= {\left\{ \begin{array}{ll} \,\,\, 1 &{} \text{ if } x \ge 0 \\ -1 &{} \text{ otherwise }. \end{array}\right. } \end{aligned}$$

(81)

The first derivative of $f(\cdot )$ is given by:

$$\begin{aligned} f'(v)&= -q \sum _{i=1}^N (x_i- v)^{q-1} \cdot \left\{ sgn(x_i-v) \right\} ^{q} \end{aligned}$$

(82)

$$\begin{aligned} f'(v)&= -q \sum _{i=1}^N |x_i- v|^{q-1} \cdot sgn(x_i-v), \end{aligned}$$

(83)

and the second derivative of $f(\cdot )$ is given by:

$$\begin{aligned} f''(v)&= q (q-1) \sum _{i=1}^N (x_i- v)^{q-2} \cdot \left\{ sgn(x_i-v) \right\} ^{q} \end{aligned}$$

(84)

$$\begin{aligned} f''(v)&= q (q-1) \sum _{i=1}^N |x_i- v|^{q-2}. \end{aligned}$$

(85)

When $q > 1$ the second derivative is always positive. We conclude that the first derivative is monotonically increasing for any v.

The value $v_*$ which minimizes f(v) must satisfy $f'(v_*)=0$. Suppose a value $v_{-}$ with $v_{-} < x_1$. Then, $v_{-} < x_i, \forall x_i$. So, $x_i -v_{-} > 0$, which implies that $sgn(x_i-v_{-})=1, \forall x_i$. Then, the first derivative becomes

$$\begin{aligned} f'(v_{-}) = -q \sum _{i=1}^N |x_i- v_{-}|^{q-1} \end{aligned}$$

(86)

which always assumes a negative value. So, $f'(v_{-}) < 0$.

Now, suppose that $v_+ > x_N$. Then, $v_+ > x_i, \forall x_i$. So, $x_i-v_+<0$, implying $sgn(x_i-v_+)=-1$. Then, the first derivative becomes

$$\begin{aligned} f'(v_+) = q \sum _{i=1}^N |x_i- v_+|^{q-1} \end{aligned}$$

(87)

which assumes positive values. So, $f'(v_+) > 0$.

When $v < x_1$, $f'(v) < 0$ and when $v > x_N$, $f'(v) > 0$, so, the function $f'(v)$ changes its signal on the interval $[x_1, x_N]$, then $\exists \, v_* \in [x_1, x_N]$ such that $f'(v_*)=0$. As $f'(v)$ is monotonically increasing, this solution is unique. Unfortunately, the $f'(v)$ expression is too complex and a general analytic solution cannot be computed. We propose the use of Newton−Raphson numeric method to find $v_*$. In this case, a initial value $v_0$ is chosen randomly on interval $[x_1,x_N]$. Iterative values $\left\{ v_i \right\} $ are computed as follows:

$$\begin{aligned} v_i = v_{i-1} - \frac{f'(v_{i-1})}{f''(v_{i-1})}. \end{aligned}$$

(88)

Convergence occurs when $|v_i - v_{i-1}| < \epsilon $, with $\epsilon > 0$.

The criterion to be minimized for the $HL_q$ distance is given by

$$\begin{aligned} J_{d_{HL_q}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j +|\breve{\gamma }_k^j - \breve{g}_k^j| \right\} . \end{aligned}$$

(89)

Fixing the kth cluster and jth dimension results in

$$\begin{aligned} \sum _{\gamma _n \in C_k} \left\{ |\underline{\gamma }_n^j -\underline{g}_k^j|^q + |\breve{\gamma }_k^j - \breve{g}_k^j|^q\right\} , \end{aligned}$$

(90)

and the two sums must be minimized:

$$\begin{aligned} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j -\underline{g}_k^j|^q \quad \text{ and } \quad \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j - \breve{g}_k^j|^q. \end{aligned}$$

(91)

The steps described above (for the f function) can be applied for each sum independently. The sets $L_k^j =\left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} $ and $R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} $ can be replaced by set X, and components $\underline{g}_k^j$ and $\breve{g}_k^j$ are determined. Algorithm 2 shows the steps to compute them using the Newton–Raphson numeric method. $\square $

F Proof of Proposition 6

Fixing cluster k, dimension j and parameter q ($q \ge 1$), the prototypes for $WHL_q$ and $AHL_q$ distances are computed according to one of three cases:

1.
If $(q=1$ or $q=\infty )$, prototypes have an analytic solution, given by:
$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \quad \text{ and } \quad \, b_{g_k}^j = \underline{g}_k^j + \underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j \right\} . \end{aligned}$$
2.
If $(q=2)$, prototypes have an analytic solution, given by:
$$\begin{aligned} \underline{g}_k^j = \frac{1}{|C_k|} \sum _{j=1}^p \underline{\gamma }_n^j \quad \text{ and } \quad \,\, b_{g_k}^j =\underline{g}_k^j + \frac{1}{|C_k|} \sum _{j=1}^p \breve{\gamma }_n^j, \end{aligned}$$
where $|C_k|$ is the number of instances in cluster k.
3.
If ($q\ne 1$ and $q \ne 2$ and $q \ne \infty $), the Newton–Raphson numeric method is used, as described by Algorithm 2. The sets $L_k^j = \left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} $ and $R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} $ are manipulated by it, resulting on the values of $\underline{g}_k^j$ and $\breve{g}_k^j$, respectively. Prototype upper bounds are found by $b_{g_k}^j=\underline{g}_k^j +\breve{g}_k^j$, for $j=1\ldots ,p$.

Proof

The optimization criterion for the $AHL_q$ distance is given by:

$$\begin{aligned} J_{d_{AHLq}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \lambda _k^j \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j |^q +|\breve{\gamma }_k^j - \breve{g}_k^j |^q \right\} . \end{aligned}$$

(92)

Fixing cluster k and dimension j, the optimization problem can be reduced to

$$\begin{aligned} \lambda _k^j \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j -\underline{g}_k^j |^q + \lambda _k^j \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j - \breve{g}_k^j |^q . \end{aligned}$$

(93)

The optimization criterion for the $WHL_q$ distance is given by:

$$\begin{aligned} J_{d_{WHLq}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p (w_{k,1}^j)^t \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j |^q + (w_{k,2}^j)^t |\breve{\gamma }_k^j - \breve{g}_k^j |^q \right\} . \end{aligned}$$

(94)

Fixing cluster k and dimension j, the optimization problem can be reduced to

$$\begin{aligned} (w_{k,1}^j)^t \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j -\underline{g}_k^j |^q + (w_{k,2}^j)^t \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j - \breve{g}_k^j |^q . \end{aligned}$$

(95)

Adaptive and Hybrid weights become constants when classes and dimensions are fixed. The problem is reduced to optimize the following two sums:

$$\begin{aligned} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j - \underline{g}_k^j |^q \, \quad \text{ and } \quad \, \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j - \breve{g}_k^j |^q . \end{aligned}$$

(96)

Solutions are proposed according to the value of the q parameter. If $q=1$, the optimization becomes

$$\begin{aligned} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j - \underline{g}_k^j | \, \quad \text{ and } \quad \, \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j -\breve{g}_k^j |, \end{aligned}$$

(97)

whose solution, according to Proposition 3, is given by:

$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \quad \text{ and } \quad \, \breve{g}_k^j =\underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j\right\} . \end{aligned}$$

(98)

If $q=2$, the optimization becomes

$$\begin{aligned} \sum _{\gamma _n \in C_k} (\underline{\gamma }_n^j -\underline{g}_k^j)^2 \, \quad \text{ and } \quad \, \sum _{\gamma _n \in C_k} (\breve{\gamma }_k^j - \breve{g}_k^j)^2. \end{aligned}$$

(99)

The solution, according to Proposition 4, is given by:

$$\begin{aligned} \underline{g}_k^j = \frac{1}{|C_k|} \sum _{j=1}^p \underline{\gamma }_n^j \quad \text{ and } \quad \,\, \breve{g}_k^j =\frac{1}{|C_k|} \sum _{j=1}^p \breve{\gamma }_{n}^j, \end{aligned}$$

(100)

where $|C_k|$ is the number of instances in cluster k.

The optimization criterion for the $WHL_{\infty }$ distance is given by:

$$\begin{aligned} J_{d_{HL_{\infty }}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \left( (w_{k,1} )^t \max _{j=1}^p \left\{ |\underline{\gamma }_n^j -\underline{g}_k^j | \right\} + (w_{k,2})^t \max _{j=1}^p \left\{ |\breve{\gamma }_k^j - \breve{g}_k^j | \right\} \right) . \end{aligned}$$

(101)

Fixing cluster k, the optimization problem can be reduced to

$$\begin{aligned} (w_{k,1} )^t \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j | \right\} + (w_{k,2} )^t \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_k^j- \breve{g}_k^j | \right\} . \end{aligned}$$

(102)

When the cluster is fixed, the hybrid weights become constants; then, the problem is reduced to optimizing the two sums independently:

$$\begin{aligned} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j | \right\} \, \quad \text{ and } \quad \, \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_k^j - \breve{g}_k^j | \right\} . \end{aligned}$$

(103)

The solution are the medians of lower bounds and ranges (as showed on Proposition 3). Then,

$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \quad \text{ and } \quad \, \breve{g}_k^j =\underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j \right\} . \end{aligned}$$

(104)

If $q>1$, $q \ne 2$ or $q\ne \infty $, it is not possible to express an analytic solution for Eq. in (96). So the solution is computed as discussed on Proposition 5. The Newton–Raphson numeric method is used, as described by Algorithm 2. The sets $L_k^j = \left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} $ and $R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} $ are parameters for this algorithm, resulting on the values of $\underline{g}_k^j$ and $\breve{g}_k^j$, respectively.

Using the inverse mapping (see Eq. (15)), we compute the upper bounds as

$$\begin{aligned} b_{g_k}^j&= \underline{g}_k^j + \breve{g}_k^j. \end{aligned}$$

(105)

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Souza, L.C., de Souza, R.M.C.R. & do Amaral, G.J.A. Dynamic clustering of interval data based on hybrid $L_q$ distance. Knowl Inf Syst 62, 687–718 (2020). https://doi.org/10.1007/s10115-019-01367-w

Download citation

Received: 16 September 2018
Revised: 04 May 2019
Accepted: 06 May 2019
Published: 17 May 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s10115-019-01367-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Dynamic clustering of interval data based on hybrid \(L_q\) distance

Abstract

Access this article

Similar content being viewed by others

A New Representation of Interval Symbolic Data and Its Application in Dynamic Clustering

Distance based Incremental Clustering for Mining Clusters of Arbitrary Shapes

A neighborhood-based three-stage hierarchical clustering algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Proof of Proposition 1

Proof

B Proof of Proposition 2

Proof

C Proof of Proposition 3

Proof

D Proof of Proposition 4

Proof

E Proof of Proposition 5

Proof

F Proof of Proposition 6

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dynamic clustering of interval data based on hybrid \(L_q\) distance

Abstract

Access this article

Similar content being viewed by others

A New Representation of Interval Symbolic Data and Its Application in Dynamic Clustering

Distance based Incremental Clustering for Mining Clusters of Arbitrary Shapes

A neighborhood-based three-stage hierarchical clustering algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Proof of Proposition 1

Proof

B Proof of Proposition 2

Proof

C Proof of Proposition 3

Proof

D Proof of Proposition 4

Proof

E Proof of Proposition 5

Proof

F Proof of Proposition 6

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation