Skip to main content
Log in

Dynamic clustering of interval data based on hybrid \(L_q\) distance

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Dynamic clustering defines partitions within data and prototypes to each partition. Distance metrics are responsible for checking the closeness between instances and prototypes. Considering the literature about interval data, distances depend on interval bounds and the information inside the intervals is ignored. This paper proposes new distances, which explore the information inside of intervals. It also presents a mapping of intervals to points, which preserves their spatial location and internal variation. We formulate a new hybrid distance for interval data based on the well-known \(L_q\) distance for point data. This new distance allows for a weighted formulation of the hybridism. Hence, we propose a Hybrid \(L_q\) distance, a Weighted Hybrid \(L_q\) distance, as well as the adaptive version of the Hybrid \(L_q\) distance for interval data. Experiments with synthetic and real interval data sets illustrate the usefulness of the hybrid approach to improve dynamic clustering for interval data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chichester

    Book  Google Scholar 

  2. Billard L, Le-Rademacher J (2012) Principal component analysis for interval data. Wiley Interdiscip Rev Comput Stat 4(6):535–540

    Article  Google Scholar 

  3. Burden RL, Faires JD (2011) Numerical analysis. Cengage Learning, Brooks/Cole

    MATH  Google Scholar 

  4. Chavent M, Lechevallier Y (2002) Dynamical clustering of interval data: optimization of an adequacy criterion based on Hausdorff distance. In: Classification, clustering, and data analysis, pp 53–60

  5. Chavent M (2004) An Hausdorff distance between hyper-rectangles for clustering interval data. In: Banks D et al (eds) Classification, clustering an data mining application, proceedings of the IFCS04. Springer, Berlin, pp 333–340

    Chapter  Google Scholar 

  6. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The MIT Press, Cambridge

    MATH  Google Scholar 

  7. De Carvalho FAT, Brito P, Bock H-H (2006b) Dynamic clustering for interval data based on L2 distance. Comput Stat 21:231–250

    Article  Google Scholar 

  8. De Carvalho FAT, Souza RMCR, Chavent M, Lechevallier Y (2006a) Adaptive Hausdorff distances and dynamic clustering of symbolic interval data. Pattern Recognit Lett 27:167–179

    Article  Google Scholar 

  9. De Carvalho FAT, Lechevallier Y (2009a) Dynamic clustering of interval-valued data based on adaptive quadratic distances. Trans Syst Man Cyber Part A 39:1295–1306

    Article  Google Scholar 

  10. De Carvalho FAT, Lechevallier Y (2009b) Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognit 42:1223–1236

    Article  Google Scholar 

  11. De Carvalho FAT, Souza RMCR (2010) Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recognit Lett 31(5):430–443

    Article  Google Scholar 

  12. Diday E, Simon JC (1976) Clustering analysis. In: Fu KS (ed) Digit Pattern Classif. Springer, Berlin, pp 47–94

    Google Scholar 

  13. Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, Chichester

    MATH  Google Scholar 

  14. Diday E (2016) Thinking by classes in data science: the symbolic data analysis paradigm. Wiley Interdiscip Rev Comput Stat 8(5):172–205

    Article  MathSciNet  Google Scholar 

  15. Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4(2):229–246

    Article  MathSciNet  Google Scholar 

  16. Fränti P, Kivijärvi J (2000) Randomised local search algorithm for the clustering problem. Pattern Anal Appl 3:358–369

    Article  MathSciNet  Google Scholar 

  17. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 24:2367–2376

    Google Scholar 

  18. Lichaman M (2013) newblock UCI machine learning repository

  19. Lima Neto EA, De Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54:333–347

    Article  MathSciNet  Google Scholar 

  20. Lima Neto EA, De Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52:1500–1515

    Article  MathSciNet  Google Scholar 

  21. Martinez WL, Martinez AR (2007) Computational statistics handbook with MATLAB. Chapman & Hall CRC, New York

    MATH  Google Scholar 

  22. Renche AC, Christensen WF (2012) Methods of multivariate analysis, 3rd edn. Wiley, New York

    Book  Google Scholar 

  23. Silva Filho TM, Souza RMCR (2015) A swarm-trained k-nearest prototypes adaptive classifier with automatic feature selection for interval data. Neural Netw 80:19–33

    Article  Google Scholar 

  24. Silva APD, Brito P (2006) Linear discriminant analysis for interval data. Comput Stat 21(2):289–308

    Article  MathSciNet  Google Scholar 

  25. Silva APD, Brito P (2015) Discriminant analysis of interval data: An assessment of parametric and distance-based approaches. J Classif 32(3):516–541

    Article  MathSciNet  Google Scholar 

  26. Souza LC (2016) Agrupamento e regressão linear de dados simblicos intervalares baseados em novas representações. PhD Thesis, Universidade Federal de Pernambuco, PE, Brazil, https://repositorio.ufpe.br/handle/123456789/17640

  27. Souza RMCR, De Carvalho FAT (2004) Clustering of interval data based on city–block distances. Pattern Recognit Lett 25:353–365

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank CNPq and CAPES (Brazilian Agencies) for their financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Renata Maria Cardoso Rodrigues de Souza.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Proposition 1

Fixing cluster k and dimension j, the hybrid weights of the \(WHL_q\) distance are obtained using Lagrange multipliers under the restrictions: \(w_{k,1}^j + w_{k,2}^j = 1\); \(w_{k,1}^j \ge 0 \); \(w_{k,2}^j \ge 0\); and \(t > 1\). Let

$$\begin{aligned} \xi _{k,1}^j = \sum _{\gamma _n \in C_k} | \underline{\gamma }_n^j -\underline{g}_k^j |^q \quad \text{ and } \quad \,\, \xi _{k,2}^j =\sum _{\gamma _n \in C_k} | \breve{\gamma }_n^j - \breve{g}_k^j |^q. \end{aligned}$$

The Hybrid weight values are computed by:

$$\begin{aligned} w_{k,1}^j = \left\{ 1 + \left( \frac{\xi _{k,1}^j}{\xi _{k,2}^j} \right) ^{\frac{1}{t-1}} \right\} ^{-1} \text{ and } \,\,\, w_{k,2}^j = \left\{ 1 + \left( \frac{\xi _{k,2}^j}{\xi _{k,1}^j} \right) ^{\frac{1}{t-1}} \right\} ^{-1}. \end{aligned}$$

Proof

The partitional dynamic clustering criterion for the \(WHL_q\) distance is given by

$$\begin{aligned} J_{d_{WHLq}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \left\{ (w_{k,1}^j)^t \, |\underline{\gamma }_n^j -\underline{g}_k^j|^q + (w_{k,2}^j)^t \, |\breve{\gamma }_n^j-\breve{g}_k^j|^q \right\} , \end{aligned}$$
(28)

under the restrictions: \(w_{k,1}^j + w_{k,2}^j = 1\), \(w_{k,1}^j \ge 0\), \(w_{k,2}^j \ge 0\) and \(t > 1\). The solution can be found using Lagrange multipliers. Let \(J_{d_{WHLq}}(\varLambda _1^1,\ldots , \varLambda _K^p)\) be the version of Eq. (28) with the Lagrange multipliers (\(\varLambda _k^j\)) associated restrictions. Thus, it becomes

$$\begin{aligned} J_{d_{WHLq}}(\varLambda _1^1,\ldots , \varLambda _K^p)= & {} \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \left\{ (w_{k,1}^j)^t \, |\underline{\gamma }_n^j-\underline{g}_k^j|^q + (w_{k,2}^j)^t \, |\breve{\gamma }_n^j-\breve{g}_k^j|^q \right\} \nonumber \\&- \sum _{k=1}^K \sum _{j=1}^p \left\{ \varLambda _k^j \, (w_{k,1}^j + w_{k,2}^j - 1) \right\} . \end{aligned}$$
(29)

Weights can be found when the partial derivatives of \(J_{d_{WHLq}}\) are equal to 0. Fixing cluster k and dimension j, deriving \(J_{d_{WHLq}}\) according to the first weight component (\(w_{k,1}^j\)), we get

$$\begin{aligned}&\frac{\partial J_{d_{WHLq}}(\varLambda _1^1,\ldots , \varLambda _K^p)}{\partial w_{k,1}^j} = \sum _{\gamma _n \in C_k} \left\{ t \, (w_{k,1}^j)^{t-1} \, |\underline{\gamma }_n^j -\underline{g}_k^j|^q \right\} - \varLambda _k^j = 0 \end{aligned}$$
(30)
$$\begin{aligned}&t \, (w_{k,1}^j)^{t-1} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j-\underline{g}_k^j|^q - \varLambda _k^j = 0. \end{aligned}$$
(31)

Defining

$$\begin{aligned} \xi _{k,1}^j = \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j-\underline{g}_k^j|^q, \end{aligned}$$
(32)

and isolating the \(w_{k,1}^j\) term, we get

$$\begin{aligned} t \, (w_{k,1}^j)^{t-1} \xi _{k,1}^j - \varLambda _k^j = 0 \Longrightarrow w_{k,1}^j = \left( \frac{\varLambda _k^j}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}}. \end{aligned}$$
(33)

Now, deriving \(J_{d_{WHLq}}\) according to the second weight component (\(w_{k,2}^j\)), we obtain

$$\begin{aligned}&\displaystyle \frac{\partial J_{d_{WHLq}}(\varLambda _1^1,\ldots , \varLambda _K^p)}{\partial w_{k,2}^j} = \sum _{\gamma _n \in C_k} \left\{ t \, (w_{k,2}^j)^{t-1} \, |\breve{\gamma }_n^j -\breve{g}_k^j|^q \right\} - \varLambda _k^j = 0 \end{aligned}$$
(34)
$$\begin{aligned}&\displaystyle t \, (w_{k,2}^j)^{t-1} \sum _{\gamma _n \in C_k} |\breve{\gamma }_n^j-\breve{g}_k^j|^q - \varLambda _k^j = 0. \end{aligned}$$
(35)

Defining

$$\begin{aligned} \xi _{k,2}^j = \sum _{\gamma _n \in C_k} |\breve{\gamma }_n^j-\breve{g}_k^j|^q, \end{aligned}$$
(36)

we get \(w_{k,2}^j\), as follows:

$$\begin{aligned} t \, (w_{k,2}^j)^{t-1} \xi _{k,2}^j - \varLambda _k^j = 0 \Longrightarrow w_{k,2}^j = \left( \frac{\varLambda _k^j}{t \, \xi _{k,2}^j } \right) ^{\frac{1}{t-1}}. \end{aligned}$$
(37)

With the expressions above, used to find weights, we compute the Lagrange multiplier (\(\varLambda _k^j\)) using the restriction \(w_{k,1}^j + w_{k,2}^j =1\). Then,

$$\begin{aligned}&\displaystyle w_{k,1}^j + w_{k,2}^j = 1 \end{aligned}$$
(38)
$$\begin{aligned}&\displaystyle \left( \frac{\varLambda _k^j}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}} + \left( \frac{\varLambda _k^j}{t \, \xi _{k,2}^j } \right) ^{\frac{1}{t-1}} = 1 \end{aligned}$$
(39)
$$\begin{aligned}&\displaystyle \varLambda _k^j = \left\{ \left( \frac{1}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2}^j } \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} \end{aligned}$$
(40)

Replacing Eq. (40)) in Eq. (33)), we get

$$\begin{aligned} w_{k,1}^j = \left( \frac{\left\{ \left( \frac{1}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2}^j } \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} }{(t \, \xi _{k,1}^j) }\right) ^{\frac{1}{t-1}} = \left\{ 1+ \left( \frac{\xi _{k,1}^j}{\xi _{k,2}^j} \right) ^{\frac{1}{t-1}}\right\} ^{-1}. \end{aligned}$$
(41)

Now, replacing Eq. (40)) in Eq. (37)), we get

$$\begin{aligned} w_{k,2}^j = \left( \frac{\left\{ \left( \frac{1}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2}^j } \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} }{(t \, \xi _{k,2}^j) }\right) ^{\frac{1}{t-1}} = \left\{ 1+ \left( \frac{\xi _{k,2}^j}{\xi _{k,1}^j} \right) ^{\frac{1}{t-1}}\right\} ^{-1}. \end{aligned}$$
(42)

So, the hybrid weights can be computed using Eqs. (41) and (42). \(\square \)

B Proof of Proposition 2

Fixing cluster k, the hybrid weights of the \(WHL_{\infty }\) distance are obtained using Lagrange multipliers under the following restrictions: \(w_{k,1} + w_{k,2} = 1\); \(w_{k,1} \ge 0\); \(w_{k,2} \ge 0\); and \( t > 1\). Let

$$\begin{aligned} \xi _{k,1} = \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ | \underline{\gamma }_n^j - \underline{g}_k^j |\right\} \, \text{ and } \, \xi _{k,2} = \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ | \breve{\gamma }_n^j - \breve{g}_k^j |\right\} . \end{aligned}$$

Then,

$$\begin{aligned} w_{k,1} = \left\{ 1 + \left( \frac{\xi _{k,1}}{\xi _{k,2}} \right) ^{\frac{1}{t-1}} \right\} ^{-1} \, \text{ and } \, \, w_{k,2} = \left\{ 1 + \left( \frac{\xi _{k,2}}{\xi _{k,1}} \right) ^{\frac{1}{t-1}} \right\} ^{-1}. \end{aligned}$$

Proof

The partitional dynamic clustering criterion for the \(WHL_{\infty }\) distance is given by

$$\begin{aligned} J_{d_{WHLq}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \left\{ (w_{k,1})^t \max _{j=1}^p \left\{ |\underline{\gamma }_n^j-\underline{g}_k^j|\right\} +(w_{k,2})^t\max _{j=1}^p \left\{ |\breve{\gamma }_n^j -\breve{g}_k^j| \right\} \right\} \end{aligned}$$
(43)

under the restrictions: \(w_{k,1} + w_{k,2} = 1\); \(w_{k,1} \ge 0\); \(w_{k,2} \ge 0\); and \(t > 1\). This solution can be computed using Lagrange multipliers. Eq. (43) is rewritten to incorporate Lagrange multipliers (\(\varLambda _k\)) and its respective restrictions. It then becomes

$$\begin{aligned} J_{d_{WHLq}}(\varLambda _1, \ldots , \varLambda _K)= & {} \sum _{k=1}^K \sum _{\gamma _n \in C_k} \left( (w_{k,1})^t \max _{j=1}^p \left\{ |\underline{\gamma }_n^j-\underline{g}_k^j| \right\} \right. \nonumber \\&+\left. (w_{k,2})^t \max _{j=1}^p \left\{ |\breve{\gamma }_n^j-\breve{g}_k^j| \right\} \right) -\sum _{k=1}^K \varLambda _k (w_{k,1} + w_{k,2} - 1)\qquad \end{aligned}$$
(44)

Weights can be found when partial derivatives of \(J_{d_{WHLq}}\), with respect to the weights, are equal to 0. For fixed cluster k and dimension j, deriving \(J_{d_{WHLq}}\) according to the first weight component (\(w_{k,1}\)), we get

$$\begin{aligned}&\displaystyle \frac{\partial J_{d_{WHLq}}(\varLambda _1, \ldots , \varLambda _K)}{\partial w_{k,1}} = \sum _{\gamma _n \in C_k} t \, (w_{k,1})^{t-1} \, \max _{j=1}^p \left\{ |\underline{\gamma }_n^j-\underline{g}_k^j| \right\} - \varLambda _k = 0 \end{aligned}$$
(45)
$$\begin{aligned}&\displaystyle t \, (w_{k,1})^{t-1} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j-\underline{g}_k^j| \right\} - \varLambda _k = 0. \end{aligned}$$
(46)

Defining

$$\begin{aligned} \xi _{k,1} = \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j-\underline{g}_k^j| \right\} , \end{aligned}$$
(47)

we get

$$\begin{aligned} t \, (w_{k,1})^{t-1} \xi _{k,1} - \varLambda _k = 0 \Longrightarrow w_{k,1} = \left( \frac{\varLambda _k}{t \, \xi _{k,1}^j } \right) ^{\frac{1}{t-1}}. \end{aligned}$$
(48)

Now, deriving \(J_{d_{WHLq}}\) according to the second weight component (\(w_{k,2}\)), we get

$$\begin{aligned}&\displaystyle \frac{\partial J_{d_{WHLq}}(\varLambda _1, \ldots , \varLambda _K)}{\partial w_{k,2}} = \sum _{\gamma _n \in C_k} t \, (w_{k,2})^{t-1} \max _{j=1}^p \left\{ |\breve{\gamma }_n^j-\breve{g}_k^j| \right\} - \varLambda _k = 0 \end{aligned}$$
(49)
$$\begin{aligned}&\displaystyle t \, (w_{k,2})^{t-1} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_n^j-\breve{g}_k^j| \right\} - \varLambda _k = 0. \end{aligned}$$
(50)

Defining

$$\begin{aligned} \xi _{k,2} = \sum _{\gamma _n \in C_k}\max _{j=1}^p \left\{ |\breve{\gamma }_n^j-\breve{g}_k^j| \right\} , \end{aligned}$$
(51)

it becomes

$$\begin{aligned} t \, (w_{k,2})^{t-1} \xi _{k,2} - \varLambda _k = 0 \Longrightarrow w_{k,2} = \left( \frac{\varLambda _k}{t \, \xi _{k,2}} \right) ^{\frac{1}{t-1}}. \end{aligned}$$
(52)

The Lagrange multiplier (\(\varLambda _k\)) is computed based on restriction \(w_{k,1}+ w_{k,2} =1\). Then,

$$\begin{aligned}&\displaystyle w_{k,1} + w_{k,2} = 1 \end{aligned}$$
(53)
$$\begin{aligned}&\displaystyle \left( \frac{\varLambda _k}{t \, \xi _{k,1} } \right) ^{\frac{1}{t-1}} + \left( \frac{\varLambda _k}{t \, \xi _{k,2} } \right) ^{\frac{1}{t-1}} = 1 \end{aligned}$$
(54)
$$\begin{aligned}&\displaystyle \varLambda _k = \left\{ \left( \frac{1}{t \, \xi _{k,1} } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2}} \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} \end{aligned}$$
(55)

Replacing Eq. (55) in Eq. (48), we get

$$\begin{aligned} w_{k,1} = \left( \frac{\left\{ \left( \frac{1}{t \, \xi _{k,1} } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2}} \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} }{(t \, \xi _{k,1}) }\right) ^{\frac{1}{t-1}} = \left\{ 1+\left( \frac{\xi _{k,1}}{\xi _{k,2}} \right) ^{\frac{1}{t-1}}\right\} ^{-1}. \end{aligned}$$
(56)

Now, replacing Eq. (55) in Eq. (52), we get

$$\begin{aligned} w_{k,2} = \left( \frac{\left\{ \left( \frac{1}{t \, \xi _{k,1} } \right) ^{\frac{1}{t-1}} + \left( \frac{1}{t \, \xi _{k,2} } \right) ^{\frac{1}{t-1}}\right\} ^{-(t-1)} }{(t \, \xi _{k,2}) } \right) ^{\frac{1}{t-1}} =\left\{ 1+ \left( \frac{\xi _{k,2}}{\xi _{k,1}} \right) ^{\frac{1}{t-1}}\right\} ^{-1}. \end{aligned}$$
(57)

So, the hybrid weights are computed by Eqs. (56) and (57). \(\square \)

C Proof of Proposition 3

Fixing cluster k and dimension j, the prototype for the \(HL_1\) and \(HL_{\infty }\) distances has an analytic solution, given by Eq. (58),

$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_{n}^j \right\} \, \text{ and } \, \bar{g}_k^j = \underline{g}_k^j + \underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_{n}^j \right\} . \end{aligned}$$
(58)

Proof

The criterion to be minimized for the \(HL_1\) distance is

$$\begin{aligned} J_{d_{HL_1}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j | + |\breve{\gamma }_k^j - \breve{g}_k^j | \right\} . \end{aligned}$$
(59)

Fixing cluster k and dimension j, it is possible to reduce the optimization complexity to

$$\begin{aligned} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j -\underline{g}_k^j | + \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j -\breve{g}_k^j | . \end{aligned}$$
(60)

The problem is resumed to optimize the two sums

$$\begin{aligned} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j - \underline{g}_k^j | \, \text{ and } \, \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j -\breve{g}_k^j | . \end{aligned}$$
(61)

Each sum is minimized by the median of the respective set [27]. Then,

$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \text{ and } \, \breve{g}_k^j =\underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j\right\} . \end{aligned}$$
(62)

The criterion to be minimized for the \(HL_{\infty }\) distance is

$$\begin{aligned} J_{d_{HL_{\infty }}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \left( \max _{j=1}^p \{ |\underline{\gamma }_n^j - \underline{g}_k^j | \} +\max _{j=1}^p \{ |\breve{\gamma }_k^j - \breve{g}_k^j | \} \right) . \end{aligned}$$
(63)

Fixing the cluster k, it is possible to reduce the optimization complexity to

$$\begin{aligned} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j |\right\} + \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_k^j - \breve{g}_k^j | \right\} . \end{aligned}$$
(64)

The problem is reduced to optimizing the two sums independently,

$$\begin{aligned} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j | \right\} \, \text{ and } \, \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_k^j - \breve{g}_k^j|\right\} . \end{aligned}$$
(65)

The \(\max \) function can be rewritten as a limit of the \(HL_q\) distance when \(q \rightarrow \infty \), so

$$\begin{aligned} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j | \right\} = \sum _{\gamma _n \in C_k} \lim _{q \rightarrow \infty } \left\{ \sum _{j=1}^p |\underline{\gamma }_n^j -\underline{g}_k^j |^q \right\} ^{\frac{1}{q}} \end{aligned}$$
(66)

and

$$\begin{aligned} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_k^j -\breve{g}_k^j | \right\} = \sum _{\gamma _n \in C_k} \lim _{q \rightarrow \infty } \left\{ \sum _{j=1}^p |\breve{\gamma }_k^j -\breve{g}_k^j |^q \right\} ^{\frac{1}{q}}. \end{aligned}$$
(67)

As the terms of the sums are positive, their minimization entails the minimization of all sums. Fixing dimension j, the problem is reduced to

$$\begin{aligned} \sum _{\gamma _n \in C_k} \lim _{q \rightarrow \infty } \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j |^q\right\} ^{\frac{1}{q}} = \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j - \underline{g}_k^j | \end{aligned}$$
(68)

and

$$\begin{aligned} \sum _{\gamma _n \in C_k} \lim _{q \rightarrow \infty } \left\{ |\breve{\gamma }_k^j - \breve{g}_k^j |^q\right\} ^{\frac{1}{q}} =\sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j - \breve{g}_k^j |. \end{aligned}$$
(69)

This follows the \(HL_1\) optimization problem, whose solution is the medians of lower bounds and ranges. Then,

$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \text{ and } \, \breve{g}_k^j =\underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j\right\} . \end{aligned}$$
(70)

Using the inverse mapping [see Eq. (15)], we compute the upper bounds as

$$\begin{aligned} b_{g_k}^j = \underline{g}_k^j + \breve{g}_k^j = \underline{g}_k^j + \underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j\right\} . \end{aligned}$$
(71)

\(\square \)

D Proof of Proposition 4

Fixing cluster k and dimension j, the prototype for the \(HL_2\) distance has an analytic solution, which is the mean of the interval bounds. It is computed by Eq. (72),

$$\begin{aligned} \underline{g}_k^j = \frac{1}{|C_k|} \sum _{j=1}^p \underline{\gamma }_n^j\, \text{ and } \, b_{g_k}^j = \underline{g}_k^j + \frac{1}{|C_k|} \sum _{j=1}^p \breve{\gamma }_{n}^j, \end{aligned}$$
(72)

where \(|C_k|\) is the number of instances allocated in the cluster \(C_k\).

Proof

The criterion to be minimized for the \(HL_2\) distance is

$$\begin{aligned} J_{d_{HL_2}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \left\{ (\underline{\gamma }_n^j - \underline{g}_k^j)^2| + (\breve{\gamma }_k^j - \breve{g}_k^j)^2 \right\} . \end{aligned}$$
(73)

Fixing cluster k and dimension j, it is possible to reduce the optimization complexity to

$$\begin{aligned} J_k^j=\sum _{\gamma _n \in C_k} (\underline{\gamma }_n^j -\underline{g}_k^j)^2+ \sum _{\gamma _n \in C_k} (\breve{\gamma }_k^j -\breve{g}_k^j)^2 . \end{aligned}$$
(74)

The solution is found using minimum squares. Partial derivatives of \(J_k^j\) with respect to \(\underline{g}_k^j\) and \(\breve{g}_k^j\) should be null. So,

$$\begin{aligned} \frac{\partial J_k^j}{\partial \underline{g}_k^j} = \sum _{\gamma _n \in C_k} 2 \cdot (\underline{\gamma }_n^j - \underline{g}_k^j) = 0 \Longrightarrow \underline{g}_k^j = \frac{1}{|C_k|} \sum _{\gamma _n \in C_k} \underline{\gamma }_n^j \end{aligned}$$
(75)

and

$$\begin{aligned} \frac{\partial J_k^j}{\partial \underline{g}_k^j} = \sum _{\gamma _n \in C_k} 2 \cdot (\breve{\gamma }_k^j - \breve{g}_k^j) = 0 \Longrightarrow \breve{g}_k^j = \frac{1}{|C_k|} \sum _{\gamma _n \in C_k} \breve{\gamma }_k^j \end{aligned}$$
(76)

where \(|C_k|\) is the number of instances allocated in cluster k. The \(HL_2\) prototypes are computed by:

$$\begin{aligned} \underline{g}_k^j = \frac{1}{|C_k|} \sum _ {\gamma _n \in C_k } \underline{\gamma }_n^j \, \text{ and } \, \breve{g}_k^j = \frac{1}{|C_k|} \sum _ {\gamma _n \in C_k } \breve{\gamma }_n^j. \end{aligned}$$
(77)

Using the inverse mapping (see Eq. (15)), we compute the upper bounds as

$$\begin{aligned} b_{g_k}^j = \underline{g}_k^j + \breve{g}_k^j = \underline{g}_k^j + \frac{1}{|C_k|} \sum _ {\gamma _n \in C_k } \breve{\gamma }_{n}^j. \end{aligned}$$
(78)

\(\square \)

E Proof of Proposition 5

Fixing cluster k and dimension j, the prototype for the \(HL_q\) distance (when \(q > 1\)) can be found using the Newton–Raphson numeric method. Let the sets \(L_k^j =\left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} \) and \(R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} \). Algorithm 2 shows how to compute the prototype components \(\underline{g}_k^j\) and \(\breve{g}_k^j\), respectively.

Proof

Let an ascending ordered set \(X = \left\{ x_1, x_2, \ldots , x_N\right\} \), i.e., \(x_i \le x_{i+1}\), and the function \(f:\mathfrak {R}\leftarrow \mathfrak {R}\),

$$\begin{aligned} f(v) = \sum _{i=1}^N |x_i- v|^q \end{aligned}$$
(79)

with \(q > 1\). We are interested on the value which minimizes f(v). This function can be rewritten as follows:

$$\begin{aligned} f(v) = \sum _{i=1}^N (x_i- v)^q \cdot \left\{ sgn(x_i-v) \right\} ^q, \end{aligned}$$
(80)

where \(sgn(\cdot )\) is a constant, defined as

$$\begin{aligned} sgn(x)= {\left\{ \begin{array}{ll} \,\,\, 1 &{} \text{ if } x \ge 0 \\ -1 &{} \text{ otherwise }. \end{array}\right. } \end{aligned}$$
(81)

The first derivative of \(f(\cdot )\) is given by:

$$\begin{aligned} f'(v)&= -q \sum _{i=1}^N (x_i- v)^{q-1} \cdot \left\{ sgn(x_i-v) \right\} ^{q} \end{aligned}$$
(82)
$$\begin{aligned} f'(v)&= -q \sum _{i=1}^N |x_i- v|^{q-1} \cdot sgn(x_i-v), \end{aligned}$$
(83)

and the second derivative of \(f(\cdot )\) is given by:

$$\begin{aligned} f''(v)&= q (q-1) \sum _{i=1}^N (x_i- v)^{q-2} \cdot \left\{ sgn(x_i-v) \right\} ^{q} \end{aligned}$$
(84)
$$\begin{aligned} f''(v)&= q (q-1) \sum _{i=1}^N |x_i- v|^{q-2}. \end{aligned}$$
(85)

When \(q > 1\) the second derivative is always positive. We conclude that the first derivative is monotonically increasing for any v.

The value \(v_*\) which minimizes f(v) must satisfy \(f'(v_*)=0\). Suppose a value \(v_{-}\) with \(v_{-} < x_1\). Then, \(v_{-} < x_i, \forall x_i\). So, \(x_i -v_{-} > 0\), which implies that \(sgn(x_i-v_{-})=1, \forall x_i\). Then, the first derivative becomes

$$\begin{aligned} f'(v_{-}) = -q \sum _{i=1}^N |x_i- v_{-}|^{q-1} \end{aligned}$$
(86)

which always assumes a negative value. So, \(f'(v_{-}) < 0\).

Now, suppose that \(v_+ > x_N\). Then, \(v_+ > x_i, \forall x_i\). So, \(x_i-v_+<0\), implying \(sgn(x_i-v_+)=-1\). Then, the first derivative becomes

$$\begin{aligned} f'(v_+) = q \sum _{i=1}^N |x_i- v_+|^{q-1} \end{aligned}$$
(87)

which assumes positive values. So, \(f'(v_+) > 0\).

When \(v < x_1\), \(f'(v) < 0\) and when \(v > x_N\), \(f'(v) > 0\), so, the function \(f'(v)\) changes its signal on the interval \([x_1, x_N]\), then \(\exists \, v_* \in [x_1, x_N]\) such that \(f'(v_*)=0\). As \(f'(v)\) is monotonically increasing, this solution is unique. Unfortunately, the \(f'(v)\) expression is too complex and a general analytic solution cannot be computed. We propose the use of Newton−Raphson numeric method to find \(v_*\). In this case, a initial value \(v_0\) is chosen randomly on interval \([x_1,x_N]\). Iterative values \(\left\{ v_i \right\} \) are computed as follows:

$$\begin{aligned} v_i = v_{i-1} - \frac{f'(v_{i-1})}{f''(v_{i-1})}. \end{aligned}$$
(88)

Convergence occurs when \(|v_i - v_{i-1}| < \epsilon \), with \(\epsilon > 0\).

The criterion to be minimized for the \(HL_q\) distance is given by

$$\begin{aligned} J_{d_{HL_q}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j +|\breve{\gamma }_k^j - \breve{g}_k^j| \right\} . \end{aligned}$$
(89)

Fixing the kth cluster and jth dimension results in

$$\begin{aligned} \sum _{\gamma _n \in C_k} \left\{ |\underline{\gamma }_n^j -\underline{g}_k^j|^q + |\breve{\gamma }_k^j - \breve{g}_k^j|^q\right\} , \end{aligned}$$
(90)

and the two sums must be minimized:

$$\begin{aligned} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j -\underline{g}_k^j|^q \quad \text{ and } \quad \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j - \breve{g}_k^j|^q. \end{aligned}$$
(91)

The steps described above (for the f function) can be applied for each sum independently. The sets \(L_k^j =\left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} \) and \(R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} \) can be replaced by set X, and components \(\underline{g}_k^j\) and \(\breve{g}_k^j\) are determined. Algorithm 2 shows the steps to compute them using the Newton–Raphson numeric method. \(\square \)

F Proof of Proposition 6

Fixing cluster k, dimension j and parameter q (\(q \ge 1\)), the prototypes for \(WHL_q\) and \(AHL_q\) distances are computed according to one of three cases:

  1. 1.

    If \((q=1\) or \(q=\infty )\), prototypes have an analytic solution, given by:

    $$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \quad \text{ and } \quad \, b_{g_k}^j = \underline{g}_k^j + \underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j \right\} . \end{aligned}$$
  2. 2.

    If \((q=2)\), prototypes have an analytic solution, given by:

    $$\begin{aligned} \underline{g}_k^j = \frac{1}{|C_k|} \sum _{j=1}^p \underline{\gamma }_n^j \quad \text{ and } \quad \,\, b_{g_k}^j =\underline{g}_k^j + \frac{1}{|C_k|} \sum _{j=1}^p \breve{\gamma }_n^j, \end{aligned}$$

    where \(|C_k|\) is the number of instances in cluster k.

  3. 3.

    If (\(q\ne 1\) and \(q \ne 2\) and \(q \ne \infty \)), the Newton–Raphson numeric method is used, as described by Algorithm 2. The sets \(L_k^j = \left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} \) and \(R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} \) are manipulated by it, resulting on the values of \(\underline{g}_k^j\) and \(\breve{g}_k^j\), respectively. Prototype upper bounds are found by \(b_{g_k}^j=\underline{g}_k^j +\breve{g}_k^j\), for \(j=1\ldots ,p\).

Proof

The optimization criterion for the \(AHL_q\) distance is given by:

$$\begin{aligned} J_{d_{AHLq}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p \lambda _k^j \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j |^q +|\breve{\gamma }_k^j - \breve{g}_k^j |^q \right\} . \end{aligned}$$
(92)

Fixing cluster k and dimension j, the optimization problem can be reduced to

$$\begin{aligned} \lambda _k^j \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j -\underline{g}_k^j |^q + \lambda _k^j \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j - \breve{g}_k^j |^q . \end{aligned}$$
(93)

The optimization criterion for the \(WHL_q\) distance is given by:

$$\begin{aligned} J_{d_{WHLq}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \sum _{j=1}^p (w_{k,1}^j)^t \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j |^q + (w_{k,2}^j)^t |\breve{\gamma }_k^j - \breve{g}_k^j |^q \right\} . \end{aligned}$$
(94)

Fixing cluster k and dimension j, the optimization problem can be reduced to

$$\begin{aligned} (w_{k,1}^j)^t \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j -\underline{g}_k^j |^q + (w_{k,2}^j)^t \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j - \breve{g}_k^j |^q . \end{aligned}$$
(95)

Adaptive and Hybrid weights become constants when classes and dimensions are fixed. The problem is reduced to optimize the following two sums:

$$\begin{aligned} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j - \underline{g}_k^j |^q \, \quad \text{ and } \quad \, \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j - \breve{g}_k^j |^q . \end{aligned}$$
(96)

Solutions are proposed according to the value of the q parameter. If \(q=1\), the optimization becomes

$$\begin{aligned} \sum _{\gamma _n \in C_k} |\underline{\gamma }_n^j - \underline{g}_k^j | \, \quad \text{ and } \quad \, \sum _{\gamma _n \in C_k} |\breve{\gamma }_k^j -\breve{g}_k^j |, \end{aligned}$$
(97)

whose solution, according to Proposition 3, is given by:

$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \quad \text{ and } \quad \, \breve{g}_k^j =\underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j\right\} . \end{aligned}$$
(98)

If \(q=2\), the optimization becomes

$$\begin{aligned} \sum _{\gamma _n \in C_k} (\underline{\gamma }_n^j -\underline{g}_k^j)^2 \, \quad \text{ and } \quad \, \sum _{\gamma _n \in C_k} (\breve{\gamma }_k^j - \breve{g}_k^j)^2. \end{aligned}$$
(99)

The solution, according to Proposition 4, is given by:

$$\begin{aligned} \underline{g}_k^j = \frac{1}{|C_k|} \sum _{j=1}^p \underline{\gamma }_n^j \quad \text{ and } \quad \,\, \breve{g}_k^j =\frac{1}{|C_k|} \sum _{j=1}^p \breve{\gamma }_{n}^j, \end{aligned}$$
(100)

where \(|C_k|\) is the number of instances in cluster k.

The optimization criterion for the \(WHL_{\infty }\) distance is given by:

$$\begin{aligned} J_{d_{HL_{\infty }}} = \sum _{k=1}^K \sum _{\gamma _n \in C_k} \left( (w_{k,1} )^t \max _{j=1}^p \left\{ |\underline{\gamma }_n^j -\underline{g}_k^j | \right\} + (w_{k,2})^t \max _{j=1}^p \left\{ |\breve{\gamma }_k^j - \breve{g}_k^j | \right\} \right) . \end{aligned}$$
(101)

Fixing cluster k, the optimization problem can be reduced to

$$\begin{aligned} (w_{k,1} )^t \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j | \right\} + (w_{k,2} )^t \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_k^j- \breve{g}_k^j | \right\} . \end{aligned}$$
(102)

When the cluster is fixed, the hybrid weights become constants; then, the problem is reduced to optimizing the two sums independently:

$$\begin{aligned} \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\underline{\gamma }_n^j - \underline{g}_k^j | \right\} \, \quad \text{ and } \quad \, \sum _{\gamma _n \in C_k} \max _{j=1}^p \left\{ |\breve{\gamma }_k^j - \breve{g}_k^j | \right\} . \end{aligned}$$
(103)

The solution are the medians of lower bounds and ranges (as showed on Proposition 3). Then,

$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \quad \text{ and } \quad \, \breve{g}_k^j =\underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j \right\} . \end{aligned}$$
(104)

If \(q>1\), \(q \ne 2\) or \(q\ne \infty \), it is not possible to express an analytic solution for Eq. in (96). So the solution is computed as discussed on Proposition 5. The Newton–Raphson numeric method is used, as described by Algorithm 2. The sets \(L_k^j = \left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} \) and \(R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} \) are parameters for this algorithm, resulting on the values of \(\underline{g}_k^j\) and \(\breve{g}_k^j\), respectively.

Using the inverse mapping (see Eq. (15)), we compute the upper bounds as

$$\begin{aligned} b_{g_k}^j&= \underline{g}_k^j + \breve{g}_k^j. \end{aligned}$$
(105)

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

de Souza, L.C., de Souza, R.M.C.R. & do Amaral, G.J.A. Dynamic clustering of interval data based on hybrid \(L_q\) distance. Knowl Inf Syst 62, 687–718 (2020). https://doi.org/10.1007/s10115-019-01367-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01367-w

Keywords

Navigation