Abstract
Dynamic clustering defines partitions within data and prototypes to each partition. Distance metrics are responsible for checking the closeness between instances and prototypes. Considering the literature about interval data, distances depend on interval bounds and the information inside the intervals is ignored. This paper proposes new distances, which explore the information inside of intervals. It also presents a mapping of intervals to points, which preserves their spatial location and internal variation. We formulate a new hybrid distance for interval data based on the well-known \(L_q\) distance for point data. This new distance allows for a weighted formulation of the hybridism. Hence, we propose a Hybrid \(L_q\) distance, a Weighted Hybrid \(L_q\) distance, as well as the adaptive version of the Hybrid \(L_q\) distance for interval data. Experiments with synthetic and real interval data sets illustrate the usefulness of the hybrid approach to improve dynamic clustering for interval data.








Similar content being viewed by others
References
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chichester
Billard L, Le-Rademacher J (2012) Principal component analysis for interval data. Wiley Interdiscip Rev Comput Stat 4(6):535–540
Burden RL, Faires JD (2011) Numerical analysis. Cengage Learning, Brooks/Cole
Chavent M, Lechevallier Y (2002) Dynamical clustering of interval data: optimization of an adequacy criterion based on Hausdorff distance. In: Classification, clustering, and data analysis, pp 53–60
Chavent M (2004) An Hausdorff distance between hyper-rectangles for clustering interval data. In: Banks D et al (eds) Classification, clustering an data mining application, proceedings of the IFCS04. Springer, Berlin, pp 333–340
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The MIT Press, Cambridge
De Carvalho FAT, Brito P, Bock H-H (2006b) Dynamic clustering for interval data based on L2 distance. Comput Stat 21:231–250
De Carvalho FAT, Souza RMCR, Chavent M, Lechevallier Y (2006a) Adaptive Hausdorff distances and dynamic clustering of symbolic interval data. Pattern Recognit Lett 27:167–179
De Carvalho FAT, Lechevallier Y (2009a) Dynamic clustering of interval-valued data based on adaptive quadratic distances. Trans Syst Man Cyber Part A 39:1295–1306
De Carvalho FAT, Lechevallier Y (2009b) Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognit 42:1223–1236
De Carvalho FAT, Souza RMCR (2010) Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recognit Lett 31(5):430–443
Diday E, Simon JC (1976) Clustering analysis. In: Fu KS (ed) Digit Pattern Classif. Springer, Berlin, pp 47–94
Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, Chichester
Diday E (2016) Thinking by classes in data science: the symbolic data analysis paradigm. Wiley Interdiscip Rev Comput Stat 8(5):172–205
Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4(2):229–246
Fränti P, Kivijärvi J (2000) Randomised local search algorithm for the clustering problem. Pattern Anal Appl 3:358–369
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 24:2367–2376
Lichaman M (2013) newblock UCI machine learning repository
Lima Neto EA, De Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54:333–347
Lima Neto EA, De Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52:1500–1515
Martinez WL, Martinez AR (2007) Computational statistics handbook with MATLAB. Chapman & Hall CRC, New York
Renche AC, Christensen WF (2012) Methods of multivariate analysis, 3rd edn. Wiley, New York
Silva Filho TM, Souza RMCR (2015) A swarm-trained k-nearest prototypes adaptive classifier with automatic feature selection for interval data. Neural Netw 80:19–33
Silva APD, Brito P (2006) Linear discriminant analysis for interval data. Comput Stat 21(2):289–308
Silva APD, Brito P (2015) Discriminant analysis of interval data: An assessment of parametric and distance-based approaches. J Classif 32(3):516–541
Souza LC (2016) Agrupamento e regressão linear de dados simblicos intervalares baseados em novas representações. PhD Thesis, Universidade Federal de Pernambuco, PE, Brazil, https://repositorio.ufpe.br/handle/123456789/17640
Souza RMCR, De Carvalho FAT (2004) Clustering of interval data based on city–block distances. Pattern Recognit Lett 25:353–365
Acknowledgements
The authors would like to thank CNPq and CAPES (Brazilian Agencies) for their financial support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proof of Proposition 1
Fixing cluster k and dimension j, the hybrid weights of the \(WHL_q\) distance are obtained using Lagrange multipliers under the restrictions: \(w_{k,1}^j + w_{k,2}^j = 1\); \(w_{k,1}^j \ge 0 \); \(w_{k,2}^j \ge 0\); and \(t > 1\). Let
The Hybrid weight values are computed by:
Proof
The partitional dynamic clustering criterion for the \(WHL_q\) distance is given by
under the restrictions: \(w_{k,1}^j + w_{k,2}^j = 1\), \(w_{k,1}^j \ge 0\), \(w_{k,2}^j \ge 0\) and \(t > 1\). The solution can be found using Lagrange multipliers. Let \(J_{d_{WHLq}}(\varLambda _1^1,\ldots , \varLambda _K^p)\) be the version of Eq. (28) with the Lagrange multipliers (\(\varLambda _k^j\)) associated restrictions. Thus, it becomes
Weights can be found when the partial derivatives of \(J_{d_{WHLq}}\) are equal to 0. Fixing cluster k and dimension j, deriving \(J_{d_{WHLq}}\) according to the first weight component (\(w_{k,1}^j\)), we get
Defining
and isolating the \(w_{k,1}^j\) term, we get
Now, deriving \(J_{d_{WHLq}}\) according to the second weight component (\(w_{k,2}^j\)), we obtain
Defining
we get \(w_{k,2}^j\), as follows:
With the expressions above, used to find weights, we compute the Lagrange multiplier (\(\varLambda _k^j\)) using the restriction \(w_{k,1}^j + w_{k,2}^j =1\). Then,
Replacing Eq. (40)) in Eq. (33)), we get
Now, replacing Eq. (40)) in Eq. (37)), we get
So, the hybrid weights can be computed using Eqs. (41) and (42). \(\square \)
B Proof of Proposition 2
Fixing cluster k, the hybrid weights of the \(WHL_{\infty }\) distance are obtained using Lagrange multipliers under the following restrictions: \(w_{k,1} + w_{k,2} = 1\); \(w_{k,1} \ge 0\); \(w_{k,2} \ge 0\); and \( t > 1\). Let
Then,
Proof
The partitional dynamic clustering criterion for the \(WHL_{\infty }\) distance is given by
under the restrictions: \(w_{k,1} + w_{k,2} = 1\); \(w_{k,1} \ge 0\); \(w_{k,2} \ge 0\); and \(t > 1\). This solution can be computed using Lagrange multipliers. Eq. (43) is rewritten to incorporate Lagrange multipliers (\(\varLambda _k\)) and its respective restrictions. It then becomes
Weights can be found when partial derivatives of \(J_{d_{WHLq}}\), with respect to the weights, are equal to 0. For fixed cluster k and dimension j, deriving \(J_{d_{WHLq}}\) according to the first weight component (\(w_{k,1}\)), we get
Defining
we get
Now, deriving \(J_{d_{WHLq}}\) according to the second weight component (\(w_{k,2}\)), we get
Defining
it becomes
The Lagrange multiplier (\(\varLambda _k\)) is computed based on restriction \(w_{k,1}+ w_{k,2} =1\). Then,
Replacing Eq. (55) in Eq. (48), we get
Now, replacing Eq. (55) in Eq. (52), we get
So, the hybrid weights are computed by Eqs. (56) and (57). \(\square \)
C Proof of Proposition 3
Fixing cluster k and dimension j, the prototype for the \(HL_1\) and \(HL_{\infty }\) distances has an analytic solution, given by Eq. (58),
Proof
The criterion to be minimized for the \(HL_1\) distance is
Fixing cluster k and dimension j, it is possible to reduce the optimization complexity to
The problem is resumed to optimize the two sums
Each sum is minimized by the median of the respective set [27]. Then,
The criterion to be minimized for the \(HL_{\infty }\) distance is
Fixing the cluster k, it is possible to reduce the optimization complexity to
The problem is reduced to optimizing the two sums independently,
The \(\max \) function can be rewritten as a limit of the \(HL_q\) distance when \(q \rightarrow \infty \), so
and
As the terms of the sums are positive, their minimization entails the minimization of all sums. Fixing dimension j, the problem is reduced to
and
This follows the \(HL_1\) optimization problem, whose solution is the medians of lower bounds and ranges. Then,
Using the inverse mapping [see Eq. (15)], we compute the upper bounds as
\(\square \)
D Proof of Proposition 4
Fixing cluster k and dimension j, the prototype for the \(HL_2\) distance has an analytic solution, which is the mean of the interval bounds. It is computed by Eq. (72),
where \(|C_k|\) is the number of instances allocated in the cluster \(C_k\).
Proof
The criterion to be minimized for the \(HL_2\) distance is
Fixing cluster k and dimension j, it is possible to reduce the optimization complexity to
The solution is found using minimum squares. Partial derivatives of \(J_k^j\) with respect to \(\underline{g}_k^j\) and \(\breve{g}_k^j\) should be null. So,
and
where \(|C_k|\) is the number of instances allocated in cluster k. The \(HL_2\) prototypes are computed by:
Using the inverse mapping (see Eq. (15)), we compute the upper bounds as
\(\square \)
E Proof of Proposition 5
Fixing cluster k and dimension j, the prototype for the \(HL_q\) distance (when \(q > 1\)) can be found using the Newton–Raphson numeric method. Let the sets \(L_k^j =\left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} \) and \(R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} \). Algorithm 2 shows how to compute the prototype components \(\underline{g}_k^j\) and \(\breve{g}_k^j\), respectively.
Proof
Let an ascending ordered set \(X = \left\{ x_1, x_2, \ldots , x_N\right\} \), i.e., \(x_i \le x_{i+1}\), and the function \(f:\mathfrak {R}\leftarrow \mathfrak {R}\),
with \(q > 1\). We are interested on the value which minimizes f(v). This function can be rewritten as follows:
where \(sgn(\cdot )\) is a constant, defined as
The first derivative of \(f(\cdot )\) is given by:
and the second derivative of \(f(\cdot )\) is given by:
When \(q > 1\) the second derivative is always positive. We conclude that the first derivative is monotonically increasing for any v.
The value \(v_*\) which minimizes f(v) must satisfy \(f'(v_*)=0\). Suppose a value \(v_{-}\) with \(v_{-} < x_1\). Then, \(v_{-} < x_i, \forall x_i\). So, \(x_i -v_{-} > 0\), which implies that \(sgn(x_i-v_{-})=1, \forall x_i\). Then, the first derivative becomes
which always assumes a negative value. So, \(f'(v_{-}) < 0\).
Now, suppose that \(v_+ > x_N\). Then, \(v_+ > x_i, \forall x_i\). So, \(x_i-v_+<0\), implying \(sgn(x_i-v_+)=-1\). Then, the first derivative becomes
which assumes positive values. So, \(f'(v_+) > 0\).
When \(v < x_1\), \(f'(v) < 0\) and when \(v > x_N\), \(f'(v) > 0\), so, the function \(f'(v)\) changes its signal on the interval \([x_1, x_N]\), then \(\exists \, v_* \in [x_1, x_N]\) such that \(f'(v_*)=0\). As \(f'(v)\) is monotonically increasing, this solution is unique. Unfortunately, the \(f'(v)\) expression is too complex and a general analytic solution cannot be computed. We propose the use of Newton−Raphson numeric method to find \(v_*\). In this case, a initial value \(v_0\) is chosen randomly on interval \([x_1,x_N]\). Iterative values \(\left\{ v_i \right\} \) are computed as follows:
Convergence occurs when \(|v_i - v_{i-1}| < \epsilon \), with \(\epsilon > 0\).
The criterion to be minimized for the \(HL_q\) distance is given by
Fixing the kth cluster and jth dimension results in
and the two sums must be minimized:
The steps described above (for the f function) can be applied for each sum independently. The sets \(L_k^j =\left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} \) and \(R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} \) can be replaced by set X, and components \(\underline{g}_k^j\) and \(\breve{g}_k^j\) are determined. Algorithm 2 shows the steps to compute them using the Newton–Raphson numeric method. \(\square \)
F Proof of Proposition 6
Fixing cluster k, dimension j and parameter q (\(q \ge 1\)), the prototypes for \(WHL_q\) and \(AHL_q\) distances are computed according to one of three cases:
- 1.
If \((q=1\) or \(q=\infty )\), prototypes have an analytic solution, given by:
$$\begin{aligned} \underline{g}_k^j = \underset{\gamma _n \in C_k }{Me } \left\{ \underline{\gamma }_n^j \right\} \, \quad \text{ and } \quad \, b_{g_k}^j = \underline{g}_k^j + \underset{\gamma _n \in C_k }{Me } \left\{ \breve{\gamma }_n^j \right\} . \end{aligned}$$ - 2.
If \((q=2)\), prototypes have an analytic solution, given by:
$$\begin{aligned} \underline{g}_k^j = \frac{1}{|C_k|} \sum _{j=1}^p \underline{\gamma }_n^j \quad \text{ and } \quad \,\, b_{g_k}^j =\underline{g}_k^j + \frac{1}{|C_k|} \sum _{j=1}^p \breve{\gamma }_n^j, \end{aligned}$$where \(|C_k|\) is the number of instances in cluster k.
- 3.
If (\(q\ne 1\) and \(q \ne 2\) and \(q \ne \infty \)), the Newton–Raphson numeric method is used, as described by Algorithm 2. The sets \(L_k^j = \left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} \) and \(R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} \) are manipulated by it, resulting on the values of \(\underline{g}_k^j\) and \(\breve{g}_k^j\), respectively. Prototype upper bounds are found by \(b_{g_k}^j=\underline{g}_k^j +\breve{g}_k^j\), for \(j=1\ldots ,p\).
Proof
The optimization criterion for the \(AHL_q\) distance is given by:
Fixing cluster k and dimension j, the optimization problem can be reduced to
The optimization criterion for the \(WHL_q\) distance is given by:
Fixing cluster k and dimension j, the optimization problem can be reduced to
Adaptive and Hybrid weights become constants when classes and dimensions are fixed. The problem is reduced to optimize the following two sums:
Solutions are proposed according to the value of the q parameter. If \(q=1\), the optimization becomes
whose solution, according to Proposition 3, is given by:
If \(q=2\), the optimization becomes
The solution, according to Proposition 4, is given by:
where \(|C_k|\) is the number of instances in cluster k.
The optimization criterion for the \(WHL_{\infty }\) distance is given by:
Fixing cluster k, the optimization problem can be reduced to
When the cluster is fixed, the hybrid weights become constants; then, the problem is reduced to optimizing the two sums independently:
The solution are the medians of lower bounds and ranges (as showed on Proposition 3). Then,
If \(q>1\), \(q \ne 2\) or \(q\ne \infty \), it is not possible to express an analytic solution for Eq. in (96). So the solution is computed as discussed on Proposition 5. The Newton–Raphson numeric method is used, as described by Algorithm 2. The sets \(L_k^j = \left\{ \underline{\gamma }_n^j | \gamma _n \in C_k \right\} \) and \(R_k^j = \left\{ \breve{\gamma }_k^j | \gamma _n \in C_k \right\} \) are parameters for this algorithm, resulting on the values of \(\underline{g}_k^j\) and \(\breve{g}_k^j\), respectively.
Using the inverse mapping (see Eq. (15)), we compute the upper bounds as
\(\square \)
Rights and permissions
About this article
Cite this article
de Souza, L.C., de Souza, R.M.C.R. & do Amaral, G.J.A. Dynamic clustering of interval data based on hybrid \(L_q\) distance. Knowl Inf Syst 62, 687–718 (2020). https://doi.org/10.1007/s10115-019-01367-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01367-w