Abstract
The defining feature of a longitudinal data set is that individuals are measured repeatedly through time, giving rise to (a vector of) observations that tend to be intercorrelated. In longitudinal studies with a large number of subjects, clustering of the longitudinal trajectories and the definition of a much smaller number of mean trajectories is often of interest. Several methods have been built up to extend cluster analysis to longitudinal data. Firstly, we introduce a novel non-parametric methodology for clustering longitudinal data. The correlations between the observations from individual trajectories are taken into account by pre-defined correlation matrices with parameters that are estimated from the data. An original Mahalanobis-type distance using the above correlation matrix is considered and then a longitudinal K-Means algorithm is applied. Regarding the computation of the clustering, a much useful result is introduced which allows us to use the well known kml or kml3d (Genolini et al J Stat Softw 65(4):1–34, 2015) algorithm, avoiding thus the need for a new computer program. In fact, we show that our method with the new Mahalanobis-type distance coincides with the application of the longitudinal K-Means algorithm (kml), using the Euclidean distance, to certain transformed trajectories. This property simplifies the process for general users. Secondly, in some circumstances where it is the relative behavior of the trajectories that matter, rather than their absolute values, we propose the use of profiles before entering the algorithm. The methodology is tested on simulated data with different time behaviors and also on real data. The results are compared with those obtained from the direct application of the K-Means algorithm on the original data and on the profiled data. The new methodology produces in general better results than those obtained from the straightforward application of the longitudinal K-Means algorithm (kml) to the raw data. In addition, a comparison with a parametric model, lcmm (Proust-Lima et al in J Stat Softw 78(2), 2017), will also be presented.
Similar content being viewed by others
References
Abraham C, Cornillon P, Matzner-Lober E, Molinari N. Unsupervised curve clustering using B-splines. Scand J Stat. 2003;30:581–95.
Bagirov MA, Karmitsa N, Taheri S. Metaheuristic clustering algorithms. In: Partitional clustering via nonsmooth optimization. Unsupervised and semi-supervised learning. Springer, 2020.
Beauchaine TP, Beauchaine RJ. A comparison of maximum covariance and K-means cluster analysis in classifying cases into known taxon groups. Psychol Methods. 2002;7(2):245–61.
Calinski T, Harabasz J. A dendrite method for cluster analysis. Commun Stat. 1974;3(1):1–27.
Charrad M, Ghazzali N, Boiteau V, Niknafs A. NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw. 2014;63(6):1–36.
Ciampi A, et al. Model-based clustering of longitudinal data: application to modeling disease course and gene expression trajectories. Commun Stat. 2012;41(7):992–1005.
Delmelle EC. Mapping the DNA of urban neighborhoods: clustering longitudinal sequences of neighborhood socioeconomic change. Ann Am Assoc Geogr. 2016;106(1):36–56.
Den Teuling NGP, Pauws SC, van den Heuvel ER. A comparison of methods for clustering longitudinal data with slowly changing trends. Commun Stat. (Published online: 19 Jan 2021).
Diggle P, Heagerty P, Liang K-Y, Zeger S. Analysis of longitudinal data. New York: Oxford University Press Inc.; 2002.
Fitzmaurice G, Laird N, Ware J. Applied longitudinal analysis. New Jersey: Wiley; 2004.
Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002;97:611–31.
Genolini C, et al. KmL: k-means for longitudinal data. Berlin: Springer; 2009.
Genolini C, et al. kml and kml3d: packages to cluster longitudinal data. J Stat Softw. 2015;65(4):1–34.
Genolini C, Écochard R, Jacqmin-Gadda H. Copy mean: a new method to impute intermittent missing values in longitudinal studies. Open J Stat. 2013;3(04):26.
Hastie T, et al. The elements of statistical learning. Data mining inference and predictions. Berlin: Springer; 2009.
Hedeker D, Gibbons RD. Longitudinal data analysis. Wiley Series in Probability and Statistics; 2006.
Heggeseth BC. Longitudinal cluster analysis with applications to growth trajectories. Berkeley: University of California; 2013.
James G, Sugar C. Clustering for sparsely sampled functional data. J Am Stat Assoc. 2003;98:397–408.
Kurum E, Li R, Shiffman S, Yao W. Time-varying coefficient models for joint modeling binary and continuous outcomes in longitudinal data. Stat Sin. 2016;26:979–1000.
Céline LP, et al. Using a continuous riverscape survey to examine the effects of the spatial structure of functional habitats on fish distribution. J Freshwater Ecol. 2015;31(1):1–19.
Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ. Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinform. 2004;5:172.
Maruotti A, et al. Time-varying clustering of multivariate longitudinal observations. Commun Stat. 2016;45(2):430–43.
Melnykov V, Maitra R. Finite mixture models and model-based clustering. Stat Surv. 2010;4:80–116.
Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985;50(2):159–79.
Morris R, et al. Developmental classification of reading-disabled children. J Clin Exp Neuropsychol. 1986;8(4):371–92.
Ng CC. Examining the self-congruent engagement hypothesis: the link between academic self-schemas, motivational goals, learning approaches and achievement within an academic year. Educ Psychol. 2014;43(6):730–62.
Oh M-S, Raftery AE. Model-based clustering with dissimilarities: a Bayesian approach. J Comput Graph Stat. 2007;16:559–85.
Pourahmadi M. Joint mean-covariance models with applications to longitudinal data: unconstrained parameterisation. Biometrika. 1999;86(3):677–90.
Proust-Lima C, Philipps V, Liquet B. Estimation of extended mixed models using latent classes and latent processes: the R package lcmm. J Stat Softw. 2017;78(2):1–56.
Qin S, et al. Forage crops alter soil bacterial and fungal communities in an apple orchard. Acta Agriculturae Scandinavica. 2016;66(3):229–36.
Rossi F, Conan-Guez B, Golli AE. Clustering functional data with the SOM algorithm. In: Proceedings of ESANN, 2004;305–312.
Shim Y, Chung J, Choi I-C. A comparison study of cluster validity indices using a nonhierarchical clustering algorithm. IEEE Comput Soc. 2005.
Sousa P, Oliveira A, Gomes M, Gaio AR, Duarte R. Longitudinal clustering of tuberculosis incidence and predictors for the time profiles: the impact of HIV. Int J Tuberc Lung Dis. 2016;20(8):1027–32.
Tarpey T, Kinateder K. Clustering functional data. J Classif. 2003;20:93–114.
Vu DQ, Hunter DR, Schweinberger M. Model-based clustering of large networks. Ann Appl Stat. 2013;7(2):1010.
Zhong P-S, Li R, Santo S. Homogeneity test of covariance matrices and change-points identification with high-Dimensional longitudinal data. Biometrika. 2019;106:619–34.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pinto da Costa, J.F., Ferreira, F., Mascarello, M. et al. Clustering of Longitudinal Trajectories Using Correlation-Based Distances. SN COMPUT. SCI. 2, 432 (2021). https://doi.org/10.1007/s42979-021-00822-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-021-00822-2