Skip to main content
Log in

Clustering of Longitudinal Trajectories Using Correlation-Based Distances

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

The defining feature of a longitudinal data set is that individuals are measured repeatedly through time, giving rise to (a vector of) observations that tend to be intercorrelated. In longitudinal studies with a large number of subjects, clustering of the longitudinal trajectories and the definition of a much smaller number of mean trajectories is often of interest. Several methods have been built up to extend cluster analysis to longitudinal data. Firstly, we introduce a novel non-parametric methodology for clustering longitudinal data. The correlations between the observations from individual trajectories are taken into account by pre-defined correlation matrices with parameters that are estimated from the data. An original Mahalanobis-type distance using the above correlation matrix is considered and then a longitudinal K-Means algorithm is applied. Regarding the computation of the clustering, a much useful result is introduced which allows us to use the well known kml or kml3d (Genolini et al J Stat Softw 65(4):1–34, 2015) algorithm, avoiding thus the need for a new computer program. In fact, we show that our method with the new Mahalanobis-type distance coincides with the application of the longitudinal K-Means algorithm (kml), using the Euclidean distance, to certain transformed trajectories. This property simplifies the process for general users. Secondly, in some circumstances where it is the relative behavior of the trajectories that matter, rather than their absolute values, we propose the use of profiles before entering the algorithm. The methodology is tested on simulated data with different time behaviors and also on real data. The results are compared with those obtained from the direct application of the K-Means algorithm on the original data and on the profiled data. The new methodology produces in general better results than those obtained from the straightforward application of the longitudinal K-Means algorithm (kml) to the raw data. In addition, a comparison with a parametric model, lcmm (Proust-Lima et al in J Stat Softw 78(2), 2017), will also be presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://vincentarelbundock.github.io/Rdatasets/datasets.html.

  2. http://www.stat.ufl.edu/~winner/datasets.html.

References

  1. Abraham C, Cornillon P, Matzner-Lober E, Molinari N. Unsupervised curve clustering using B-splines. Scand J Stat. 2003;30:581–95.

    Article  MathSciNet  Google Scholar 

  2. Bagirov MA, Karmitsa N, Taheri S. Metaheuristic clustering algorithms. In: Partitional clustering via nonsmooth optimization. Unsupervised and semi-supervised learning. Springer, 2020.

  3. Beauchaine TP, Beauchaine RJ. A comparison of maximum covariance and K-means cluster analysis in classifying cases into known taxon groups. Psychol Methods. 2002;7(2):245–61.

    Article  Google Scholar 

  4. Calinski T, Harabasz J. A dendrite method for cluster analysis. Commun Stat. 1974;3(1):1–27.

    MathSciNet  MATH  Google Scholar 

  5. Charrad M, Ghazzali N, Boiteau V, Niknafs A. NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw. 2014;63(6):1–36.

    Google Scholar 

  6. Ciampi A, et al. Model-based clustering of longitudinal data: application to modeling disease course and gene expression trajectories. Commun Stat. 2012;41(7):992–1005.

    Article  MathSciNet  Google Scholar 

  7. Delmelle EC. Mapping the DNA of urban neighborhoods: clustering longitudinal sequences of neighborhood socioeconomic change. Ann Am Assoc Geogr. 2016;106(1):36–56.

    Google Scholar 

  8. Den Teuling NGP, Pauws SC, van den Heuvel ER. A comparison of methods for clustering longitudinal data with slowly changing trends. Commun Stat. (Published online: 19 Jan 2021).

  9. Diggle P, Heagerty P, Liang K-Y, Zeger S. Analysis of longitudinal data. New York: Oxford University Press Inc.; 2002.

    MATH  Google Scholar 

  10. Fitzmaurice G, Laird N, Ware J. Applied longitudinal analysis. New Jersey: Wiley; 2004.

    MATH  Google Scholar 

  11. Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002;97:611–31.

    Article  MathSciNet  Google Scholar 

  12. Genolini C, et al. KmL: k-means for longitudinal data. Berlin: Springer; 2009.

    MATH  Google Scholar 

  13. Genolini C, et al. kml and kml3d: packages to cluster longitudinal data. J Stat Softw. 2015;65(4):1–34.

    Article  Google Scholar 

  14. Genolini C, Écochard R, Jacqmin-Gadda H. Copy mean: a new method to impute intermittent missing values in longitudinal studies. Open J Stat. 2013;3(04):26.

    Article  Google Scholar 

  15. Hastie T, et al. The elements of statistical learning. Data mining inference and predictions. Berlin: Springer; 2009.

    MATH  Google Scholar 

  16. Hedeker D, Gibbons RD. Longitudinal data analysis. Wiley Series in Probability and Statistics; 2006.

  17. Heggeseth BC. Longitudinal cluster analysis with applications to growth trajectories. Berkeley: University of California; 2013.

    Google Scholar 

  18. James G, Sugar C. Clustering for sparsely sampled functional data. J Am Stat Assoc. 2003;98:397–408.

    Article  MathSciNet  Google Scholar 

  19. Kurum E, Li R, Shiffman S, Yao W. Time-varying coefficient models for joint modeling binary and continuous outcomes in longitudinal data. Stat Sin. 2016;26:979–1000.

    MathSciNet  MATH  Google Scholar 

  20. Céline LP, et al. Using a continuous riverscape survey to examine the effects of the spatial structure of functional habitats on fish distribution. J Freshwater Ecol. 2015;31(1):1–19.

    Google Scholar 

  21. Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ. Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinform. 2004;5:172.

    Article  Google Scholar 

  22. Maruotti A, et al. Time-varying clustering of multivariate longitudinal observations. Commun Stat. 2016;45(2):430–43.

    Article  MathSciNet  Google Scholar 

  23. Melnykov V, Maitra R. Finite mixture models and model-based clustering. Stat Surv. 2010;4:80–116.

    Article  MathSciNet  Google Scholar 

  24. Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985;50(2):159–79.

    Article  Google Scholar 

  25. Morris R, et al. Developmental classification of reading-disabled children. J Clin Exp Neuropsychol. 1986;8(4):371–92.

    Article  MathSciNet  Google Scholar 

  26. Ng CC. Examining the self-congruent engagement hypothesis: the link between academic self-schemas, motivational goals, learning approaches and achievement within an academic year. Educ Psychol. 2014;43(6):730–62.

    Google Scholar 

  27. Oh M-S, Raftery AE. Model-based clustering with dissimilarities: a Bayesian approach. J Comput Graph Stat. 2007;16:559–85.

    Article  MathSciNet  Google Scholar 

  28. Pourahmadi M. Joint mean-covariance models with applications to longitudinal data: unconstrained parameterisation. Biometrika. 1999;86(3):677–90.

    Article  MathSciNet  Google Scholar 

  29. Proust-Lima C, Philipps V, Liquet B. Estimation of extended mixed models using latent classes and latent processes: the R package lcmm. J Stat Softw. 2017;78(2):1–56.

    Article  Google Scholar 

  30. Qin S, et al. Forage crops alter soil bacterial and fungal communities in an apple orchard. Acta Agriculturae Scandinavica. 2016;66(3):229–36.

    Google Scholar 

  31. Rossi F, Conan-Guez B, Golli AE. Clustering functional data with the SOM algorithm. In: Proceedings of ESANN, 2004;305–312.

  32. Shim Y, Chung J, Choi I-C. A comparison study of cluster validity indices using a nonhierarchical clustering algorithm. IEEE Comput Soc. 2005.

  33. Sousa P, Oliveira A, Gomes M, Gaio AR, Duarte R. Longitudinal clustering of tuberculosis incidence and predictors for the time profiles: the impact of HIV. Int J Tuberc Lung Dis. 2016;20(8):1027–32.

    Article  Google Scholar 

  34. Tarpey T, Kinateder K. Clustering functional data. J Classif. 2003;20:93–114.

    Article  MathSciNet  Google Scholar 

  35. Vu DQ, Hunter DR, Schweinberger M. Model-based clustering of large networks. Ann Appl Stat. 2013;7(2):1010.

    Article  MathSciNet  Google Scholar 

  36. Zhong P-S, Li R, Santo S. Homogeneity test of covariance matrices and change-points identification with high-Dimensional longitudinal data. Biometrika. 2019;106:619–34.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joaquim F. Pinto da Costa.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pinto da Costa, J.F., Ferreira, F., Mascarello, M. et al. Clustering of Longitudinal Trajectories Using Correlation-Based Distances. SN COMPUT. SCI. 2, 432 (2021). https://doi.org/10.1007/s42979-021-00822-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-021-00822-2

Keywords

Navigation