Clustering of Longitudinal Trajectories Using Correlation-Based Distances

Pinto da Costa, Joaquim F.; Ferreira, Fábio; Mascarello, Martina; Gaio, Rita

doi:10.1007/s42979-021-00822-2

Clustering of Longitudinal Trajectories Using Correlation-Based Distances

Original Research
Published: 28 August 2021

Volume 2, article number 432, (2021)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Joaquim F. Pinto da Costa ORCID: orcid.org/0000-0002-3991-2715¹,
Fábio Ferreira¹,
Martina Mascarello¹ &
…
Rita Gaio¹

576 Accesses
1 Citation
Explore all metrics

Abstract

The defining feature of a longitudinal data set is that individuals are measured repeatedly through time, giving rise to (a vector of) observations that tend to be intercorrelated. In longitudinal studies with a large number of subjects, clustering of the longitudinal trajectories and the definition of a much smaller number of mean trajectories is often of interest. Several methods have been built up to extend cluster analysis to longitudinal data. Firstly, we introduce a novel non-parametric methodology for clustering longitudinal data. The correlations between the observations from individual trajectories are taken into account by pre-defined correlation matrices with parameters that are estimated from the data. An original Mahalanobis-type distance using the above correlation matrix is considered and then a longitudinal K-Means algorithm is applied. Regarding the computation of the clustering, a much useful result is introduced which allows us to use the well known kml or kml3d (Genolini et al J Stat Softw 65(4):1–34, 2015) algorithm, avoiding thus the need for a new computer program. In fact, we show that our method with the new Mahalanobis-type distance coincides with the application of the longitudinal K-Means algorithm (kml), using the Euclidean distance, to certain transformed trajectories. This property simplifies the process for general users. Secondly, in some circumstances where it is the relative behavior of the trajectories that matter, rather than their absolute values, we propose the use of profiles before entering the algorithm. The methodology is tested on simulated data with different time behaviors and also on real data. The results are compared with those obtained from the direct application of the K-Means algorithm on the original data and on the profiled data. The new methodology produces in general better results than those obtained from the straightforward application of the longitudinal K-Means algorithm (kml) to the raw data. In addition, a comparison with a parametric model, lcmm (Proust-Lima et al in J Stat Softw 78(2), 2017), will also be presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Co-clustering of Time-Dependent Data via the Shape Invariant Model

Article Open access 06 October 2021

Combining Sequence Analysis and Hidden Markov Models in the Analysis of Complex Life Sequence Data

Does group-based trajectory modeling estimate spurious trajectories?

Article Open access 14 July 2022

Notes

References

Abraham C, Cornillon P, Matzner-Lober E, Molinari N. Unsupervised curve clustering using B-splines. Scand J Stat. 2003;30:581–95.
Article MathSciNet Google Scholar
Bagirov MA, Karmitsa N, Taheri S. Metaheuristic clustering algorithms. In: Partitional clustering via nonsmooth optimization. Unsupervised and semi-supervised learning. Springer, 2020.
Beauchaine TP, Beauchaine RJ. A comparison of maximum covariance and K-means cluster analysis in classifying cases into known taxon groups. Psychol Methods. 2002;7(2):245–61.
Article Google Scholar
Calinski T, Harabasz J. A dendrite method for cluster analysis. Commun Stat. 1974;3(1):1–27.
MathSciNet MATH Google Scholar
Charrad M, Ghazzali N, Boiteau V, Niknafs A. NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw. 2014;63(6):1–36.
Google Scholar
Ciampi A, et al. Model-based clustering of longitudinal data: application to modeling disease course and gene expression trajectories. Commun Stat. 2012;41(7):992–1005.
Article MathSciNet Google Scholar
Delmelle EC. Mapping the DNA of urban neighborhoods: clustering longitudinal sequences of neighborhood socioeconomic change. Ann Am Assoc Geogr. 2016;106(1):36–56.
Google Scholar
Den Teuling NGP, Pauws SC, van den Heuvel ER. A comparison of methods for clustering longitudinal data with slowly changing trends. Commun Stat. (Published online: 19 Jan 2021).
Diggle P, Heagerty P, Liang K-Y, Zeger S. Analysis of longitudinal data. New York: Oxford University Press Inc.; 2002.
MATH Google Scholar
Fitzmaurice G, Laird N, Ware J. Applied longitudinal analysis. New Jersey: Wiley; 2004.
MATH Google Scholar
Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002;97:611–31.
Article MathSciNet Google Scholar
Genolini C, et al. KmL: k-means for longitudinal data. Berlin: Springer; 2009.
MATH Google Scholar
Genolini C, et al. kml and kml3d: packages to cluster longitudinal data. J Stat Softw. 2015;65(4):1–34.
Article Google Scholar
Genolini C, Écochard R, Jacqmin-Gadda H. Copy mean: a new method to impute intermittent missing values in longitudinal studies. Open J Stat. 2013;3(04):26.
Article Google Scholar
Hastie T, et al. The elements of statistical learning. Data mining inference and predictions. Berlin: Springer; 2009.
MATH Google Scholar
Hedeker D, Gibbons RD. Longitudinal data analysis. Wiley Series in Probability and Statistics; 2006.
Heggeseth BC. Longitudinal cluster analysis with applications to growth trajectories. Berkeley: University of California; 2013.
Google Scholar
James G, Sugar C. Clustering for sparsely sampled functional data. J Am Stat Assoc. 2003;98:397–408.
Article MathSciNet Google Scholar
Kurum E, Li R, Shiffman S, Yao W. Time-varying coefficient models for joint modeling binary and continuous outcomes in longitudinal data. Stat Sin. 2016;26:979–1000.
MathSciNet MATH Google Scholar
Céline LP, et al. Using a continuous riverscape survey to examine the effects of the spatial structure of functional habitats on fish distribution. J Freshwater Ecol. 2015;31(1):1–19.
Google Scholar
Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ. Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinform. 2004;5:172.
Article Google Scholar
Maruotti A, et al. Time-varying clustering of multivariate longitudinal observations. Commun Stat. 2016;45(2):430–43.
Article MathSciNet Google Scholar
Melnykov V, Maitra R. Finite mixture models and model-based clustering. Stat Surv. 2010;4:80–116.
Article MathSciNet Google Scholar
Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985;50(2):159–79.
Article Google Scholar
Morris R, et al. Developmental classification of reading-disabled children. J Clin Exp Neuropsychol. 1986;8(4):371–92.
Article MathSciNet Google Scholar
Ng CC. Examining the self-congruent engagement hypothesis: the link between academic self-schemas, motivational goals, learning approaches and achievement within an academic year. Educ Psychol. 2014;43(6):730–62.
Google Scholar
Oh M-S, Raftery AE. Model-based clustering with dissimilarities: a Bayesian approach. J Comput Graph Stat. 2007;16:559–85.
Article MathSciNet Google Scholar
Pourahmadi M. Joint mean-covariance models with applications to longitudinal data: unconstrained parameterisation. Biometrika. 1999;86(3):677–90.
Article MathSciNet Google Scholar
Proust-Lima C, Philipps V, Liquet B. Estimation of extended mixed models using latent classes and latent processes: the R package lcmm. J Stat Softw. 2017;78(2):1–56.
Article Google Scholar
Qin S, et al. Forage crops alter soil bacterial and fungal communities in an apple orchard. Acta Agriculturae Scandinavica. 2016;66(3):229–36.
Google Scholar
Rossi F, Conan-Guez B, Golli AE. Clustering functional data with the SOM algorithm. In: Proceedings of ESANN, 2004;305–312.
Shim Y, Chung J, Choi I-C. A comparison study of cluster validity indices using a nonhierarchical clustering algorithm. IEEE Comput Soc. 2005.
Sousa P, Oliveira A, Gomes M, Gaio AR, Duarte R. Longitudinal clustering of tuberculosis incidence and predictors for the time profiles: the impact of HIV. Int J Tuberc Lung Dis. 2016;20(8):1027–32.
Article Google Scholar
Tarpey T, Kinateder K. Clustering functional data. J Classif. 2003;20:93–114.
Article MathSciNet Google Scholar
Vu DQ, Hunter DR, Schweinberger M. Model-based clustering of large networks. Ann Appl Stat. 2013;7(2):1010.
Article MathSciNet Google Scholar
Zhong P-S, Li R, Santo S. Homogeneity test of covariance matrices and change-points identification with high-Dimensional longitudinal data. Biometrika. 2019;106:619–34.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Matemática, Faculdade de Ciências, Universidade do Porto, R. do Campo Alegre, 687, 4169-007, Porto, Portugal
Joaquim F. Pinto da Costa, Fábio Ferreira, Martina Mascarello & Rita Gaio

Authors

Joaquim F. Pinto da Costa
View author publications
You can also search for this author in PubMed Google Scholar
Fábio Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Martina Mascarello
View author publications
You can also search for this author in PubMed Google Scholar
Rita Gaio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joaquim F. Pinto da Costa.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pinto da Costa, J.F., Ferreira, F., Mascarello, M. et al. Clustering of Longitudinal Trajectories Using Correlation-Based Distances. SN COMPUT. SCI. 2, 432 (2021). https://doi.org/10.1007/s42979-021-00822-2

Download citation

Received: 01 March 2021
Accepted: 12 August 2021
Published: 28 August 2021
DOI: https://doi.org/10.1007/s42979-021-00822-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering of Longitudinal Trajectories Using Correlation-Based Distances

Abstract

Access this article

Similar content being viewed by others

Co-clustering of Time-Dependent Data via the Shape Invariant Model

Combining Sequence Analysis and Hidden Markov Models in the Analysis of Complex Life Sequence Data

Does group-based trajectory modeling estimate spurious trajectories?

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering of Longitudinal Trajectories Using Correlation-Based Distances

Abstract

Access this article

Similar content being viewed by others

Co-clustering of Time-Dependent Data via the Shape Invariant Model

Combining Sequence Analysis and Hidden Markov Models in the Analysis of Complex Life Sequence Data

Does group-based trajectory modeling estimate spurious trajectories?

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation