Abstract
Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive.
Similar content being viewed by others
References
Bergami M, Bagozzi RP (2000) Self-categorization, affective commitment and group-esteem as distinct aspects of social identity in the organization. Brit J Soc Psychol 39:555–577
Bernaards CA, Sijtsma K (2000) Influence of imputation and EM methods on factor analysis when item nonresponse in questionnaire data is nonignorable. Multivar Behav Res 35:321–364
Dray S, Josse J (2015) Principal component analysis with missing values: a comparative survey of methods. Plant Ecol 216:657–667
Folch-Fortuny A, Arteaga F, Ferrer A (2015) PCA model building with missing data. Chemom Intell Lab 146:77–88
Folch-Fortuny A, Arteaga F, Ferrer A (2016) Missing data imputation toolbox for MATLAB. Chemom Intell Lab 154:93–100
Gabriel KR, Zamir S (1979) Lower rank approximation of matrices by least squares with any choice of weights. Technometrics 22:489–498
Gifi A (1990) Nonlinear multivariate analysis. Wiley, Chichester
Grung B, Manne R (1998) Missing values in principal component analysis. Chemom Intell Lab 42:125–139
Hwang H, Takane Y (2014) Generalized structured component analysis: a component-based approach to structural equation modeling. Chapman and Hall/CRC Press, Boca Raton
Ilin A, Raiko T (2010) Practical approaches to principal component analysis in the presence of missing values. J Mach Learn Res 11:1957–2000
Josse J, Husson F, Pagès J (2009) Gestion des données manquantes en analyse en composantes principales. J de la Société Française de Statistique 150:28–51
Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. J de la Société Française de Statistique 153:79–99
Josse J, Timmerman ME, Kiers HAL (2013) Missing values in multi-level simultaneous component analysis. Chemom Intell Lab 129:21–32
Kiers HAL (1997) Weighted least squares fitting using iterative ordinary least squares algorithms. Psychometrika 62:251–266
Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York
McDonald RP, Burr EJ (1967) A comparison of four methods of constructing factor scores. Psychometrika 32:381–401
Meulman JJ (1982) Homogeneity analysis of incomplete data. DSWO Press, Leiden
Mezzich JE (1978) Evaluating clustering methods for psychiatric diagnosis. Biol Psychol 13:265–281
Mori Y, Iizuka M, Tarumi T, Tanaka Y (2007) Variable selection in principal component analysis. In: Härdle W, Mori Y, Vieu P (eds) Statistical mehtods for biostatistics and related fields. Springer, Berlin, pp 265–283
Overall JE, Gorham DR (1962) The brief psychatric rating scale. Psychol Rep 10:799–812
Rubin DB (1987) Multiple imputation for nonresponse in survey. Wiley, New York
Schafer JL (1997) Analysis of incomplete multivariate data. Wiley, New York
Segi M (1979) Age-adjusted death rates for cancer for selected sites (A-classification) in 51 countries in 1974. Segi Institute of Cancer Epidemiology, Nagoya
Serneels S, Verdonck T (2008) Principal component analysis for data containing outliers and missing elements. Comput Stat Data Anal 52:1712–1727
Shibayama T (1995) A linear composite method for test scores with missing values. Mem Faulty Educ Niigata Univ 36:445–455
Stanimirova I, Daszykowski M, Walczak B (2008) Dealing with missing values and outliers in principal component analysis. Talanta 72:172–178
Takane Y (2013) Constrained principal component anlysis and related techniques. Chapman and Hall/CRC Press, Boca Raton
Takane Y, Oshima-Takane Y (2003) Relationships between two methods for dealing with missing data in principal component analysis. Behaviometrika 30:145–154
Tanner MA, Wong WH (1987) The calculation of posterier distributions by data augumentation (with discussion). J Am Stat Assoc 82:528–550
Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc B 61:611–622
Tucker L R (1951) A method of synthesis of factor analysis studies. Personnel Research Section Report No. 984, U. S. Department of Army, Wasgington, DC
Van Ginkel JR, Kroonenberg PM (2014) Using generalized procrustes analysis for multiple imputation in principal component analysis. J Classif 31:242–269
Van Ginkel JR, Kroonenberg PM, Kiers HAL (2014) Missing data in principal component analysis of questionnaire data. J Stat Comput Sim 84:2298–2315
Walczak B, Massart DL (2001) Dealing with missing data, part 1. Chemom Intell Lab 58:15–27
Wentzell PD, Andrews DT, Hamilton DC, Faber K, Kowalski BR (1997) Maximum likelihood principal component analysis. J Chemom 11:339–366
Acknowledgements
The work reported in this paper has been supported by a research grant (Discovery Grant: 10630) from the Natural Sciences and Engineering Research Council of Canada to the second author. We thank Aida Eslami for providing the reference to Josse and Husson (2012) on RPCA.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Loisel, S., Takane, Y. Comparisons among several methods for handling missing data in principal component analysis (PCA). Adv Data Anal Classif 13, 495–518 (2019). https://doi.org/10.1007/s11634-018-0310-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-018-0310-9
Keywords
- Homogeneity criterion
- Missing data passive (MDP) method
- Alternating least squares (ALS) algorithm
- Weighted low rank approximation (WLRA) method
- Regularized PCA (RPCA) method
- Trimmed scores regression (TSR) method
- Data augmentation (DA) method
- Congruence coefficient