Comparisons among several methods for handling missing data in principal component analysis (PCA)

Loisel, Sébastien; Takane, Yoshio

doi:10.1007/s11634-018-0310-9

Comparisons among several methods for handling missing data in principal component analysis (PCA)

Regular Article
Published: 18 January 2018

Volume 13, pages 495–518, (2019)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

871 Accesses
7 Citations
Explore all metrics

Abstract

Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

Ulrich Knief & Wolfgang Forstmeier

Recognize the Value of the Sum Score, Psychometrics’ Greatest Accomplishment

Article Open access 17 April 2024

Klaas Sijtsma, Jules L. Ellis & Denny Borsboom

Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation

Article 17 October 2016

Meghan K. Cain, Zhiyong Zhang & Ke-Hai Yuan

References

Bergami M, Bagozzi RP (2000) Self-categorization, affective commitment and group-esteem as distinct aspects of social identity in the organization. Brit J Soc Psychol 39:555–577
Article Google Scholar
Bernaards CA, Sijtsma K (2000) Influence of imputation and EM methods on factor analysis when item nonresponse in questionnaire data is nonignorable. Multivar Behav Res 35:321–364
Article Google Scholar
Dray S, Josse J (2015) Principal component analysis with missing values: a comparative survey of methods. Plant Ecol 216:657–667
Article Google Scholar
Folch-Fortuny A, Arteaga F, Ferrer A (2015) PCA model building with missing data. Chemom Intell Lab 146:77–88
Article Google Scholar
Folch-Fortuny A, Arteaga F, Ferrer A (2016) Missing data imputation toolbox for MATLAB. Chemom Intell Lab 154:93–100
Article Google Scholar
Gabriel KR, Zamir S (1979) Lower rank approximation of matrices by least squares with any choice of weights. Technometrics 22:489–498
Article MATH Google Scholar
Gifi A (1990) Nonlinear multivariate analysis. Wiley, Chichester
MATH Google Scholar
Grung B, Manne R (1998) Missing values in principal component analysis. Chemom Intell Lab 42:125–139
Article Google Scholar
Hwang H, Takane Y (2014) Generalized structured component analysis: a component-based approach to structural equation modeling. Chapman and Hall/CRC Press, Boca Raton
Book MATH Google Scholar
Ilin A, Raiko T (2010) Practical approaches to principal component analysis in the presence of missing values. J Mach Learn Res 11:1957–2000
MathSciNet MATH Google Scholar
Josse J, Husson F, Pagès J (2009) Gestion des données manquantes en analyse en composantes principales. J de la Société Française de Statistique 150:28–51
MathSciNet MATH Google Scholar
Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. J de la Société Française de Statistique 153:79–99
MathSciNet MATH Google Scholar
Josse J, Timmerman ME, Kiers HAL (2013) Missing values in multi-level simultaneous component analysis. Chemom Intell Lab 129:21–32
Article Google Scholar
Kiers HAL (1997) Weighted least squares fitting using iterative ordinary least squares algorithms. Psychometrika 62:251–266
Article MathSciNet MATH Google Scholar
Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York
MATH Google Scholar
McDonald RP, Burr EJ (1967) A comparison of four methods of constructing factor scores. Psychometrika 32:381–401
Article MATH Google Scholar
Meulman JJ (1982) Homogeneity analysis of incomplete data. DSWO Press, Leiden
Google Scholar
Mezzich JE (1978) Evaluating clustering methods for psychiatric diagnosis. Biol Psychol 13:265–281
Google Scholar
Mori Y, Iizuka M, Tarumi T, Tanaka Y (2007) Variable selection in principal component analysis. In: Härdle W, Mori Y, Vieu P (eds) Statistical mehtods for biostatistics and related fields. Springer, Berlin, pp 265–283
Chapter Google Scholar
Overall JE, Gorham DR (1962) The brief psychatric rating scale. Psychol Rep 10:799–812
Article Google Scholar
Rubin DB (1987) Multiple imputation for nonresponse in survey. Wiley, New York
Book Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. Wiley, New York
Book MATH Google Scholar
Segi M (1979) Age-adjusted death rates for cancer for selected sites (A-classification) in 51 countries in 1974. Segi Institute of Cancer Epidemiology, Nagoya
Google Scholar
Serneels S, Verdonck T (2008) Principal component analysis for data containing outliers and missing elements. Comput Stat Data Anal 52:1712–1727
Article MathSciNet MATH Google Scholar
Shibayama T (1995) A linear composite method for test scores with missing values. Mem Faulty Educ Niigata Univ 36:445–455
Google Scholar
Stanimirova I, Daszykowski M, Walczak B (2008) Dealing with missing values and outliers in principal component analysis. Talanta 72:172–178
Article Google Scholar
Takane Y (2013) Constrained principal component anlysis and related techniques. Chapman and Hall/CRC Press, Boca Raton
Google Scholar
Takane Y, Oshima-Takane Y (2003) Relationships between two methods for dealing with missing data in principal component analysis. Behaviometrika 30:145–154
Article MathSciNet MATH Google Scholar
Tanner MA, Wong WH (1987) The calculation of posterier distributions by data augumentation (with discussion). J Am Stat Assoc 82:528–550
Article Google Scholar
Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc B 61:611–622
Article MathSciNet MATH Google Scholar
Tucker L R (1951) A method of synthesis of factor analysis studies. Personnel Research Section Report No. 984, U. S. Department of Army, Wasgington, DC
Van Ginkel JR, Kroonenberg PM (2014) Using generalized procrustes analysis for multiple imputation in principal component analysis. J Classif 31:242–269
Article MathSciNet MATH Google Scholar
Van Ginkel JR, Kroonenberg PM, Kiers HAL (2014) Missing data in principal component analysis of questionnaire data. J Stat Comput Sim 84:2298–2315
Article MathSciNet Google Scholar
Walczak B, Massart DL (2001) Dealing with missing data, part 1. Chemom Intell Lab 58:15–27
Article Google Scholar
Wentzell PD, Andrews DT, Hamilton DC, Faber K, Kowalski BR (1997) Maximum likelihood principal component analysis. J Chemom 11:339–366
Article Google Scholar

Download references

Acknowledgements

The work reported in this paper has been supported by a research grant (Discovery Grant: 10630) from the Natural Sciences and Engineering Research Council of Canada to the second author. We thank Aida Eslami for providing the reference to Josse and Husson (2012) on RPCA.

Author information

Authors and Affiliations

Department of Mathematics, Heriot-Watt University, Edinburgh, EH14 4AS, UK
Sébastien Loisel
Department of Psychology, University of Victoria, 5173 Del Monte Avenue, Victoria, BC, V8Y 1X3, Canada
Yoshio Takane

Authors

Sébastien Loisel
View author publications
You can also search for this author in PubMed Google Scholar
Yoshio Takane
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoshio Takane.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 56 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Loisel, S., Takane, Y. Comparisons among several methods for handling missing data in principal component analysis (PCA). Adv Data Anal Classif 13, 495–518 (2019). https://doi.org/10.1007/s11634-018-0310-9

Download citation

Received: 18 December 2016
Revised: 19 December 2017
Accepted: 11 January 2018
Published: 18 January 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s11634-018-0310-9

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparisons among several methods for handling missing data in principal component analysis (PCA)

Abstract

Access this article

Similar content being viewed by others

Violating the normality assumption may be the lesser of two evils

Recognize the Value of the Sum Score, Psychometrics’ Greatest Accomplishment

Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 56 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Comparisons among several methods for handling missing data in principal component analysis (PCA)

Abstract

Access this article

Similar content being viewed by others

Violating the normality assumption may be the lesser of two evils

Recognize the Value of the Sum Score, Psychometrics’ Greatest Accomplishment

Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 56 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation