Many methods have been proposed recently for high-dimensional data representation to reduce the dimensionality of the data. Matrix Factorization (MF) as an efficient dimension-reduction method is increasingly used in a wide range of applications. However, these methods are often unable to handle data with missing entries. In a Semi-Supervised Learning (SSL) scenario, many commonly used missing value imputation methods, e.g., KNN imputation, cannot utilize the existing information on the labels, which is one of the most discriminative information in the data. Considering the outliers in the observed entries, in this paper, we propose an algorithm called Correntropy based Constraint Nonnegative Matrix Factorization Completion (CCNMF) for simultaneous construction of robust representation and imputation of high-dimensional data in an SSL scenario. Specifically, the Maximum Correntropy Criterion (MCC) is used to construct the model of the CCNMF method to alleviate the negative effects of non-Gaussian noise and outliers in the data. To solve the optimization problem, an iterative algorithm based on a Fenchel Conjugate (FC) and Block Coordinate Update (BCU) framework is proposed. We show that the proposed algorithm can satisfy not only objective sequential convergence but also iterate sequence convergence. The experiments are conducted on the real-world image dataset and community health dataset. In many cases, it is shown that the proposed method outperforms several state-of-the-art methods for both representation and imputation.
The Orl and Yale datasets are publicly available. The community health data is available from the corresponding author, Kup-Sze Choi, upon reasonable request.
Jeni LA, Cohn JF, De La Torre F (2013) Facing imbalanced data–recommendations for the use of performance metrics. In: 2013 humaine association conference on affective computing and intelligent interaction. IEEE, pp 245–251
This work was made possible by support from the National Natural Science Foundation of China (No. 11901063), National Key R&D Program of China (No. 2021ZD0112701), the Innovation and Technology Fund of Hong Kong (No. MRP/015/18), the General Research Fund of the Hong Kong Research Grants Council (No. PolyU 152006/19E), and the Scientific Research Fund of the Sichuan Provincial Science and Technology Department (Nos. 2022NSFSC0462, 2021YFG0133, 2021YFG0295, 21ZDYF3598 and 2021YFH0069).
Appendix: A. Proof Of Theorem 1
The problem in (19) can be decomposed into n × d independent problems, each involving one element in B and coming in the form
If i,j ∈Ω, due to the constraint \({\mathcal {P}}_{\Omega }({\mathbf {B}})={\mathcal {P}}_{\Omega }({\mathbf {A}})\), \({\mathbf {B}}^{*}_{i,j}\) needs to equal Ai,j. If i,j∉Ω, the problem (A.1) can be written as
which is a conventional quadratic function, with optimal solution as \({\mathbf {B}}^{*}_{i,j}=D_{i,j}\). Therefore, (20) is optimal solution of (19). □
Appendix: B. The Equivalence of (12c) and (15)
Let \({\mathbf {C}} = \nabla _{{\mathbf {X}}} f(\hat {{\mathbf {X}}}^{t},{\mathbf {Z}}^{t},{\mathbf {P}}^{t+1},{\mathbf {B}}^{t})\) and \(L = L_{{\mathbf {X}}}^{t}\), the objective of (12c) can be written as
Eliminating the terms which are independent of X in (B.1), (12c) can be reformulated as follows:
Finally, (12c) can be reformulated as:
which is the formulation of (15).
Zhou, N., Du, Y., Liu, J. et al. Robust semi-supervised data representation and imputation by correntropy based constraint nonnegative matrix factorization. Appl Intell 53, 11599–11617 (2023). https://doi.org/10.1007/s10489-022-03884-8
DOI: https://doi.org/10.1007/s10489-022-03884-8