1 Introduction

1.1 An Overview of Matrix Completion Approach

With the advancements in data collection and the increased availability of data, the problem of missing values will only intensify. Traditional approaches to treating this problem just remove rows and/or column that have missing values but, especially in online applications, this will mean removing most of the rows and columns as most data collected is sparse. Naïve approaches impute missing values with the mean or median of the column, which changes the distribution of the variables and increases the bias in the model. More complex approaches create one model for each column based on the other variables; our test show that this work well for small matrices but the computational time increases exponentially as more columns are added. For only numerical datasets, matrix factorization using SVD-based models proved to work on the Netflix Prize but has the drawback of inferring a linear combination between variables and not working well with mixed datasets (continuous and categorical). For sequential data, researches have been done using Recurrent Neural Networks (RNN). However, the purpose of this paper is to create a general matrix completion algorithm that does not depend on the data being sequential and works with both continuous and categorical variables that would be the founding block of a Recommendation System. A novel model is proposed using an autoencoder to reconstruct each row and impute the unknown values based on the known values, with a cost function that optimizes separately the continuous and categorical variables. Tests show that this method outperforms the performance of more complex models with a fraction of the execution time.

Matrix Completion is a problem that’s been around for decades but took prominence in 2006 with the Netflix Price, where the first model to beat Netflix’s baseline recommender system by more than 10% would win 1 million dollars. In such a dataset, each row represented a different user and each column a different movie. When a user i rated movie j, the position ij of the matrix would reflect the rating, otherwise it would be a missing value. This is a very particular type of dataset, as every column represented a movie from which a limited number of ratings was possible (1–5). It is fair to say that the difference between the values in the columns reflect the taste of the user but, in a general sense, each column represents the same concept i.e., a movie. Most of the research in matrix completion and recommendation systems have been done on datasets of this type, predicting the rating that a user will give on a movie, song, book, or any other content. However, most of the datasets, created in the real world, are not of this type as each column may represent a different type of data. Thus, the data could be demographical (age, income, etc.), geographical (city, state, etc.), medical (temperature, blood pressure, etc.), just to name a few. Any dataset may have missing values, and the purpose of this work is to create a general model that imputes these missing values and recommends contents in the face of having all possible type of data.

1.2 The State of the Art

Naïve Approaches

The most basic approach is to fill the missing values with the mean or median (for continuous variables) or the mode (for categorical variables). This method presents two clear problems: the first is that it is changing the distribution of the variable by giving more prominence and over-representation to the imputed variable than it really has in the data, and the second is that bias is introduced to the model, as the output is the same for all the missing values in a specific column. This is specially a problem for highly sparse datasets. It is important to notice that a variation of this method exists where the mean or median of the row (instead of the column) is imputed, but only works for continuous variables. The mode could be used for both continuous and categorical but will still present the problems described earlier. Some more models can be found in [1, 6, 48, 66,67,68].

Collaborative Filtering and Content-Based Filtering

Collaborative filtering is one of the main methods for completing Netflix-style datasets. In collaborative filtering, a similarity between rows (or columns) is calculated and used to compute a weighted average of the known values to impute the missing values. This method only works for numerical datasets, and is not scalable as similarity must be computed for all pairs (which is very computationally expensive).

Content-Based filtering uses attributes of the columns to find the similarity between them and then calculate the weighted average to impute. This method only works for numerical datasets.

SVD Based

The Singular Value Decomposition works by finding the latent factors of the matrix by factorizing it into 3 matrices:

$$ X = U\varSigma V^{T} $$

Where U is an m x m unitary matrix, Σ is a diagonal matrix of dimensions m x n and V is an n x n unitary matrix. The matrix Σ represent the singular values of matrix X, and the columns of U and V are orthonormal. It reconstructs the matrix X by finding its low-rank approximation. A preprocessing step for this method is pre-imputing the missing values, usually with the mean of the column, as missing values are not permitted. This method is one of the most popular one as it was the winning solution of the Netflix Prize, but has the drawback of only working on numerical datasets, inferring a linear combination of the columns, and usually are fit for Netflix-style datasets.

More Complex Approaches

More complex approaches create one model for each variable with missing values, using the rows with known values in a column as the training set. A model is trained using all the variables, except the one column, as the input, and that column as the output. After a model is trained, the missing values are estimated by predicting the output of the other rows. The principal drawback of these methods is that the number of models that have to be trained increase with the number of columns of the dataset, therefore it is very computationally expensive for large datasets. This framework can work for mixed datasets or for numerical only datasets, depending on the model used. Pre-imputing missing values is needed for this framework as missing values are not permitted, usually with the mean of the column.

Some implementations of these models use Random Forest (missForest, works for mixed datasets), chained equations (mice, works for numerical only), EMB (Amelia, works for mixed datasets in theory but in this paper only the numerical part worked), FAMD (missMDA, works for mixed datasets).

2 Our Deep Learning Model

2.1 The General Framework

When designing the model, three main objectives were considered:

  • Minimize reconstruction error for continuous variables

  • Minimize reconstruction error for categorical variables

  • Eliminate the effect of missing values in the model

Our proposed method uses autoencoders to reconstruct the dataset and impute the missing values. The concept originates from idea of SVD method through using deep learning model. Autoencoders are an unsupervised method that tries to reconstruct the input in the output using a neural network that is trained using backpropagation.

A general overview of the model is shown in Fig. 1.

Fig. 1.
figure 1

The general overview of the model.

2.2 The Step of Pre-process the Dataset

The dataset can be of three types: all continuous, all categorical, or mixed (some columns are continuous and some categorical). Therefore, the first step of pre-processing the data is finding out which columns are numerical and which are categorical. The procedure followed in this work, to achieve this, is shown in Fig. 2, below.

Fig. 2.
figure 2

The column type definition.

Once the column type is known, each of the continuous columns (if they exist) are normalized using Min Max Scaling. This way, every numerical column is scaled between 0 and 1. This step of normalization of data is a necessary step in the application of Neural Networks. The minimum and maximum values for each column are saved to be able to rescale the reconstructed matrix to the original scale.

After normalizing the continuous columns, the next step is encoding the categorical columns. For simplicity purposes, and because the order of the columns is not relevant in the model, all the continuous columns are moved to the beginning of the matrix and the categorical columns to the end. Then, each categorical column is encoded using One-Hot encoding, where one new column is created for each level of each categorical variable. The column with the label has a value of 1 and the rest a value of 0.

At this step, the matrix is all numerical and every column is between 0 and 1. For the reasons that will be explained in Sect. 2.3, three masks will be extracted from the encoded dataset:

  • Missing Value Mask: same shape as the encoded matrix, where the missing values are encoded as 0 and the non-missing values as 1.

  • Numerical Mask: a vector of the same length as the number of columns, where the continuous columns (if exist) are encoded as 1 and the categorical columns (if exist) are encoded as 0.

  • Categorical Mask: the complement of the numerical mask, where the continuous columns are encoded as 0 and the categorical as 1.

The last step in encoding the matrix is converting all missing values to 0. This serves two purposes: the first is that neural networks can’t handle missing values, and the other is to remove the effect of these missing nodes in the neural network. Once the encoded matrix and the three masks are created, the training step can begin.

2.3 Training the Autoencoder

To train the autoencoder, each row of the encoded matrix is treated as the input and output at the same time. Therefore, the number of nodes in the input (n_input) and output layer are equal to the number of columns in the encoded matrix.

The architecture that was defined consists of 3 hidden layers. The design is symmetrical with the number of nodes of each of the hidden layers as follows:

  • Hidden Layer 1: n_input/2

  • Hidden Layer 2: n_input/4

  • Hidden Layer 3: n_input/2

There are two encoding layers and two decoding layers. The reason why the number of nodes for the hidden layers is smaller than the input layer is due to the idea of projecting the data onto a lower dimension and find the latent factors to reconstruct the data set from there.

Figure 3 shows the autoencoder neural network architecture, with the dimensions of each encoding/decoding layer. The “+1” in the first dimension of each encoder/decoder is the bias term that was added.

Fig. 3.
figure 3

The network architecture.

The activation function that was used for each of the nodes is the sigmoid given as,

$$ \sigma \left( x \right) = \frac{1}{{1 + e^{ - x} }} $$

The output of each encoder and decoder are computed as follows:

$$ Encoder\,1 = \sigma \left( {X*W_{E1} + B_{E1} } \right) $$

Where * denotes matrix multiplication, \( W_{E1} \) are the weights for encoder 1 learned from the network (initialized randomly) and \( B_{E1} \) is the bias of the encoder 1 learned from the network (initialized randomly). This result is fed to the second encoder,

$$ Encoder\,2 = \sigma \left( {Encoder\,1*W_{E2} + B_{E2} } \right) $$

Similarly, for the Decoders:

$$ Decoder\,1 = \sigma \left( {Encoder\,2*W_{D1} + B_{D1} } \right) $$
$$ X^{{\prime }} = Decoder\,2 = \sigma \left( {Decoder\,1*W_{D2} + B_{D2} } \right) $$

The output of decoder 2 has the same dimensions as the input and is the output from which the weights will be trained.

2.4 The Cost Functions

As stated previously, there are three main objectives in this work; to minimize reconstruction error for both continuous and categorical variables, and to eliminate the effect of missing values in the model.

Continuous and categorical variables are different in nature, and therefore should be treated differently when used in any model. In most neural networks applications, there is only one type of output variable (either continuous or categorical) but in this case, there may be mixed nodes. This work proposes using a mixed cost function that is the sum of two separate cost functions, one for continuous variables and one for categorical variables.

$$ cost_{total} = \mathop {\text{argmin}}\limits_{W,B} (cost_{continuous} + cost_{categorical} ) $$

To be able to distinguish between continuous and categorical variables, the numerical and categorical masks, that are created earlier, will be used.

For the purpose of the third objective, the missing values mask will be used to only consider the error of values that are not missing. By using this approach, there is no need to pre-impute missing values as they will have no effect on the overall cost function.

Mathematically, the continuous cost function is as follows:

$$ cost_{continuous} = \sum\nolimits_{i,j} {\left( {\left( {X_{ij}^{{\prime }} - X_{ij} } \right)\delta_{{num_{j} }} \delta_{{miss_{ij} }} } \right)^{2} } $$

Where \( X_{ij}^{{\prime }} \) is the output of Decoder 2 for position ij, \( X_{ij} \) is the same value in the original encoded matrix, \( \delta_{{num_{j} }} \) is the value in the numerical mask for column j, and \( \delta_{{miss_{ij} }} \) is the value in the missing value mask for position ij. It is clear that this cost will only consider values that are in columns that are numerical \( (\delta_{{num_{j} }} = 1) \) and that are not missing in the original matrix \( (\delta_{{miss_{ij} }} = 1) \).

The categorical cost function is given by the cross entropy:

$$ cost_{categorical} = - \sum\limits_{i,j} {\left( {X_{ij} \ln \left( {X_{ij}^{{\prime }} } \right) + \left( {1 - X_{ij} } \right){ \ln }\left( {1 - X_{ij}^{{\prime }} } \right)} \right)\delta_{{cat_{j} }} \delta_{{miss_{ij} }} } $$

Similarly, \( X_{ij}^{{\prime }} \) is the output of Decoder 2 for position ij, \( X_{ij} \) is the same value in the original encoded matrix, \( \delta_{{cat_{j} }} \) is the value in the categorical mask for column j, and \( \delta_{{miss_{ij} }} \) is the value in the missing value mask for position ij. It is clear that this cost will only consider values that are in columns that are categorical \( (\delta_{{cat_{j} }} = 1) \) and that are not missing in the original matrix \( (\delta_{{miss_{ij} }} = 1) \). The total cost function is minimized using Gradient Descent. The learning rate for these tests was set at a default of 0.01.

2.5 The Post-processing of the Dataset

The output of the Autoencoder is a matrix where all the numerical columns are at the beginning, and all the categorical columns are split among different columns, with a value between 0 and 1, at the end. The goal is to reconstruct the original matrix, with the columns in the same order and each categorical variable as one column with different levels.

The first step is computing the “prediction” for the categorical variables, that is, the level of the categorical variables that obtained the highest score after the decoder 2. Once the category is found, the name of the column is assigned as the category or level for that variable. This is repeated for all categorical variables.

Once each categorical column is decoded to its original form and levels, the columns are reordered using the order of the original dataset. Then, the numerical variables are scaled back using the minimum and maximum values saved during the pre-processing step for each column.

At this point, the matrix is in the same shape and scale as the original matrix; with all the missing values imputed.

The model in this work is based on a deep learning model using autoencoder for content recommendation based on the solution of the matrix completion problem. The main idea that this work proposes is extending the state of the art to impute missing values of any type of dataset, and not just numerical. One of the principal idea of this work is the application of a new cost function, a mixed cost function, that has not been done before. This function detects which columns are continuous and which are categorical, and computes the proper error depending on the type of the data. This improves considerably the performance of the model and can be extended to any neural network application that requires output nodes of mixed types.

3 The Results and Conclusion

3.1 The Data Set and the Results

For this analysis, 15 publicly available datasets [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] were used. The dataset was selected such that the data set would be diverse with respect to sparsity level, domain or application, amount of numerical vs categorical data, and the number of rows and columns.

To create a more varied selection of data, 100 bootstrap samples were created from each of the datasets by selecting a random number of rows, a random number of columns, and a random number of missing values.

To measure the performance of continuous variables, the Normalized Root Mean Squared Error (NRMSE) measure is used. The reason this metric is used is that we could compare the performance of difference datasets regardless of the range or variance it has. The lower the NRMSE score, the better.

$$ NRMSE = \sqrt {\frac{{mean\left( {\left( {x_{true} - x_{pred} } \right)^{2} } \right)}}{{var\left( {x_{true} } \right)}}} $$

To measure the performance of categorical variables, the Accuracy is used. The higher the accuracy score, the better.

$$ Accuracy = mean\left( {x_{true} = x_{pred} } \right) $$

The execution time is measured in seconds. The lower the execution time, the better.

To compare the performance of our model vs other state of the art models, seven packages in R were used as baselines models: Amelia [51], impute [49], mice [72], missForest [70], missMDA [59], rrecsys [11], and softImpute [48]. The models in these packages are state of the art solutions for the matrix completion problem and cover all the models described in the introduction.

The number of missing values ranged from 0 to 100%, but limitations on other packages only allowed only up to 80% on most models, and 20% on Amelia package model.

Figure 4 shows the performance of the models with 1500 bootstrap samples (100 per dataset) measured by the NRMSE. It can be seen that the model proposed in this paper outperforms all of the models, with less variation in the results. The closest model, Amelia, was only tested with up to 20% sparsity but our autoencoder still improves the median NRMSE by 11% (0.09293 vs 0.10395).

Fig. 4.
figure 4

Comparing the performance using NRMSE.

Figure 5 shows the accuracy of categorical variables for all packages that are able to handle them. Out of the seven packages that were tested to compare, only four are able to impute categorical variables. The model proposed in this paper sits right in the middle in terms of median performance with large variation in the results.

Fig. 5.
figure 5

Comparing the accuracy of different models.

Figure 6 shows the execution time in seconds for all the packages. The tests were run in a MacBook Pro with a 2.5 GHz Intel Core i7 processor. It can be seen that the autoencoder model is the third slowest, however the median computational cost is still reasonable at about 0.5 s per model. Comparing the execution time to models that can handle categorical values, the two models that outperform in accuracy take about 5 times as long to execute as the autoencoder indicating our model has the best performance for NRMSE for all models tested. Thus, for the models that can handle mixed datasets, our model has the best tradeoff between accuracy and execution time.

Fig. 6.
figure 6

Comparing the execution time of different models.

The results indicate our model outperforms existing models. It has the best NRMSE of all models, and has the best trade-off accuracy and computational complexity as the two models.