1 Introduction

With the prevalence of intelligent mobile devices, a huge volume of online transactions and browsing data has become available. Given huge number of items, Click Rate Prediction (CTR) has become a dominant element on these platforms. Specifically, CTR prediction focuses on predicting the likelihood of a user clicking an item, and the predicted items with larger probability can be displayed for users.

As CTR prediction is usually deemed as a classification problem in CTR prediction [4, 8, 9, 11, 14, 16], current solutions could be divided into two categories: traditional machine learning based models  [3, 6, 21] and deep learning based models  [5, 9, 13, 17, 22]. As one of the proved most effective models among traditional machine learning based approaches, GBDT [7] constructs a tree structure by iteratively selecting the feature with the most significant statistical information gain, which is more conducive to automatically combining some dense numerical features, but it is impossible to learn very well for high-dimensional sparse category features [16]. Recently, DeepFM  [9], as a representative deep learning based models, models the complex and hidden correlations between features for prediction, nevertheless, it’s learning performance for dense digital features is not good enough. In fact, these two kinds of models are complementary, and jointly considering them would facilitate learning both the linear and non-linear features, in order to further enhance performance of either model.

The current solutions for combining the different models can be mainly classified into two categories: [1, 11, 18, 24, 25]: either to feed the intermediate results learned by one model into the second category  [11, 18, 24, 25], or to rely on the ensemble techniques to fuse two independently trained model outputs  [1, 18]. These models show the advantages of modeling both the linear and non-linear features together, and lead to better performance in practice. However, we argue that: since different models capture different characteristics of the data, instead of fusing them simplely, could we design a better fusion model that more explicitly utilizes the different prediction power of these two kinds of models? Considering the above characteristics, exploring how to use the respective advantages of GBDT and NN, and merging the two types of models effectively to solve the one-sided problem in feature learning seem critical.

In this paper, we propose a fusion framework learning strategy, based on the idea of ResNet [10], to improve the CTR prediction in real world data. The key idea is that, we first train a model (e.g., GBDT), and let the second model (e.g., DeepFM) learn the residual part that can not be accurately predicted by the first model. In fact, our proposed framework is an extension of the residual learning in the deep architecture design into model fusion for CTR prediction. Then, we analyze the soundness of this framework: as the prediction power of these two kinds of models is complementary, it is easier to let the second model learn the residual output that can not be well captured by the first model. We also show that our proposed framework is flexible and it is easier to train with faster convergence. Finally, extensive experimental results on three real-world datasets clearly show the effectiveness of our proposed model for the CTR prediction tasks.

2 Problem Definition and Related Works

2.1 Problem Definition

In a CTR prediction system, usually we can obtain the users’ historical click records \(\textit{D}=\{ (x_i, y_i)\}\) about the products. Let \(x_i \in \ \mathbb {R}^d\) denote each sample with d features including numerical features and categorical features, and \(y_i=\{0,1\}\) denote observed label representing whether user clicks item. The CTR prediction task can be formulated as a supervised classification problem as follows:

Definition 1 (Click-Through Rate Prediction)

Given the training datasets \({D_{train} = \{(x_{train} , y_{train})\}}\), our goal is to learn a mapping function f(x) that satisfies \(\hat{y}_{train}=f(x_{train})\) reaching as much closer to \(y_{train}\) as possible. Then, for test datasets \({D_{test} = \{x_{test}\}}\), we compute \(\hat{y}_{test}=f(x_{test})\) for user \(x_{test}\) to denote whether users will click on items.

2.2 Related Works

In this section, we will mainly introduce the related works of the current CTR prediction tasks from the following three aspects: Traditional Machine Learning Models, Deep Neural Network Models and current Fusion Models.

Traditional Machine Learning Models. Logistic Regression is a generalized CTR prediction model that linearly combines each feature, it has been widely used in large-scale classification tasks due to its simplicity and low time complexity  [20]. Apart from the linear combination of individual features, Factorization Machine (FM)  [19] enumerates extra second-order cross information of all features and sends them into the model on the basis of LR. Field-aware Factorization Machines (FFM) [13] introduces the concept of field and assumes that each feature has different feature embeddings respectively for the different cross fields. Some other prevalent CTR prediction models derive from ensemble approaches. These approaches lie in three aspects: 1). Boosting works for the under-fitting models with high bias and low variance. For instance, AdaBoost [6] algorithm is one earliest classical implementation of the boosting algorithms, which is essentially a strong classifier constructed by a linear combination of multiple weak classifiers. Subsequently, some more effective gradient boosting methods, like GBDT [7], have been proposed as the promotion of AdaBoost algorithm by optimizing the loss function based on a negative gradient descent method. 2). Bagging approaches work for the instance of data with high variance but low bias. And Bagging could alleviate the high variance problems by bootstrap sampling from data. 3). Stacking could instead work for both variance and bias problems. It introduces a meta learner for aggregating heterogeneous component strong classifiers, which distinguishes the most from the former two approaches.

Deep Neural Network Models. Recently, many deep learning based CTR models have been proposed [5, 9, 17]. These models, focusing on how to more effectively model non-linear feature interactions, have been successfully applied to many industrial scenarios. Among them, Wide & Deep [5] can jointly learn by both the wide linear models and deep neural networks, which captures the low-order and high-order cross features. Besides, DeepFM [9] exceeds FM in extracting high-order combined features learnt by additional DNN parts, which can automatically combine high-order features without manual intervention in an end-to-end manner. Last but not least, xDeepFM [17] introduces a significant structure of the compressed interactive network (CIN) to generate feature interactions in an explicit fashion. Graph Convolutional Networks (GCNs) [2, 23] iteratively encode graph structure and node features for node representation, which could capture the hidden feature interactions for CTR prediction.

Fusion Models. Since GBDT and NN models are suitable respectively for numerical features and categorical features, a growing number of methods emerge about how to fuse these two kinds of models for higher accuracy in prediction. These fusion models can be divided into the following two categories:

  1. 1)

    Feature Fusion. This kind of fusion models utilize the first model’s results as additional features to train the second model. There are some fusion works directly combining the GBDT with NN on the feature level. In other words, they use one model’s learning output results as additional feature inputs to feed into a second model with the same original data. For instance, we can extract leaf nodes of a pre-trained GBDT as a series of features input and then put them into a new model. Many works [11, 18, 25] have proved the effectiveness of this method, such as GBDT+LR [11] model which uses leaf nodes information trained by GBDT as combined features for LR training and GBDT2DNN [18] is a cascading fusion model that first trains a GBDT model, and the predicting score of GBDT is fed as an input feature into the DNN model. So this kind of fusion methods can be understood as a cascade process of feature engineering plus model learning.

  2. 2)

    Prediction Fusion. Another kind of fusion models usually intuitively combine the two model predictions by learning ensemble weights. In this way, \(\overline{DNN+GBDT}\) [18] proposes to take a weighted average of prediction scores learnt separately from DNN and GBDT models sharing common training data and outputs the final probability score after an activation function. Another model named MTRecS-DLT [1] directly fuses the output scores of two single models in the ratio of 1:1 without using the sigmoid function.

In summary, traditional machine learning based models can accomplish linear feature combinations, and NN models use embedding strategy to solve complex feature intersections. Nevertheless, when they are faced with large-scale, heterogeneous data, one single model is no more effective because of their respective weaknesses. In addition, the existing fusion methods also have some notable shortcomings. For feature fusion models, only a single model is used as a feature extraction process, failing to directly combine the complementary advantages of the two types of models. Besides, in terms of prediction fusion models, an additional ensemble frame is needed to fuse the results of two single models. The quality of the final prediction results depends excessively on the fusion ability of the additional ensemble frame. Therefore, based on the above characteristics, we propose a residual learning based fusion framework to alleviate the limitation of existing fusion methods. Figure 1 shows the differences between the existing fusion models and the ResFusion framework which we proposed.

Fig. 1.
figure 1

The differences between our model and other fusion model

3 The Proposed Framework

In this section, we would introduce our proposed ResFusion framework for CTR prediction tasks in detail. We begin with the integral architecture, followed by the details of model components. At the end of this section, we would demonstrate the model training process and the discussion of our ResFusion framework.

3.1 Overall ResFusion Framework Architecture

The Fig. 2 shows the integral architecture of the ResFusion. By taking a feature set \(X(x_{1},x_{2},x_{3}......x_{d})\) as input, it outputs the probability \(\hat{y}\) that user would like to click the item (e.g., web pages or ads). The overall architecture of our model contains two main parts: the GBDT component and the DeepFM component. Specifically, by taking the related inputs, the GBDT outputs probability \(\hat{y}_{t}\). Then we can calculate the residual between the true label y and the predicted value of GBDT \(\hat{y}_{t}\), and we called this value as \(res_{t}\), which is the key of our model. Next, the residuals are sent into the DeepFM component as the new learning target, with the same input features as GBDT’s. Then the DeepFM part would output residual prediction value \(\hat{y}_{D}\), for complementary of GBDT component. The model will finally get the predicted value, which can be expressed as: \(\hat{y}=\hat{y}_{D}+\hat{y}_{t}\). We detail each part used in our fusion model as follows:

Fig. 2.
figure 2

The overall architecture of ResFusion

GBDT. GBDT is a decision tree algorithm based on the gradient boosting framework and can be seen on the right of the Fig. 2. “Gradient boost” means that each iteration process is to reduce the residual of the previous iteration, and a new weak classifier model is established in the direction of the gradient of the residual reduction. So the essence of GBDT algorithm can be expressed as the boosting method based decision tree:

$$\begin{aligned} F_{M} (x)=\sum _{m=1}^M T(x ,\gamma _{m}) \end{aligned}$$
(1)

Where \(T(x,\gamma _{m})\) represents the decision tree, \(\gamma _{m}\) represents the parameter of the tree, and M is the number of trees. And strong classifier \(F_{M} (x)\) can be composed of multiple weak classifiers \(T(x,\gamma _{m})\) linear added. And by training a GBDT, we can get the prediction score \(\hat{y}_{t}\).

Then we can compute the residual as:

$$\begin{aligned} res_{t}&= y-\hat{y}_{t} \end{aligned}$$
(2)

DeepFM. DeepFM is a neural network-based factorization machine (FM). Moreover, the model structure can be found in the right of the Fig. 2. This method contains two inner parts: FM part and DNN part. The FM part learns mainly from primary and second-order cross features as low-order features, while the deep part is a feed-forward neural network to extracts high-order cross features. FM and DNN share the standard features’ input by linking with the common input layers and embedding layers. Finally, we combine the results of DNN \(y_{DNN}\) and FM \(y_{FM}\) and send them into an activate function. The final prediction result of DeepFM component is summed as:

$$\begin{aligned} \hat{y}_{D}(x_{n})=y_{DNN}(x_{n})+y_{FM}(x_{n}) \end{aligned}$$
(3)

Then we sum the output of two components: \(\hat{y}_{t}\), \(\hat{y}_{D}\), and obtain the final prediction score \(\hat{y}=\hat{y}_{t}+\hat{y}_{D}\).

3.2 Model Training

As our model contains two components, we first train a strong classifier called GBDT through the process of fitting the residuals by multiple iterations. The loss function for optimizing the tree model in every iteration process is:

$$\begin{aligned} \gamma _{m}=\mathop {\arg \min }_{\gamma } \sum _{n=1}^N \mathcal {L}(y_{n} ,F_{m-1}(x_{n})+T(x_{n} ,\gamma )) \end{aligned}$$
(4)

and the prediction score of GBDT \(\hat{y}_{t}\) is used to calculate the residual with the true label y, and then the DeepFM try to fit this residual by optimizing the following formula:

$$\begin{aligned} W=\mathop {\arg \min }_{W} \sum _{n=1}^N \mathcal {L}(y_{n} ,s(\hat{y}_{D}(x_{n})+\hat{y}_{t}(x_{n}))) \end{aligned}$$
(5)

where s(x) is a sigmoid function, above two loss we all use logloss function to complete a binary classification task. \(\gamma \) and W represent the model parameters of GBDT and DeepFM. In practice, we use the LightGBM [15] to train a GBDT model, then we implement the DeepFM model with PytorchFootnote 1 to train model parameters with mini-batch Adam.

3.3 Discussions

In this section, We would discuss our proposed framework from three aspects: convergence speed, model generalization ability and model flexibility.

Rapid Convergence. Our proposed model is trained on the basis of complementary of two single models. By fitting a new model on the remaining residuals between the true label and another model’s output to learn what the former model failed to learn, the new model merely needs to learn less content until reaching convergence with relatively faster speed.

Model Generalization. ResFusion is designed under the problem setting with the input of the combination feature matrix F. When GBDT and DeepFM learn separately with the same input, they would focus on different content even in the same data according to their different learning methods. ResFusion’s learning ability is no longer single and one-sided, and it can learn the hidden information more generally under the input data. Through the repeated joint learning of two completely different learning mechanisms, the optimal solution of the model can be obtained. So our fusion model finally proves to have better generalization and scalability.

Model Flexibility. ResFusion can also be understood as a result fusion model. Distinguished from other result fusion methods mentioned above, our model doesn’t rely on the additional external fusion aggregation model for learning, rather, it’s artfully sequentially links the two model as the fuse process during the model training process. So ResFusion can also be extended with feature fusion methods, which highlights the superior flexibility of our model.

In general, distinct from the plain feature fusion models, ResFusion uses the different learning capabilities of the two types of models more directly utilizing both the advantages from the models. Additionally, In our model, the latter (e.g., DeepFM) learns the remaining residual parts based on what the former (e.g., GBDT) has not learned, thus the speed of model convergence will be relatively accelerated. Specially, ResFusion has better generalization ability. Compared with the result fusion models, we fuse two methods naturally as an integral joint learning process and do not depend on the specific external aggregation method. Therefore ResFusion is also very flexible and can also be used in combination with other fusion methods mentioned formerly.

4 Experiments

In this section, we conduct extensive experiments on three real-world datasets to evaluate the effectiveness of our proposed fusion models.

Table 1. The statistics of the three datasets

4.1 Experimental Settings

Datasets. To evaluate the effectiveness of our proposed fusion model, we conduct experiments on three public datasets: Avazu, Criteo and ZhiHu datasets:

1) Avazu. AvazuFootnote 2 comes from kaggle CTR prediction competition  [12, 14]. It consists of 40 M click logs arranged in chronological order along ten days.

2) Criteo. CriteoFootnote 3 as a famous and accessible benchmarking dataset widely used in CTR model evaluation  [9, 12]. It includes 45 M click records

3) ZhiHu. ZhiHuFootnote 4 derives from ZhiYuan 2019 artificial intelligence competition. The provided data consists of 2M instances of inviting users to answer questions.

For the above three datasets, we first filled the null values in numerical features with 0 and categorical features with −1. After data pre-processing, we randomly split all records into train and test with the ratio of 9:1. The number of numerical features and categorical features for different datasets and other detailed dataset statistics are shown in Table 1.

Evaluation Metrics. We adopt two widely used evaluation metrics in experiments: AUC (Area Under ROC) and Logloss(cross entropy). AUC is used to evaluate the probability of ranking positive samples to be front while Logloss is used to measure the difference between predictions and true labels.

Baselines. We compare our proposed model with several state-of-the-art baselines for CTR prediction. We split all baselines into four groups: 1)Traditional machine learning models: LR  [20], FM  [19] and GBDT  [7]; 2) Deep learning based models: DeepFM  [9]; 3) Feature fusion models: GBDT+LR  [11], GBDT2DNN  [18], GBDT2DeepFM  [18]; 4) Prediction fusion models: \(\overline{GBDT+DeepFM}\)  [1].

Table 2. AUC and Logloss comparisons for different models

4.2 Overall Comparisons

In this section, we compare the overall performances of our proposed framework with other baselines. Specifically, Table 2 summarizes the AUC and Logloss values of various models on three datasets. We firstly analyze the single-models: LR only considers each feature’s linear combination for CTR prediction, FM exceeds LR by combining the two features and obtaining the information of the second-order cross feature. GBDT can capture effective features and feature linear combinations efficient than LR by combining multiple weak classifiers. DeepFM performs better than FM, showing the effectiveness of the combination of DNN and FM. We find that DeepFM shows a better performance than GBDT on the Avazu dataset but not on the Cretio and ZhiHu datasets. The reason is, as we mentioned before, that shallow model is more suitable for dense numerical features and deep model is more suitable for sparse categorical features. Furthermore, we can find the difference on three datasets, the Avazu dataset has only categorical features. Then, we compare our model with other fusion models: for GBDT+LR, we take the output leaf nodes of GBDT as extra feature of data set to feed in LR; GBDT2DNN and GBDT2DeepFM are fed GBDT’s predictions as extra features into DNN and DeepFM respectively. The three fusion models fuse GBDT with other models on feature-level and all exceed GBDT. Different from fusion on feature-level, \(\overline{{\mathrm{GBDT + DeepFM}}}\) model fuses GBDT and DeepFM on output-level by learning the weight parameters for two outputs for final predictions. Compared with other fusion models, our GBDTRes+NN model which is based on ResFusion framework consistently achieves best performance on both evaluation metrics. On ZhiHu dataset, our model improves best fusion baselines by 2.59% and 0.2% on AUC and Logloss, respectively. Based on the analysis of above experimental results, we could empirically conclude that our proposed ResFusion framework outperforms all baselines.

4.3 Detailed Model Analysis

In this subsection, we would like to give a detailed analysis of our proposed ResFusion framework and show the effectiveness of our fusion strategy.

Convergence Speed Analysis. We logged the convergence process of our GBDTRes+NN and other NN (DeepFM)-based models to verify that our model has faster convergence speed. Figure 3 shows the convergence process of AUC and Logloss values on the Avazu dataset. We find that our model achieves convergence at the second epoch. Compared with the deepFM model, which mainly requires nearly six epochs to converge, it is faster by nearly four epochs. Compared with the other two fusion models, our model is also faster by about two epochs. The reason is that our fusion model is based on residual learning, the DeepFM module only needs to fit the residual part which GBDT did not learn very well, so it can achieve convergence rapidly.

Fig. 3.
figure 3

The convergence speed comparison on various fusion models

Table 3. AUC and Logloss comparisons with different number of iterations K.

Model Generalization Analysis. In this part, We verify the generalization of our proposed framework by setting different number of residual learning K. The experiment results can be observed in Table 3. In the verification experiment, we choose GBDT as the initial model so our fusion model can be seen as a single GBDT model when \(K=0\). Then the results of \(K=1\) mean that we use the DeepFM to fit the residual values of real-labels and the predictions of GBDT, called GBDTRes+NN. The improvement of AUC over the single model (GBDT) is 3.13%. After that, we feed the predictions of the first fusion model GBDTRes+NN’s into the GBDT to fit the residual again when \(K=2\). We can do residual learning in an iterative way at different K. According to the experimental results, when \(K=2\), our strategy reaches the best, which means that through the first two residual learning, each model has already fully learned the residual part that another model does not learn. So as K increases to 3 from 2, the performance of the fusion model can no longer be improved, and may even result in over-fitting which leads to suboptimal results.

5 Conclusion

In order to alleviate the challenge that the existing CTR models cannot fully learn from data with both sparse category and dense numerical features, we propose a ResFusion framework which integrates GBDT and NN together by residual learning. It gains performance improvement for these advantages: 1) Compared with existing fusion models, it can directly utilize the complementary advantages of the component models; 2) In the process of fusion, it does not depend on the specific fusion method, so it is more generalized; 3) Residual-based fusion methods can boost model convergence. We finally conduct extensive experiments on three real-world datasets and prove the effectiveness and efficiency of our model over current state-of-the-art models on the two main evaluations of AUC and Logloss.