Elsevier

Knowledge-Based Systems

Volume 71, November 2014, Pages 339-344
Knowledge-Based Systems

Kernel ridge regression using truncated newton method

https://doi.org/10.1016/j.knosys.2014.08.012Get rights and content

Abstract

Kernel Ridge Regression (KRR) is a powerful nonlinear regression method. The combination of KRR and the truncated-regularized Newton method, which is based on the conjugate gradient (CG) method, leads to a powerful regression method. The proposed method (algorithm), is called Truncated-Regularized Kernel Ridge Regression (TR-KRR). Compared to the closed-form solution of KRR, Support Vector Machines (SVM) and Least-Squares Support Vector Machines (LS-SVM) algorithms on six data sets, the proposed TR-KRR algorithm is as accurate as, and much faster than all of the other algorithms.

Introduction

Regression and classification are fundamental machine learning techniques to find patterns in data. Predictive tasks whose outcomes are quantitative (real numbers) are called regression, and tasks whose outcomes are qualitative (binary, categorical, or discrete) are called classification. The most fundamental method to address regression problems is the least squares (LS) method, while logistic regression (LR) is the fundamental method for classification. Some disadvantages of the LS method (over-fitting and multicollinearity) are addressed through the development of the method of ridge regression (RR) [1], which is based on the LS method. Kernel methods are some of the most successful for machine learning in recent years. One of the their advantages is extending linear algorithms to non-linear problems by the implementation of the kernel. Support vector machines (SVM), developed originally by Vapnik [2], is considered a state-of-the-art algorithm for both classification (SVC) and regression (SVR) [3] through its implementation of kernels. Least squares support vector machines (LS-SVM), developed by Suykens and Vanderwalle [4], is extended to solve regression problems. The LS-SVM method is easier to train and it converts the inequality constraints of SVM into equality constraints [5]. Kernel ridge regression (KRR) [6] extends the RR method to non-linear problems and is now an established data mining tool [7].

Each one of aforementioned methods has a limitation. LS linearity may be an obstacle to handling highly nonlinear small-to-medium size data sets [8]. The SVM method requires solving a constrained quadratic optimization problem with a time complexity of O(N3) where N is the number of training instances. The KRR method, in the form of ridge regression, is not sparse and requires all of the training instances in its model [8]. Like SVM, KRR has a time complexity of O(N3). Its computation can be slow due to the density of its matrices [8], [5].

Komarek and Moore [9] are the first to show that the truncated-regularized iteratively re-weighted least squares (TR-IRLS) algorithm can be effectively implemented on LR to classify large and high dimensional data sets, and that it can outperform the support vector machine (SVM) algorithm. The TR-IRLS algorithm is based on the linear CG method, as described by Komarek [9]. Maalouf and Siddiqi [10] apply the LR truncated Newton method to large-scale imbalanced and rare events data using the rare events weighted logistic regression (RE-WLR) algorithm. Maalouf et al. [11] show the effectiveness of the linear CG in solving the kernel logistic regression (KLR) model through the truncated regularized kernel logistic regression (TR-KLR) algorithm. Furthermore, Maalouf and Trafalis [12] extended the TR-KLR model to imbalanced data through the rare-event weighted kernel logistic regression (RE-WKLR) algorithm. To the authors’ knowledge, truncated Newton methods have not been fully utilized to solve KRR problems. A possible reason could be the notion that the stability of the CG method is not guaranteed when the data matrix is dense [8], [13].

Our motivation for this study is based on the success and effectiveness of truncated Newton methods when applied to KLR classification problems [11], [12]. In this study we combine the speed of the truncated Newton techniques with the accuracy generated by the use of kernels for solving nonlinear KRR problems. As with our TR-KLR classification method, our proposed regression method, the TR-KRR algorithm, is easy to implement and requires solving only an unconstrained regularized optimization problem, thus providing a computationally more efficient alternative algorithm to SVM. The combination of regularization, approximate numerical methods, kernelization and efficient implementation are essential to enabling TR-KRR to be at once an effective and powerful regression method. We test the performance of TR-KRR on six data sets, one of which is simulated and the rest are real-life data sets. In as much as the use of truncated Newton methods has not been fully exploited in solving KRR models, it is our intention to provide further contribution.

In Section 2, we provide a brief description of the LS method. In Section 3, we derive the RR model. Sections 4 Kernel Ridge Regression (KRR), 5 TR-KRR algorithm discuss the KRR model and the TR-KRR algorithm, respectively. Numerical results are presented in Section 6 and Section 7 states the conclusion.

Section snippets

Least squares method

Let X in RN×d be a data matrix where N is the number of training instances (examples) and d is the number of features (parameters or attributes), and y be a real-valued outcome vector. Let the set of training data be {(x1,y1),,(xN,yN)}, where each xi in Rd (a row vector in X) denotes a sample (instance) in the input space with a corresponding output yi in R, for i=1,2,N. The goal is to find a functional approximation, f(x), for inputs outside of the training sample but hypothetically follow

Ridge regression in the primal

One of the drawbacks of the method of least-squares is poor estimation of the regression coefficients, which could make the absolute values of the least-squares estimates too large and unstable [16]. Ridge regression ”shrinks” the least-squares coefficients through the addition of a regularization parameter, thus minimizing the following objective function [17]:f(β)=12(y-Xβ)T(y-Xβ)+λ2βTβ,where λ0 is the regularization parameter, and it is usually user-defined. The parameter λ is important in

Kernel Ridge Regression (KRR)

The linear transformation in (7) can be replaced with a more general non-linear mapping function, ϕ(·), which maps the data from a lower dimensional space into a higher one, such thatϕ:xRdϕ(x)FRΛ.The goal for choosing the mapping ϕ is to convert nonlinear relations between the response variable and the independent variables into linear relations. Usually, the transformations ϕ(.) are often unknown. However, the solution to the regression problem depends only on the dot product in the

TR-KRR algorithm

The KRR solution above can be rearranged as(K+λIN)α=y,which is systems of linear equations with a kernel matrix K, and a response vector y. Solving for α can be done iteratively by using a whole host of methods from the Krylov subspace, giving a sequence of estimates that converges to αˆ. One of the most effective methods is the linear CG method. Recent studies show that the CG method provides better results than any other numerical method [23]. The CG method in KRR has a time complexity of O(N3

Computational results and discussion

The performance of the TR-KRR algorithm is examined using simulated as well as real-life regression data sets (see Table 1). The algorithm performance is then compared to that of the direct KRR, SVM, and LS-SVM. The direct KRR refers to using the MATLAB matrix inversion (Gauss elimination) in order to evaluate the closed-form solution of KRR. The Gaussian Radial Basis Function (RBF) kernelK(xi,xj)=e-12σ2xi-xj2=e-γxi-xj2is used for all the these methods, where σ is the width of the kernel.

Conclusions

We have thus presented the TR-KRR algorithm and have shown that it is easy to implement and is as accurate as the SVM and LS-SVM methods, yet much faster. We have further demonstrated that the TR-KRR algorithm takes advantage of the speed of the truncated Newton techniques and the power of the kernel methods. Another benefit to using TR-KRR is that it uses unconstrained optimization methods whose algorithms are less complex than those with constrained optimization methods, such as SVM. Our

Acknowledgments

The authors would like to thank Dr. Naji Khoury of Notre Dame University (Lebanon), and Mr. Bilal Krayem of Khalifa University for their valuable input in this study.

References (31)

  • N. Cristianini et al.

    An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

    (2000)
  • P. Komarek, Logistic Regression For Data Mining and High-Dimensional Classification, Ph.D. thesis, Carnegie Mellon...
  • M. Maalouf et al.

    Kernel logistic regression using truncated newton method

    Comput. Manage. Sci.

    (2011)
  • H. Kashima et al.

    Recent advances and trends in large-scale kernel methods

    IEICE Trans. Inform. Syst.

    (2009)
  • J.M. Lewis et al.

    Dynamic Data Assimilation: A Least Squares Approach

    (2006)
  • Cited by (26)

    • Sparse polynomial chaos expansion based on D-MORPH regression

      2018, Applied Mathematics and Computation
    • Learning Output Relevant Features by Joint Autoencoder

      2024, IEEE Transactions on Cybernetics
    • Predicting photoresist sensitivity using machine learning

      2023, Bulletin of the Korean Chemical Society
    View all citing articles on Scopus
    View full text