Skip to main content
Log in

Robust relevance vector machine for classification with variational inference

  • Data Mining and Analytics
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

The relevance vector machine (RVM) is a widely employed statistical method for classification, which provides probability outputs and a sparse solution. However, the RVM can be very sensitive to outliers far from the decision boundary which discriminates between two classes. In this paper, we propose the robust RVM based on a weighting scheme, which is insensitive to outliers and simultaneously maintains the advantages of the original RVM. Given a prior distribution of weights, weight values are determined in a probabilistic way and computed automatically during training. Our theoretical result indicates that the influences of outliers are bounded through the probabilistic weights. Also, a guideline for determining hyperparameters governing a prior is discussed. The experimental results from synthetic and real data sets show that the proposed method performs consistently better than the RVM if a training data set is contaminated by outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. It can be found at http://www.stats.ox.ac.uk/pub/PRNN/.

  2. http://www.miketipping.com/.

  3. This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

  4. http://archive.ics.uci.edu/ml/.

References

  • An, L. T. H., & Tao, P. D. (1997). Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. Journal of Global Optimization, 11(3), 253–285.

    Article  Google Scholar 

  • Bishop, C. M., & Tipping, M. E. (2000), Variational relevance vector machine. In Proceedings of the 16th conference on uncertainty in artificial intelligence (pp. 46–53).

  • Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.

    Article  Google Scholar 

  • Caruana, R., & Niculescu-Mizil, A. (2004). Data mining in metric space: An empirical analysis of supervised learning performance criteria. In Proceedings of the 10th international conference on knowledge discovery and data mining (pp. 69–78).

  • Chang, C. C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27:21–27:27.

    Article  Google Scholar 

  • Christmann, A., & Steinwart, I. (2004). On robustness properties of convex risk minimization methods for pattern recognition. Journal of Machine Learning Research, 5, 1007–1034.

    Google Scholar 

  • Debruyne, M., Serneels, S., & Verdonck, T. (2009). Robustified least squares support vector classification. Journal of Chemometrics, 23(9), 479–486.

    Article  Google Scholar 

  • Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.

    Google Scholar 

  • Fang, Y., & Jeong, M. K. (2008). Robust probabilistic multivariate calibration model. Technometrics, 50, 305–316.

    Article  Google Scholar 

  • Frank, A., & Asuncion, A. (2010). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science.

    Google Scholar 

  • Hwang, S., Yum, B., & Jeong, M. K. (2014). Robust relevance vector machine with variational inference for improving virtual metrology accuracy. IEEE Transaction on Semiconductor Manufacturing, 27, 1–12.

    Article  Google Scholar 

  • Hwang, S., Kim, N., Jeong, M. K., & Yum, B. (2015). Robust kernel based regression with bounded influence for outliers. Journal of Operations Research Society (to appear).

  • Jaakkola, T. S. (2000). Tutorial on variational approximation methods. Technical Report, MIT Artificial Intelligence Lab.

  • Jaakkola, T. S., & Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics and Computing, 10(1), 25–37.

    Article  Google Scholar 

  • Lee, K., Kim, N., & Jeong, M. K. (2014). A sparse signomial model for classification and regression. Annals of Operations Research, 216, 257–286.

    Article  Google Scholar 

  • Lin, X. W., Wahba, G., Xiang, D., Gao, F. Y., Klein, R., & Klein, B. (2000). Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV. Annals of Statistics, 28(6), 1570–1600.

    Article  Google Scholar 

  • Ling, C. X., Huang, J., & Zhang, H. (2003). AUC: A better measure than accuracy in comparing learning algorithms. In Proceedings of the 2003 Canadian artificial intelligence conference (pp. 329–341).

  • Ma, Z., & Leijon, A. (2011). Bayesian estimation of beta mixture models with variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11), 2160–2173.

    Article  Google Scholar 

  • Mackay, D. J. C. (1992). The evidence framework applied to classification networks. Neural Computation, 4(5), 720–736.

    Article  Google Scholar 

  • Neal, R. M. (1996). Bayesian learning for neural networks. New York: Springer.

    Book  Google Scholar 

  • Ormerod, J. T., & Wand, M. P. (2010). Explaining variational approximations. The American Statistician, 64(2), 140–153.

    Article  Google Scholar 

  • Park, S. Y., & Liu, Y. (2011). Robust penalized logistic regression with truncated loss functions. Canadian Journal of Statistics, 39(2), 300–323.

    Article  Google Scholar 

  • Ratsch, G., Onoda, T., & Muller, K. R. (2001). Soft margins for AdaBoost. Machine Learning, 42(3), 287–320.

    Article  Google Scholar 

  • Song, Q., Hu, W., & Xie, W. (2002). Robust support vector machine with bullet hole image classification. IEEE Transactions on Systems Man and Cybernetics Part C Applications and Reviews, 32(4), 440–448.

    Article  Google Scholar 

  • Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.

    Google Scholar 

  • Wu, Y., & Liu, Y. (2007). Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479), 974–983.

    Article  Google Scholar 

Download references

Acknowledgments

The authors thank the anonymous reviewers and editors for their helpful and constructive comments that greatly contributed to improving the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Myong K. Jeong.

Appendices

Appendix 1: Proof of Proposition 1

The weight value \({{\mathbb {E}}}(w)\) is computed as the mean of \(Gamma\left( {w|\tilde{c},\tilde{d}}\right) \), that is

$$\begin{aligned} {{\mathbb {E}}}(w)=\frac{\tilde{c}}{\tilde{d}}=\frac{c}{d-\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) }. \end{aligned}$$

Since the logistic loss function is equivalent to a negative log likelihood function, the weighted logistic loss function can be written as

$$\begin{aligned} {{\mathbb {E}}}(w)l\left\{ {\left( {2t-1}\right) {\varvec{\upbeta }}^{T}{\varvec{\phi }} (\mathbf{x})} \right\}= & {} -{{\mathbb {E}}}(w)\ln p\left( {t|{\varvec{\upbeta }}}\right) \le -{{\mathbb {E}}}(w)\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) \\= & {} -\frac{c\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) }{d-\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) }\le c. \end{aligned}$$

Thus, the weighted logistic loss function is bounded by c.

Appendix 2: Proof of Proposition 2

Recall that \(p\left( {t|{\varvec{\upbeta }}}\right) \ge h\left( {{\varvec{\upbeta }},\xi } \right) \). Taking the expectations on both sides with respect to \({\varvec{\upbeta }}\) yields the following result:

$$\begin{aligned} {{\mathbb {E}}}\left( {p\left( {t|{\varvec{\upbeta }}}\right) }\right) =p\left( {t|{\hat{\varvec{{\beta }}}}}\right) \ge {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) \end{aligned}$$

where \({\hat{\varvec{{\beta }}}}\) denotes the expectation of \({\varvec{\upbeta }}\). Since \(0\le p\left( {t|{\hat{\varvec{{\beta }}}}}\right) \le 1\), it is always true that \({{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) \le 1\Leftrightarrow \ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) \le 0\).

\((\Rightarrow )\) If \(\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi _i} \right) }\right) \) is 0, then the weight \({{\mathbb {E}}}(w)\) should be 1 since \(\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi _i} \right) }\right) \) approaches to 0 as \(p\left( {t|{\hat{\varvec{{\beta }}}}}\right) \) goes to 1. Therefore, if

$$\begin{aligned} 0\le {{\mathbb {E}}}(w)=\frac{c}{d-\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) }\le 1 \end{aligned}$$

then c should be equal to d.

\((\Leftarrow )\) If \(c=d\equiv r\), then

$$\begin{aligned} 0\le {{\mathbb {E}}}(w)=\frac{r}{r-\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) }\le 1 \end{aligned}$$

since \(\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) \) is always negative.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hwang, S., Jeong, M.K. Robust relevance vector machine for classification with variational inference. Ann Oper Res 263, 21–43 (2018). https://doi.org/10.1007/s10479-015-1890-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-015-1890-9

Keywords

Navigation