Abstract
The relevance vector machine (RVM) is a widely employed statistical method for classification, which provides probability outputs and a sparse solution. However, the RVM can be very sensitive to outliers far from the decision boundary which discriminates between two classes. In this paper, we propose the robust RVM based on a weighting scheme, which is insensitive to outliers and simultaneously maintains the advantages of the original RVM. Given a prior distribution of weights, weight values are determined in a probabilistic way and computed automatically during training. Our theoretical result indicates that the influences of outliers are bounded through the probabilistic weights. Also, a guideline for determining hyperparameters governing a prior is discussed. The experimental results from synthetic and real data sets show that the proposed method performs consistently better than the RVM if a training data set is contaminated by outliers.
Similar content being viewed by others
Notes
It can be found at http://www.stats.ox.ac.uk/pub/PRNN/.
This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
References
An, L. T. H., & Tao, P. D. (1997). Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. Journal of Global Optimization, 11(3), 253–285.
Bishop, C. M., & Tipping, M. E. (2000), Variational relevance vector machine. In Proceedings of the 16th conference on uncertainty in artificial intelligence (pp. 46–53).
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.
Caruana, R., & Niculescu-Mizil, A. (2004). Data mining in metric space: An empirical analysis of supervised learning performance criteria. In Proceedings of the 10th international conference on knowledge discovery and data mining (pp. 69–78).
Chang, C. C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27:21–27:27.
Christmann, A., & Steinwart, I. (2004). On robustness properties of convex risk minimization methods for pattern recognition. Journal of Machine Learning Research, 5, 1007–1034.
Debruyne, M., Serneels, S., & Verdonck, T. (2009). Robustified least squares support vector classification. Journal of Chemometrics, 23(9), 479–486.
Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.
Fang, Y., & Jeong, M. K. (2008). Robust probabilistic multivariate calibration model. Technometrics, 50, 305–316.
Frank, A., & Asuncion, A. (2010). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science.
Hwang, S., Yum, B., & Jeong, M. K. (2014). Robust relevance vector machine with variational inference for improving virtual metrology accuracy. IEEE Transaction on Semiconductor Manufacturing, 27, 1–12.
Hwang, S., Kim, N., Jeong, M. K., & Yum, B. (2015). Robust kernel based regression with bounded influence for outliers. Journal of Operations Research Society (to appear).
Jaakkola, T. S. (2000). Tutorial on variational approximation methods. Technical Report, MIT Artificial Intelligence Lab.
Jaakkola, T. S., & Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics and Computing, 10(1), 25–37.
Lee, K., Kim, N., & Jeong, M. K. (2014). A sparse signomial model for classification and regression. Annals of Operations Research, 216, 257–286.
Lin, X. W., Wahba, G., Xiang, D., Gao, F. Y., Klein, R., & Klein, B. (2000). Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV. Annals of Statistics, 28(6), 1570–1600.
Ling, C. X., Huang, J., & Zhang, H. (2003). AUC: A better measure than accuracy in comparing learning algorithms. In Proceedings of the 2003 Canadian artificial intelligence conference (pp. 329–341).
Ma, Z., & Leijon, A. (2011). Bayesian estimation of beta mixture models with variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11), 2160–2173.
Mackay, D. J. C. (1992). The evidence framework applied to classification networks. Neural Computation, 4(5), 720–736.
Neal, R. M. (1996). Bayesian learning for neural networks. New York: Springer.
Ormerod, J. T., & Wand, M. P. (2010). Explaining variational approximations. The American Statistician, 64(2), 140–153.
Park, S. Y., & Liu, Y. (2011). Robust penalized logistic regression with truncated loss functions. Canadian Journal of Statistics, 39(2), 300–323.
Ratsch, G., Onoda, T., & Muller, K. R. (2001). Soft margins for AdaBoost. Machine Learning, 42(3), 287–320.
Song, Q., Hu, W., & Xie, W. (2002). Robust support vector machine with bullet hole image classification. IEEE Transactions on Systems Man and Cybernetics Part C Applications and Reviews, 32(4), 440–448.
Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.
Wu, Y., & Liu, Y. (2007). Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479), 974–983.
Acknowledgments
The authors thank the anonymous reviewers and editors for their helpful and constructive comments that greatly contributed to improving the paper.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Proof of Proposition 1
The weight value \({{\mathbb {E}}}(w)\) is computed as the mean of \(Gamma\left( {w|\tilde{c},\tilde{d}}\right) \), that is
Since the logistic loss function is equivalent to a negative log likelihood function, the weighted logistic loss function can be written as
Thus, the weighted logistic loss function is bounded by c.
Appendix 2: Proof of Proposition 2
Recall that \(p\left( {t|{\varvec{\upbeta }}}\right) \ge h\left( {{\varvec{\upbeta }},\xi } \right) \). Taking the expectations on both sides with respect to \({\varvec{\upbeta }}\) yields the following result:
where \({\hat{\varvec{{\beta }}}}\) denotes the expectation of \({\varvec{\upbeta }}\). Since \(0\le p\left( {t|{\hat{\varvec{{\beta }}}}}\right) \le 1\), it is always true that \({{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) \le 1\Leftrightarrow \ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) \le 0\).
\((\Rightarrow )\) If \(\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi _i} \right) }\right) \) is 0, then the weight \({{\mathbb {E}}}(w)\) should be 1 since \(\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi _i} \right) }\right) \) approaches to 0 as \(p\left( {t|{\hat{\varvec{{\beta }}}}}\right) \) goes to 1. Therefore, if
then c should be equal to d.
\((\Leftarrow )\) If \(c=d\equiv r\), then
since \(\ln {{\mathbb {E}}}\left( {h\left( {{\varvec{\upbeta }},\xi } \right) }\right) \) is always negative.
Rights and permissions
About this article
Cite this article
Hwang, S., Jeong, M.K. Robust relevance vector machine for classification with variational inference. Ann Oper Res 263, 21–43 (2018). https://doi.org/10.1007/s10479-015-1890-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-015-1890-9