Abstract
A prediction of a chemical property or activity is subject to uncertainty. Which type of uncertainties to consider, whether to account for them in a differentiated manner and with which methods, depends on the practical context. In chemical modelling, general guidance of the assessment of uncertainty is hindered by the high variety in underlying modelling algorithms, high-dimensionality problems, the acknowledgement of both qualitative and quantitative dimensions of uncertainty, and the fact that statistics offers alternative principles for uncertainty quantification. Here, a view of the assessment of uncertainty in predictions is presented with the aim to overcome these issues. The assessment sets out to quantify uncertainty representing error in predictions and is based on probability modelling of errors where uncertainty is measured by Bayesian probabilities. Even though well motivated, the choice to use Bayesian probabilities is a challenge to statistics and chemical modelling. Fully Bayesian modelling, Bayesian meta-modelling and bootstrapping are discussed as possible approaches. Deciding how to assess uncertainty is an active choice, and should not be constrained by traditions or lack of validated and reliable ways of doing it.
Similar content being viewed by others
Notes
We provide both ways to express the model to demonstrate the transition from classical statistical model specification, where the probabilistic model is implemented to the errors, to the general model specification, where the whole model is probabilistic.
The Bayesian framework is usually presented with parametric models, but is possible to apply on non-parametric models as well.
It does not have to be the classifier. Later we give an example where C is a variable expressing reliability in prediction given by the number of times a compound is classified as active from a set of ensemble predictions.
References
Nicholls A (2014) Confidence limits, error bars and method comparison in molecular modeling. Part 1: the calculation of confidence intervals. JCAMD 28(9):887–918
Sahlin U, Golsteijn L, Iqbal MS, Peijnenburg W (2013) Arguments for considering uncertainty in QSAR predictions in hazard and risk assessments. ATLA 41(1):91–110
Iqbal MS, Golsteijn L, Oberg T, Sahlin U, Papa E, Kovarich S, Huijbregts MAJ (2013) Understanding quantitative structure–property relationships uncertaity in environmental fate modelling. Environ Toxicol Chem 32(5):1069–1076
Jaworska J, Gabbert S, Aldenberg T (2010) Towards optimization of chemical testing under REACH: a Bayesian network approach to integrated testing strategies. Regul Toxicol Pharmacol 57(2–3):157–167
Eriksson L, Jaworska J, Worth AP, Cronin MTD, McDowell RM, Gramatica P (2003) Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. Environ Health Perspect 111(10):1361–1375
Geisser S (1993) Predictive inference: an introduction. Chapman & Hall, New York
Wood DJ, Carlsson L, Eklund M, Norinder U, Stalring J (2013) QSAR with experimental and predictive distributions: an information theoretic approach for assessing model quality. JCAMD 27(3):203–219
Gelman A, Hill J (2007) Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, Cambridge
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning : data mining, inference, and prediction, 2nd edn. Springer, New York
Bosnic Z, Kononenko I (2009) An overview of advances in reliability estimation of individual predictions in machine learning. Intell Data Anal 13(2):385–401
Cox DR (2006) Principles of statistical inference. Cambridge University Press, Cambridge
Aldenberg T, Jaworska JS (2000) Uncertainty of the hazardous concentration and fraction affected for normal species sensitivity distributions. Ecotoxicol Environ Saf 46(1):1–18
Aven T, Kvaløy JT (2002) Implementing the Bayesian paradigm in risk analysis. Reliab Eng Syst Saf 78(2):195–201
Sahlin U (2013) Uncertainty in QSAR predictions. ATLA 41:111–125
Fielding AH, Bell JF (1997) A review of methods for the assessment of prediction errors in conservation presence/absence models. Environ Conserv 24(1):38–49
O’Hara RB, Sillanpaa MJ (2009) A review of Bayesian variable selection methods: What, how and which. Bayesian Anal 4(1):85–117
Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14(4):382–401
Andrieu C, Doucet A, Holenstein R (2010) Particle Markov chain Monte Carlo methods. J R Stat Soc Series B Stat Methodol 72:269–342
Petralias A, Dellaportas P (2013) An MCMC model search algorithm for regression problems. J Stat Comput Simul 83(9):1722–1740
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103(482):681–686
Tipping ME (2004) Bayesian inference: an introduction to principles and practice in machine learning. In: Bousquet O, VonLuxburg U, Ratsch G (eds) Advanced Lectures on Machine Learning, vol 3176. Springer-verlag, Hiedelberg, pp 41–62
Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc B Series Methodol 71:319–392
Rasmussen CE (2004) Gaussian processes in machine learning. In: Bousquet O, VonLuxburg U, Ratsch G (eds) Lecture notes in artificial intelligence, vol 3176. Springer-verlag, Hiedelberg, pp 63–71
Schwaighofer A, Schroeter T, Mika S, Blanchard G (2009) How wrong can we get? A review of machine learning approaches and error bars. Comb Chem High Throughput Screen 12(5):453–468
Denham MC (1997) Prediction intervals in partial least squares. J Chemom 11(1):39–52
O’Hagan A (2006) Bayesian analysis of computer code outputs: a tutorial. Reliab Eng Syst Saf 91(10–11):1290–1300
Clark RD, Liang W, Lee AC, Lawless MS, Fraczkiewicz R, Waldman M (2014) Using beta binomials to estimate classification uncertainty for ensemble models. J Chemom 6:34
Tetko IV, Sushko I, Pandey AK, Zhu H, Tropsha A, Papa E, Oberg T, Todeschini R, Fourches D, Varnek A (2008) Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. J Chem Inf Model 48(9):1733–1746
Sahlin U, Jeliazkova N, Öberg T (2013) Applicability domain dependent predictive uncertainty in QSAR regressions. Mol Inform 33(1):26–35
Davison AC, Hinkley DV (1997) Bootstrap methods and their application. Cambridge Univ. Press, Cambridge
Rubin DB (1981) The Bayesian Bootstrap. Ann Stat 9(1):130–134
Acknowledgments
This work has been funded by the Swedish Research Council Formas through the project 219-2013-1271 “Scaling up uncertain environmental evidence-Quality assurance in ecosystem service predictions” and through the strategic research area Biodiversity and Ecosystems in a Changing Climate, BECC and by the European Seventh Framework Programme through the CADASTER (CAse studies on the Development and Application of in-Silico Techniques for Environmental hazard and Risk assessment) project FP7-ENV-2007-1-212668. The author wish to thank Rasmus Bååth and Tom Aldenberg for nice discussions on Bayesian concepts and Niklas Vareman and Yann Clough for valuable comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sahlin, U. Assessment of uncertainty in chemical models by Bayesian probabilities: Why, when, how?. J Comput Aided Mol Des 29, 583–594 (2015). https://doi.org/10.1007/s10822-014-9822-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-014-9822-3