Abstract
Mathematical solvers use parametrized Optimization Problems (OPs) as inputs to yield optimal decisions. In many real-world settings, some of these parameters are unknown or uncertain. Recent research focuses on predicting the value of these unknown parameters using available contextual features, aiming to decrease decision regret by adopting end-to-end learning approaches. However, these approaches disregard prediction uncertainty and therefore make the mathematical solver susceptible to provide erroneous decisions in case of low-confidence predictions. We propose a novel framework that models prediction uncertainty with Bayesian Neural Networks (BNNs) and propagates this uncertainty into the mathematical solver with a Stochastic Programming technique. The differentiable nature of BNNs and differentiable mathematical solvers allow for two different learning approaches: In the Decoupled learning approach, we update the BNN weights to increase the quality of the predictions’ distribution of the OP parameters, while in the Combined learning approach, we update the weights aiming to directly minimize the expected OP’s cost function in a stochastic end-to-end fashion. We do an extensive evaluation using synthetic data with various noise properties and a real dataset, showing that decisions regret are generally lower (better) with both proposed methods. The code is available at https://github.com/AlanLahoud/BNNSOP.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., Kolter, J.Z.: Differentiable Convex Optimization Layers, vol. 32. Curran Associates Inc. (2019)
Amos, B., Kolter, J.Z.: Optnet: differentiable optimization as a layer in neural networks. In: International Conference on Machine Learning, pp. 136–145. PMLR (2017)
Ban, G.Y., Rudin, C.: The big data newsvendor: practical insights from machine learning. Oper. Res. 67(1), 90–108 (2019)
Bayraksan, G., Love, D.K.: Data-driven stochastic programming using phi-divergences. In: The operations research revolution, pp. 1–19. INFORMS (2015)
Bell, D.E.: Regret in decision making under uncertainty. Oper. Res. 30(5), 961–981 (1982)
Birge, J.R., Louveaux, F.: Introduction to stochastic programming. Springer Science & Business Media (2011)
Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. In: International Conference on Machine Learning, pp. 1613–1622. PMLR (2015)
Demirović, E., et al.: An investigation into prediction + optimisation for the knapsack problem
Donti, P., Amos, B., Kolter, J.Z.: Task-based end-to-end model learning in stochastic optimization. Adv. Neural Inform. Process. Syst. 30 (2017)
Elmachtoub, A.N., Grigas, P.: Smart “predict, then optimize’’. Manage. Sci. 68(1), 9–26 (2017)
Ferber, A., Wilder, B., Dilkina, B., Tambe, M.: Mipaal: Mixed integer program as a layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1504–1511 (2020)
Gah-Yi, B., Rudin, C.: The big data newsvendor: practical insights from machine learning. Oper. Res. 67(1), 90–108 (2018)
Grimes, D., Ifrim, G., O’Sullivan, B., Simonis, H.: Analyzing the impact of electricity price forecasting on energy cost-aware scheduling. Sustainable Comput. Inform. Syst. 4(4), 276–291 (2014), special Issue on Energy Aware Resource Management and Scheduling (EARMS)
Hannah, L.A.: Stochastic optimization. Inter. Encycl. Soc. Behav. Sci. 2, 473–481 (2015)
Hoseinzade, E., Haratizadeh, S.: Cnnpred: Cnn-based stock market prediction using a diverse set of variables. Expert Syst. Appl. 129, 273–285 (2019)
Hüllermeier, E., Waegeman, W.: Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach. Learn. 110(3), 457–506 (2021)
Ifrim, G., O’Sullivan, B., Simonis, H.: Properties of energy-price forecasts for scheduling. In: Milano, M. (ed.) CP 2012. LNCS, pp. 957–972. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33558-7_68
Jospin, L.V., Laga, H., Boussaid, F., Buntine, W., Bennamoun, M.: Hands-on bayesian neural networks-a tutorial for deep learning users. IEEE Comput. Intell. Mag. 17(2), 29–48 (2022)
Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Adv. Neural Inform. Process. Syst. 30 (2017)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14-16 April 2014, Conference Track Proceedings (2014)
Kong, L., Cui, J., Zhuang, Y., Feng, R., Prakash, B.A., Zhang, C.: End-to-end stochastic optimization with energy-based model. Adv. Neural. Inf. Process. Syst. 35, 11341–11354 (2022)
Lahoud, A.A., Schaffernicht, E., Stork, J.A.: Datasp: A differential all-to-all shortest path algorithm for learning costs and predicting paths with context. arXiv preprint arXiv:2405.04923 (2024)
Li, X., Shou, B., Qin, Z.: An expected regret minimization portfolio selection model. Eur. J. Oper. Res. 218(2), 484–492 (2012)
Mandi, J., Guns, T.: Interior point solving for lp-based prediction+ optimisation. Adv. Neural. Inf. Process. Syst. 33, 7272–7282 (2020)
Pearce, T., Leibfried, F., Brintrup, A.: Uncertainty in neural networks: Approximately bayesian ensembling. In: International Conference on Artificial Intelligence and Statistics, pp. 234–244. PMLR (2020)
Powell, W.B.: A unified framework for stochastic optimization. Eur. J. Oper. Res. 275(3), 795–821 (2019)
Rockafellar, R.T., Uryasev, S., et al.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000)
Wilder, B., Dilkina, B., Tambe, M.: Melding the data-decisions pipeline: decision-focused learning for combinatorial optimization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 1658–1665 (2019)
Acknowledgement
This work has been supported by the Industrial Graduate School Collaborative AI & Robotics funded by the Swedish Knowledge Foundation Dnr:20190128, and the Knut and Alice Wallenberg Foundation through Wallenberg AI, Autonomous Systems and Software Program (WASP).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendices
1.1 Appendix A. Limitations of Uncertainty Propagation
This paper focuses on minimizing \({{\,\mathrm{arg\,min}\,}}_{\boldsymbol{z}} {{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}f(\boldsymbol{z}, \hat{\boldsymbol{y}})\). This problem simplifies to \({{\,\mathrm{arg\,min}\,}}_{\boldsymbol{z}} f(\boldsymbol{z}, {{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}\hat{\boldsymbol{y}})\) when substituting the objective function’s expected value with the expected value of predictions, but this simplification is only applicable in certain conditions. If these conditions are met, we recommend solving the argmin by directly calculating the expected value of predictions (Decoupled).
1.2 Appendix A.1. Linear Objective Functions with Respect to the Unknown Variable
If \(f(\boldsymbol{z}, \boldsymbol{y})\) is linear with respect to \(\boldsymbol{y}\), then \({{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}f(\boldsymbol{z}, \hat{\boldsymbol{y}}) = f(\boldsymbol{z}, {{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}\hat{\boldsymbol{y}})\). Applying the argmin with respect to \(\boldsymbol{z}\) on both sides we have \({{\,\mathrm{arg\,min}\,}}_{\boldsymbol{z}} {{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}f(\boldsymbol{z}, \hat{\boldsymbol{y}}) = {{\,\mathrm{arg\,min}\,}}_{\boldsymbol{z}} f(\boldsymbol{z}, {{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}\hat{\boldsymbol{y}}).\)
1.3 Appendix A.2. Balanced Newsvendor Problem
When \(c_s = c_e\) in the NV problem, the optimal order quantity \({{\,\mathrm{arg\,min}\,}}_{z} {{\,\mathrm{\mathbb {E}}\,}}_{\hat{y}}f(z, \hat{y})\) corresponds to the median of \(\hat{y}\)’s distribution, given by the \(\frac{c_s}{c_s+c_e} = 0.5\) quantile. If \(\hat{y}\)’s distribution is Gaussian, this median equals the mean, simplifying the argmin to the mean of \(\hat{y}\). This observation extends to both Gaussian models and the NVQP, highlighting that propagating uncertainty becomes more beneficial with more imbalance between \(c_s\) and \(c_e\).
Appendix B. Newsvendor Problem as Quadratic Programming
Following [9, 12], we reformulate Eq. 8 by introducing new decision variables \(\boldsymbol{z_s} = \boldsymbol{y}-\boldsymbol{z}\) and \(\boldsymbol{z_e} = \boldsymbol{z}-\boldsymbol{y}\), with added constraints to align with the original problem’s bounds. This leads to a QP formulation: \({{\,\mathrm{arg\,min}\,}}_{\boldsymbol{v}} \frac{1}{2}\boldsymbol{v}^\intercal \boldsymbol{H} \boldsymbol{v} + \boldsymbol{k}^\intercal \boldsymbol{v}\) subject to \(\boldsymbol{A}\boldsymbol{v}\preceq \boldsymbol{b}\), where \(\boldsymbol{H} = 2\text {diag}[\boldsymbol{Q}, \boldsymbol{Q_s}, \boldsymbol{Q_e}]\), \(\boldsymbol{v} = [\boldsymbol{z}, \boldsymbol{z_s}, \boldsymbol{z_e}]\), \(\boldsymbol{k} = [\boldsymbol{c}, \boldsymbol{c_s}, \boldsymbol{c_e}]\), \(\boldsymbol{A} = [-\boldsymbol{I_{3d_z}}, [-\boldsymbol{I_{d_z}}, -\boldsymbol{I_{d_z}}, 0], [\boldsymbol{I_{d_z}}, 0, -\boldsymbol{I_{d_z}}], [\boldsymbol{p}, 0, 0]]^{\intercal }\), and \(\boldsymbol{b}=[0, 0, 0, -\boldsymbol{y}, \boldsymbol{y}, B]\). \(\boldsymbol{z}^*(\boldsymbol{y})\) is the primary variable of interest. Assuming \(\boldsymbol{H}\) is positive-definite ensures convexity. The formulation’s efficiency depends on the item count. It’s initially suitable for single vector predictions \(\boldsymbol{y}\), but we propose a Stochastic Programming method for generalization to multiple predictions.
1.1 Appendix B.1. Newsvendor Problem as Stochastic Quadratic Programming
When propagating the uncertainty of \(\boldsymbol{y}\) in a Monte Carlo fashion with M samples, the formulation above becomes as \({{\,\mathrm{arg\,min}\,}}_{\boldsymbol{v}} \frac{1}{2}\boldsymbol{v}^\intercal \boldsymbol{H} \boldsymbol{v} + \boldsymbol{k}^\intercal \boldsymbol{v}\) s.t. \(\boldsymbol{A}\boldsymbol{v}\preceq \boldsymbol{b}\) where \(H = 2diag([\boldsymbol{Q} \quad \frac{\boldsymbol{Q_s}}{M} \quad ... \quad \frac{\boldsymbol{Q_s}}{M} \quad \frac{\boldsymbol{Q_e}}{M} \quad ... \quad \frac{\boldsymbol{Q_e}}{M}])\); \(\boldsymbol{k} = [\boldsymbol{c} \quad \frac{\boldsymbol{c_s}}{M} \quad ... \quad \frac{\boldsymbol{c_s}}{M} \quad \frac{\boldsymbol{c_e}}{M} \quad ... \quad \frac{\boldsymbol{c_e}}{M}]\); \( \boldsymbol{v} = [\boldsymbol{z} \quad \boldsymbol{z_s}^{(1)} \quad ... \quad \boldsymbol{z_s}^{(M)} \quad \boldsymbol{z_e}^{(1)} \quad ... \quad \boldsymbol{z_e}^{(M)}]\); \(\boldsymbol{A} = [-\boldsymbol{I_F}, [-\boldsymbol{I_{B1}}, -\boldsymbol{I_{BB}}, \boldsymbol{0_{BB}}], [\boldsymbol{I_{B1}} ,\boldsymbol{0_{BB}} , -\boldsymbol{I_{BB}}], [\boldsymbol{p} ,0 , 0]]^\intercal \); and
\(\boldsymbol{b} = [\boldsymbol{0_{F}}, -\boldsymbol{y}^{(1)}, ..., -\boldsymbol{y}^{(M)}, \boldsymbol{y}^{(1)},..., \boldsymbol{y}^{(M)}, B]\). Where \(\boldsymbol{I_F} = \boldsymbol{I_{d_z + 2Md_z}}\), \(\boldsymbol{0_F} = \boldsymbol{0_{d_z + 2Md_z}}\) (1D vector), \(\boldsymbol{I_{BB}} = \boldsymbol{I_{Md_z}}\), \(\boldsymbol{0_{BB}} = \boldsymbol{0_{Md_z}}\), \(\boldsymbol{I_{B1}} = \boldsymbol{I_{d_z}}\) repeated for M rows. This is a generalization of the quadratic newsvendor experiment proposed in [Donti et al., 2017]. Note that \(v \in \mathbb {R}^{d_z+2Md_z}\), \(H \in \mathbb {R}^{d_z+2Md_z \times d_z+2Md_z}\), \(k \in \mathbb {R}^{d_z+2Md_z}\), \(A \in \mathbb {R}^{d_z + 4Md_z+1 \times d_z+2Md_z}\) and \(b \in \mathbb {R}^{d_z + 4Md_z+1}\). Therefore, both the number of items and prediction sampling size play an important and approximately equal role on the time to solve each instance of the OP. In practice, the complexity of the QP problem depends on the decision variable dimension, which is \(d_z+2Md_z\).
Appendix C. Portfolio Risk Minimization as a LP
With the same strategy as in Appendix 7, we use the auxiliary variable \(\boldsymbol{u} = \max \{-\boldsymbol{y}^\intercal \boldsymbol{z}, 0\}\) to rewrite the POP formulation from the main text to
Note that the zero constant in the objective function is only to reinforce that \(\boldsymbol{z}\) is also part of the set of decision variables.
1.1 Appendix C.1. Portfolio Risk Minimization as a Stochastic LP
By giving a set of samples \(\boldsymbol{y}^{(j)}\) as input, as suggested in [Rockafellar et al., 2000], the equation above can be rewritten in a stochastic programming fashion as
For implementation purpose, we followed [28] by adding a quadratic small term to LP in order to fit the OP into the Amos & Kolter QP solver.
Appendix D. Implementation Details
NNs were implemented with Pytorch and Adam optimizer, with learning rates of 0.0015 for NV, 0.002 for NVQP, and 0.001 for POP experiments. The Decoupled Bayesian Neural Network (BNN) had a learning rate of 0.0007, whereas the Combined BNN’s rate ranged between 0.0004 and 0.0007. An exponential scheduler was used to adjust the learning rate by a factor of 0.99. Hyperparameter K balanced data loss and regularization, selected without optimization. Training occurred on Nvidia RTX 2080 GPUs, with models evaluated on the validation set before testing report. Gaussian Process baselines were implemented with Scikit-learn and radial basis function kernel, optimized the length scale and white noise. For multi-output tasks (NVQP and POP), separate Gaussian processes for each output proved more effective.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lahoud, A.A., Schaffernicht, E., Stork, J.A. (2024). Learning Solutions of Stochastic Optimization Problems with Bayesian Neural Networks. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15016. Springer, Cham. https://doi.org/10.1007/978-3-031-72332-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-72332-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72331-5
Online ISBN: 978-3-031-72332-2
eBook Packages: Computer ScienceComputer Science (R0)