Learning Solutions of Stochastic Optimization Problems with Bayesian Neural Networks

Lahoud, Alan A.; Schaffernicht, Erik; Stork, Johannes A.

doi:10.1007/978-3-031-72332-2_11

Alan A. Lahoud¹¹,
Erik Schaffernicht¹¹ &
Johannes A. Stork¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15016))

Included in the following conference series:

International Conference on Artificial Neural Networks

577 Accesses

Abstract

Mathematical solvers use parametrized Optimization Problems (OPs) as inputs to yield optimal decisions. In many real-world settings, some of these parameters are unknown or uncertain. Recent research focuses on predicting the value of these unknown parameters using available contextual features, aiming to decrease decision regret by adopting end-to-end learning approaches. However, these approaches disregard prediction uncertainty and therefore make the mathematical solver susceptible to provide erroneous decisions in case of low-confidence predictions. We propose a novel framework that models prediction uncertainty with Bayesian Neural Networks (BNNs) and propagates this uncertainty into the mathematical solver with a Stochastic Programming technique. The differentiable nature of BNNs and differentiable mathematical solvers allow for two different learning approaches: In the Decoupled learning approach, we update the BNN weights to increase the quality of the predictions’ distribution of the OP parameters, while in the Combined learning approach, we update the weights aiming to directly minimize the expected OP’s cost function in a stochastic end-to-end fashion. We do an extensive evaluation using synthetic data with various noise properties and a real dataset, showing that decisions regret are generally lower (better) with both proposed methods. The code is available at https://github.com/AlanLahoud/BNNSOP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., Kolter, J.Z.: Differentiable Convex Optimization Layers, vol. 32. Curran Associates Inc. (2019)
Google Scholar
Amos, B., Kolter, J.Z.: Optnet: differentiable optimization as a layer in neural networks. In: International Conference on Machine Learning, pp. 136–145. PMLR (2017)
Google Scholar
Ban, G.Y., Rudin, C.: The big data newsvendor: practical insights from machine learning. Oper. Res. 67(1), 90–108 (2019)
Article MathSciNet Google Scholar
Bayraksan, G., Love, D.K.: Data-driven stochastic programming using phi-divergences. In: The operations research revolution, pp. 1–19. INFORMS (2015)
Google Scholar
Bell, D.E.: Regret in decision making under uncertainty. Oper. Res. 30(5), 961–981 (1982)
Article Google Scholar
Birge, J.R., Louveaux, F.: Introduction to stochastic programming. Springer Science & Business Media (2011)
Google Scholar
Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. In: International Conference on Machine Learning, pp. 1613–1622. PMLR (2015)
Google Scholar
Demirović, E., et al.: An investigation into prediction + optimisation for the knapsack problem
Google Scholar
Donti, P., Amos, B., Kolter, J.Z.: Task-based end-to-end model learning in stochastic optimization. Adv. Neural Inform. Process. Syst. 30 (2017)
Google Scholar
Elmachtoub, A.N., Grigas, P.: Smart “predict, then optimize’’. Manage. Sci. 68(1), 9–26 (2017)
Article Google Scholar
Ferber, A., Wilder, B., Dilkina, B., Tambe, M.: Mipaal: Mixed integer program as a layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1504–1511 (2020)
Google Scholar
Gah-Yi, B., Rudin, C.: The big data newsvendor: practical insights from machine learning. Oper. Res. 67(1), 90–108 (2018)
MathSciNet Google Scholar
Grimes, D., Ifrim, G., O’Sullivan, B., Simonis, H.: Analyzing the impact of electricity price forecasting on energy cost-aware scheduling. Sustainable Comput. Inform. Syst. 4(4), 276–291 (2014), special Issue on Energy Aware Resource Management and Scheduling (EARMS)
Google Scholar
Hannah, L.A.: Stochastic optimization. Inter. Encycl. Soc. Behav. Sci. 2, 473–481 (2015)
Google Scholar
Hoseinzade, E., Haratizadeh, S.: Cnnpred: Cnn-based stock market prediction using a diverse set of variables. Expert Syst. Appl. 129, 273–285 (2019)
Article Google Scholar
Hüllermeier, E., Waegeman, W.: Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach. Learn. 110(3), 457–506 (2021)
Article MathSciNet Google Scholar
Ifrim, G., O’Sullivan, B., Simonis, H.: Properties of energy-price forecasts for scheduling. In: Milano, M. (ed.) CP 2012. LNCS, pp. 957–972. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33558-7_68
Chapter Google Scholar
Jospin, L.V., Laga, H., Boussaid, F., Buntine, W., Bennamoun, M.: Hands-on bayesian neural networks-a tutorial for deep learning users. IEEE Comput. Intell. Mag. 17(2), 29–48 (2022)
Article Google Scholar
Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Adv. Neural Inform. Process. Syst. 30 (2017)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14-16 April 2014, Conference Track Proceedings (2014)
Google Scholar
Kong, L., Cui, J., Zhuang, Y., Feng, R., Prakash, B.A., Zhang, C.: End-to-end stochastic optimization with energy-based model. Adv. Neural. Inf. Process. Syst. 35, 11341–11354 (2022)
Google Scholar
Lahoud, A.A., Schaffernicht, E., Stork, J.A.: Datasp: A differential all-to-all shortest path algorithm for learning costs and predicting paths with context. arXiv preprint arXiv:2405.04923 (2024)
Li, X., Shou, B., Qin, Z.: An expected regret minimization portfolio selection model. Eur. J. Oper. Res. 218(2), 484–492 (2012)
Article MathSciNet Google Scholar
Mandi, J., Guns, T.: Interior point solving for lp-based prediction+ optimisation. Adv. Neural. Inf. Process. Syst. 33, 7272–7282 (2020)
Google Scholar
Pearce, T., Leibfried, F., Brintrup, A.: Uncertainty in neural networks: Approximately bayesian ensembling. In: International Conference on Artificial Intelligence and Statistics, pp. 234–244. PMLR (2020)
Google Scholar
Powell, W.B.: A unified framework for stochastic optimization. Eur. J. Oper. Res. 275(3), 795–821 (2019)
Article MathSciNet Google Scholar
Rockafellar, R.T., Uryasev, S., et al.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000)
Article Google Scholar
Wilder, B., Dilkina, B., Tambe, M.: Melding the data-decisions pipeline: decision-focused learning for combinatorial optimization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 1658–1665 (2019)
Google Scholar

Download references

Acknowledgement

This work has been supported by the Industrial Graduate School Collaborative AI & Robotics funded by the Swedish Knowledge Foundation Dnr:20190128, and the Knut and Alice Wallenberg Foundation through Wallenberg AI, Autonomous Systems and Software Program (WASP).

Author information

Authors and Affiliations

Center for Applied Autonomous Sensor Systems (AASS), Örebro University, Örebro, Sweden
Alan A. Lahoud, Erik Schaffernicht & Johannes A. Stork

Authors

Alan A. Lahoud
View author publications
You can also search for this author in PubMed Google Scholar
Erik Schaffernicht
View author publications
You can also search for this author in PubMed Google Scholar
Johannes A. Stork
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alan A. Lahoud .

Editor information

Editors and Affiliations

IDSIA USI-SUPSI, Lugano, Switzerland
Michael Wand
Comenius University, Bratislava, Slovakia
Kristína Malinovská
KAUST Center of Generative AI, Thuwal, Saudi Arabia
Jürgen Schmidhuber
Helmholtz Zentrum München, Neuherberg, Germany
Igor V. Tetko

Appendices

1.1 Appendix A. Limitations of Uncertainty Propagation

This paper focuses on minimizing ${{\,\mathrm{arg\,min}\,}}_{\boldsymbol{z}} {{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}f(\boldsymbol{z}, \hat{\boldsymbol{y}})$. This problem simplifies to ${{\,\mathrm{arg\,min}\,}}_{\boldsymbol{z}} f(\boldsymbol{z}, {{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}\hat{\boldsymbol{y}})$ when substituting the objective function’s expected value with the expected value of predictions, but this simplification is only applicable in certain conditions. If these conditions are met, we recommend solving the argmin by directly calculating the expected value of predictions (Decoupled).

1.2 Appendix A.1. Linear Objective Functions with Respect to the Unknown Variable

If $f(\boldsymbol{z}, \boldsymbol{y})$ is linear with respect to $\boldsymbol{y}$, then ${{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}f(\boldsymbol{z}, \hat{\boldsymbol{y}}) = f(\boldsymbol{z}, {{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}\hat{\boldsymbol{y}})$. Applying the argmin with respect to $\boldsymbol{z}$ on both sides we have ${{\,\mathrm{arg\,min}\,}}_{\boldsymbol{z}} {{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}f(\boldsymbol{z}, \hat{\boldsymbol{y}}) = {{\,\mathrm{arg\,min}\,}}_{\boldsymbol{z}} f(\boldsymbol{z}, {{\,\mathrm{\mathbb {E}}\,}}_{\hat{\boldsymbol{y}}}\hat{\boldsymbol{y}}).$

1.3 Appendix A.2. Balanced Newsvendor Problem

When $c_s = c_e$ in the NV problem, the optimal order quantity ${{\,\mathrm{arg\,min}\,}}_{z} {{\,\mathrm{\mathbb {E}}\,}}_{\hat{y}}f(z, \hat{y})$ corresponds to the median of $\hat{y}$’s distribution, given by the $\frac{c_s}{c_s+c_e} = 0.5$ quantile. If $\hat{y}$’s distribution is Gaussian, this median equals the mean, simplifying the argmin to the mean of $\hat{y}$. This observation extends to both Gaussian models and the NVQP, highlighting that propagating uncertainty becomes more beneficial with more imbalance between $c_s$ and $c_e$.

Appendix B. Newsvendor Problem as Quadratic Programming

Following [9, 12], we reformulate Eq. 8 by introducing new decision variables $\boldsymbol{z_s} = \boldsymbol{y}-\boldsymbol{z}$ and $\boldsymbol{z_e} = \boldsymbol{z}-\boldsymbol{y}$, with added constraints to align with the original problem’s bounds. This leads to a QP formulation: ${{\,\mathrm{arg\,min}\,}}_{\boldsymbol{v}} \frac{1}{2}\boldsymbol{v}^\intercal \boldsymbol{H} \boldsymbol{v} + \boldsymbol{k}^\intercal \boldsymbol{v}$ subject to $\boldsymbol{A}\boldsymbol{v}\preceq \boldsymbol{b}$, where $\boldsymbol{H} = 2\text {diag}[\boldsymbol{Q}, \boldsymbol{Q_s}, \boldsymbol{Q_e}]$, $\boldsymbol{v} = [\boldsymbol{z}, \boldsymbol{z_s}, \boldsymbol{z_e}]$, $\boldsymbol{k} = [\boldsymbol{c}, \boldsymbol{c_s}, \boldsymbol{c_e}]$, $\boldsymbol{A} = [-\boldsymbol{I_{3d_z}}, [-\boldsymbol{I_{d_z}}, -\boldsymbol{I_{d_z}}, 0], [\boldsymbol{I_{d_z}}, 0, -\boldsymbol{I_{d_z}}], [\boldsymbol{p}, 0, 0]]^{\intercal }$, and $\boldsymbol{b}=[0, 0, 0, -\boldsymbol{y}, \boldsymbol{y}, B]$. $\boldsymbol{z}^*(\boldsymbol{y})$ is the primary variable of interest. Assuming $\boldsymbol{H}$ is positive-definite ensures convexity. The formulation’s efficiency depends on the item count. It’s initially suitable for single vector predictions $\boldsymbol{y}$, but we propose a Stochastic Programming method for generalization to multiple predictions.

1.1 Appendix B.1. Newsvendor Problem as Stochastic Quadratic Programming

When propagating the uncertainty of $\boldsymbol{y}$ in a Monte Carlo fashion with M samples, the formulation above becomes as ${{\,\mathrm{arg\,min}\,}}_{\boldsymbol{v}} \frac{1}{2}\boldsymbol{v}^\intercal \boldsymbol{H} \boldsymbol{v} + \boldsymbol{k}^\intercal \boldsymbol{v}$ s.t. $\boldsymbol{A}\boldsymbol{v}\preceq \boldsymbol{b}$ where $H = 2diag([\boldsymbol{Q} \quad \frac{\boldsymbol{Q_s}}{M} \quad ... \quad \frac{\boldsymbol{Q_s}}{M} \quad \frac{\boldsymbol{Q_e}}{M} \quad ... \quad \frac{\boldsymbol{Q_e}}{M}])$; $\boldsymbol{k} = [\boldsymbol{c} \quad \frac{\boldsymbol{c_s}}{M} \quad ... \quad \frac{\boldsymbol{c_s}}{M} \quad \frac{\boldsymbol{c_e}}{M} \quad ... \quad \frac{\boldsymbol{c_e}}{M}]$; $ \boldsymbol{v} = [\boldsymbol{z} \quad \boldsymbol{z_s}^{(1)} \quad ... \quad \boldsymbol{z_s}^{(M)} \quad \boldsymbol{z_e}^{(1)} \quad ... \quad \boldsymbol{z_e}^{(M)}]$; $\boldsymbol{A} = [-\boldsymbol{I_F}, [-\boldsymbol{I_{B1}}, -\boldsymbol{I_{BB}}, \boldsymbol{0_{BB}}], [\boldsymbol{I_{B1}} ,\boldsymbol{0_{BB}} , -\boldsymbol{I_{BB}}], [\boldsymbol{p} ,0 , 0]]^\intercal $; and

$\boldsymbol{b} = [\boldsymbol{0_{F}}, -\boldsymbol{y}^{(1)}, ..., -\boldsymbol{y}^{(M)}, \boldsymbol{y}^{(1)},..., \boldsymbol{y}^{(M)}, B]$. Where $\boldsymbol{I_F} = \boldsymbol{I_{d_z + 2Md_z}}$, $\boldsymbol{0_F} = \boldsymbol{0_{d_z + 2Md_z}}$ (1D vector), $\boldsymbol{I_{BB}} = \boldsymbol{I_{Md_z}}$, $\boldsymbol{0_{BB}} = \boldsymbol{0_{Md_z}}$, $\boldsymbol{I_{B1}} = \boldsymbol{I_{d_z}}$ repeated for M rows. This is a generalization of the quadratic newsvendor experiment proposed in [Donti et al., 2017]. Note that $v \in \mathbb {R}^{d_z+2Md_z}$, $H \in \mathbb {R}^{d_z+2Md_z \times d_z+2Md_z}$, $k \in \mathbb {R}^{d_z+2Md_z}$, $A \in \mathbb {R}^{d_z + 4Md_z+1 \times d_z+2Md_z}$ and $b \in \mathbb {R}^{d_z + 4Md_z+1}$. Therefore, both the number of items and prediction sampling size play an important and approximately equal role on the time to solve each instance of the OP. In practice, the complexity of the QP problem depends on the decision variable dimension, which is $d_z+2Md_z$.

Appendix C. Portfolio Risk Minimization as a LP

With the same strategy as in Appendix 7, we use the auxiliary variable $\boldsymbol{u} = \max \{-\boldsymbol{y}^\intercal \boldsymbol{z}, 0\}$ to rewrite the POP formulation from the main text to

$$\begin{aligned} \begin{aligned} &(\boldsymbol{z}^*, \boldsymbol{u}^*)({\boldsymbol{y}}) = {{\,\mathrm{arg\,min}\,}}_{\boldsymbol{z}, \boldsymbol{u}} \boldsymbol{0}^\intercal \boldsymbol{z} + \boldsymbol{u} \quad \text {s.t. } -\boldsymbol{[z, u]}\preceq 0, -\boldsymbol{y}^\intercal \boldsymbol{z} \preceq \boldsymbol{u}, -\boldsymbol{p}^\intercal \boldsymbol{z} \le -R. \end{aligned} \end{aligned}$$

(6)

Note that the zero constant in the objective function is only to reinforce that $\boldsymbol{z}$ is also part of the set of decision variables.

1.1 Appendix C.1. Portfolio Risk Minimization as a Stochastic LP

By giving a set of samples $\boldsymbol{y}^{(j)}$ as input, as suggested in [Rockafellar et al., 2000], the equation above can be rewritten in a stochastic programming fashion as

$$\begin{aligned} \begin{aligned} &(\boldsymbol{z}^*({\boldsymbol{y}}), \boldsymbol{u}^*({\boldsymbol{y}}) ) = {{\,\mathrm{arg\,min}\,}}_{\boldsymbol{z}, \boldsymbol{u}} \boldsymbol{0}^\intercal \boldsymbol{z} + \frac{1}{M} \sum _{j=1}^{M} \boldsymbol{u}^{(j)}\\ &\text {s.t. } -\boldsymbol{z}\preceq 0, -\boldsymbol{u}^{(j)} \preceq \boldsymbol{0}, -\boldsymbol{y}^{\intercal (j)} \boldsymbol{z} -\boldsymbol{u}^{(j)} \preceq \boldsymbol{0} \quad \forall j \in 1..M, &-\boldsymbol{p}^\intercal \boldsymbol{z} \le -R. \end{aligned} \end{aligned}$$

(7)

For implementation purpose, we followed [28] by adding a quadratic small term to LP in order to fit the OP into the Amos & Kolter QP solver.

Appendix D. Implementation Details

NNs were implemented with Pytorch and Adam optimizer, with learning rates of 0.0015 for NV, 0.002 for NVQP, and 0.001 for POP experiments. The Decoupled Bayesian Neural Network (BNN) had a learning rate of 0.0007, whereas the Combined BNN’s rate ranged between 0.0004 and 0.0007. An exponential scheduler was used to adjust the learning rate by a factor of 0.99. Hyperparameter K balanced data loss and regularization, selected without optimization. Training occurred on Nvidia RTX 2080 GPUs, with models evaluated on the validation set before testing report. Gaussian Process baselines were implemented with Scikit-learn and radial basis function kernel, optimized the length scale and white noise. For multi-output tasks (NVQP and POP), separate Gaussian processes for each output proved more effective.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lahoud, A.A., Schaffernicht, E., Stork, J.A. (2024). Learning Solutions of Stochastic Optimization Problems with Bayesian Neural Networks. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15016. Springer, Cham. https://doi.org/10.1007/978-3-031-72332-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-72332-2_11
Published: 17 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72331-5
Online ISBN: 978-3-031-72332-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Solutions of Stochastic Optimization Problems with Bayesian Neural Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendices

1.1 Appendix A. Limitations of Uncertainty Propagation

1.2 Appendix A.1. Linear Objective Functions with Respect to the Unknown Variable

1.3 Appendix A.2. Balanced Newsvendor Problem

Appendix B. Newsvendor Problem as Quadratic Programming

1.1 Appendix B.1. Newsvendor Problem as Stochastic Quadratic Programming

Appendix C. Portfolio Risk Minimization as a LP

1.1 Appendix C.1. Portfolio Risk Minimization as a Stochastic LP

Appendix D. Implementation Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us