Predictive stochastic programming

Deng, Yunxiao; Sen, Suvrajeet

doi:10.1007/s10287-021-00400-0

Predictive stochastic programming

Original Paper
Published: 31 July 2021

Volume 19, pages 65–98, (2022)
Cite this article

Computational Management Science Aims and scope Submit manuscript

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Several emerging applications call for a fusion of statistical learning and stochastic programming (SP). We introduce a new class of models which we refer to as Predictive Stochastic Programming (PSP). Unlike ordinary SP, PSP models work with datasets which represent random covariates, often refered to as predictors (or features) and responses (or labels) in the machine learning literature. As a result, these PSP models call for methodologies which borrow relevant concepts from both learning and optimization. We refer to such a methodology as Learning Enabled Optimization (LEO). This paper sets forth the foundation for such a framework by introducing several novel concepts such as statistical optimality, hypothesis tests for model-fidelity, generalization error of PSP, and finally, a non-parametric methodology for model selection. These new concepts, which are collectively referred to as LEO, provide a formal framework for modeling, solving, validating, and reporting solutions for PSP models. We illustrate the LEO framework by applying it to a production-marketing coordination model based on combining a pedagogical production planning model with an advertising dataset intended for sales prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distribution-free algorithms for predictive stochastic programming in the presence of streaming data

Article Open access 22 September 2023

Deterministic Models and Optimization

Stochastic Dynamic Programming

Notes

To see the distinction between PSP and LEO, note that the former is a model, while the latter is a workflow/algorithm.
The actual data set is provided in James et al. (2013) also includes newspapers. However we have dropped it here to preserve simplicity of the example.
The numbers are the same as in James et al. (2013).
The term “Acceptable” is the label for the diamond in both parts (a) and (b) of the figure.
Standard acronyms like LP and DP will mean Linear Programming and Dynamic Programming. Similarly SLP means Stochastic LP.
This paper appears to have been completed around the same time as the first submission of our paper.
A simpler version of a PSP model is one where $J = \emptyset $, and in that case, the predictors Z do not assume values with the same coordinates as the optimization variables (x).
Readers unfamiliar with the context of Appendix II are well advised to familiarize themselves with it before reading the remainder of this section.
We are grateful to Simge Kucukyavuz for this observation.
Combining multiple linear regression and stochastic linear programming.

References

Atakan S, Gangammanavar H, Sen S (2021) Operations planning experiments for power systems with high renewable resources. USC Epstein Dept. Working Paper
Ban GY, El Karoui N, Lim AEB (2019) Machine learning and portfolio optimization. Manag Sci 64:1136–1154
Google Scholar
Ban GY, Rudin C (2019) The big data newsvendor: practical insights from machine learning. Manag Sci 67:90–108
Google Scholar
Bayraksan G, Morton DP (2011) A sequential sampling procedure for stochastic programming. Operations Res 59(4):898–913
Google Scholar
Bayraksan G, Pierre-Louis P (2012) Fixed-width sequential stopping rules for a class of stochastic programs. SIAM J Optim 22(4):1518–1548
Google Scholar
Ben-Tal A, Nemirovski A (2001) Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. SIAM
Bertsekas DP (2012) Dynamic programming and optimal control. No. v. 2 in Athena Scientific optimization and computation series, Athena Scientific
Bertsimas D, Kallus N (2020) From predictive to prescriptive analytics. Manag Sci 66(3):1025–1044
Google Scholar
Bertsimas D, King A (2016) OR FORUM-An algorithmic approach to linear regression. Operations Res 64(1):2–16
Google Scholar
Bertsimas D, Shioda R (2007) Classification and regression via integer optimization. Operations Res 55(2):252–271
Google Scholar
Bertsimas D, Sim Melvyn (2004) The price of robustness. Operations Res 52(1):35–53
Google Scholar
Birge JR, Louveaux F (2011) Introduction to stochastic programming. Springer Science Business Media, Berlin
Google Scholar
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424
Boonmee C, Arimura M, Asada T (2017) Facility location optimization model for emergency humanitarian logistics. Int J Disaster Risk Reduct 24:485–498
Google Scholar
DE Farias DP, Van Roy B (2004) On constraint sampling in the linear programming approach to approximate dynamic programming. Math Operations Res 29(3):462–478
Google Scholar
Deng Y, Liu J, Sen S (2018) Coalescing data and decisions for analytics. Doug S (eds) Tutorials in Operations Research: Recent Advances in Optimization and Modeling of Contemporary Problems, chap. 2. INFORMS, Catonsville, MD, 20–49
Dye S (2009) Simple recourse problem. Springer, US, Boston, MA, pp 3571–3575
Google Scholar
Frazier P (2012) Optimization via simulation with bayesian statistics and dynamic programming. Proceedings of the Winter Simulation Conference. Winter Simulation Conference, 7
Freimer MB, Linderoth JT, Thomas DJ (2012) The impact of sampling methods on bias and variance in stochastic linear programs. Comput Optim Appl 51(1):51–75
Google Scholar
Frey J (2013) Data-driven nonparametric prediction intervals. J Stat Plan Inference 143(6):1039–1048
Google Scholar
Glynn PW, Infanger G (2013) Simulation-based confidence bounds for two-stage stochastic programs. Math Program 138(1–2):15–42
Google Scholar
Hastie TJ, Tibshirani RJ, Friedman JH (2011) The elements of statistical learning: data mining, inference, and prediction. Springer, Berlin
Google Scholar
Higle JL, Sen S (1991) Stochastic decomposition: an algorithm for two-stage linear programs with recourse. Math Operations Res 16(3):650–669
Google Scholar
Higle JL, Sen S (1994) Finite master programs in regularized stochastic decomposition. Math Program 67(1–3):143–168
Google Scholar
Higle JL, Sen S (1996) Duality and statistical tests of optimality for two stage stochastic programs. Math Program 75(2):257–275
Google Scholar
Higle JL, Sen S (1996) Stochastic decomposition. Springer, Berlin
Google Scholar
Hillier FS, Lieberman GJ (2012) Introduction to operations research. Tata McGraw-Hill Education, New York
Google Scholar
Homem-de-Mello T, Bayraksan G (2014) Monte carlo sampling-based methods for stochastic optimization. Surv Operations Res Manag Sci 19(1):56–85
Google Scholar
James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to statistical learning. Springer, Berlin
Google Scholar
Jarque CM, Bera AK (1980) Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Econ Lett 6(3):255–259
Google Scholar
Kao YH, Van Roy B, Yan X (2009) Directed regression. Proceedings of the Neural Information Processing Conference. NIPS, 889–897
Kleywegt AJ, Shapiro A, Homem-de-Mello T (2002) The sample average approximation method for stochastic discrete optimization. SIAM J Optim 12(2):479–502
Google Scholar
Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621
Google Scholar
Kulis B et al. (2013) Metric learning: A survey. Foundations and Trends® in Machine Learning 5(4) 287–364
Linderoth J, Shapiro A, Wright S (2006) The empirical behavior of sampling methods for stochastic programming. Ann Operations Res 142(1):215–241
Google Scholar
Liu J, Li G, Sen S (2021) Coupled learning enabled stochastic programming with endogenous uncertainty. Mathematics of Operations Research (accepted)
Liu J, Sen S (2020) Asymptotic results for two-stage stochastic quadratic programming. SIAM J Optim 30:823–854
Google Scholar
Liyanage LH, Shanthikumar JG (2005) A practical inventory control policy using operational statistics. Operations Res Lett 33(4):341–348
Google Scholar
Mak WK, Morton DP, Wood RK (1999) Monte Carlo bounding techniques for determining solution quality in stochastic programs. Operations Res Lett 24(1):47–56
Google Scholar
Miller N, Ruszczyński A (2011) Risk-averse two-stage stochastic linear programming: Modeling and decomposition. Operations Res 59(1):125–132
Google Scholar
Pawlowski N, Rajchl M, Glocker B (2017) Implicit weight uncertainty in neural networks. arXiv preprint arXiv:1711.01297
Powell WB (2011) Approximate dynamic programming: solving the curses of dimensionality, vol 842. John Wiley & Sons, Hoboken
Google Scholar
Rencher AC, Schallje GB (2008) Linear models in statistics. Wiley Interscience, Hoboken
Google Scholar
Rios I, Wets RJB, Woodruff DL (2015) Multi-period forecasting and scenario generation with limited data. Comput Manag Sci 12(2):267–295
Google Scholar
Royset JO, Szechtman R (2013) Optimal budget allocation for sample average approximation. Operations Res 61(3):762–776
Google Scholar
Royset JO, Wets RJB (2014) From data to assessments and decisions: Epi-spline technology. Tutorials in Operations Research: Bridging Data and Decisions. INFORMS, 27–53
Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Operations Res 60(1):180–195
Google Scholar
Sen S, Liu Y (2016) Mitigating uncertainty via compromise decisions in two-stage stochastic linear programming: variance reduction. Operations Res 64(6):1422–1437
Google Scholar
Sen S, Zhou Z (2014) Multistage stochastic decomposition: a bridge between stochastic programming and approximate dynamic programming. SIAM J Optim 24(1):127–153
Google Scholar
Shapiro A, Dentcheva D, Ruszczynski A (2009) Lectures on stochastic programming, vol 10. SIAM, Philadelphia
Google Scholar
Shapiro A, Homem-de-Mello T (1998) A simulation-based approach to two-stage stochastic programming with recourse. Math Program 81(3):301–325
Google Scholar
Van Parys BPG, Esfahani PM, Kuhn D (2020) From data to decisions: distributionally robust optimization is optimal. Manag Sci. https://doi.org/10.1287/mnsc.2020.3678
Article Google Scholar
Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J Optim 24(4):2057–2075
Google Scholar

Download references

Acknowledgements

This research was supported by Grants from AFOSR FA9550-20-1-0006, ONR N00014-20-1-2077 and NSF Grant CMMI 1538605. This paper was presented at ICSP 2019. We thank Stein Wallace for hosting an entire session on PSP and LEO at the above mentioned conference. We also thank Gabe Hackebeil for extending PySP functionality to allow SMPS output which is neecessary as input to our SD code (available from the authors as well as the Github repository). We are grateful to the referees for their detailed referee reports which have made the exposition much clearer than the previous version. We also appreciate the encouragement from Francesca Maggioni and Stein-Erik Fleten to include this paper as part of the CMS special issue. Finally, we thank Jiajun Xu for his feedback on this paper, and Yihang Zhang for suggesting (19) and confirming the computational results.

Author information

Authors and Affiliations

Daniel J. Epstein Department of Industrial and Systems Engineering, University of Southern California, Los Angeles, 90089, USA
Yunxiao Deng & Suvrajeet Sen

Authors

Yunxiao Deng
View author publications
You can also search for this author inPubMed Google Scholar
Suvrajeet Sen
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Suvrajeet Sen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix I: Details of LEO-Wyndor and computational results

We consider the following statistical models in the computational results. The errors used for training (T) as well as validation (V) will depend on the model q, which in this case $q \in \{1, \ldots ,4 \}$ as shown below.

1.
(DF) For Deterministic Forecasts (DF) we simply use the sales given by ${\hat{m}}_1(z) = {\hat{\beta _0}} + {\hat{\beta }}_1 z_1 + {\hat{\beta }}_2 z_2$. Thus, for deterministic predictions, we do not account for errors in forecast, i.e., training errors for DF are assumed to be zero. The remaining cases below will use errors based on the data.
2.
(NDU) One approximation of the sales forecast is where the correlation between the coefficients are ignored. The resulting model takes the form $m_2(z, \xi ) := ({\hat{\beta }}_0 +\xi _0) + ({\hat{\beta }}_1 + \xi _1)z_1 + ({\hat{\beta }}_2 + \xi _2) z_2$, $z=(z_1, z_2)$ represents a generic outcome vector, and $(\xi _0, \xi _1, \xi _2)$ are generic error outcomes associated with ${{\hbox { normally distributed uncorrelated (NDU)}}}$ random variables $({\tilde{\xi }}_0, {\tilde{\xi }}_1, {\tilde{\xi }}_2)$ with mean zero, and a diagonal variance matrix $\varSigma _\beta $ as computed by the linear regression. The outcomes of errors associated with specific outcomes $(Z_i, W_i)$ are denoted $\xi _i = (\xi _{i0}, \xi _{i1}, \xi _{i2})$ and can be obtained by solving the following problem.
$$\begin{aligned}&\underset{\xi _i}{\min } \left( \xi _i\right) ^\top \varSigma _\beta ^{-1}(\xi _i) \end{aligned}$$
(19a)
$$\begin{aligned}&\left( {\hat{\beta _0}}+\xi _{i0}\right) +\left( {\hat{\beta _1}} +\xi _{i1}\right) Z_{i1} + \left( {\hat{\beta _2}} +\xi _{i2}\right) Z_{i2} = W_i \end{aligned}$$
(19b)
3.
(NDC) This model of sales forecast ${\underline{\hbox {allows correlations between coefficients}}}$ of $m_3(z,\xi ) := ({\hat{\beta }}_0 + \xi _0) + ({\hat{\beta }}_1 + \xi _1)z_1 + ({\hat{\beta }}_2 + \xi _2) z_2$. Again, $(\xi _0, \xi _1, \xi _2)$ represent generic errors associated with generic outcomes z, whereas, identifying specific errors associated with each specific outcome $(Z_i, W_i)$ requires determining $(\xi _{i0}, \xi _{i1}, \xi _{i2})$ by solving (19), although the variance-covariance matrix $\varSigma _{\beta }$ is not diagonal in the case of NDC.
4.
(EAE) This is the empirical additive error model, where $ m_4(z,\xi _0)= ({\hat{\beta _0}} + \xi _{0}) + {\hat{\beta }}_1 z_{1} + {\hat{\beta }}_2 z_{2}$, and $\xi _0$ denotes a generic outcome of the error random variable. The specific values of these outcomes are $\xi _{i0} = W_i - m_4(Z_i, 0)$. We refer to this model as empirical additive errors.

As in the set up for (5), the index q refers to alternative error models (DF, NDU, NDC and EAE). Since all decision variables x (in this example) belong to the same space as outcomes of random variable Z, we remind the reader to explicitly use the recommended decision $x_q$ in defining the collection of validation errors by substituting $Z=z=x_q$ for the model indexed by q.

${{\hbox {LEO-Wyndor Model:}}}$ Index Sets and Variables

$i \equiv $ index of product, $i\in \{A, B\}. $ $y_i \equiv $ number of batches of product i produced.

$$\begin{aligned} f(x)&=-0.1x_1-0.5x_2+{\mathbb {E}}\left[ H\left( x, ~ {\tilde{\xi }}_q ~|~Z=z=x\right) \right] \end{aligned}$$

(20a)

$$\begin{aligned}&\text {s.t.} \quad x_1 + x_2 \le 200 \end{aligned}$$

(20b)

$$\begin{aligned}&x_1 - 0.5 x_2 \ge 0 \end{aligned}$$

(20c)

$$\begin{aligned}&L_1 := 0.7 \le x_1 \le U_1:=200, L_2 := 0 \le x_2 \le U_2 := 50 \end{aligned}$$

(20d)

$$\begin{aligned} h({x}, \xi _q ~|~ Z=z=x) \quad&= \quad \text {Max} \quad 3 y_A + 5 y_B \end{aligned}$$

(21a)

$$\begin{aligned} \text {s.t.} \quad y_A \quad \qquad&\le 8 \end{aligned}$$

(21b)

$$\begin{aligned} \quad \quad 2 y_B&\le 24 \end{aligned}$$

(21c)

$$\begin{aligned} \quad 3y_A + 2y_B&\le 36 \end{aligned}$$

(21d)

$$\begin{aligned} \quad y_A + y_B&\le m_q(z,\xi _q) \end{aligned}$$

(21e)

$$\begin{aligned} y_A, y_B&\ge 0. \end{aligned}$$

(21f)

As discussed in Sect. 5.1, $[L_1, U_1]$ and $[L_2, U_2]$ are available from the lower and upper limits of the regression data on covariates, and are assumed to allow decisions which are consistent with assumption A2. These values are same as upper and lower bounds for model (3a). Moreover, this instance is stated as a “maximization” model, whereas, our previous discussions were set in the context of “minimization”. When interpreting the results, it helps to keep this distinction in mind. The PSP models presented above are relatively general, allowing very general regression models such as kernel-based methods, projection pursuit, and others. However, our current computational infrastructure is limited to stochastic linear programming (SLP) and as a result the regressions are restricted to MLR models which are allowed to have one or more coefficients that are random, and are possibly correlated.

1.1 Computational results

We now present computations for the LEO-Wyndor data under alternative models. DF/LP represents predictive stochastic programming using deterministic forecasts, in which we use the expected value of the linear regression as the demand model. Because such a setup assume all coefficients to be deterministic, the resulting model is a linear program. Of course, one can also study other models where linear regression suggests alternative parameters: (a) the additive scalar error model (${\tilde{\xi }}_0$) using the empirical additive errors (EAE) and deterministic model coefficients ${\hat{\beta _0}}, {\hat{\beta _1}}, {\hat{\beta _2}}$ where ${\hat{\beta _1}}$ is the coefficient for TV expenditures, and ${\hat{\beta _2}}$ is the coefficient for radio expenditures; (b) a linear regression whose coefficients are random variables $\{{\tilde{\beta }}_j\}$, which are normally distributed and uncorrelated (NDU); (c) a linear regression whose coefficients are multi-variate random variables $\{{\tilde{\beta }}_\tau \}$ which and are possibly correlated (NDC). We reiterate that all three models EAE, NDU, NDC correspond to specific types of errors (which are indexed by q in the presentation in Sect. 3). Note that for models NDU and NDC, we have continuous random variables, and as a result we adopted SD as the solution methodology because it manages discrete and continuous random variables with equal ease. We refer to these results by NDU/SD and NDC/SD. However, note that for the case of EAE, the dataset is finite and reasonably manageable. Hence we will use both SAA and SD for this model, and refer to them by EAE/SAA and EAE/SD.

1.1.1 Results for alternative models (Error Terms)

The calculations begin with the first test as the top diamond block in Fig. 1b. Table 4 shows p-values and test results of $\chi ^2$ test for NDU/SD, NDC/SD and EAE. From values reported in Table 4, the fit appears to improve when a few of the data points near the boundary are eliminated. Since we retain more than 95% of the available training data, we accept the revised dataset as being appropriate for training.

Table 4 LEO-Wyndor: comparison of Chi-square test

Full size table

1.1.2 Results for decisions and optimal value estimates

The decisions and various metrics discussed earlier are shown in Table 5. Although the prediction interval is listed in a symmetric way, the actual data points are asymmetric with respect to the estimated mean. The last two rows report the probability $\gamma $ and the corresponding tolerance level $\delta $, which are provided by SD algorithm based on the theorems in Sect. 4. We choose 1% of the mean value of validated objective to be $\varDelta $ in Theorem 2. Once again, notice that for both DF/LP and EAE/SAA, we do not report any probability because we use a deterministic solver as in (5).

The hypothesis test results for the recourse function objectives (the lowest diamond in Fig. 1b) for each model are reported in Table 6. The T-test rejected the DF/LP model as being appropriate for the given data, while the others were not rejected. The next two rows give the test results of variance based on F-statistic, and we conclude that none of the models can be rejected. We also performed a $\chi ^2$ test for the recourse function objectives using the training and validation sets. Again, the DF/LP model was rejected where as the others were not.

Remark 3

The concept of cross-validation ($\kappa $-fold) is uncommon in the stochastic programming literature.With $\kappa >1$, this tool is a computational embodiment of (15), and provides a prediction of error (Hastie et al. 2011). Without such $\kappa $-fold cross-validation ($\kappa >1$), it is often likely that model assessment can go awry. Although we have not explicitly reported on the case of $\kappa =1$ we have observed that the EAE/SAA model can get rejected, even though using $\kappa =5$ the EAE/SAA model is no longer rejected. This can be attributed to the fact that variance reduction due 5-fold cross-validation produced reduced Type I error (when compared with $\kappa =1$).

Table 7 reports the estimated optimization error, as well as the generalization error for all models. DF/LP shows the largest estimated optimization error, which indicates that it is not an appropriate model to recommend for this application. On the other hand, NDU/SD and NDC/SD have comparable and relatively small generalization errors. However the estimated optimization errors appear to be significant, therefore NDU/SD and NDC/SD do not yield the most profitable decisions. Had the criterion been simply the generalization error (as in SL), we might have chosen non-profit-maximizing decisions.

In Table 8 we present the pairwise comparisons due to the Kruskal-Wallis test which is a non-parametric method for testing similarities based on comparing counts in common bins for any pair of distributions. For the tests of DF/LP with other methodologies, the p-values are all smaller than 0.01, which implies that there are significant differences between DF/LP and each of the other four approaches. The stepped curve in Fig. 2 illustrates the ordering discovered by the Kruskal-Wallis test. Moreover, the curves for NDU/SD and NDC/SD are relatively close, whereas EAE/SAA and EAE/SD are indistinguishable. These similarities were quantified in Table 8 by the fact that the p-values for these comparisons are greater than 0.01. Finally, EAE/SAA and EAE/SD give the largest objective value, which is also reported in Table 5. The Kruskal-Wallis test suggests that the difference of EAE/SAA and EAE/SD is not significant, therefore both EAE/SAA and EAE/SD provide the most profitable decision.

Table 5 LEO-Wyndor: comparison of solutions for alternative models

Full size table

Table 6 LEO-Wyndor: hypothesis test results for alternative models

Full size table

Table 7 LEO-Wyndor: errors for alternative models

Full size table

Table 8 LEO-Wyndor: Kruskal-Wallis test results ($p>0.01$)

Full size table

Appendix II: algorithmic stochastic programming background

This Appendix is included for the sake of completeness, especially for those readers who are specialists in SL, but not SP.

1.1 Sample Average Approximation(SAA)

Sample Average Approximation is a standard sampling-based SP methodology, which involves replacing the expectation in the objective function by a sample average function of a finite number of data points. Suppose we have sample size of K, an SAA example is as follows:

$$\begin{aligned} \min _{x\in {\mathbf {X}}} F_K(x)=c^\top x+\frac{1}{K}\sum _{j=1}^K h(x,\xi ^j). \end{aligned}$$

(22)

As an overview, the SAA process may be summarized as follows.

Assumption SAA-a. The expectation function f(x) remains finite and well defined for all $x\in {\mathbf {X}}$. For $\delta > 0$ we denote by

$$\begin{aligned} S({\delta }) := \left\{ x\in {\mathbf {X}} : f(x)\le f^*+\delta \right\} ~~~\text {and}~~~ {\hat{S}}_K({\delta }) := \left\{ x\in {\mathbf {X}} : {\hat{f}}_K(x)\le {\hat{f}}_K^*+\delta \right\} , \end{aligned}$$

where $f^*$ denotes the true optimal objective, and $f_K^*$ denotes the optimal objective to the SAA problem with sample size K.

Assumption SAA-b. There exists a function $\kappa : \varXi \rightarrow {\mathbb {R}}_+$ such that its moment-generating function $M_\kappa (t)$ is finite valued for all t in a neighborhood of zero and

$$| F(x',\xi ) - F(x,\xi )| \le \kappa (\xi ) ||x'-x||$$

for a.e. $\xi \in \varXi $ and all $x', x \in {\mathbf {X}}$.

Assumption SAA-c. There exists a constant $\lambda > 0$ such that for any $x',x \in {\mathbf {X}} $ the moment-generating function $M_{x',x}(t)$ of random variable $[F(x',\xi )-f(x')]-[F(x,\xi )-f(x)]$, satisfies

$$\begin{aligned} M_{x',x}(t)\le \exp (\lambda ^2 ||x'-x||^2 t^2 /2), \forall t\in {\mathbb {R}}. \end{aligned}$$

(23)

From assumption SAA-b, $\Big |[F(x',\xi )-f(x')]-[F(x,\xi )-f(x)]\Big |\le 2L||x'-x||$ w.p. 1, and $\lambda =2L$.

Proposition 2

Suppose that assumptions SAA(a-c) hold, the feasible set ${\mathbf {X}}$ has a finite diameter D, and let $\delta >0, \varDelta \in [0,\delta ), \epsilon \in (0,1)$, and $L={\mathbb {E}}[\kappa (\xi )]$. Then for the sample size K satisfying

$$\begin{aligned} K(\epsilon , \delta )\ge \frac{8\lambda ^2D^2}{(\delta - \varDelta )^2 }\left[ n \ln \left( \frac{8 LD}{\delta -\varDelta }\right) +\ln \left( \frac{1}{1-\epsilon }\right) \right] , \end{aligned}$$

(24)

we have

$$\begin{aligned} \text {Pr}\left( {\hat{S}}_K(\varDelta ) \subset S(\delta ) \right) \ge \epsilon . \end{aligned}$$

Proof

This is Corollary 5.19 of Shapiro et al. (2009) with the assumption that the sample size K is larger than that required by large deviations theory (see 5.122 of Shapiro et al. (2009)). $\square $

1.1.1 Stochastic Decomposition (SD)

Unlike SAA which separates sampling from optimization, SD is based on sequential sampling and the method discovers sample sizes “on-the-fly” (Higle and Sen 1991, 1994). Because of sampling, any stochastic algorithm must contend with both variance reduction in objective values as well as solutions. SD uses M independent replications of value functions denoted $f^\nu $, $\nu = 1,\ldots ,M$. Each of these functions is a max-function whose affine pieces represent some Sample Average Subgradient Approximations (SASA). Because SD uses proximal point iterations, Higle and Sen (1994) shows that the maximum number of affine pieces is $n_1 + 3$, where $n_1$ is the number of first stage decisions, and these pieces are automatically identified using positive Lagrange multiplier estimates during the iterations. In theory, one can control the number of affine pieces to be smaller, but that can also be chosen “on-the-fly” depending on the size of the first stage decision variables $n_1$. When the number of affine functions reduces to only 1, the SD method reduces to a proximal stochastic variance reduction method (prox-SVRG) (Xiao and Zhang 2014). Among other strengths such as parallelizability and variance reduction, one of the main usability issues which SD overcomes is that when a model has a large number of random variables (as in the case of multi-dimensional random variables) it does not require users to choose a sample size because the sequential process automatically discovers an appropriate sample. Together with the notion of Statistical Optimality as set forth in Sect. 4, SD provides a very convenient optimization tool for PSP models.

For SLP models, Sen and Liu (2016) have already presented significant computational evidence of the advantage of SD over plain SAA. The reduced computational effort also facilitates replications for variance reduction (VR). VR in SD is achieved by creating the so-called compromise decision, denoted ${\mathbf {x}}^c$, which minimizes a grand-mean approximation $\bar{F}_M (x) := \frac{1}{M}\sum _{\nu =1}^M f^\nu (x)$ , where $\{f^\nu \}_{\nu =1}^M$ denotes a terminal value function approximation for each replication $\nu $. Suppose that solutions $x^\nu (\varepsilon )\in (\varepsilon -\text {argmin}~\{f^\nu (x) ~|~ x \in {\mathbf {X}} \})$ and ${\mathbf {x}}^c(\delta )\in (\delta -\text {argmin} \{{\bar{F}}_M(x) ~|~ x \in {\mathbf {X}}\})$. Then, Sen and Liu (2016) has proved consistency in the sense that $\lim _{\delta \rightarrow 0}\text {Pr}({\bar{F}}_M({\mathbf {x}}^c(\delta ))-f^*)\rightarrow 0$. Here are the critical assumptions of SD (Higle and Sen 1996b).

Assumption SD-a. The objective functions in the first and second stage models are either linear or quadratic, and all constraints are linear. Moreover, the set of first stage solutions is compact.

Assumption SD-b. The second stage optimization problem is feasible, and possesses a finite optimal value for all $x \in {\mathbf {X}}$, and outcomes $\xi $ (i.e., the relatively complete recourse assumption holds).

Assumption SD-c. The second stage constraint functions are deterministic (i.e., fixed recourse), although the right hand side can be governed by random variables. The set of outcomes of the random variables is compact.

Assumption SD-d. The recourse function h is non-negative. So long as a lower bound on the optimal value is known, we can relax this assumption. (Higle and Sen 1996b)

A high-level structure of SD algorithm can be summarized as follows. For greater specifics regarding two-stage SD, we refer to section 3.1 of Sen and Liu (2016), and for the multi-stage version we refer to Sen and Zhou (2014).

The value function approximation for replication $\nu $ is denoted $f^\nu $ and the terminal solution for that replication is ${x}^\nu $. Note that we generate SASA functions using $K_\nu (\varepsilon )$ observations. Since these observations are i.i.d, the in-sample stopping rule ensures an unbiased estimate of the second stage objective is used for the objective function estimate at ${x}^\nu $. Hence, the Central Limit Theorem (CLT) implies that $[K_\nu (\varepsilon )]^{\frac{1}{2}} [f({x}^\nu ) - f^\nu ({x}^\nu )]$ is asymptotically normal ${\mathbf {N}}(0, \sigma _{\nu }^2)$, where $\sigma _{\nu }^2 < \infty $ denotes the variance of $f^\nu ({x}^\nu )$. Since

$$\begin{aligned} N = \min _\nu K_\nu (\varepsilon ), \end{aligned}$$

(25)

it follows that the error $[f({x}^\nu ) - f^\nu ({x}^\nu )]$ is no greater than $O_{p} (N^{-\frac{1}{2}})$. The following result provides the basis for compromise solutions $\mathbf{x }^c$ as proved in Sen and Liu (2016).

Proposition 3

Suppose that assumptions SD(a-d) stated in the Appendix II hold. Suppose $\bar{{\mathbf {x}}}$ is defined as in Theorem 1, and ${\mathbf {x}}^c = \bar{{\mathbf {x}}}$. Then,

(a) ${\mathbf {x}}^c$ solves

$$\begin{aligned} \underset{x \in {\mathbf {X}}}{\text {Min}} \quad \bar{F}_M (x) := \frac{1}{M} \sum _{\nu =1}^M f^\nu (x), \end{aligned}$$

(26)

(b)

$$\begin{aligned} f\left( {\mathbf {x}}^c\right) \le \bar{F}_M\left( {\mathbf {x}}^c\right) + O_p\left( (NM)^{-\frac{1}{2}}\right) , \end{aligned}$$

(27)

Proof

See Sen and Liu (2016). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deng, Y., Sen, S. Predictive stochastic programming. Comput Manag Sci 19, 65–98 (2022). https://doi.org/10.1007/s10287-021-00400-0

Download citation

Received: 31 December 2019
Accepted: 24 April 2021
Published: 31 July 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s10287-021-00400-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predictive stochastic programming

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Distribution-free algorithms for predictive stochastic programming in the presence of streaming data

Deterministic Models and Optimization

Stochastic Dynamic Programming

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix I: Details of LEO-Wyndor and computational results

1.1 Computational results

1.1.1 Results for alternative models (Error Terms)

1.1.2 Results for decisions and optimal value estimates

Remark 3

Appendix II: algorithmic stochastic programming background

1.1 Sample Average Approximation(SAA)

Proposition 2

Proof

1.1.1 Stochastic Decomposition (SD)

Proposition 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now