Skip to main content
Log in

Predictive stochastic programming

  • Original Paper
  • Published:
Computational Management Science Aims and scope Submit manuscript

Abstract

Several emerging applications call for a fusion of statistical learning and stochastic programming (SP). We introduce a new class of models which we refer to as Predictive Stochastic Programming (PSP). Unlike ordinary SP, PSP models work with datasets which represent random covariates, often refered to as predictors (or features) and responses (or labels) in the machine learning literature. As a result, these PSP models call for methodologies which borrow relevant concepts from both learning and optimization. We refer to such a methodology as Learning Enabled Optimization (LEO). This paper sets forth the foundation for such a framework by introducing several novel concepts such as statistical optimality, hypothesis tests for model-fidelity, generalization error of PSP, and finally, a non-parametric methodology for model selection. These new concepts, which are collectively referred to as LEO, provide a formal framework for modeling, solving, validating, and reporting solutions for PSP models. We illustrate the LEO framework by applying it to a production-marketing coordination model based on combining a pedagogical production planning model with an advertising dataset intended for sales prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. To see the distinction between PSP and LEO, note that the former is a model, while the latter is a workflow/algorithm.

  2. The actual data set is provided in James et al. (2013) also includes newspapers. However we have dropped it here to preserve simplicity of the example.

  3. The numbers are the same as in James et al. (2013).

  4. The term “Acceptable” is the label for the diamond in both parts (a) and (b) of the figure.

  5. Standard acronyms like LP and DP will mean Linear Programming and Dynamic Programming. Similarly SLP means Stochastic LP.

  6. This paper appears to have been completed around the same time as the first submission of our paper.

  7. A simpler version of a PSP model is one where \(J = \emptyset \), and in that case, the predictors Z do not assume values with the same coordinates as the optimization variables (x).

  8. Readers unfamiliar with the context of Appendix II are well advised to familiarize themselves with it before reading the remainder of this section.

  9. We are grateful to Simge Kucukyavuz for this observation.

  10. Combining multiple linear regression and stochastic linear programming.

References

  • Atakan S, Gangammanavar H, Sen S (2021) Operations planning experiments for power systems with high renewable resources. USC Epstein Dept. Working Paper

  • Ban GY, El Karoui N, Lim AEB (2019) Machine learning and portfolio optimization. Manag Sci 64:1136–1154

    Google Scholar 

  • Ban GY, Rudin C (2019) The big data newsvendor: practical insights from machine learning. Manag Sci 67:90–108

    Google Scholar 

  • Bayraksan G, Morton DP (2011) A sequential sampling procedure for stochastic programming. Operations Res 59(4):898–913

    Google Scholar 

  • Bayraksan G, Pierre-Louis P (2012) Fixed-width sequential stopping rules for a class of stochastic programs. SIAM J Optim 22(4):1518–1548

    Google Scholar 

  • Ben-Tal A, Nemirovski A (2001) Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. SIAM

  • Bertsekas DP (2012) Dynamic programming and optimal control. No. v. 2 in Athena Scientific optimization and computation series, Athena Scientific

  • Bertsimas D, Kallus N (2020) From predictive to prescriptive analytics. Manag Sci 66(3):1025–1044

    Google Scholar 

  • Bertsimas D, King A (2016) OR FORUM-An algorithmic approach to linear regression. Operations Res 64(1):2–16

    Google Scholar 

  • Bertsimas D, Shioda R (2007) Classification and regression via integer optimization. Operations Res 55(2):252–271

    Google Scholar 

  • Bertsimas D, Sim Melvyn (2004) The price of robustness. Operations Res 52(1):35–53

    Google Scholar 

  • Birge JR, Louveaux F (2011) Introduction to stochastic programming. Springer Science Business Media, Berlin

    Google Scholar 

  • Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424

  • Boonmee C, Arimura M, Asada T (2017) Facility location optimization model for emergency humanitarian logistics. Int J Disaster Risk Reduct 24:485–498

    Google Scholar 

  • DE Farias DP, Van Roy B (2004) On constraint sampling in the linear programming approach to approximate dynamic programming. Math Operations Res 29(3):462–478

    Google Scholar 

  • Deng Y, Liu J, Sen S (2018) Coalescing data and decisions for analytics. Doug S (eds) Tutorials in Operations Research: Recent Advances in Optimization and Modeling of Contemporary Problems, chap. 2. INFORMS, Catonsville, MD, 20–49

  • Dye S (2009) Simple recourse problem. Springer, US, Boston, MA, pp 3571–3575

    Google Scholar 

  • Frazier P (2012) Optimization via simulation with bayesian statistics and dynamic programming. Proceedings of the Winter Simulation Conference. Winter Simulation Conference, 7

  • Freimer MB, Linderoth JT, Thomas DJ (2012) The impact of sampling methods on bias and variance in stochastic linear programs. Comput Optim Appl 51(1):51–75

    Google Scholar 

  • Frey J (2013) Data-driven nonparametric prediction intervals. J Stat Plan Inference 143(6):1039–1048

    Google Scholar 

  • Glynn PW, Infanger G (2013) Simulation-based confidence bounds for two-stage stochastic programs. Math Program 138(1–2):15–42

    Google Scholar 

  • Hastie TJ, Tibshirani RJ, Friedman JH (2011) The elements of statistical learning: data mining, inference, and prediction. Springer, Berlin

    Google Scholar 

  • Higle JL, Sen S (1991) Stochastic decomposition: an algorithm for two-stage linear programs with recourse. Math Operations Res 16(3):650–669

    Google Scholar 

  • Higle JL, Sen S (1994) Finite master programs in regularized stochastic decomposition. Math Program 67(1–3):143–168

    Google Scholar 

  • Higle JL, Sen S (1996) Duality and statistical tests of optimality for two stage stochastic programs. Math Program 75(2):257–275

    Google Scholar 

  • Higle JL, Sen S (1996) Stochastic decomposition. Springer, Berlin

    Google Scholar 

  • Hillier FS, Lieberman GJ (2012) Introduction to operations research. Tata McGraw-Hill Education, New York

    Google Scholar 

  • Homem-de-Mello T, Bayraksan G (2014) Monte carlo sampling-based methods for stochastic optimization. Surv Operations Res Manag Sci 19(1):56–85

    Google Scholar 

  • James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to statistical learning. Springer, Berlin

    Google Scholar 

  • Jarque CM, Bera AK (1980) Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Econ Lett 6(3):255–259

    Google Scholar 

  • Kao YH, Van Roy B, Yan X (2009) Directed regression. Proceedings of the Neural Information Processing Conference. NIPS, 889–897

  • Kleywegt AJ, Shapiro A, Homem-de-Mello T (2002) The sample average approximation method for stochastic discrete optimization. SIAM J Optim 12(2):479–502

    Google Scholar 

  • Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621

    Google Scholar 

  • Kulis B et al. (2013) Metric learning: A survey. Foundations and Trends® in Machine Learning 5(4) 287–364

  • Linderoth J, Shapiro A, Wright S (2006) The empirical behavior of sampling methods for stochastic programming. Ann Operations Res 142(1):215–241

    Google Scholar 

  • Liu J, Li G, Sen S (2021) Coupled learning enabled stochastic programming with endogenous uncertainty. Mathematics of Operations Research (accepted)

  • Liu J, Sen S (2020) Asymptotic results for two-stage stochastic quadratic programming. SIAM J Optim 30:823–854

    Google Scholar 

  • Liyanage LH, Shanthikumar JG (2005) A practical inventory control policy using operational statistics. Operations Res Lett 33(4):341–348

    Google Scholar 

  • Mak WK, Morton DP, Wood RK (1999) Monte Carlo bounding techniques for determining solution quality in stochastic programs. Operations Res Lett 24(1):47–56

    Google Scholar 

  • Miller N, Ruszczyński A (2011) Risk-averse two-stage stochastic linear programming: Modeling and decomposition. Operations Res 59(1):125–132

    Google Scholar 

  • Pawlowski N, Rajchl M, Glocker B (2017) Implicit weight uncertainty in neural networks. arXiv preprint arXiv:1711.01297

  • Powell WB (2011) Approximate dynamic programming: solving the curses of dimensionality, vol 842. John Wiley & Sons, Hoboken

    Google Scholar 

  • Rencher AC, Schallje GB (2008) Linear models in statistics. Wiley Interscience, Hoboken

    Google Scholar 

  • Rios I, Wets RJB, Woodruff DL (2015) Multi-period forecasting and scenario generation with limited data. Comput Manag Sci 12(2):267–295

    Google Scholar 

  • Royset JO, Szechtman R (2013) Optimal budget allocation for sample average approximation. Operations Res 61(3):762–776

    Google Scholar 

  • Royset JO, Wets RJB (2014) From data to assessments and decisions: Epi-spline technology. Tutorials in Operations Research: Bridging Data and Decisions. INFORMS, 27–53

  • Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Operations Res 60(1):180–195

    Google Scholar 

  • Sen S, Liu Y (2016) Mitigating uncertainty via compromise decisions in two-stage stochastic linear programming: variance reduction. Operations Res 64(6):1422–1437

    Google Scholar 

  • Sen S, Zhou Z (2014) Multistage stochastic decomposition: a bridge between stochastic programming and approximate dynamic programming. SIAM J Optim 24(1):127–153

    Google Scholar 

  • Shapiro A, Dentcheva D, Ruszczynski A (2009) Lectures on stochastic programming, vol 10. SIAM, Philadelphia

    Google Scholar 

  • Shapiro A, Homem-de-Mello T (1998) A simulation-based approach to two-stage stochastic programming with recourse. Math Program 81(3):301–325

    Google Scholar 

  • Van Parys BPG, Esfahani PM, Kuhn D (2020) From data to decisions: distributionally robust optimization is optimal. Manag Sci. https://doi.org/10.1287/mnsc.2020.3678

    Article  Google Scholar 

  • Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J Optim 24(4):2057–2075

    Google Scholar 

Download references

Acknowledgements

This research was supported by Grants from AFOSR FA9550-20-1-0006, ONR N00014-20-1-2077 and NSF Grant CMMI 1538605. This paper was presented at ICSP 2019. We thank Stein Wallace for hosting an entire session on PSP and LEO at the above mentioned conference. We also thank Gabe Hackebeil for extending PySP functionality to allow SMPS output which is neecessary as input to our SD code (available from the authors as well as the Github repository). We are grateful to the referees for their detailed referee reports which have made the exposition much clearer than the previous version. We also appreciate the encouragement from Francesca Maggioni and Stein-Erik Fleten to include this paper as part of the CMS special issue. Finally, we thank Jiajun Xu for his feedback on this paper, and Yihang Zhang for suggesting (19) and confirming the computational results.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suvrajeet Sen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix I: Details of LEO-Wyndor and computational results

We consider the following statistical models in the computational results. The errors used for training (T) as well as validation (V) will depend on the model q, which in this case \(q \in \{1, \ldots ,4 \}\) as shown below.

  1. 1.

    (DF) For Deterministic Forecasts (DF) we simply use the sales given by \({\hat{m}}_1(z) = {\hat{\beta _0}} + {\hat{\beta }}_1 z_1 + {\hat{\beta }}_2 z_2\). Thus, for deterministic predictions, we do not account for errors in forecast, i.e., training errors for DF are assumed to be zero. The remaining cases below will use errors based on the data.

  2. 2.

    (NDU) One approximation of the sales forecast is where the correlation between the coefficients are ignored. The resulting model takes the form \(m_2(z, \xi ) := ({\hat{\beta }}_0 +\xi _0) + ({\hat{\beta }}_1 + \xi _1)z_1 + ({\hat{\beta }}_2 + \xi _2) z_2\), \(z=(z_1, z_2)\) represents a generic outcome vector, and \((\xi _0, \xi _1, \xi _2)\) are generic error outcomes associated with \({{\hbox { normally distributed uncorrelated (NDU)}}}\) random variables \(({\tilde{\xi }}_0, {\tilde{\xi }}_1, {\tilde{\xi }}_2)\) with mean zero, and a diagonal variance matrix \(\varSigma _\beta \) as computed by the linear regression. The outcomes of errors associated with specific outcomes \((Z_i, W_i)\) are denoted \(\xi _i = (\xi _{i0}, \xi _{i1}, \xi _{i2})\) and can be obtained by solving the following problem.

    $$\begin{aligned}&\underset{\xi _i}{\min } \left( \xi _i\right) ^\top \varSigma _\beta ^{-1}(\xi _i) \end{aligned}$$
    (19a)
    $$\begin{aligned}&\left( {\hat{\beta _0}}+\xi _{i0}\right) +\left( {\hat{\beta _1}} +\xi _{i1}\right) Z_{i1} + \left( {\hat{\beta _2}} +\xi _{i2}\right) Z_{i2} = W_i \end{aligned}$$
    (19b)
  3. 3.

    (NDC) This model of sales forecast \({\underline{\hbox {allows correlations between coefficients}}}\) of \(m_3(z,\xi ) := ({\hat{\beta }}_0 + \xi _0) + ({\hat{\beta }}_1 + \xi _1)z_1 + ({\hat{\beta }}_2 + \xi _2) z_2\). Again, \((\xi _0, \xi _1, \xi _2)\) represent generic errors associated with generic outcomes z, whereas, identifying specific errors associated with each specific outcome \((Z_i, W_i)\) requires determining \((\xi _{i0}, \xi _{i1}, \xi _{i2})\) by solving (19), although the variance-covariance matrix \(\varSigma _{\beta }\) is not diagonal in the case of NDC.

  4. 4.

    (EAE) This is the empirical additive error model, where \( m_4(z,\xi _0)= ({\hat{\beta _0}} + \xi _{0}) + {\hat{\beta }}_1 z_{1} + {\hat{\beta }}_2 z_{2}\), and \(\xi _0\) denotes a generic outcome of the error random variable. The specific values of these outcomes are \(\xi _{i0} = W_i - m_4(Z_i, 0)\). We refer to this model as empirical additive errors.

As in the set up for (5), the index q refers to alternative error models (DF, NDU, NDC and EAE). Since all decision variables x (in this example) belong to the same space as outcomes of random variable Z, we remind the reader to explicitly use the recommended decision \(x_q\) in defining the collection of validation errors by substituting \(Z=z=x_q\) for the model indexed by q.

\({{\hbox {LEO-Wyndor Model:}}}\) Index Sets and Variables

\(i \equiv \) index of product, \(i\in \{A, B\}. \) \(y_i \equiv \) number of batches of product i produced.

$$\begin{aligned} f(x)&=-0.1x_1-0.5x_2+{\mathbb {E}}\left[ H\left( x, ~ {\tilde{\xi }}_q ~|~Z=z=x\right) \right] \end{aligned}$$
(20a)
$$\begin{aligned}&\text {s.t.} \quad x_1 + x_2 \le 200 \end{aligned}$$
(20b)
$$\begin{aligned}&x_1 - 0.5 x_2 \ge 0 \end{aligned}$$
(20c)
$$\begin{aligned}&L_1 := 0.7 \le x_1 \le U_1:=200, L_2 := 0 \le x_2 \le U_2 := 50 \end{aligned}$$
(20d)
$$\begin{aligned} h({x}, \xi _q ~|~ Z=z=x) \quad&= \quad \text {Max} \quad 3 y_A + 5 y_B \end{aligned}$$
(21a)
$$\begin{aligned} \text {s.t.} \quad y_A \quad \qquad&\le 8 \end{aligned}$$
(21b)
$$\begin{aligned} \quad \quad 2 y_B&\le 24 \end{aligned}$$
(21c)
$$\begin{aligned} \quad 3y_A + 2y_B&\le 36 \end{aligned}$$
(21d)
$$\begin{aligned} \quad y_A + y_B&\le m_q(z,\xi _q) \end{aligned}$$
(21e)
$$\begin{aligned} y_A, y_B&\ge 0. \end{aligned}$$
(21f)

As discussed in Sect. 5.1, \([L_1, U_1]\) and \([L_2, U_2]\) are available from the lower and upper limits of the regression data on covariates, and are assumed to allow decisions which are consistent with assumption A2. These values are same as upper and lower bounds for model (3a). Moreover, this instance is stated as a “maximization” model, whereas, our previous discussions were set in the context of “minimization”. When interpreting the results, it helps to keep this distinction in mind. The PSP models presented above are relatively general, allowing very general regression models such as kernel-based methods, projection pursuit, and others. However, our current computational infrastructure is limited to stochastic linear programming (SLP) and as a result the regressions are restricted to MLR models which are allowed to have one or more coefficients that are random, and are possibly correlated.

1.1 Computational results

We now present computations for the LEO-Wyndor data under alternative models. DF/LP represents predictive stochastic programming using deterministic forecasts, in which we use the expected value of the linear regression as the demand model. Because such a setup assume all coefficients to be deterministic, the resulting model is a linear program. Of course, one can also study other models where linear regression suggests alternative parameters: (a) the additive scalar error model (\({\tilde{\xi }}_0\)) using the empirical additive errors (EAE) and deterministic model coefficients \({\hat{\beta _0}}, {\hat{\beta _1}}, {\hat{\beta _2}}\) where \({\hat{\beta _1}}\) is the coefficient for TV expenditures, and \({\hat{\beta _2}}\) is the coefficient for radio expenditures; (b) a linear regression whose coefficients are random variables \(\{{\tilde{\beta }}_j\}\), which are normally distributed and uncorrelated (NDU); (c) a linear regression whose coefficients are multi-variate random variables \(\{{\tilde{\beta }}_\tau \}\) which and are possibly correlated (NDC). We reiterate that all three models EAE, NDU, NDC correspond to specific types of errors (which are indexed by q in the presentation in Sect. 3). Note that for models NDU and NDC, we have continuous random variables, and as a result we adopted SD as the solution methodology because it manages discrete and continuous random variables with equal ease. We refer to these results by NDU/SD and NDC/SD. However, note that for the case of EAE, the dataset is finite and reasonably manageable. Hence we will use both SAA and SD for this model, and refer to them by EAE/SAA and EAE/SD.

1.1.1 Results for alternative models (Error Terms)

The calculations begin with the first test as the top diamond block in Fig. 1b. Table 4 shows p-values and test results of \(\chi ^2\) test for NDU/SD, NDC/SD and EAE. From values reported in Table 4, the fit appears to improve when a few of the data points near the boundary are eliminated. Since we retain more than 95% of the available training data, we accept the revised dataset as being appropriate for training.

Table 4 LEO-Wyndor: comparison of Chi-square test

1.1.2 Results for decisions and optimal value estimates

The decisions and various metrics discussed earlier are shown in Table 5. Although the prediction interval is listed in a symmetric way, the actual data points are asymmetric with respect to the estimated mean. The last two rows report the probability \(\gamma \) and the corresponding tolerance level \(\delta \), which are provided by SD algorithm based on the theorems in Sect. 4. We choose 1% of the mean value of validated objective to be \(\varDelta \) in Theorem 2. Once again, notice that for both DF/LP and EAE/SAA, we do not report any probability because we use a deterministic solver as in (5).

The hypothesis test results for the recourse function objectives (the lowest diamond in Fig. 1b) for each model are reported in Table 6. The T-test rejected the DF/LP model as being appropriate for the given data, while the others were not rejected. The next two rows give the test results of variance based on F-statistic, and we conclude that none of the models can be rejected. We also performed a \(\chi ^2\) test for the recourse function objectives using the training and validation sets. Again, the DF/LP model was rejected where as the others were not.

Remark 3

The concept of cross-validation (\(\kappa \)-fold) is uncommon in the stochastic programming literature.With \(\kappa >1\), this tool is a computational embodiment of (15), and provides a prediction of error (Hastie et al. 2011). Without such \(\kappa \)-fold cross-validation (\(\kappa >1\)), it is often likely that model assessment can go awry. Although we have not explicitly reported on the case of \(\kappa =1\) we have observed that the EAE/SAA model can get rejected, even though using \(\kappa =5\) the EAE/SAA model is no longer rejected. This can be attributed to the fact that variance reduction due 5-fold cross-validation produced reduced Type I error (when compared with \(\kappa =1\)).

Table 7 reports the estimated optimization error, as well as the generalization error for all models. DF/LP shows the largest estimated optimization error, which indicates that it is not an appropriate model to recommend for this application. On the other hand, NDU/SD and NDC/SD have comparable and relatively small generalization errors. However the estimated optimization errors appear to be significant, therefore NDU/SD and NDC/SD do not yield the most profitable decisions. Had the criterion been simply the generalization error (as in SL), we might have chosen non-profit-maximizing decisions.

In Table 8 we present the pairwise comparisons due to the Kruskal-Wallis test which is a non-parametric method for testing similarities based on comparing counts in common bins for any pair of distributions. For the tests of DF/LP with other methodologies, the p-values are all smaller than 0.01, which implies that there are significant differences between DF/LP and each of the other four approaches. The stepped curve in Fig. 2 illustrates the ordering discovered by the Kruskal-Wallis test. Moreover, the curves for NDU/SD and NDC/SD are relatively close, whereas EAE/SAA and EAE/SD are indistinguishable. These similarities were quantified in Table 8 by the fact that the p-values for these comparisons are greater than 0.01. Finally, EAE/SAA and EAE/SD give the largest objective value, which is also reported in Table 5. The Kruskal-Wallis test suggests that the difference of EAE/SAA and EAE/SD is not significant, therefore both EAE/SAA and EAE/SD provide the most profitable decision.

Table 5 LEO-Wyndor: comparison of solutions for alternative models
Table 6 LEO-Wyndor: hypothesis test results for alternative models
Fig. 2
figure 2

LEO-Wyndor: stochastic dominance of validated objectives under alternative models

Table 7 LEO-Wyndor: errors for alternative models
Table 8 LEO-Wyndor: Kruskal-Wallis test results (\(p>0.01\))

Appendix II: algorithmic stochastic programming background

This Appendix is included for the sake of completeness, especially for those readers who are specialists in SL, but not SP.

1.1 Sample Average Approximation(SAA)

Sample Average Approximation is a standard sampling-based SP methodology, which involves replacing the expectation in the objective function by a sample average function of a finite number of data points. Suppose we have sample size of K, an SAA example is as follows:

$$\begin{aligned} \min _{x\in {\mathbf {X}}} F_K(x)=c^\top x+\frac{1}{K}\sum _{j=1}^K h(x,\xi ^j). \end{aligned}$$
(22)

As an overview, the SAA process may be summarized as follows.

figure a

Assumption SAA-a. The expectation function f(x) remains finite and well defined for all \(x\in {\mathbf {X}}\). For \(\delta > 0\) we denote by

$$\begin{aligned} S({\delta }) := \left\{ x\in {\mathbf {X}} : f(x)\le f^*+\delta \right\} ~~~\text {and}~~~ {\hat{S}}_K({\delta }) := \left\{ x\in {\mathbf {X}} : {\hat{f}}_K(x)\le {\hat{f}}_K^*+\delta \right\} , \end{aligned}$$

where \(f^*\) denotes the true optimal objective, and \(f_K^*\) denotes the optimal objective to the SAA problem with sample size K.

Assumption SAA-b. There exists a function \(\kappa : \varXi \rightarrow {\mathbb {R}}_+\) such that its moment-generating function \(M_\kappa (t)\) is finite valued for all t in a neighborhood of zero and

$$| F(x',\xi ) - F(x,\xi )| \le \kappa (\xi ) ||x'-x||$$

for a.e. \(\xi \in \varXi \) and all \(x', x \in {\mathbf {X}}\).

Assumption SAA-c. There exists a constant \(\lambda > 0\) such that for any \(x',x \in {\mathbf {X}} \) the moment-generating function \(M_{x',x}(t)\) of random variable \([F(x',\xi )-f(x')]-[F(x,\xi )-f(x)]\), satisfies

$$\begin{aligned} M_{x',x}(t)\le \exp (\lambda ^2 ||x'-x||^2 t^2 /2), \forall t\in {\mathbb {R}}. \end{aligned}$$
(23)

From assumption SAA-b, \(\Big |[F(x',\xi )-f(x')]-[F(x,\xi )-f(x)]\Big |\le 2L||x'-x||\) w.p. 1, and \(\lambda =2L\).

Proposition 2

Suppose that assumptions SAA(a-c) hold, the feasible set \({\mathbf {X}}\) has a finite diameter D, and let \(\delta >0, \varDelta \in [0,\delta ), \epsilon \in (0,1)\), and \(L={\mathbb {E}}[\kappa (\xi )]\). Then for the sample size K satisfying

$$\begin{aligned} K(\epsilon , \delta )\ge \frac{8\lambda ^2D^2}{(\delta - \varDelta )^2 }\left[ n \ln \left( \frac{8 LD}{\delta -\varDelta }\right) +\ln \left( \frac{1}{1-\epsilon }\right) \right] , \end{aligned}$$
(24)

we have

$$\begin{aligned} \text {Pr}\left( {\hat{S}}_K(\varDelta ) \subset S(\delta ) \right) \ge \epsilon . \end{aligned}$$

Proof

This is Corollary 5.19 of Shapiro et al. (2009) with the assumption that the sample size K is larger than that required by large deviations theory (see 5.122 of Shapiro et al. (2009)). \(\square \)

1.1.1 Stochastic Decomposition (SD)

Unlike SAA which separates sampling from optimization, SD is based on sequential sampling and the method discovers sample sizes “on-the-fly” (Higle and Sen 1991, 1994). Because of sampling, any stochastic algorithm must contend with both variance reduction in objective values as well as solutions. SD uses M independent replications of value functions denoted \(f^\nu \), \(\nu = 1,\ldots ,M\). Each of these functions is a max-function whose affine pieces represent some Sample Average Subgradient Approximations (SASA). Because SD uses proximal point iterations, Higle and Sen (1994) shows that the maximum number of affine pieces is \(n_1 + 3\), where \(n_1\) is the number of first stage decisions, and these pieces are automatically identified using positive Lagrange multiplier estimates during the iterations. In theory, one can control the number of affine pieces to be smaller, but that can also be chosen “on-the-fly” depending on the size of the first stage decision variables \(n_1\). When the number of affine functions reduces to only 1, the SD method reduces to a proximal stochastic variance reduction method (prox-SVRG) (Xiao and Zhang 2014). Among other strengths such as parallelizability and variance reduction, one of the main usability issues which SD overcomes is that when a model has a large number of random variables (as in the case of multi-dimensional random variables) it does not require users to choose a sample size because the sequential process automatically discovers an appropriate sample. Together with the notion of Statistical Optimality as set forth in Sect. 4, SD provides a very convenient optimization tool for PSP models.

For SLP models, Sen and Liu (2016) have already presented significant computational evidence of the advantage of SD over plain SAA. The reduced computational effort also facilitates replications for variance reduction (VR). VR in SD is achieved by creating the so-called compromise decision, denoted \({\mathbf {x}}^c\), which minimizes a grand-mean approximation \(\bar{F}_M (x) := \frac{1}{M}\sum _{\nu =1}^M f^\nu (x)\) , where \(\{f^\nu \}_{\nu =1}^M\) denotes a terminal value function approximation for each replication \(\nu \). Suppose that solutions \(x^\nu (\varepsilon )\in (\varepsilon -\text {argmin}~\{f^\nu (x) ~|~ x \in {\mathbf {X}} \})\) and \({\mathbf {x}}^c(\delta )\in (\delta -\text {argmin} \{{\bar{F}}_M(x) ~|~ x \in {\mathbf {X}}\})\). Then, Sen and Liu (2016) has proved consistency in the sense that \(\lim _{\delta \rightarrow 0}\text {Pr}({\bar{F}}_M({\mathbf {x}}^c(\delta ))-f^*)\rightarrow 0\). Here are the critical assumptions of SD (Higle and Sen 1996b).

Assumption SD-a. The objective functions in the first and second stage models are either linear or quadratic, and all constraints are linear. Moreover, the set of first stage solutions is compact.

Assumption SD-b. The second stage optimization problem is feasible, and possesses a finite optimal value for all  \(x \in {\mathbf {X}}\), and outcomes \(\xi \) (i.e., the relatively complete recourse assumption holds).

Assumption SD-c. The second stage constraint functions are deterministic (i.e., fixed recourse), although the right hand side can be governed by random variables. The set of outcomes of the random variables is compact.

Assumption SD-d. The recourse function h is non-negative. So long as a lower bound on the optimal value is known, we can relax this assumption. (Higle and Sen 1996b)

A high-level structure of SD algorithm can be summarized as follows. For greater specifics regarding two-stage SD, we refer to section 3.1 of Sen and Liu (2016), and for the multi-stage version we refer to Sen and Zhou (2014).

figure b

The value function approximation for replication \(\nu \) is denoted \(f^\nu \) and the terminal solution for that replication is \({x}^\nu \). Note that we generate SASA functions using \(K_\nu (\varepsilon )\) observations. Since these observations are i.i.d, the in-sample stopping rule ensures an unbiased estimate of the second stage objective is used for the objective function estimate at \({x}^\nu \). Hence, the Central Limit Theorem (CLT) implies that \([K_\nu (\varepsilon )]^{\frac{1}{2}} [f({x}^\nu ) - f^\nu ({x}^\nu )]\) is asymptotically normal \({\mathbf {N}}(0, \sigma _{\nu }^2)\), where \(\sigma _{\nu }^2 < \infty \) denotes the variance of \(f^\nu ({x}^\nu )\). Since

$$\begin{aligned} N = \min _\nu K_\nu (\varepsilon ), \end{aligned}$$
(25)

it follows that the error \([f({x}^\nu ) - f^\nu ({x}^\nu )]\) is no greater than \(O_{p} (N^{-\frac{1}{2}})\). The following result provides the basis for compromise solutions \(\mathbf{x }^c\) as proved in Sen and Liu (2016).

Proposition 3

Suppose that assumptions SD(a-d) stated in the Appendix II hold. Suppose \(\bar{{\mathbf {x}}}\) is defined as in Theorem 1, and \({\mathbf {x}}^c = \bar{{\mathbf {x}}}\). Then,

(a) \({\mathbf {x}}^c\) solves

$$\begin{aligned} \underset{x \in {\mathbf {X}}}{\text {Min}} \quad \bar{F}_M (x) := \frac{1}{M} \sum _{\nu =1}^M f^\nu (x), \end{aligned}$$
(26)

(b)

$$\begin{aligned} f\left( {\mathbf {x}}^c\right) \le \bar{F}_M\left( {\mathbf {x}}^c\right) + O_p\left( (NM)^{-\frac{1}{2}}\right) , \end{aligned}$$
(27)

Proof

See Sen and Liu (2016). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, Y., Sen, S. Predictive stochastic programming. Comput Manag Sci 19, 65–98 (2022). https://doi.org/10.1007/s10287-021-00400-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10287-021-00400-0

Keywords

Navigation