Stochastics and StatisticsA study of data-driven distributionally robust optimization with incomplete joint data under finite support
Introduction
This paper aims to address the issue of missing data (incomplete data) in data-driven stochastic programming. Stochastic programming is an important framework for optimization under uncertainty. It generally assumes that a probability distribution of the random variable is available and seeks an optimal solution in terms of expected performance. In practice, the distribution of is often unknown; consequently, data-driven approaches have been proposed to solve this problem. These data-driven methods only work well if well-conditioned historical data for are available. However, if is multidimensional, complete joint data of are often hard to obtain in many real-world settings due to the following issues:
- •
Missing data in some dimensions for .
- •
Sharing of data is limited among dimensions representing different components.
- •
Different sizes of data in different dimensions.
Missing data is a common issue for practical operations research (OR) problems as data sets for many applications are incomplete. For example, a common challenge in portfolio optimization theory is that the historical data of the returns for assets have missing values (Radulescu, 2013, Taylor, 2006). The data in many other application domains are also incomplete, including large-scale transportation management systems (TMSs) used to monitor traffic conditions to improve traffic congestion. The mobility monitoring program of the Texas Transportation Institute (TTI) reports that after screening erroneous data, TMS data archives can be anywhere from to complete (Smith, Scherer, & Conklin, 2003).
The most popular method to solve the missing data problem is data imputation. In this approach, missing values are imputed before implementing optimization and performing other analyses on the complete data set. Many variants of data-imputation are proposed in the literature (Dempster, Laird, Rubin, 1977, Little, Rubin, 2019, Stekhoven, Bühlmann, 2012, Troyanskaya, Cantor, Sherlock, Brown, Hastie, Tibshirani, et al., 2001, Wang, Li, Jiang, Feng, 2006). In general, machine learning tools are intended to minimize prediction error and do not consider how the predictions will impact the downstream optimization problem Elmachtoub & Grigas (2022). Utilizing data-imputation methods for data-driven stochastic optimization facing incomplete data suffers from two major issues. First, estimate-then-optimize methods are known to give sub-optimal solutions in many recent studies (Delage, Ye, 2010, Esfahani, Kuhn, 2018, Gupta, Rusmevichientong, 2021, Liyanage, Shanthikumar, 2005). The paper (Liyanage & Shanthikumar, 2005) analyzes the newsvendor inventory control problem with an ambiguous demand. It shows the estimate-the-optimize approach leads to a suboptimal inventory policy. The authors present a better solution by integrating the estimation and the optimization tasks. Second, theoretical guarantees are hardly obtained; this is because the analysis of the missing data and the derivation of the stochastic optimization’s optimal solutions are conducted separately.
We propose an integrated approach, a distributionally robust optimization (DRO) framework, that combines the estimation and the optimization steps for data-driven stochastic programming facing incomplete data. Compared with the existing missing data literature, our approach does not attempt to find a suitable estimation for the missing values. The goal of the presented method is to solve the corresponding data-driven stochastic programming. We quantify the uncertainties brought by the missing values and incorporate them into stochastic programming through the DRO framework.
This work also extends the DRO literature. DRO (Delage, Ye, 2010, Esfahani, Kuhn, 2018, Gao, Chen, & Kleywegt, Jiang, Guan, 2018, Zhao, Zhang, 2019) is a powerful modeling paradigm that combines the estimation and the optimization steps. This approach first constructs some ambiguity sets based on the available data set; optimization techniques are then proposed to solve these models with respect to the worst-case distributions within the ambiguity sets. However, to the best of our knowledge, the existing DRO works have not considered any ambiguity sets based on the incomplete data sets. Researchers in Zhao & Zhang (2019) do consider a missing data problem encountered in incomplete trajectories data. But their main goal is to reconstruct the missing location-duration path choices, and their ambiguity set is still based on the complete historical data.
The main contributions of this paper are as follows: It proposes a DRO framework for data-driven optimization problems with incomplete (partially observable) data sets. We consider data distributions with finite support and assume only partially observed data are available, meaning that the components for each piece of data are randomly missing. We present several classes of ambiguity sets based on the incomplete data for our DRO framework and discuss their properties. We provide finite sample guarantees of our DRO model by utilizing the observed information matrix (Efron & Hinkley, 1978) into our analysis. We also prove the statistical consistency results using the properties of maximum likelihood estimation. Tractable reformulations of the models are presented. Finally, we conduct computational studies to evaluate the performances of the proposed approaches compared to data-imputation-based approaches based on both synthetic and real-world data. Below, we highlight the details of the contributions.
- 1.
A new DRO framework based on incomplete data is proposed. The proposed DRO framework is fundamentally different from the popular data-imputation-based methods. It signifies an integrated model that solves the missing data problem and stochastic programming simultaneously instead of following the estimate-then-optimize procedure. By adopting a DRO framework, the presented models are robust towards the uncertainties of the missing values. Therefore, they greatly improve the out-of-sample performances in applications where the optimal solutions are sensitive to unknown parameters.
- 2.
Our DRO framework extends the current studies on DRO by proposing ambiguity sets that are constructed directly based on the incomplete data set. We construct several types of ambiguity sets based on -divergence, Wasserstein metric, L1 norm, and ellipsoids in the probability space. The centers of these ambiguity sets are chosen as optimal estimators of the maximum likelihood estimation (MLE) based on the incomplete data set. The first two kinds of ambiguity sets are inspired by two general metrics used in the DRO literature. The last one is inspired by the special structures in our model. We discuss that the ellipsoid ambiguity set is asymptotically optimal in the sense that it contains the complete data (true) distribution with the highest probability among all ambiguity sets having the same volume.
- 3.
We obtain theoretical guarantees and tractable reformulations for the proposed models. We first derive the finite sample guarantees of our model by providing a probabilistic upper bound to their out-of-sample performances. Our analyses are based on the asymptotic normality and the observed information matrix (empirical fisher information). We then prove the statistical consistency guarantee, which means the solution of our model converges to the true optimal in probability when the number of observed data goes to infinity. Finally, we show that these reformulations can be efficiently solved if the cost functions of the original stochastic program are convex and the feasible regions are convex or mixed-integer linear sets.
- 4.
We provide computational experiments for evaluation using both synthetic and real-world data. Two applications are studied: the multi-item inventory control problem and portfolio optimization. Considering the multi-item inventory control problem, we illustrate the reductions in the total costs of our models based on synthetic data. We benchmark our frameworks for portfolio optimization against the data imputation-based method using 57 pairs of training sets and test sets. These sets are obtained from real-world historical returns of exchange-traded funds (ETFs) and the US central bank (FED) rate of return from 2006 to 2016 (Boyd, 2019). We justify the conclusion of improvements by showing that proposed models consistently yield better out-of-sample performance.
The rest of the paper is organized as follows. We briefly review related works in Section 2. The background and notations are introduced in Section 3. We present the main results in Section 4. More specifically, in Section 4.1, we propose one ambiguity set, and we prove the corresponding statistical consistency and finite sample guarantees in Sections 4.2 and 4.3, respectively. We also discuss its reformulations in Section 4.4 and different classes of ambiguity sets in Section 4.5. Computational studies are summarized in Section 5. This paper concludes itself in Section 6. All the proofs and extensions to two-stage stochastic programming are provided in the Appendix.
Section snippets
Related works
This section briefly reviews related works about missing data and data-driven optimization under uncertainty.
Missing data. Missing or incomplete data has been studied widely, especially in machine learning and statistics (García-Laencina, Sancho-Gómez, Figueiras-Vidal, 2010, Goodfellow, Bengio, Courville, 2016, Rubin, 1976) literature. One of the most natural options to solve the missing data problem is to discard any data that include missing values. However, this approach may lead to biased
Notations and background
Throughout this paper, we use the following notations and assumptions. A stochastic program can be formulated aswhere represents a cost function with respect to decision , and represents the feasible region of . In this paper, we study the problem whose random variable has known finite support. Finite discrete support sets are common and studied in many operations research problems (Barbarosolu, Arda, 2004, Feng, Ryan, 2013, Rujeerapaiboon, Schindler, Kuhn,
Main model
In this section, we first introduce and discuss our model based on one specific ambiguity set. Then, we introduce the concepts of asymptotic normality and the observed information matrix to obtain the theoretical guarantees, which contain two main points. First, we prove the consistency result in Section 4.2. The consistency follows the definition in Van der Vaart (2000) meaning that the solution of our model converges in probability to the true optimal solution as the observed data size goes
Computational study
In this section, we conduct computational studies to validate the superiority of the proposed model in some real-world applications. We first study a multi-item two-stage inventory control problem based on the synthetic data in Section 5.1. Please refer to Appendix H for extensions of our model to two-stage stochastic programming. In this experiment, we show the improvements in the out-of-sample performances of our models compared to a data-imputation-based approach. In Section 5.2, we study a
Conclusion
This paper develops a new DRO framework for data-driven stochastic optimization facing incomplete data sets. Our model represents an integrated analysis of missing data and stochastic optimization, which is different from the most popular data-imputation-based approaches. We provide theoretical guarantees and utilize the concepts in MLE and the observed information matrix to obtain the bounds for the performance. Several classes of ambiguity sets with their reformulations are discussed. There
References (59)
- et al.
Em algorithm in gaussian copula with missing data
Computational Statistics and Data Analysis
(2016) - et al.
Scenario construction and reduction applied to stochastic power generation expansion planning
Computers and Operations Research
(2013) - et al.
A practical inventory control policy using operational statistics
Operations Research Letters
(2005) - et al.
Probabilistic neural network based categorical data imputation
Neurocomputing
(2016) - et al.
Data-driven risk-averse stochastic optimization with Wasserstein metric
Operations Research Letters
(2018) - et al.
A distributionally robust optimization approach to reconstructing missing locations and paths using high-frequency trajectory data
Transportation Research Part C: Emerging Technologies
(2019) - et al.
Multi-stage production planning and inventory control
(2012) - et al.
A two-stage stochastic programming framework for transportation planning in disaster response
Journal of the Operational Research Society
(2004) - et al.
Robust solutions of optimization problems affected by uncertain probabilities
Management Science
(2013) - et al.
Data-driven robust optimization
Mathematical Programming
(2018)
Robust sample average approximation
Mathematical Programming
Convex optimization
Robust stochastic optimization made easy with RSOME
Management Science
Distributionally robust optimization with infinitely constrained ambiguity sets
Operations Research
Integer programming
Distributionally robust optimization under moment uncertainty with application to data-driven problems
Operations Research
Maximum likelihood from incomplete data via the em algorithm
Journal of the Royal Statistical Society: Series B (Methodological)
Maximum likelihood estimation for incomplete multinomial data via the weaver algorithm
Statistics and Computing
Assessing the accuracy of the maximum likelihood estimator: Observed versus expected fisher information
Biometrika
Smart “predict, then optimize”
Management Science
The relative performance of full information maximum likelihood estimation for missing data in structural equation models
Structural Equation Modeling
Ambiguous chance constrained problems and robust optimization
Mathematical Programming
Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations
Mathematical Programming
Pattern classification with missing data: A review
Neural Computing and Applications
Deep learning
Small-data, large-scale linear optimization with uncertain objectives
Management Science
Cited by (3)
Incorporating risk aversion and time preference into omnichannel retail operations considering assortment and inventory optimization
2024, European Journal of Operational ResearchA study of distributionally robust mixed-integer programming with Wasserstein metric: on the value of incomplete data
2024, European Journal of Operational ResearchDesigning a resilient supply chain network under ambiguous information and disruption risk
2023, Computers and Chemical Engineering