Stochastics and Statistics
A study of data-driven distributionally robust optimization with incomplete joint data under finite support

https://doi.org/10.1016/j.ejor.2022.06.032Get rights and content

Highlights

  • Addressing the missing data issue in data-driven stochastic programming problems.

  • Offering a new distributionally robust optimization (DRO) framework that extends the current studies on DRO by proposing ambiguity sets that are constructed based on incomplete data.

  • Obtaining theoretical guarantees (finite sample guarantees and statistical consistency) and tractable reformulations for the proposed distributionally robust optimization models.

  • Validating and providing computational experiments for evaluation using synthetic and real-world data for multi-item inventory control and portfolio optimization problems.

Abstract

Missing data is a common issue for many practical data-driven stochastic programming problems. The state-of-the-art approaches first estimate the missing data values and then separately solve the corresponding stochastic programming. Accurate estimation of missing values is typically inaccessible as it requires enormous data and sophisticated statistical methods. Therefore, this paper proposes an integrated approach, a distributionally robust optimization (DRO) framework, that simultaneously tackles the missing data problem and data-driven stochastic optimization by hedging against the uncertainties of the missing values. This paper adds to the DRO literature by considering the practical scenario where the data can be incomplete and partially observable; it particularly focuses on data distributions with finite support. We construct several classes of ambiguity sets for our DRO model utilizing the incomplete data sets, maximum likelihood estimation method, and different metrics. We prove the statistical consistency and finite sample guarantees of the corresponding models and provide tractable reformulations of our model for different scenarios. We perform computational studies on the multi-item inventory control problem and portfolio optimization using synthetic and real-world data. We validate that our method outperforms the traditional estimate-then-optimized approaches.

Introduction

This paper aims to address the issue of missing data (incomplete data) in data-driven stochastic programming. Stochastic programming is an important framework for optimization under uncertainty. It generally assumes that a probability distribution of the random variable ξ is available and seeks an optimal solution in terms of expected performance. In practice, the distribution of ξ is often unknown; consequently, data-driven approaches have been proposed to solve this problem. These data-driven methods only work well if well-conditioned historical data for ξ are available. However, if ξ is multidimensional, complete joint data of ξ are often hard to obtain in many real-world settings due to the following issues:

  • Missing data in some dimensions for ξ.

  • Sharing of data is limited among dimensions representing different components.

  • Different sizes of data in different dimensions.

Missing data is a common issue for practical operations research (OR) problems as data sets for many applications are incomplete. For example, a common challenge in portfolio optimization theory is that the historical data of the returns for assets have missing values (Radulescu, 2013, Taylor, 2006). The data in many other application domains are also incomplete, including large-scale transportation management systems (TMSs) used to monitor traffic conditions to improve traffic congestion. The mobility monitoring program of the Texas Transportation Institute (TTI) reports that after screening erroneous data, TMS data archives can be anywhere from 16% to 93% complete (Smith, Scherer, & Conklin, 2003).

The most popular method to solve the missing data problem is data imputation. In this approach, missing values are imputed before implementing optimization and performing other analyses on the complete data set. Many variants of data-imputation are proposed in the literature (Dempster, Laird, Rubin, 1977, Little, Rubin, 2019, Stekhoven, Bühlmann, 2012, Troyanskaya, Cantor, Sherlock, Brown, Hastie, Tibshirani, et al., 2001, Wang, Li, Jiang, Feng, 2006). In general, machine learning tools are intended to minimize prediction error and do not consider how the predictions will impact the downstream optimization problem Elmachtoub & Grigas (2022). Utilizing data-imputation methods for data-driven stochastic optimization facing incomplete data suffers from two major issues. First, estimate-then-optimize methods are known to give sub-optimal solutions in many recent studies (Delage, Ye, 2010, Esfahani, Kuhn, 2018, Gupta, Rusmevichientong, 2021, Liyanage, Shanthikumar, 2005). The paper (Liyanage & Shanthikumar, 2005) analyzes the newsvendor inventory control problem with an ambiguous demand. It shows the estimate-the-optimize approach leads to a suboptimal inventory policy. The authors present a better solution by integrating the estimation and the optimization tasks. Second, theoretical guarantees are hardly obtained; this is because the analysis of the missing data and the derivation of the stochastic optimization’s optimal solutions are conducted separately.

We propose an integrated approach, a distributionally robust optimization (DRO) framework, that combines the estimation and the optimization steps for data-driven stochastic programming facing incomplete data. Compared with the existing missing data literature, our approach does not attempt to find a suitable estimation for the missing values. The goal of the presented method is to solve the corresponding data-driven stochastic programming. We quantify the uncertainties brought by the missing values and incorporate them into stochastic programming through the DRO framework.

This work also extends the DRO literature. DRO (Delage, Ye, 2010, Esfahani, Kuhn, 2018, Gao, Chen, & Kleywegt, Jiang, Guan, 2018, Zhao, Zhang, 2019) is a powerful modeling paradigm that combines the estimation and the optimization steps. This approach first constructs some ambiguity sets P based on the available data set; optimization techniques are then proposed to solve these models with respect to the worst-case distributions within the ambiguity sets. However, to the best of our knowledge, the existing DRO works have not considered any ambiguity sets based on the incomplete data sets. Researchers in Zhao & Zhang (2019) do consider a missing data problem encountered in incomplete trajectories data. But their main goal is to reconstruct the missing location-duration path choices, and their ambiguity set is still based on the complete historical data.

The main contributions of this paper are as follows: It proposes a DRO framework for data-driven optimization problems with incomplete (partially observable) data sets. We consider data distributions with finite support and assume only partially observed data are available, meaning that the components for each piece of data are randomly missing. We present several classes of ambiguity sets based on the incomplete data for our DRO framework and discuss their properties. We provide finite sample guarantees of our DRO model by utilizing the observed information matrix (Efron & Hinkley, 1978) into our analysis. We also prove the statistical consistency results using the properties of maximum likelihood estimation. Tractable reformulations of the models are presented. Finally, we conduct computational studies to evaluate the performances of the proposed approaches compared to data-imputation-based approaches based on both synthetic and real-world data. Below, we highlight the details of the contributions.

  • 1.

    A new DRO framework based on incomplete data is proposed. The proposed DRO framework is fundamentally different from the popular data-imputation-based methods. It signifies an integrated model that solves the missing data problem and stochastic programming simultaneously instead of following the estimate-then-optimize procedure. By adopting a DRO framework, the presented models are robust towards the uncertainties of the missing values. Therefore, they greatly improve the out-of-sample performances in applications where the optimal solutions are sensitive to unknown parameters.

  • 2.

    Our DRO framework extends the current studies on DRO by proposing ambiguity sets that are constructed directly based on the incomplete data set. We construct several types of ambiguity sets based on f-divergence, Wasserstein metric, L1 norm, and ellipsoids in the probability space. The centers of these ambiguity sets are chosen as optimal estimators of the maximum likelihood estimation (MLE) based on the incomplete data set. The first two kinds of ambiguity sets are inspired by two general metrics used in the DRO literature. The last one is inspired by the special structures in our model. We discuss that the ellipsoid ambiguity set is asymptotically optimal in the sense that it contains the complete data (true) distribution with the highest probability among all ambiguity sets having the same volume.

  • 3.

    We obtain theoretical guarantees and tractable reformulations for the proposed models. We first derive the finite sample guarantees of our model by providing a probabilistic upper bound to their out-of-sample performances. Our analyses are based on the asymptotic normality and the observed information matrix (empirical fisher information). We then prove the statistical consistency guarantee, which means the solution of our model converges to the true optimal in probability when the number of observed data goes to infinity. Finally, we show that these reformulations can be efficiently solved if the cost functions of the original stochastic program are convex and the feasible regions are convex or mixed-integer linear sets.

  • 4.

    We provide computational experiments for evaluation using both synthetic and real-world data. Two applications are studied: the multi-item inventory control problem and portfolio optimization. Considering the multi-item inventory control problem, we illustrate the reductions in the total costs of our models based on synthetic data. We benchmark our frameworks for portfolio optimization against the data imputation-based method using 57 pairs of training sets and test sets. These sets are obtained from real-world historical returns of exchange-traded funds (ETFs) and the US central bank (FED) rate of return from 2006 to 2016 (Boyd, 2019). We justify the conclusion of improvements by showing that proposed models consistently yield better out-of-sample performance.

The rest of the paper is organized as follows. We briefly review related works in Section 2. The background and notations are introduced in Section 3. We present the main results in Section 4. More specifically, in Section 4.1, we propose one ambiguity set, and we prove the corresponding statistical consistency and finite sample guarantees in Sections 4.2 and 4.3, respectively. We also discuss its reformulations in Section 4.4 and different classes of ambiguity sets in Section 4.5. Computational studies are summarized in Section 5. This paper concludes itself in Section 6. All the proofs and extensions to two-stage stochastic programming are provided in the Appendix.

Section snippets

Related works

This section briefly reviews related works about missing data and data-driven optimization under uncertainty.

Missing data. Missing or incomplete data has been studied widely, especially in machine learning and statistics (García-Laencina, Sancho-Gómez, Figueiras-Vidal, 2010, Goodfellow, Bengio, Courville, 2016, Rubin, 1976) literature. One of the most natural options to solve the missing data problem is to discard any data that include missing values. However, this approach may lead to biased

Notations and background

Throughout this paper, we use the following notations and assumptions. A stochastic program can be formulated asminxXE[Q(x,ξ)],where Q(x,ξ) represents a cost function with respect to decision x, and X represents the feasible region of x. In this paper, we study the problem whose random variable ξ has known finite support. Finite discrete support sets are common and studied in many operations research problems (Barbarosolu, Arda, 2004, Feng, Ryan, 2013, Rujeerapaiboon, Schindler, Kuhn,

Main model

In this section, we first introduce and discuss our model based on one specific ambiguity set. Then, we introduce the concepts of asymptotic normality and the observed information matrix to obtain the theoretical guarantees, which contain two main points. First, we prove the consistency result in Section 4.2. The consistency follows the definition in Van der Vaart (2000) meaning that the solution of our model converges in probability to the true optimal solution as the observed data size N goes

Computational study

In this section, we conduct computational studies to validate the superiority of the proposed model in some real-world applications. We first study a multi-item two-stage inventory control problem based on the synthetic data in Section 5.1. Please refer to Appendix H for extensions of our model to two-stage stochastic programming. In this experiment, we show the improvements in the out-of-sample performances of our models compared to a data-imputation-based approach. In Section 5.2, we study a

Conclusion

This paper develops a new DRO framework for data-driven stochastic optimization facing incomplete data sets. Our model represents an integrated analysis of missing data and stochastic optimization, which is different from the most popular data-imputation-based approaches. We provide theoretical guarantees and utilize the concepts in MLE and the observed information matrix to obtain the bounds for the performance. Several classes of ambiguity sets with their reformulations are discussed. There

References (59)

  • D. Bertsimas et al.

    Robust sample average approximation

    Mathematical Programming

    (2018)
  • Boyd, S. (2019). Data for finance and portfolio optimization....
  • S. Boyd et al.

    Convex optimization

    (2004)
  • Z. Chen et al.

    Robust stochastic optimization made easy with RSOME

    Management Science

    (2020)
  • Z. Chen et al.

    Distributionally robust optimization with infinitely constrained ambiguity sets

    Operations Research

    (2019)
  • M. Conforti et al.

    Integer programming

    (2014)
  • E. Delage et al.

    Distributionally robust optimization under moment uncertainty with application to data-driven problems

    Operations Research

    (2010)
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the em algorithm

    Journal of the Royal Statistical Society: Series B (Methodological)

    (1977)
  • F. Dong et al.

    Maximum likelihood estimation for incomplete multinomial data via the weaver algorithm

    Statistics and Computing

    (2018)
  • B. Efron et al.

    Assessing the accuracy of the maximum likelihood estimator: Observed versus expected fisher information

    Biometrika

    (1978)
  • A.N. Elmachtoub et al.

    Smart “predict, then optimize”

    Management Science

    (2022)
  • C.K. Enders et al.

    The relative performance of full information maximum likelihood estimation for missing data in structural equation models

    Structural Equation Modeling

    (2001)
  • E. Erdoğan et al.

    Ambiguous chance constrained problems and robust optimization

    Mathematical Programming

    (2006)
  • P.M. Esfahani et al.

    Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations

    Mathematical Programming

    (2018)
  • Gao, R., Chen, X., & Kleywegt, A. J. (2017). Wasserstein distributional robustness and regularization in statistical...
  • Gao, R., & Kleywegt, A. J. (2016). Distributionally robust stochastic optimization with Wasserstein distance. arXiv...
  • P.J. García-Laencina et al.

    Pattern classification with missing data: A review

    Neural Computing and Applications

    (2010)
  • I. Goodfellow et al.

    Deep learning

    (2016)
  • V. Gupta et al.

    Small-data, large-scale linear optimization with uncertain objectives

    Management Science

    (2021)
  • Cited by (3)

    View full text