A data-driven robust optimization algorithm for black-box cases: An application to hyper-parameter optimization of machine learning algorithms
Graphical abstract
Introduction
One of the methods to cope with the uncertainty involved in data/parameters of the optimization problems that lead to violation of feasibility and optimality conditions is the so-called robust optimization (RO). Soyster (1973) proposed a linear programming (LP) model in which noisy input data belong to a convex set. This approach is too conservative since it ensures feasibility for all uncertain realizations and causes a large optimality cost. As such, vital progress in developing less conservative approaches was made independently by El Ghaoui and Lebret, 1997, Ben-Tal and Nemirovski, 1998. Intheir approach, uncertainty sets were assumed ellipsoidal based on which a counterpart model was solved with deterministic parameters. Ben-Tal & Nemirovski (1998) proposed replacing an uncertain LP problem by its robust counterpart and showed that the robust counterpart of an LP problem with ellipsoidal uncertainty set is a conic quadratic program that can be solved in polynomial time. However, such method cannot be directly applied to discrete optimization (Bertsimas & Sim, 2003). Another drawback of this method is that it leads to nonlinear, although convex models, which are more computationally demanding than the earlier linear models. Later, Bertsimas & Sim (2004) suggested using intervals as uncertainty sets for which the model is transformed into a deterministic linear model of an LP through duality theory. The main advantage of their proposal was to budget the uncertainty, i.e. ensuring the feasibility by controlling the number of active uncertain parameters.
Stochastic Programming (SP) is another powerful modelling paradigm for optimization under uncertainty. A generic single-stage SP is , where the expectation is on the distribution of the random vector and is the cost function depending on the solution as well. However, classical SP for large-scale decision-making problems is not well-suited (Esfahani & Kuhn 2018). As a remedy, an intermediate approach between SP and RO called distributionally robust optimization (DRO) was proposed in the literature in which uncertain data is governed by a distribution that is itself subject to uncertainty. This distribution belongs to an ambiguity set comprising all distributions that are compatible with the prior knowledge (Wiesemann et al., 2014). The motivation for this approach is the availability of rich and extensive historical data in recent years. The first study in this field was proposed by Scarf (1957) in the context of inventory control problem. Esteban-Pérez & Morales (2019) categorized all methods available on DRO in three major classes as follows:
- •
Studies such as Dupačová, 1987, Prékopa, 1995, Bertsimas and Sethuraman, 2000, Delage and Ye, 2010, Zymler et al., 2013, Xin and Goldberg, 2013, Mehrotra and Papp, 2014, Gao and Kleywegt, 2016, Nakao et al., 2017, and Liu et al. (2018) that consider sets based on the distribution moments.
- •
Works in which the ambiguity set is defined as the set of all distributions whose dissimilarity with a prescribed distribution is less than or equal to a given value. This class has the following three subclasses:
- (I)
The Wasserstein ambiguity set is used in Shafieezadeh-Abadeh et al., 2017, Gao and Kleywegt, 2016, Gao and Kleywegt, 2017, Blanchet et al., 2017a, Blanchet et al., 2017b, Esfahani and Kuhn, 2018.
- (II)
The -divergence is utilized in Ben-Tal et al., 2013, Bayraksan and Love, 2015, Moghaddam and Mahlooji, 2016, Namkoong and Duchi, 2016.
- (III)
Nilim and El Ghaoui, 2005, Iyengar, 2005, Wang et al., 2016, and Duchi et al. (2016) used the likelihood ratio with the historical data.
- (I)
- •
In the third class, the ambiguity set is based on all distributions that, given a sample, pass a prescribed hypothesis test, examples are Marla et al., 2018, Bertsimas et al., 2018, and Chen et al. (2019) who used this approach.
Bertsimas et al. (2018) proposed using statistical hypothesis tests which is flexible, widely applicable, and tractable, both theoretically and practically. In addition, their optimal solution enjoys a strong, finite-sample guarantee in the case that the constraints and the objective function are concave in the uncertainty. They described how to choose an appropriate set and applied their approach to multiple uncertain constraints. Nevertheless, their method requires a closed-form objective function which is rarely available in real-world problems. As such, in this paper, their approach is extended to manage objective functions that are black boxes and are not given in closed form. To do this, a Gaussian model is used in lieu of the function, based on which the model proposed by Bertsimas et al. (2018) is applied to it.
The rest of the paper is organized as follows. Section 2 provides a short review of robust optimization, Gaussian processes, data-driven robust optimization, and Bayesian optimization method. The proposed data-driven robust optimization approach and the computational complexity of proposed algorithm are explored in detail in Section 3 Section 4. In Section 4, the performance of the suggested approach is demonstrated using test functions. Robust hyper-parameter optimization of machine learning algorithms, as one of the applications of the proposed algorithm, is addressed in Section 5. Finally, conclusion and future research directions are provided in Section 6.
Section snippets
Background
This section provides a brief background on robust optimization, Gaussian processes, data-driven robust optimization, and Bayesian optimization method.
Proposed algorithm
Putting all the above preliminaries together, the following algorithm is proposed for minimization: Set the problem parameter . While the cross-validation criterion is not satisfied, do: Set n (number of design points). Set the location of design points (X) by the Latin Hypercube Sampling (LHS) design. Set . Normalize X. Calculate . For i = 1,…, Do Kolmogorov–Smirnov test and find a distribution for each uncertain parameter in a predefined level . Define aStart
Test functions
In order to evaluate the performance of the proposed approach, some cases for data-driven robust optimization are required. The designs and the noise values of the test cases in this paper are similar to the ones used in Azizi et al. (2019) who employed the 12 test functions of Marzat et al. (2013) as:
- •
- •
- •
- •
- •
Case studies
Machine learning algorithms have achieved a prominent position in many scientific and practical applications. This position has led to an ever-growing demand for machine learning systems. This prominent position is owed to their good performances, yet it is important to consider that these good performances heavily rely on choosing proper internal hyper-parameters (Feurer et al., 2015) (Falkner et al., 2018). So, we design a way to automatically set these hyper-parameters to optimize the
Conclusion
The prevalence of high-quality data and lack of a closed-form objective function in many problems led us to design a robust data-driven optimization method in this paper. This method was founded on the uncertainty sets proposed by Bertsimas et al. (2018) and uses a Gaussian meta-model. One of the advantages of the designed method is the ability to determine the robustness degree for the answer provided by the user. DRSO algorithm is computationally tractable and was shown to have a complexity
CRediT authorship contribution statement
Farshad Seifi: Conceptualization, Validation, Investigation, Data curation, Writing – original draft, Visualization. Mohammad Javad Azizi: Methodology, Software, Formal analysis. Seyed Taghi Akhavan Niaki: Resources, Writing – review & editing, Supervision, Project administration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (55)
- et al.
Robust optimization: Lessons learned from aircraft routing
Computers & Operations Research
(2018) - et al.
Network design in scarce data environment using moment-based distributionally robust optimization
Computers & Operations Research
(2017) - et al.
A robust simulation optimization algorithm using kriging and particle swarm optimization: Application to surgery room optimization
Communications in Statistics-Simulation and Computation
(2019) - Bayraksan, G., Love, D. K. (2015). Data-driven stochastic programming using phi-divergences. In The Operations Research...
- et al.
Robust convex optimization
Mathematics of Operations Research
(1998) - et al.
Robust optimization
(2009) - et al.
Robust solutions of optimization problems affected by uncertain probabilities
Management Science
(2013) - et al.
Moment problems and semidefinite optimization
Handbook of Semidefinite Programming
(2000) - et al.
Robust discrete optimization and network flows
Mathematical Programming
(2003) - et al.
The price of robustness
Operations Research
(2004)
Data-driven robust optimization
Mathematical Programming
Convex optimization
Distributionally robust optimization under moment uncertainty with application to data-driven problems
Operations Research
The minimax approach to stochastic programming and an illustrative application
Stochastics: An International Journal of Probability and Stochastic Processes
Robust solutions to least-squares problems with uncertain data
SIAM Journal on Matrix Analysis and Applications
Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations
Mathematical Programming
Efficient and robust automated machine learning
Advances in Neural Information Processing Systems
Hyperparameter optimization
Automated Machine Learning
Multi-objective model selection for support vector machines
Cited by (5)
Bayesian-optimized deep learning model to segment deterioration patterns underneath bridge decks photographed by unmanned aerial vehicle
2023, Automation in ConstructionCitation Excerpt :The BOA has been used to tune the hyper-parameters of long short-term memory (LSTM) models [44]. Several studies have used the BOA to tune the parameters of machine learning models and thereby to greatly reduce the necessary computing power [45,46]. Researchers have found that the BOA has the potential to become the standard approach to parameter tuning in the fields of optoelectronic science and engineering [47].
A type-II maximum-likelihood approach to Gaussian scale mixture-based sparse regression Kriging
2022, Computers and Industrial EngineeringCitation Excerpt :Nowadays, metamodeling, also known as surrogate modeling, has earned intensive attentions since the seminal work of Sacks et al. (Sacks et al., 1989). Among various kinds of metamodeling techniques in the past decades, Kriging is one of the most popular candidates due to its capability of quantifying prediction uncertainty, which has benefited various downstream tasks in industrial engineering (Junior et al., 2019), such as black-box global optimization (Gu et al., 2019; Jeong & Shin, 2021; Palar & Shimoyama, 2018; Seifi et al., 2021; Zhan & Xing, 2021), robust design optimization (Han & Tan, 2016; Jiang et al., 2021; Park & Leeds, 2016; Parnianifard et al., 2020); and so on. Classical Kriging.
Computer-aided food engineering
2022, Nature FoodExtending the hypergradient descent technique to reduce the time of optimal solution achieved in hyperparameter optimization algorithms
2023, International Journal of Industrial Engineering Computations