Abstract
Approximate dynamic programming has evolved, initially independently, within operations research, computer science and the engineering controls community, all searching for practical tools for solving sequential stochastic optimization problems. More so than other communities, operations research continued to develop the theory behind the basic model introduced by Bellman with discrete states and actions, even while authors as early as Bellman himself recognized its limits due to the “curse of dimensionality” inherent in discrete state spaces. In response to these limitations, subcommunities in computer science, control theory and operations research have developed a variety of methods for solving different classes of stochastic, dynamic optimization problems, creating the appearance of a jungle of competing approaches. In this article, we show that there is actually a common theme to these strategies, and underpinning the entire field remains the fundamental algorithmic strategies of value and policy iteration that were first introduced in the 1950’s and 60’s.
Similar content being viewed by others
References
Barto, A. G., Sutton, R. S., & Brouwer, P. (1981). Associative search network: A reinforcement learning associative memory. Biological Cybernetics, 40(3), 201–211.
Barto, A., Sutton, R. S., & Anderson, C. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 834–846.
Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press.
Bellman, R. E. (1971). Introduction to the mathematical theory of control processes (Vol. II). New York: Academic Press.
Bellman, R. E., & Dreyfus, S. (1959). Functional approximations and dynamic programming. Mathematical Tables and Other Aids To Computation, 13, 247–251.
Bertsekas, D. P. (2011a). Approximate dynamic programming. In Dynamic programming and optimal control (Vol. II, 3rd ed.). Belmont: Athena Scientific, Chap. 6.
Bertsekas, D. P. (2011b). Approximate policy iteration: A survey and some new methods, Journal of Control Theory and Applications, 9(3), 310–335.
Bertsekas, D. P., & Castanon, D. A. (1999). Rollout algorithms for stochastic scheduling problems. Journal of Heuristics, 5, 89–108.
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.
Birge, J. R., & Louveaux, F. (1997). Introduction to stochastic programming. New York: Springer.
Boesel, J., Nelson, B., & Kim, S. (2003). Using ranking and selection to “clean up” after simulation optimization. Operations Research, 51(5), 814–825.
Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1), 33–57.
Burnetas, A., & Katehakis, M. N. (1997). Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1), 222–225.
Cheung, R. K.-M., & Powell, W. B. (1996). An algorithm for multistage dynamic networks with random arc capacities, with an application to dynamic fleet management. Operations Research, 44, 951–963.
Chick, S. E., & Gans, N. (2009). Economic analysis of simulation selection problems. Management Science, 55(3), 421–437.
Dantzig, G. (1955). Linear programming under uncertainty. Management Science, 1, 197–206.
Dantzig, G., & Ferguson, A. (1956). The allocation of aircrafts to routes: An example of linear programming under uncertain demand. Management Science, 3, 45–73.
Denardo, E. V. (1982). Dynamic programming. Englewood Cliffs: Prentice-Hall.
Derman, C. (1962). On sequential decisions and Markov chains. Management Science, 9(1), 16–24.
Derman, C. (1966). Denumerable state Markovian decision processes-average cost criterion. Annals of Mathematical Statistics, 37(6), 1545–1553.
Derman, C. (1970). Finite state Markovian decision processes. New York: Academic Press.
Dreyfus, S., & Law, A. M. (1977). The art and theory of dynamic programming. New York: Academic Press.
Dupaçová, J., Consigli, G., & Wallace, S. W. (2000). Scenarios for multistage stochastic programs. Annals of Operations Research, 100, 25–53.
Dupacova, J. (1995). Multistage stochastic programs—the state of the art and selected bibliography. Kybernetica, 31, 151–174.
Dynkin, E. B., & Yushkevich, A. A. (1979). Controlled Markov processes. In A series of comprehensive studies in mathematics: Vol. 235. Grundlehren der mathematischen Wissenschaften. New York: Springer.
Enders, J., Powell, W. B., & Egan, D. M. (2010). Robust policies for the transformer acquisition and allocation problem. Energy Systems, 1(3), 245–272.
Frazier, P. I., Powell, W. B., & Dayanik, S. (2008). A knowledge gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5), 2410–2439.
Frazier, P. I., Powell, W. B., & Dayanik, S. (2009). The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing, 21(4), 599–613.
George, A., Powell, W. B., & Kulkarni, S. (2008). Value function approximation using multiple aggregation for multiattribute resource management. Journal of Machine Learning Research, 9, 2079–2111.
Gittins, J., Glazebrook, K., & Weber, R. R. (2011). Multi-armed bandit allocation indices. New York: Wiley.
Growe-Kuska, N., Heitsch, H., & Romisch, W. (2003). Scenario reduction and scenario tree construction for power management problems. In A. Borghetti, C. A. Nucci, & M. Paolone (Eds.), IEEE Bologna power tech proceedings.
Gupta, S., & Miescke, K. (1996). Bayesian look ahead one-stage sampling allocations for selection of the best population. Journal of Statistical Planning and Inference, 54(2), 229–244.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference and prediction. New York: Springer.
Haykin, S. (1999). Neural networks: A comprehensive foundation. New York: Prentice Hall.
Heyman, D. P., & Sobel, M. (1984). Stochastic models in operations research. Stochastic optimization (Vol. II). New York: McGraw-Hill.
Higle, J., & Sen, S. (1996). Stochastic decomposition: A statistical method for large scale stochastic linear programming. Dordrecht: Kluwer Academic.
Howard, R. A. (1960). Dynamic programming and Markov process. Cambridge: MIT Press.
Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185–1201.
Judd, K. L. (1998). Numerical methods in economics. Cambridge: MIT Press.
Kall, P., & Wallace, S. (1994). Stochastic programming. New York: Wiley.
Katehakis, M. N., & Derman, C. (1986). Computing optimal sequential allocation rules in clinical trials. In Lecture notes monograph series (Vol. 8, pp. 29–39). New York: JSTOR.
Katehakis, M. N., & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92, 8584–8585.
Katehakis, M. N., & Veinott, A. F. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research, 12(2), 262–268.
Kaut, M., & Wallace, S. W. (2003). Evaluation of scenario-generation methods for stochastic programming, Stochastic programming e-print series.
Kushner, H. J., & Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications. Berlin: Springer.
Law, A., & Kelton, W. (1991). Simulation modeling and analysis (Vol. 2). New York: McGraw-Hill.
Lewis, F., Jagannathan, S., & Yesildirek, A. (1999). Neural network control of robot manipulators and nonlinear systems. New York: CRC Press.
Lewis, F. L., & Syrmos, V. L. (1995). Optimal control. Hoboken: Wiley-Interscience.
Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, 9(3), 32–50.
Maei, H. R., Szepesvari, C., Bhatnagar, S., & Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In ICML-2010.
Negoescu, D. M., Frazier, P. I., & Powell, W. B. (2011). The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3), 346–363.
Nemhauser, G. L. (1966). Introduction to dynamic programming. New York: Wiley.
Powell, W., & Ryzhov, I. (2012). Optimal learning. Hoboken: Wiley.
Powell, W. B. (1987). An operational planning model for the dynamic vehicle allocation problem with uncertain demands. Transportation Research, 21B, 217–232.
Powell, W. B. (2007). Approximate dynamic programming: Solving the curses of dimensionality. Hoboken: Wiley.
Powell, W. B. (2010). Merging AI and OR to solve high-dimensional stochastic optimization problems using approximate dynamic programming. INFORMS Journal on Computing, 22(1), 2–17.
Powell, W. B. (2011). Approximate dynamic programming: Solving the curses of dimensionality (2nd. ed.) Hoboken: Wiley.
Powell, W. B., & Frantzeskakis, L. F. (1990). A successive linear approximation procedure for stochastic dynamic vehicle allocation problems. Transportation Science, 24, 40–57.
Powell, W. B., & Godfrey, G. (2002). An adaptive dynamic programming algorithm for dynamic fleet management, I: Single period travel times. Transportation Science, 36(1), 21–39.
Powell, W. B., & Ma, J. (2011). A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications, 9(3), 336–352.
Powell, W. B., & Simão, H. (2009). Approximate dynamic programming for management of high-value spare parts. Journal of Manufacturing Technology and Management, 20(2), 147–160.
Powell, W. B., & Topaloglu, H. (2005). Fleet management. In S. Wallace & W. Ziemba (Eds.), SIAM series in optimization. Applications of stochastic programming (pp. 185–216). Philadelphia: Math Programming Society.
Powell, W. B., & Van Roy, B. (2004). Approximate dynamic programming for high dimensional resource allocation problems. In J. Si, A. G. Barto, W. B. Powell, & D. W. II (Eds.), Handbook of learning and approximate dynamic programming. New York: IEEE Press.
Powell, W. B., George, A., Lamont, A., & Stewart, J. (2011). SMART: A stochastic multiscale model for the analysis of energy resources, technology and policy. INFORMS Journal on Computing. http://dx.doi.org/10.1287/ijoc.1110.0470.
Puterman, M. L. (1994). Markov decision processes (1st ed.). Hoboken: Wiley.
Puterman, M. L. (2005). Markov decision processes (2nd ed.). Hoboken: Wiley.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22(3), 400–407.
Romisch, W., & Heitsch, H. (2009). Scenario tree modeling for multistage stochastic programs. Mathematical Programming, 118, 371–406.
Ross, S. (1983). Introduction to stochastic dynamic programming. New York: Academic Press.
Ryzhov, I., & Powell, W. B. (2011). Bayesian active learning with basis functions. In 2011 IEEE symposium series on computational intelligence, No 3. Paris: IEEE Press.
Ryzhov, I., Frazier, P. I., & Powell, W. B. (2012). Stepsize selection for approximate value iteration and a new optimal stepsize rule (Technical report). Department of Operations Research and Financial Engineering, Princeton University.
Ryzhov, I. O., Powell, W. B., & Frazier, P. I. (n.d.). The knowledge gradient algorithm for a general class of online learning problems.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3, 211–229.
Sen, S., & Higle, J. (1999). An introductory tutorial on stochastic linear programming models. Interfaces, 29(2), 33–6152.
Si, J., & Wang, Y. T. (2001). Online learning control by association and reinforcement. IEEE Transactions on Neural Networks, 12(2), 264–276.
Si, J., Barto, A. G., Powell, W. B., & Wunsch, D. (2004). Handbook of learning and approximate dynamic programming. New York: Wiley-IEEE Press.
Silver, D. (2009). Reinforcement learning and simulation-based search in computer go. PhD thesis, University of Alberta.
Simao, H. P., Day, J., George, A. P., Gifford, T., Powell, W. B., & Nienow, J. (2009). An approximate dynamic programming algorithm for large-scale fleet management: A case application. Transportation Science, 43(2), 178–197.
Simao, H. P., George, A., Powell, W. B., Gifford, T., Nienow, J., & Day, J. (2010). Approximate dynamic programming captures fleet operations for Schneider national. Interfaces, 40(5), 1–11.
Spall, J. C. (2003). Introduction to stochastic search and optimization: Estimation, simulation and control. Hoboken: Wiley.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44.
Sutton, R. S., & Barto, A. G. (1981). Toward a modern theory of adaptive networks. Psychological Review, 88(2), 135–170.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning (Vol. 35). Cambridge: MIT Press.
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., & Wiewiora, E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning—ICML’09 (pp. 1–8). New York: ACM Press.
Sutton, R. S., Szepesvari, C., & Maei, H. (2009b). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in neural information processing systems (Vol. 21, pp. 1609–1616). Princeton: Citeseer.
Topaloglu, H., & Powell, W. B. (2006). Dynamic programming approximations for stochastic, time-staged integer multicommodity flow problems. INFORMS Journal on Computing, 18, 31–42.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16, 185–202.
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674–690.
Van Roy, B., Bertsekas, D. P., Lee, Y., & Tsitsiklis, J. N. (1997). A neuro-dynamic programming approach to retailer inventory management. In Proceedings of the IEEE conference on decision and control (Vol. 4, pp. 4052–4057).
Venayagamoorthy, G., & Harley, R. (2002). Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator. IEEE Transactions on Neural Networks, 13(3), 764–773.
Wang, F.-Y., Zhang, H., & Liu, D. (2009). Adaptive dynamic programming: An introduction. IEEE Computational Intelligence Magazine, May, 39–47.
Watkins, C. (1989). Learning from delayed rewards. PhD thesis, Kings College, Cambridge, England.
Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.
Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.
Werbos, P. J. (1989). Backpropagation and neurocontrol: A review and prospectus, Neural Networks, 209–216.
Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3, 179–189.
Werbos, P. J. (1992a). Approximate dynamic programming for real-time control and neural modelling. In D. J. White & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy, and adaptive approaches.
Werbos, P. J. (1992b). Neurocontrol and supervised learning: An overview and valuation. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy, and adaptive approaches.
Werbos, P. J., Miller, W. T., & Sutton, R. S. (Eds.) (1990). Neural networks for control. Cambridge: MIT Press.
White, D. J. (1969). Dynamic programming. San Francisco: Holden-Day.
Wu, T., Powell, W. B., & Whisman, A. (2009). The optimizing-simulator: An illustration using the military airlift problem. ACM Transactions on Modeling and Simulation, 19(3), 1–31.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Powell, W.B. Perspectives of approximate dynamic programming. Ann Oper Res 241, 319–356 (2016). https://doi.org/10.1007/s10479-012-1077-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-012-1077-6