Skip to main content
Log in

Perspectives of approximate dynamic programming

  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

Approximate dynamic programming has evolved, initially independently, within operations research, computer science and the engineering controls community, all searching for practical tools for solving sequential stochastic optimization problems. More so than other communities, operations research continued to develop the theory behind the basic model introduced by Bellman with discrete states and actions, even while authors as early as Bellman himself recognized its limits due to the “curse of dimensionality” inherent in discrete state spaces. In response to these limitations, subcommunities in computer science, control theory and operations research have developed a variety of methods for solving different classes of stochastic, dynamic optimization problems, creating the appearance of a jungle of competing approaches. In this article, we show that there is actually a common theme to these strategies, and underpinning the entire field remains the fundamental algorithmic strategies of value and policy iteration that were first introduced in the 1950’s and 60’s.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Barto, A. G., Sutton, R. S., & Brouwer, P. (1981). Associative search network: A reinforcement learning associative memory. Biological Cybernetics, 40(3), 201–211.

    Article  Google Scholar 

  • Barto, A., Sutton, R. S., & Anderson, C. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 834–846.

    Article  Google Scholar 

  • Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press.

    Google Scholar 

  • Bellman, R. E. (1971). Introduction to the mathematical theory of control processes (Vol. II). New York: Academic Press.

    Google Scholar 

  • Bellman, R. E., & Dreyfus, S. (1959). Functional approximations and dynamic programming. Mathematical Tables and Other Aids To Computation, 13, 247–251.

    Article  Google Scholar 

  • Bertsekas, D. P. (2011a). Approximate dynamic programming. In Dynamic programming and optimal control (Vol. II, 3rd ed.). Belmont: Athena Scientific, Chap. 6.

    Google Scholar 

  • Bertsekas, D. P. (2011b). Approximate policy iteration: A survey and some new methods, Journal of Control Theory and Applications, 9(3), 310–335.

    Article  Google Scholar 

  • Bertsekas, D. P., & Castanon, D. A. (1999). Rollout algorithms for stochastic scheduling problems. Journal of Heuristics, 5, 89–108.

    Article  Google Scholar 

  • Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.

    Google Scholar 

  • Birge, J. R., & Louveaux, F. (1997). Introduction to stochastic programming. New York: Springer.

    Google Scholar 

  • Boesel, J., Nelson, B., & Kim, S. (2003). Using ranking and selection to “clean up” after simulation optimization. Operations Research, 51(5), 814–825.

    Article  Google Scholar 

  • Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1), 33–57.

    Google Scholar 

  • Burnetas, A., & Katehakis, M. N. (1997). Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1), 222–225.

    Article  Google Scholar 

  • Cheung, R. K.-M., & Powell, W. B. (1996). An algorithm for multistage dynamic networks with random arc capacities, with an application to dynamic fleet management. Operations Research, 44, 951–963.

    Article  Google Scholar 

  • Chick, S. E., & Gans, N. (2009). Economic analysis of simulation selection problems. Management Science, 55(3), 421–437.

    Article  Google Scholar 

  • Dantzig, G. (1955). Linear programming under uncertainty. Management Science, 1, 197–206.

    Article  Google Scholar 

  • Dantzig, G., & Ferguson, A. (1956). The allocation of aircrafts to routes: An example of linear programming under uncertain demand. Management Science, 3, 45–73.

    Article  Google Scholar 

  • Denardo, E. V. (1982). Dynamic programming. Englewood Cliffs: Prentice-Hall.

    Google Scholar 

  • Derman, C. (1962). On sequential decisions and Markov chains. Management Science, 9(1), 16–24.

    Article  Google Scholar 

  • Derman, C. (1966). Denumerable state Markovian decision processes-average cost criterion. Annals of Mathematical Statistics, 37(6), 1545–1553.

    Article  Google Scholar 

  • Derman, C. (1970). Finite state Markovian decision processes. New York: Academic Press.

    Google Scholar 

  • Dreyfus, S., & Law, A. M. (1977). The art and theory of dynamic programming. New York: Academic Press.

    Google Scholar 

  • Dupaçová, J., Consigli, G., & Wallace, S. W. (2000). Scenarios for multistage stochastic programs. Annals of Operations Research, 100, 25–53.

    Article  Google Scholar 

  • Dupacova, J. (1995). Multistage stochastic programs—the state of the art and selected bibliography. Kybernetica, 31, 151–174.

    Google Scholar 

  • Dynkin, E. B., & Yushkevich, A. A. (1979). Controlled Markov processes. In A series of comprehensive studies in mathematics: Vol. 235. Grundlehren der mathematischen Wissenschaften. New York: Springer.

    Google Scholar 

  • Enders, J., Powell, W. B., & Egan, D. M. (2010). Robust policies for the transformer acquisition and allocation problem. Energy Systems, 1(3), 245–272.

    Article  Google Scholar 

  • Frazier, P. I., Powell, W. B., & Dayanik, S. (2008). A knowledge gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5), 2410–2439.

    Article  Google Scholar 

  • Frazier, P. I., Powell, W. B., & Dayanik, S. (2009). The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing, 21(4), 599–613.

    Article  Google Scholar 

  • George, A., Powell, W. B., & Kulkarni, S. (2008). Value function approximation using multiple aggregation for multiattribute resource management. Journal of Machine Learning Research, 9, 2079–2111.

    Google Scholar 

  • Gittins, J., Glazebrook, K., & Weber, R. R. (2011). Multi-armed bandit allocation indices. New York: Wiley.

    Book  Google Scholar 

  • Growe-Kuska, N., Heitsch, H., & Romisch, W. (2003). Scenario reduction and scenario tree construction for power management problems. In A. Borghetti, C. A. Nucci, & M. Paolone (Eds.), IEEE Bologna power tech proceedings.

    Google Scholar 

  • Gupta, S., & Miescke, K. (1996). Bayesian look ahead one-stage sampling allocations for selection of the best population. Journal of Statistical Planning and Inference, 54(2), 229–244.

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference and prediction. New York: Springer.

    Book  Google Scholar 

  • Haykin, S. (1999). Neural networks: A comprehensive foundation. New York: Prentice Hall.

    Google Scholar 

  • Heyman, D. P., & Sobel, M. (1984). Stochastic models in operations research. Stochastic optimization (Vol. II). New York: McGraw-Hill.

    Google Scholar 

  • Higle, J., & Sen, S. (1996). Stochastic decomposition: A statistical method for large scale stochastic linear programming. Dordrecht: Kluwer Academic.

    Book  Google Scholar 

  • Howard, R. A. (1960). Dynamic programming and Markov process. Cambridge: MIT Press.

    Google Scholar 

  • Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185–1201.

    Article  Google Scholar 

  • Judd, K. L. (1998). Numerical methods in economics. Cambridge: MIT Press.

    Google Scholar 

  • Kall, P., & Wallace, S. (1994). Stochastic programming. New York: Wiley.

    Google Scholar 

  • Katehakis, M. N., & Derman, C. (1986). Computing optimal sequential allocation rules in clinical trials. In Lecture notes monograph series (Vol. 8, pp. 29–39). New York: JSTOR.

    Google Scholar 

  • Katehakis, M. N., & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92, 8584–8585.

    Article  Google Scholar 

  • Katehakis, M. N., & Veinott, A. F. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research, 12(2), 262–268.

    Article  Google Scholar 

  • Kaut, M., & Wallace, S. W. (2003). Evaluation of scenario-generation methods for stochastic programming, Stochastic programming e-print series.

  • Kushner, H. J., & Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications. Berlin: Springer.

    Google Scholar 

  • Law, A., & Kelton, W. (1991). Simulation modeling and analysis (Vol. 2). New York: McGraw-Hill.

    Google Scholar 

  • Lewis, F., Jagannathan, S., & Yesildirek, A. (1999). Neural network control of robot manipulators and nonlinear systems. New York: CRC Press.

    Google Scholar 

  • Lewis, F. L., & Syrmos, V. L. (1995). Optimal control. Hoboken: Wiley-Interscience.

    Google Scholar 

  • Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, 9(3), 32–50.

    Article  Google Scholar 

  • Maei, H. R., Szepesvari, C., Bhatnagar, S., & Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In ICML-2010.

    Google Scholar 

  • Negoescu, D. M., Frazier, P. I., & Powell, W. B. (2011). The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3), 346–363.

    Article  Google Scholar 

  • Nemhauser, G. L. (1966). Introduction to dynamic programming. New York: Wiley.

    Google Scholar 

  • Powell, W., & Ryzhov, I. (2012). Optimal learning. Hoboken: Wiley.

    Book  Google Scholar 

  • Powell, W. B. (1987). An operational planning model for the dynamic vehicle allocation problem with uncertain demands. Transportation Research, 21B, 217–232.

    Article  Google Scholar 

  • Powell, W. B. (2007). Approximate dynamic programming: Solving the curses of dimensionality. Hoboken: Wiley.

    Book  Google Scholar 

  • Powell, W. B. (2010). Merging AI and OR to solve high-dimensional stochastic optimization problems using approximate dynamic programming. INFORMS Journal on Computing, 22(1), 2–17.

    Article  Google Scholar 

  • Powell, W. B. (2011). Approximate dynamic programming: Solving the curses of dimensionality (2nd. ed.) Hoboken: Wiley.

    Book  Google Scholar 

  • Powell, W. B., & Frantzeskakis, L. F. (1990). A successive linear approximation procedure for stochastic dynamic vehicle allocation problems. Transportation Science, 24, 40–57.

    Article  Google Scholar 

  • Powell, W. B., & Godfrey, G. (2002). An adaptive dynamic programming algorithm for dynamic fleet management, I: Single period travel times. Transportation Science, 36(1), 21–39.

    Article  Google Scholar 

  • Powell, W. B., & Ma, J. (2011). A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications, 9(3), 336–352.

    Article  Google Scholar 

  • Powell, W. B., & Simão, H. (2009). Approximate dynamic programming for management of high-value spare parts. Journal of Manufacturing Technology and Management, 20(2), 147–160.

    Article  Google Scholar 

  • Powell, W. B., & Topaloglu, H. (2005). Fleet management. In S. Wallace & W. Ziemba (Eds.), SIAM series in optimization. Applications of stochastic programming (pp. 185–216). Philadelphia: Math Programming Society.

    Chapter  Google Scholar 

  • Powell, W. B., & Van Roy, B. (2004). Approximate dynamic programming for high dimensional resource allocation problems. In J. Si, A. G. Barto, W. B. Powell, & D. W. II (Eds.), Handbook of learning and approximate dynamic programming. New York: IEEE Press.

    Google Scholar 

  • Powell, W. B., George, A., Lamont, A., & Stewart, J. (2011). SMART: A stochastic multiscale model for the analysis of energy resources, technology and policy. INFORMS Journal on Computing. http://dx.doi.org/10.1287/ijoc.1110.0470.

  • Puterman, M. L. (1994). Markov decision processes (1st ed.). Hoboken: Wiley.

    Book  Google Scholar 

  • Puterman, M. L. (2005). Markov decision processes (2nd ed.). Hoboken: Wiley.

    Google Scholar 

  • Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22(3), 400–407.

    Article  Google Scholar 

  • Romisch, W., & Heitsch, H. (2009). Scenario tree modeling for multistage stochastic programs. Mathematical Programming, 118, 371–406.

    Article  Google Scholar 

  • Ross, S. (1983). Introduction to stochastic dynamic programming. New York: Academic Press.

    Google Scholar 

  • Ryzhov, I., & Powell, W. B. (2011). Bayesian active learning with basis functions. In 2011 IEEE symposium series on computational intelligence, No 3. Paris: IEEE Press.

    Google Scholar 

  • Ryzhov, I., Frazier, P. I., & Powell, W. B. (2012). Stepsize selection for approximate value iteration and a new optimal stepsize rule (Technical report). Department of Operations Research and Financial Engineering, Princeton University.

  • Ryzhov, I. O., Powell, W. B., & Frazier, P. I. (n.d.). The knowledge gradient algorithm for a general class of online learning problems.

  • Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3, 211–229.

    Article  Google Scholar 

  • Sen, S., & Higle, J. (1999). An introductory tutorial on stochastic linear programming models. Interfaces, 29(2), 33–6152.

    Article  Google Scholar 

  • Si, J., & Wang, Y. T. (2001). Online learning control by association and reinforcement. IEEE Transactions on Neural Networks, 12(2), 264–276.

    Article  Google Scholar 

  • Si, J., Barto, A. G., Powell, W. B., & Wunsch, D. (2004). Handbook of learning and approximate dynamic programming. New York: Wiley-IEEE Press.

    Book  Google Scholar 

  • Silver, D. (2009). Reinforcement learning and simulation-based search in computer go. PhD thesis, University of Alberta.

  • Simao, H. P., Day, J., George, A. P., Gifford, T., Powell, W. B., & Nienow, J. (2009). An approximate dynamic programming algorithm for large-scale fleet management: A case application. Transportation Science, 43(2), 178–197.

    Article  Google Scholar 

  • Simao, H. P., George, A., Powell, W. B., Gifford, T., Nienow, J., & Day, J. (2010). Approximate dynamic programming captures fleet operations for Schneider national. Interfaces, 40(5), 1–11.

    Article  Google Scholar 

  • Spall, J. C. (2003). Introduction to stochastic search and optimization: Estimation, simulation and control. Hoboken: Wiley.

    Book  Google Scholar 

  • Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44.

    Google Scholar 

  • Sutton, R. S., & Barto, A. G. (1981). Toward a modern theory of adaptive networks. Psychological Review, 88(2), 135–170.

    Article  Google Scholar 

  • Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning (Vol. 35). Cambridge: MIT Press.

    Google Scholar 

  • Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., & Wiewiora, E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning—ICML’09 (pp. 1–8). New York: ACM Press.

    Chapter  Google Scholar 

  • Sutton, R. S., Szepesvari, C., & Maei, H. (2009b). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in neural information processing systems (Vol. 21, pp. 1609–1616). Princeton: Citeseer.

    Google Scholar 

  • Topaloglu, H., & Powell, W. B. (2006). Dynamic programming approximations for stochastic, time-staged integer multicommodity flow problems. INFORMS Journal on Computing, 18, 31–42.

    Article  Google Scholar 

  • Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16, 185–202.

    Google Scholar 

  • Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674–690.

    Article  Google Scholar 

  • Van Roy, B., Bertsekas, D. P., Lee, Y., & Tsitsiklis, J. N. (1997). A neuro-dynamic programming approach to retailer inventory management. In Proceedings of the IEEE conference on decision and control (Vol. 4, pp. 4052–4057).

    Chapter  Google Scholar 

  • Venayagamoorthy, G., & Harley, R. (2002). Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator. IEEE Transactions on Neural Networks, 13(3), 764–773.

    Article  Google Scholar 

  • Wang, F.-Y., Zhang, H., & Liu, D. (2009). Adaptive dynamic programming: An introduction. IEEE Computational Intelligence Magazine, May, 39–47.

    Article  Google Scholar 

  • Watkins, C. (1989). Learning from delayed rewards. PhD thesis, Kings College, Cambridge, England.

  • Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.

    Google Scholar 

  • Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.

  • Werbos, P. J. (1989). Backpropagation and neurocontrol: A review and prospectus, Neural Networks, 209–216.

  • Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3, 179–189.

    Article  Google Scholar 

  • Werbos, P. J. (1992a). Approximate dynamic programming for real-time control and neural modelling. In D. J. White & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy, and adaptive approaches.

    Google Scholar 

  • Werbos, P. J. (1992b). Neurocontrol and supervised learning: An overview and valuation. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy, and adaptive approaches.

    Google Scholar 

  • Werbos, P. J., Miller, W. T., & Sutton, R. S. (Eds.) (1990). Neural networks for control. Cambridge: MIT Press.

    Google Scholar 

  • White, D. J. (1969). Dynamic programming. San Francisco: Holden-Day.

    Google Scholar 

  • Wu, T., Powell, W. B., & Whisman, A. (2009). The optimizing-simulator: An illustration using the military airlift problem. ACM Transactions on Modeling and Simulation, 19(3), 1–31.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Warren B. Powell.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Powell, W.B. Perspectives of approximate dynamic programming. Ann Oper Res 241, 319–356 (2016). https://doi.org/10.1007/s10479-012-1077-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-012-1077-6

Keywords

Navigation