Abstract
The behavior of reinforcement learning (RL) algorithms is best understood in completely observable, discrete-time controlled Markov chains with finite state and action spaces. In contrast, robot-learning domains are inherently continuous both in time and space, and moreover are partially observable. Here we suggest a systematic approach to solve such problems in which the available qualitative and quantitative knowledge is used to reduce the complexity of learning task. The steps of the design process are to: (i) decompose the task into subtasks using the qualitative knowledge at hand; (ii) design local controllers to solve the subtasks using the available quantitative knowledge, and (iii) learn a coordination of these controllers by means of reinforcement learning. It is argued that the approach enables fast, semi-automatic, but still high quality robot-control as no fine-tuning of the local controllers is needed. The approach was verified on a non-trivial real-life robot task. Several RL algorithms were compared by ANOVA and it was found that the model-based approach worked significantly better than the model-free approach. The learnt switching strategy performed comparably to a handcrafted version. Moreover, the learnt strategy seemed to exploit certain properties of the environment which were not foreseen in advance, thus supporting the view that adaptive algorithms are advantageous to nonadaptive ones in complex environments.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Asada, M., Noda, S., Tawaratsumida, S., and Hosoda, K. 1996. Purposive behavior acquisition for a real robot by vision-based reinforcement learning. Machine Learning, 23:279-303.
Barto, A.G., Bradtke, S.J., and Singh, S.P. 1995. Learning to act using real-time dynamic programming. Artificial Intelligence, 1(72):81-138.
Bellman, R. 1957. Dynamic Programming, Princeton University Press: Princeton, NJ.
Birk, A. and Demiris, J. 1998. Sixth European Workshop on Learning Robots. Lecture Notes in Artificial Intelligence. Springer: Berlin.
Brafman, R.I. and Moshe, T. 1997. Modeling agents as qualitative decision makers. Artificial Intelligence, 94(1):217-268.
Branicky, M.S. 1995. Studies in hybrid systems: Modeling, analysis, and control. Ph.D. Thesis, Laboratory of Information and Decision, MIT, 77 Massachusetts Avenue, Cambridge, MA 02139-4307 USA.
Branicky, M.S., Borkar, V.S., and Mitter, S.K. 1994. A unified framework for hybrid control: Background, model, and theory. Technical report lids-p-2239, Laboratory for Information and Decision Systems, MIT, 77 Massachusetts Avenue, Cambridge, MA 02139-4307 USA.
Brockett, R.W. 1993. Hybrid models for motion control systems. Essays in Control: Perspectives in the Theory and its Applications, Birkhäuser: Boston, pp. 29-53.
Brooks, R. 1991a. Artificial life and real robots. In Proc. of the 1st European Conf. on Artificial Life (ECAL), MIT Press, pp. 3-10.
Brooks, R.A. 1991b. Elephants don't play chess. Designing Autonomous Agents, Bradford-MIT Press.
Chrisman, L. 1992. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proc. of the 10th Nat. Conf. on Artificial Intell., San Jose, CA, AAAI Press, pp. 183-188.
Colombetti, M., Dorigo, M., and Borghi. G. 1996. Behavior analysis and training: A methodology for behavior engineering. IEEE Trans. on Systems, Man, and Cybernetics—Part B, 26(3):365-380.
de Kleer, J. and Seely, B.J. 1984. A qualitative physics based on confluences. Artificial Intelligence, 24(1-3):7-83.
Dorigo, M. 1995. Alecsys and the autonomouse: Learning to control a real robot by distributed classifier systems. Machine Learning, 19(3):209-240.
Dorigo, M. and Colombetti, M. 1994. Robot shaping: Developing autonomous agents through learning. Artificial Intelligence, 71:321-370.
Gábor, Z., Kalmár, Zs., and Szepesvári, Cs., 1998. Multi-criteria reinforcement learning. Technical report 98-115, Research Group on Artificial Intelligence, JATE-MTA.
Grossman, R.L., Nerode, A., Ravn, A.P., and Rischel, H. 1993. Hybrid Systems, volume 736 of Lecture Notes in Computer Science. Springer-Verlag: New York.
Heger, M. 1996. The loss from imperfect value functions in expectation-based and minimax-based tasks. Machine Learning, 22:197-225.
Henig, M.I. 1983. Vector-valued dynamic programming. SIAM J. Control and Optimization, 21(3):490-499.
Jaakkola, T., Jordan, M.I., and Singh, S.P. 1994. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6):1185-1201.
Kaebling, L.P., Littman, M.L., and Moore, A.W. 1996. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237-285.
Kalmár, Zs., Szepesvári, Cs., and Lőrincz, A. 1994. Generalization in an autonomous agent. In Proc. of IEEE WCCI ICNN'94, Orlando, Florida, IEEE Inc., vol. 3, pp. 1815-1817.
Kalmár, Zs., Szepesvári, Cs., and Lőrincz, A. 1995. Generalized dynamic concept model as a route to construct adaptive autonomous agents. Neural Network World, 5:353-360.
Kalmár, Zs., Szepesvári, Cs., and Lőrincz, A. 1997. Module based reinforcement learning for a real robot. In Proc. of the 6th European Workshop on Learning Robots, pp. 22-32.
Koenig, S. and Simmons, R.G. 1997. Complexity analysis of realtime reinforcement learning applied to finding shortest paths in deterministic domains. Machine Learning: A Special Issue on Reinforcement Learning, 12:234-345.
Korf, R.E. 1985a. Learning to solve problems by searching for macro-operators, Pitman Publisher: Massachusetts.
Korf, R.E. 1985b. Macro-operators: A weak method for learning. Artificial Intelligence, 26:35-77.
Korf, R.E. 1987. Planning as search: A quantitative approach. Artificial Intelligence, 33:65-88.
Koza, J.R. and Rice, J.P. 1992. Automatic programming of robots using genetic programming. In Proc. of 10th Nat. Conf. on Artificial Intell., Menlo Park, CA, AAAIPress/The MIT Press, pp. 194-201.
Kumar, P.R. 1985. A survey of some results in stochastic adaptive controls. SIAM Journal of Control and Optimization, 23:329-380.
Littman, M.L. 1996. Algorithms for sequential decision making. Ph.D. Thesis, Department of Computer Science, Brown University. Also Technical Report CS-96-09.
Littman, M.L. and Szepesvári, Cs. 1996. A generalized reinforcement learning model: Convergence and applications. In Int. Conf. on Machine Learning, pp. 310-318.
Lygeros, J., Godbole, D.N., and Sastry, S.S. 1997. A design framework for hierarchical, hybrid control. IEEE Transactions on Automatic Control, special issue on Hybrid Systems, submitted.
Maes, P. 1991a. Adaptive action selection. In Proc. of the 13th Annual Conf. of the Cognitive Science Society. Lawrence Erlbaum Associates.
Maes, P. 1991b. A bottom-up mechanism for behavior selection in an artificial creature. In Proc. of the 1st Int. Conf. on Simulation of Adaptive Behavior, J.A. Meyer and S. Wilson, (Eds.), MIT Press.
Maes, P. 1992. Learning behavior networks from experience. Toward a Practice of Autonomous Systems (Proc. 1st European Conf. on Artificial Life), MIT Press: Cambridge, MA, pp. 48-57.
Maes, P. and Brooks, R.A. 1990. Learning to coordinate behaviors. In Proc. of AAAI-90, Boston, MA, pp. 796-802.
Mahadevan, S. and Connell, J. 1992. Automatic programming of behavior-based robots using reinforcement learning. Artificial Intelligence, 55:311-365.
Matarić, M. 1997. Reinforcement learning in the multirobot domain. Autonomous Robots, 4.
McCallum, R.A. 1993. Overcoming incomplete perception with utile distinction memory. In Proc. of the 10th Int. Conf. on Machine Learning, Amherst, MA, Morgan Kaufmann, pp. 190-196.
Munos, R. 1997. Finite-element methods with local triangulation refinement for continuous reinforcement learning problems. In Machine Learning: ECML'97 (9th European Conf. on Machine Learning, Proceedings), M. van Someren and G. Widmer, (Eds.), volume 1224 of Lecture Notes in Artificial Intelligence, Springer: Berlin, pp 170-183.
Newell, A. and Simon, H.A. 1972. Human Problem Solving. Prentice-Hall, Englewood Cliffs, NJ.
Parr, R. and Russell, S. 1997. Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems 11, MIT Press: Cambridge, MA, in press.
Pólya, Gy. 1945. How to solve it? Princeton University Press: Princeton, NJ.
Precup, D., Sutton, R.S., and Singh, S.P. 1997. Planning with closed-loop macro actions. In Working notes of the 1997 AAAI Fall Symposium on Model-directed Autonomous Systems. AAAI Press/The MIT Press, in press.
Ross, S.M. 1970. Applied Probability Models with Optimization Applications, Holden Day: San Francisco, CA.
Sacerdoti, E.D. 1974. Planning in a hierarchy of abstraction spaces. Artificial Intelligence, 5:115-135.
Sastry, S. 1997. Algorithms for design of hybrid systems. In Proc. of Int. Conf. of Information Sciences.
Say, A.C.C. and Selahattin, K. 1996. Qualitative system identification: deriving structure from behavior. Artificial Intelligence, 83(1):75-141.
Singh, S. and Cohn, D. 1997. How to dynamically merge markov decision processes. Advances in Neural Information Processing Systems 11, MIT Press: Cambridge, MA, in press.
Singh, S., Jaakkola, T., Littman, M.L., and Szepesvári, Cs. 1997. On the convergence of single-step on-policy reinforcement-learning algorithms. Machine Learning, accepted.
Singh, S.P. 1992. Reinforcement learning with a hierarchy of abstract models. In Proc. of the 10th National Conf. on Artificial Intell., San Jose, CA, AAAI Press, pp. 202-207.
Singh, S.P., Jaakkola, T., and Jordan, M.I. 1995. Learning without state-estimation in partially observable markovian decision processes. In Proc. of the 11th Machine Learning Conf., pp. 284-292.
Sutton, R.S. 1996. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems, 8.
Sutton, R.S. 1984. Temporal credit assignment in reinforcement learning. Ph.D. Thesis, University of Massachusetts, Amherst, MA.
Szepesvári, Cs. 1994. Dynamic concept model learns optimal policies. In Proc. of IEEE WCCI ICNN'94, Orlando, Florida, 1994. IEEE Inc, vol. 3, pp. 1738-1742.
Szepesvári, Cs. 1997a. Learning and exploitation do not conflict under minimax optimality. In Machine Learning: ECML'97 (9th European Conf. on Machine Learning, Proceedings), M. van Someren and G. Widmer (Eds.), volume 1224 of Lecture Notes in Artificial Intelligence, Springer: Berlin, pp 242-249.
Szepesvári, Cs. 1997b. Static and dynamic aspects of optimal sequential decision making. Ph.D. Thesis, Bolyai Institute of Mathematics, University of Szeged, Szeged, Aradi vrt. tere 1, Hungary, 6720.
Szepesvári, Cs. and Lőrincz, A. 1994. Behavior of an adaptive self-organizing autonomous agent working with cues and competing concepts. Adaptive Behavior, 2(2):131-160.
Szepesvári, Cs. and Littman, M.L. 1997. Generalized Markov Decision Processes: Dynamic programming and reinforcement learning algorithms. Neural Computation, in preparation.
Thrun, S.B. 1992. The Role of Exploration in Learning Control. Van Nostrand Rheinhold: Florence, KY.
Thrun, S.B. and Schwartz, A. 1995. Finding structure in reinforcement learning. Advances in Neural Information Processing Systems, G. Tesauro, D.S. Touretzky, and T.K. Leen (Eds.), The MIT Press: Cambridge, vol. 7, pp. 385-392.
Tóth, G.J., Kovács, Sz., and Lőrincz, A. 1995. Genetic algorithm with alphabet optimization. Biological Cybernetics, 73:61-68.
Tsitsiklis, J.N. 1994. Asynchronous stochastic approximation and q-learning. Machine Learning, 8(3-4):257-277.
Tsitsiklis, J.N. and Van Roy, B., 1995. An analysis of temporal difference learning with function approximation. Technical Report LIDS-P-2322, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology.
Tsitsiklis, J.N. and Van Roy, B. 1996. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59-94.
Tyrrell, T. 1993. Computational mechanisms for action selection. Ph.D. Thesis, University of Edinburgh.
Uchibe, E., Asada, M., and Hosoda, K. 1996. Behavior coordination for a mobile robot using modular reinforcement learning. In Proc. of IEEE/RSJ Int. Conf. on Intelligent Robot and Sytems, pp. 1329-1336.
Watkins, C.J.C.H. and Dayan, P. 1992. Q-learning. Machine Learning, 3(8):279-292.
Werbös, P.J. 1977. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook, 22:25-38.
Wiering, M. and Schmidhuber, J. 1997. HQ-learning. Adaptive Behavior, 6(2).
Zabczyk, J. 1973. Optimal control by means of switching. Studia Mathematica, 65:161-171.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kalmár, Z., Szepesvári, C. & Lőrincz, A. Module-Based Reinforcement Learning: Experiments with a Real Robot. Autonomous Robots 5, 273–295 (1998). https://doi.org/10.1023/A:1008858222277
Issue Date:
DOI: https://doi.org/10.1023/A:1008858222277