Proposal of PSwithEFP and its Evaluation in Multi-Agent Reinforcement Learning

Kazuteru Miyazaki; Koudai Furukawa; Hiroaki Kobayashi

doi:10.20965/jaciii.2017.p0930

single-jc.php

« previous

JACIII Vol.21 No.5 pp. 930-938

doi: 10.20965/jaciii.2017.p0930

(2017)

Paper:

Views over last 60 days: 808

Proposal of PSwithEFP and its Evaluation in Multi-Agent Reinforcement Learning

Kazuteru Miyazaki^, Koudai Furukawa^, and Hiroaki Kobayashi^

^*National Institution for Academic Degrees and Quality Enhancement of Higher Education
1-29-1 Gakuennishimachi, Kodaira, Tokyo 185-8587, Japan

^**IHI Transport Machinery Co., Ltd.
8-1 Akashi-cho, Chuo-ku, Tokyo 104-0044, Japan

^***Meiji University
1-1-1 Higashimita, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan

Received:

March 14, 2017

Accepted:

July 21, 2017

Published:

September 20, 2017

Keywords:

multi-agent learning, reinforcement learning, exploitation-oriented learning, profit sharing, expected failure probability

Abstract

When multiple agents learn a task simultaneously in an environment, the learning results often become unstable. This problem is known as the concurrent learning problem and to date, several methods have been proposed to resolve it. In this paper, we propose a new method that incorporates expected failure probability (EFP) into the action selection strategy to give agents a kind of mutual adaptability. The effectiveness of the proposed method is confirmed using Keepaway task.

Cite this article as:

K. Miyazaki, K. Furukawa, and H. Kobayashi, “Proposal of PSwithEFP and its Evaluation in Multi-Agent Reinforcement Learning,” J. Adv. Comput. Intell. Intell. Inform., Vol.21 No.5, pp. 930-938, 2017.

Data files:

References

[1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.
[2] S. Arai and N. Tanaka, “Experimental Analysis of Reward Design for Continuing Task in Multiagent Domains – RoboCup Soccer Keepaway –,” Trans. of the Japanese Society for Artificial Intelligence, Vol.21, No.6, pp. 537-546, 2006 (in Japanese).
[3] S. Kuroda, K. Miyazaki, and H. Kobayashi, “Introduction of Fixed Mode States into Online Reinforcement Learning with Penalty and Reward and Its Application to Waist Trajectory Generation of Biped Robot,” J. Adv. Comput. Intell. Intell. Inform., Vol.16, No.6, pp. 758-768, 2013.
[4] T. Matsui, T. Goto, and K. Izumi, “Acquiring a Government Bond Trading Strategy Using Reinforcement Learning,” J. Adv. Comput. Intell. Intell. Inform., Vol.13, No.6, pp. 691-696, 2009.
[5] K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.
[6] K. Miyazaki and M. Ida, “Proposal and Evaluation of the Active Course Classification Support System with Exploitation-oriented Learning,” Lecture Notes in Computer Science, Vol.7188, pp. 333-344, 2012.
[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” NIPS Deep Learning Workshop 2013, 2013.
[8] J. Randlφv and P. Alstrφm, “Learning to Drive a Bicycle Using Reinforcement Learning and Shaping,” Proc. of the 15th Int. Conf. on Machine Learning, pp. 463-471, 1998.
[9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, Vol.529, pp. 484-489, 2016.
[10] P. Stone, R. S. Sutton, and G. Kuhlamann, “Reinforcement Learning toward RoboCup Soccer Keepaway,” Adaptive Behavior, Vol.13, No.3, pp. 165-188, 2005.
[11] T. Watanabe, K. Miyazaki, and H. Kobayashi, “A New Improved Penalty Avoiding Rational Policy Making Algorithm for Keepaway with Continuous State Spaces,” J. Adv. Comput. Intell. Intell. Inform., Vol.13, No.6, pp. 678-682, 2009.
[12] J. Yoshimoto, M. Nishimura, Y. Tokita, and S. Ishii, “Acrobot control by learning the switching of multiple controllers,” J. of Artificial Life and Robotics, Vol.9, No.2, pp. 67-71, 2005.
[13] K. Miyazaki, “Proposal of an Exploitation-oriented Learning Method on Multiple Rewards and Penalties Environments and the Design Guideline,” J. of Computers, Vol.8, No.7, pp. 1683-1690, 2013.
[14] S. Singh, R. L. Lewis, A. G. Barto, and J. Sorg, “Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective,” IEEE Trans. Auton. Ment. Dev., Vol.2, No.2, pp. 70-82, 2010.
[15] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.
[16] K. Miyazaki, S. Tsuboi, and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Rational Policy Making,” Trans. of the Japanese Society for Artificial Intelligence, Vol.16, No.2, pp. 185-192, 2001 (in Japanese).
[17] K. Miyazaki and S. Kobayashi, “Exploitation-Oriented Learning PS-r#,” J. Adv. Comput. Intell. Intell. Inform., Vol.13, No.6, pp. 624-630, 2009.
[18] K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” Proc. of the 3rd Int. Conf. on Fuzzy Logic, Neural Nets and Soft Computing, pp. 285-288, 1994.
[19] K. Miyazaki, M. Yamamura, and H. Kobayashi, “A Theory of Profit Sharing in Reinforcement Learning,” Trans. of the Japanese Society for Artificial Intelligence, Vol.9, No.4, pp. 580-587, 1994 (in Japanese).
[20] K. Miyazaki and S. Kobayashi, “Rationality of Reward Sharing in Multi-agent Reinforcement Learning,” New Generation Computing, Vol.19, No.2, pp. 157-172, 2001.
[21] K. Miyazaki, K., S. Arai, and S. Kobayashi, “A Theory of Profit Sharing in Multi-Agent Reinforcement Learning,” Vol.14, No.6, pp. 1156-1164, 1999 (in Japanese).
[22] K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r* and its Evaluation,” J. of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 285-296, 2003 (in Japanese).
[23] K. Miyazaki, H. Muraoka, and H. Kobayashi, “Proposal of a Propagation Algorithm of the Expected Failure Probability and the Effectiveness on Multi-agent Environments,” SICE Annual Conf. 2013, pp. 1067-1072, 2013.
[24] H. Muraoka, K. Miyazaki, and H. Kobayashi, “Proposal of a Propagation Algorithm of the Expected Failure Probability and the Effectiveness on Multi-agent Environments,” The Trans. of the Institute of Electrical Engineers of Japan. C, Vol.136, No.3, pp. 273-281, 2016 (in Japanese).
[25] S. Arai, K. Miyazaki, and S. Kobayashi, “Generating Cooperative Behavior by Multi-Agent Reinforcement Learning,” Proc. of the 6th European Workshop on Learning Robots, pp. 143-157, 1997.
[26] S. Arai, K. Miyazaki, and S. Kobayashi, “Methodology in Multi-Agent Reinforcement Learning-Approaches by Q-learning and Profit Sharing,” Trans. of the Japanese Society for Artificial Intelligence, Vol.13, No.4, pp. 609-618, 1998 (in Japanese).
[27] L. Chrisman, “Reinforcement learning with perceptual aliasing: The perceptual distinctions approach,” Proc. of the 10th National Conf. on Artificial Intelligence, pp. 183-188, 1992.
[28] H. Kawai, A. Ueno, and S. Tatsumi, “A reinforcement learning method using reward acquisition efficiency for POMDP environment,” Trans. of the Japanese Society of Artificial Intelligence, Vol.23, No.1, pp. 1-12, 2008 (in Japanese).
[29] M. T. Spaan, “Partially Observable Markov Decision Processes,” in M.Wiering and M. van Otterlo eds., Reinforcement Learning, Springer-Verlag Berlin Heidelberg, Chapter 12, pp. 387-414, 2012.
[30] K. Miyazaki, “Exploitation-oriented Learning XoL with Deep Learning – Comparison with a deep Q-network –,” The Papers of Technical Meeting on Systems, IEE Japan, pp. 7-12, 2016 (in Japanese).

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.

[2] [2] S. Arai and N. Tanaka, “Experimental Analysis of Reward Design for Continuing Task in Multiagent Domains – RoboCup Soccer Keepaway –,” Trans. of the Japanese Society for Artificial Intelligence, Vol.21, No.6, pp. 537-546, 2006 (in Japanese).

[3] [3] S. Kuroda, K. Miyazaki, and H. Kobayashi, “Introduction of Fixed Mode States into Online Reinforcement Learning with Penalty and Reward and Its Application to Waist Trajectory Generation of Biped Robot,” J. Adv. Comput. Intell. Intell. Inform., Vol.16, No.6, pp. 758-768, 2013.

[4] [4] T. Matsui, T. Goto, and K. Izumi, “Acquiring a Government Bond Trading Strategy Using Reinforcement Learning,” J. Adv. Comput. Intell. Intell. Inform., Vol.13, No.6, pp. 691-696, 2009.

[5] [5] K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.

[6] [6] K. Miyazaki and M. Ida, “Proposal and Evaluation of the Active Course Classification Support System with Exploitation-oriented Learning,” Lecture Notes in Computer Science, Vol.7188, pp. 333-344, 2012.

[7] [7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” NIPS Deep Learning Workshop 2013, 2013.

[8] [8] J. Randlφv and P. Alstrφm, “Learning to Drive a Bicycle Using Reinforcement Learning and Shaping,” Proc. of the 15th Int. Conf. on Machine Learning, pp. 463-471, 1998.

[9] [9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, Vol.529, pp. 484-489, 2016.

[10] [10] P. Stone, R. S. Sutton, and G. Kuhlamann, “Reinforcement Learning toward RoboCup Soccer Keepaway,” Adaptive Behavior, Vol.13, No.3, pp. 165-188, 2005.

[11] [11] T. Watanabe, K. Miyazaki, and H. Kobayashi, “A New Improved Penalty Avoiding Rational Policy Making Algorithm for Keepaway with Continuous State Spaces,” J. Adv. Comput. Intell. Intell. Inform., Vol.13, No.6, pp. 678-682, 2009.

[12] [12] J. Yoshimoto, M. Nishimura, Y. Tokita, and S. Ishii, “Acrobot control by learning the switching of multiple controllers,” J. of Artificial Life and Robotics, Vol.9, No.2, pp. 67-71, 2005.

[13] [13] K. Miyazaki, “Proposal of an Exploitation-oriented Learning Method on Multiple Rewards and Penalties Environments and the Design Guideline,” J. of Computers, Vol.8, No.7, pp. 1683-1690, 2013.

[14] [14] S. Singh, R. L. Lewis, A. G. Barto, and J. Sorg, “Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective,” IEEE Trans. Auton. Ment. Dev., Vol.2, No.2, pp. 70-82, 2010.

[15] [15] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.

[16] [16] K. Miyazaki, S. Tsuboi, and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Rational Policy Making,” Trans. of the Japanese Society for Artificial Intelligence, Vol.16, No.2, pp. 185-192, 2001 (in Japanese).

[17] [17] K. Miyazaki and S. Kobayashi, “Exploitation-Oriented Learning PS-r#,” J. Adv. Comput. Intell. Intell. Inform., Vol.13, No.6, pp. 624-630, 2009.

[18] [18] K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” Proc. of the 3rd Int. Conf. on Fuzzy Logic, Neural Nets and Soft Computing, pp. 285-288, 1994.

[19] [19] K. Miyazaki, M. Yamamura, and H. Kobayashi, “A Theory of Profit Sharing in Reinforcement Learning,” Trans. of the Japanese Society for Artificial Intelligence, Vol.9, No.4, pp. 580-587, 1994 (in Japanese).

[20] [20] K. Miyazaki and S. Kobayashi, “Rationality of Reward Sharing in Multi-agent Reinforcement Learning,” New Generation Computing, Vol.19, No.2, pp. 157-172, 2001.

[21] [21] K. Miyazaki, K., S. Arai, and S. Kobayashi, “A Theory of Profit Sharing in Multi-Agent Reinforcement Learning,” Vol.14, No.6, pp. 1156-1164, 1999 (in Japanese).

[22] [22] K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r* and its Evaluation,” J. of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 285-296, 2003 (in Japanese).

[23] [23] K. Miyazaki, H. Muraoka, and H. Kobayashi, “Proposal of a Propagation Algorithm of the Expected Failure Probability and the Effectiveness on Multi-agent Environments,” SICE Annual Conf. 2013, pp. 1067-1072, 2013.

[24] [24] H. Muraoka, K. Miyazaki, and H. Kobayashi, “Proposal of a Propagation Algorithm of the Expected Failure Probability and the Effectiveness on Multi-agent Environments,” The Trans. of the Institute of Electrical Engineers of Japan. C, Vol.136, No.3, pp. 273-281, 2016 (in Japanese).

[25] [25] S. Arai, K. Miyazaki, and S. Kobayashi, “Generating Cooperative Behavior by Multi-Agent Reinforcement Learning,” Proc. of the 6th European Workshop on Learning Robots, pp. 143-157, 1997.

[26] [26] S. Arai, K. Miyazaki, and S. Kobayashi, “Methodology in Multi-Agent Reinforcement Learning-Approaches by Q-learning and Profit Sharing,” Trans. of the Japanese Society for Artificial Intelligence, Vol.13, No.4, pp. 609-618, 1998 (in Japanese).

[27] [27] L. Chrisman, “Reinforcement learning with perceptual aliasing: The perceptual distinctions approach,” Proc. of the 10th National Conf. on Artificial Intelligence, pp. 183-188, 1992.

[28] [28] H. Kawai, A. Ueno, and S. Tatsumi, “A reinforcement learning method using reward acquisition efficiency for POMDP environment,” Trans. of the Japanese Society of Artificial Intelligence, Vol.23, No.1, pp. 1-12, 2008 (in Japanese).

[29] [29] M. T. Spaan, “Partially Observable Markov Decision Processes,” in M.Wiering and M. van Otterlo eds., Reinforcement Learning, Springer-Verlag Berlin Heidelberg, Chapter 12, pp. 387-414, 2012.

[30] [30] K. Miyazaki, “Exploitation-oriented Learning XoL with Deep Learning – Comparison with a deep Q-network –,” The Papers of Technical Meeting on Systems, IEE Japan, pp. 7-12, 2016 (in Japanese).

Proposal of PSwithEFP and its Evaluation in Multi-Agent Reinforcement Learning

Kazuteru Miyazaki*, Koudai Furukawa**, and Hiroaki Kobayashi***

Kazuteru Miyazaki^, Koudai Furukawa^, and Hiroaki Kobayashi^