Abstract:
Online learning is the process of providing online control decisions in sequential decision-making problems given (possibly partial) knowledge about the optimal controls ...Show MoreMetadata
Abstract:
Online learning is the process of providing online control decisions in sequential decision-making problems given (possibly partial) knowledge about the optimal controls for the past decision epochs. The purpose of this paper is to apply the online learning techniques on finite-state finite-action Markov Decision Processes (finite MDPs). We consider a multi-agent system composed of a learning agent and observed agents. The learning agent observes from the other agents the state probability distribution (pd) resulting from a stationary policy but not the policy itself. The state pd is observed either directly from an observed agent or through the density distribution of the multi-agent system. We show that using online learning, the learned policy performs at least as well as the one of the observed agents. Specifically, this paper shows that if the observed agents are running an optimal policy, the learning agent can learn the optimal average expected cost MDP policies via online learning techniques by using a descent gradient algorithm on the observed agents' pd data.
Date of Conference: 12-15 December 2017
Date Added to IEEE Xplore: 22 January 2018
ISBN Information: