Off-policy adversarial imitation learning for robotic tasks with low-quality demonstrations

https://doi.org/10.1016/j.asoc.2020.106795Get rights and content

Highlights

  • An off-policy actor-critic architecture is used in the Adversarial imitation learning (AIL).

  • The hindsight idea of variable reward (VR) is incorporated into our off-policy AIL framework.

  • The strategy of hindsight copy (HC) is designed for sampling demonstrations.

  • The convergence analysis of the proposed method is provided.

Abstract

The goal of imitation learning (IL) is to enable the robot to imitate expert behavior given expert demonstrations. Adversarial imitation learning (AIL) is a recent successful IL architecture that has shown significant progress in complex continuous tasks, particularly robotic tasks. However, in most cases, the acquisition of high-quality demonstrations is costly and laborious, which poses a significant challenge for AILs. Although generative adversarial imitation learning (GAIL) and its extensions have shown that they are robust to sub-optimal experts, it is difficult for them to surpass the performance of experts by a large margin. To address this issue, in this paper, we propose a novel off-policy AIL method called robust adversarial imitation learning (RAIL). To enable the agent to significantly outperform a sub-optimal expert providing demonstrations, the hindsight idea of variable reward (VR) is first incorporated into the off-policy AIL framework. Then, a strategy called hindsight copy (HC) of demonstrations is designed to provide the discriminator and trained policy in the AIL framework with different demonstrations to maximize the use of such demonstrations and speed up the learning. Experiments were conducted on two multi-goal robotic tasks to test the proposed method. The results show that our method is not limited to the quality of expert demonstrations and can outperform other IL approaches.

Introduction

Reinforcement learning (RL) is a powerful extensive framework that enables the agent to tackle complex continuous control tasks [1]. During the past few years, RL has achieved a performance surpassing human beings in some application domains, such as video games [2], board games [3], robot manipulation [4], [5], and autonomous driving [6]. In those tasks that can successfully apply RL, it is not difficult to design a reward function that indicates favorable behaviors for the agent. For harder tasks [7], however, it is time-consuming and difficult to design appropriate reward functions. Although a sparse reward is extremely easy to specify, in most complex continuous control tasks, it cannot guide the agent to an effective exploration, but results in a learned policy that makes the agent fall into the local optima and does not easily achieve the desired objective [8]. Imitation learning (IL) is an effective way to solve real-world problems for which it is often difficult to design the reward function. The goal of IL is to enable the agent to imitate expert behavior given expert demonstrations without a reward signal.

A wide variety of IL methods have been proposed during the last few decades. The simplest IL method is behavior cloning (BC) [9], [10], which learns an expert policy in a supervised fashion without environmental interaction during training. BC can be the first IL option when a large amount of high-quality demonstrations are available. However, when only a small number of demonstrations can be obtained or the quality of demonstrations is low, this method fails to imitate expert behavior owing to compound errors [11]. Because it is often difficult to provide sufficient high-quality demonstrations in a real-world environment, the application of BC is limited.

Another widely used IL method is inverse reinforcement learning (IRL) [12], [13], [14], [15]. Instead of copying the expert behavior directly, the IRL learns the reward function constantly by assuming that the expert policy is optimal. Compared with BC, IRL overcomes the problem of compound error and improves the sample efficiency in terms of expert demonstrations [16], [17]. However, because the IRL problem is known to be ill-posed and multiple reward functions can explain a certain observed expert behavior, careful hand-engineering of the reward functions is required. Furthermore, the IRL algorithm needs to solve an RL problem in the inner loop. Its huge computational complexity makes it difficult to apply them to complex tasks with high-dimensional spaces.

In recent years, based on generative adversarial networks (GANs) [18], an emerging IL method has emerged, namely, generative adversarial imitation learning (GAIL) [19]. This method incorporates RL into the GAN framework. The generator network uses RL to generate a policy, and the discriminator network is used to discriminate the generated and expert policies, such that a generated policy converges to an expert policy. Because GAIL has achieved a state-of-the-art performance on numerous complex robotics tasks [20], [21], [22], researchers have shown significant interest in adversarial imitation learning (AIL) algorithms. To improve the sample efficiency of AIL, there have been numerous studies leveraging off-policy RL algorithms rather than original on-policy RL algorithms for policy generation [23], [24], [25], [26]. In addition, GASIL [27], [28] combines the idea of self-imitation with GAIL to realize imitation learning without demonstration data. HGAIL [29] introduced the hindsight experience replay ((HER) [30] into the GAIL to successfully deal with goal-conditioned tasks. Clearly, the AIL framework has become a popular choice for ILs.

Unfortunately, it cannot always approach the ability of perfectly optimal experts, and the existing AIL methods cannot outperform a sub-optimal expert, which poses a challenge to GAIL and its extensions. In [31], the author showed that GAIL can learn the optimal policy from a sub-optimal expert, although the sub-optimal expert is derived from an optimal expert, and the distribution of the demonstrations generated from the sub-optimal expert is not far from that of the optimal demonstrations. Furthermore, when there are no good experiences for the agent to exploit, it is also impossible for GASIL to learn a good policy by applying self-imitation.

In this paper, to address the challenges mentioned above, we propose a novel AIL method, called robust adversarial imitation learning (RAIL), which can be a good choice for tasks that cannot obtain high-quality demonstrations. As shown in Fig. 1, the proposed method adapts a new off-policy AIL framework that can improve the sample efficiency. To avoid limiting the agent to the quality of the demonstrations and allow it to outperform a sub-optimal expert providing demonstrations, we incorporate the hindsight idea of a variable reward into our off-policy AIL framework. Furthermore, by using a new technique called hindsight copy, two different forms of demonstrations are leveraged to speed up the learning. To test the proposed method, three experiments were conducted in a multi-goal environment of robotic tasks. The results show that the proposed method can efficiently accomplish the two robotic tasks and achieve a better performance than other methods.

The remainder of this paper is organized as follows. We provide a review of the background in Section 2. In Section 3, we introduce the proposed method in three parts in detail, and a convergence analysis of the algorithm is described. Section 4 describes our experimental setup, the network architecture, and the hyperparameters, and discusses the experimental results. Finally, a summary and future studies are laid out in Section 5.

Section snippets

Preliminaries

Given an agent interacting with the environment and assuming that the environment is fully observable, a Markov decision process is defined as a tuple (S,A,p,r,γ), where S is a set of states, A is a set of actions, p(st+1|st,at) are the transition probabilities, r: S×AR is a reward function, and γ[0,1] is a discount factor. A policy π maps a state to an action, π:SA. At the beginning of each episode, the initial state s0 is sampled from the distribution p(s0). At each timestep t, the agent

Method

In this section, to enable the agent in adversarial imitation learning (AIL) to outperform the sub-optimal expert providing demonstrations, we use a new experience replay method, called hindsight experience replay of variable reward, to provide additional rewards for the agent. Then, a strategy called hindsight copy of expert demonstrations is designed to maximize the utilization of demonstrations and accelerate the agent learning of AIL. Next, we propose an efficient off-policy AIL method,

Experiments

In this section, we first describe the experimental setup in detail. Then, the network architecture and hyperparameters of our method are given. Finally, to test our method, three experiments were conducted on two tasks, as shown in Fig. 3, to answer the following questions:

  • Can our method be robust to the quality of demonstrations and learn effectively from low-quality demonstrations?

  • Are a variable reward (VR) and hindsight copy (HC) two effective techniques to improve the performance of our

Conclusion

In this paper, we proposed a novel adversarial imitation learning method, RAIL, that can learn from low-quality demonstrations and eventually outperform a sub-optimal expert by a large margin. In our method, to improve the sample efficiency of the method, we propose a new off-policy framework for adversarial imitation learning, and design a new experience replay method, VRHER, to provide an additional reward for the agent, and thus the agent can outperform the sub-optimal expert. Meanwhile, a

CRediT authorship contribution statement

Guoyu Zuo: Conceptualization, Methodology, Writing - review & editing, Funding acquisition. Qishen Zhao: Methodology, Software, Writing - original draft, Investigation. Kexin Chen: Formal analysis, Investigation, Data curation. Jiangeng Li: Writing - review & editing, Supervision. Daoxiong Gong: Project administration, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (42)

  • SuttonR.S. et al.

    Reinforcement learning: An introduction

    IEEE Trans. Neural Netw.

    (1998)
  • VolodymyrM. et al.

    Human-level control through deep reinforcement learning

    Nature

    (2015)
  • SilverD. et al.

    Mastering the game of go without human knowledge

    Nature

    (2017)
  • AndrychowiczM. et al.

    Learning dexterous in-hand manipulation

    (2018)
  • LevineS. et al.

    End-to-end training of deep visuomotor policies

    J. Mach. Learn. Res.

    (2015)
  • M. Kuderer, S. Gulati, W. Burgard, Learning driving styles for autonomous vehicles from demonstration, in: IEEE...
  • A.H. Qureshi, B. Boots, M.C. Yip, Adversarial imitation via variational inverse reinforcement learning, in:...
  • VečeríkM. et al.

    Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards

    (2017)
  • D.A. Pomerleau, Alvinn: An autonomous land vehicle in a neural network, in: Advances in Neural Information Processing...
  • PomerleauD.A.

    Efficient training of artificial neural networks for autonomous navigation

    Neural Comput.

    (1991)
  • S. Ross, D. Bagnell, Efficient reductions for imitation learning, in: Proceedings of the Thirteenth International...
  • S.J. Russell, Learning agents for uncertain environments, in: COLT, Vol. 98, 1998, pp....
  • A.Y. Ng, S. Russell, Algorithms for inverse reinforcement learning, in: International Conference on Machine Learning,...
  • P. Abbeel, A.Y. Ng, Apprenticeship learning via inverse reinforcement learning, in: International Conference on Machine...
  • ZiebartB.D. et al.

    Maximum Entropy Inverse Reinforcement Learning, Vol. 36 (5)

    (2008)
  • P. Abbeel, D. Dolgov, A.Y. Ng, S. Thrun, Apprenticeship learning for motion planning with application to parking lot...
  • M. Kuderer, H. Kretzschmar, W. Burgard, Teaching mobile robots to cooperatively navigate in populated environments, in:...
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative...
  • J. Ho, S. Ermon, Generative adversarial imitation learning, in: Advances in Neural Information Processing Systems,...
  • StadieB.C. et al.

    Third-person imitation learning

    (2017)
  • J. Merel, Y. Tassa, T.B. Dhruva, S. Srinivasan, N. Heess, Learning human behaviors from motion capture by adversarial...
  • Cited by (9)

    • Camera view planning based on generative adversarial imitation learning in indoor active exploration

      2022, Applied Soft Computing
      Citation Excerpt :

      Pan et al. [57] proposed an explainable GAIL method to extract both global and local knowledge of drivers’ passenger-seeking strategies. Zuo et al. [58] presented a robust adversarial imitation learning method in order to make an agent outperform the sub-optimal expert that provides the demonstration. Their recent work [59] introduced the deterministic policy into the GAIL method to make the robot quickly imitate the policy from demonstration without reward engineering, and another work [60] enabled an agent to learn effectively from imperfect demonstration by multiple demonstrators.

    View all citing articles on Scopus

    This document is the results of a research project funded by the National Natural Science Foundation of China (61873008) and Beijing Natural Science Foundation (4182008).

    View full text