Off-policy adversarial imitation learning for robotic tasks with low-quality demonstrations☆
Introduction
Reinforcement learning (RL) is a powerful extensive framework that enables the agent to tackle complex continuous control tasks [1]. During the past few years, RL has achieved a performance surpassing human beings in some application domains, such as video games [2], board games [3], robot manipulation [4], [5], and autonomous driving [6]. In those tasks that can successfully apply RL, it is not difficult to design a reward function that indicates favorable behaviors for the agent. For harder tasks [7], however, it is time-consuming and difficult to design appropriate reward functions. Although a sparse reward is extremely easy to specify, in most complex continuous control tasks, it cannot guide the agent to an effective exploration, but results in a learned policy that makes the agent fall into the local optima and does not easily achieve the desired objective [8]. Imitation learning (IL) is an effective way to solve real-world problems for which it is often difficult to design the reward function. The goal of IL is to enable the agent to imitate expert behavior given expert demonstrations without a reward signal.
A wide variety of IL methods have been proposed during the last few decades. The simplest IL method is behavior cloning (BC) [9], [10], which learns an expert policy in a supervised fashion without environmental interaction during training. BC can be the first IL option when a large amount of high-quality demonstrations are available. However, when only a small number of demonstrations can be obtained or the quality of demonstrations is low, this method fails to imitate expert behavior owing to compound errors [11]. Because it is often difficult to provide sufficient high-quality demonstrations in a real-world environment, the application of BC is limited.
Another widely used IL method is inverse reinforcement learning (IRL) [12], [13], [14], [15]. Instead of copying the expert behavior directly, the IRL learns the reward function constantly by assuming that the expert policy is optimal. Compared with BC, IRL overcomes the problem of compound error and improves the sample efficiency in terms of expert demonstrations [16], [17]. However, because the IRL problem is known to be ill-posed and multiple reward functions can explain a certain observed expert behavior, careful hand-engineering of the reward functions is required. Furthermore, the IRL algorithm needs to solve an RL problem in the inner loop. Its huge computational complexity makes it difficult to apply them to complex tasks with high-dimensional spaces.
In recent years, based on generative adversarial networks (GANs) [18], an emerging IL method has emerged, namely, generative adversarial imitation learning (GAIL) [19]. This method incorporates RL into the GAN framework. The generator network uses RL to generate a policy, and the discriminator network is used to discriminate the generated and expert policies, such that a generated policy converges to an expert policy. Because GAIL has achieved a state-of-the-art performance on numerous complex robotics tasks [20], [21], [22], researchers have shown significant interest in adversarial imitation learning (AIL) algorithms. To improve the sample efficiency of AIL, there have been numerous studies leveraging off-policy RL algorithms rather than original on-policy RL algorithms for policy generation [23], [24], [25], [26]. In addition, GASIL [27], [28] combines the idea of self-imitation with GAIL to realize imitation learning without demonstration data. HGAIL [29] introduced the hindsight experience replay ((HER) [30] into the GAIL to successfully deal with goal-conditioned tasks. Clearly, the AIL framework has become a popular choice for ILs.
Unfortunately, it cannot always approach the ability of perfectly optimal experts, and the existing AIL methods cannot outperform a sub-optimal expert, which poses a challenge to GAIL and its extensions. In [31], the author showed that GAIL can learn the optimal policy from a sub-optimal expert, although the sub-optimal expert is derived from an optimal expert, and the distribution of the demonstrations generated from the sub-optimal expert is not far from that of the optimal demonstrations. Furthermore, when there are no good experiences for the agent to exploit, it is also impossible for GASIL to learn a good policy by applying self-imitation.
In this paper, to address the challenges mentioned above, we propose a novel AIL method, called robust adversarial imitation learning (RAIL), which can be a good choice for tasks that cannot obtain high-quality demonstrations. As shown in Fig. 1, the proposed method adapts a new off-policy AIL framework that can improve the sample efficiency. To avoid limiting the agent to the quality of the demonstrations and allow it to outperform a sub-optimal expert providing demonstrations, we incorporate the hindsight idea of a variable reward into our off-policy AIL framework. Furthermore, by using a new technique called hindsight copy, two different forms of demonstrations are leveraged to speed up the learning. To test the proposed method, three experiments were conducted in a multi-goal environment of robotic tasks. The results show that the proposed method can efficiently accomplish the two robotic tasks and achieve a better performance than other methods.
The remainder of this paper is organized as follows. We provide a review of the background in Section 2. In Section 3, we introduce the proposed method in three parts in detail, and a convergence analysis of the algorithm is described. Section 4 describes our experimental setup, the network architecture, and the hyperparameters, and discusses the experimental results. Finally, a summary and future studies are laid out in Section 5.
Section snippets
Preliminaries
Given an agent interacting with the environment and assuming that the environment is fully observable, a Markov decision process is defined as a tuple , where is a set of states, is a set of actions, are the transition probabilities, : is a reward function, and is a discount factor. A policy maps a state to an action, . At the beginning of each episode, the initial state is sampled from the distribution . At each timestep , the agent
Method
In this section, to enable the agent in adversarial imitation learning (AIL) to outperform the sub-optimal expert providing demonstrations, we use a new experience replay method, called hindsight experience replay of variable reward, to provide additional rewards for the agent. Then, a strategy called hindsight copy of expert demonstrations is designed to maximize the utilization of demonstrations and accelerate the agent learning of AIL. Next, we propose an efficient off-policy AIL method,
Experiments
In this section, we first describe the experimental setup in detail. Then, the network architecture and hyperparameters of our method are given. Finally, to test our method, three experiments were conducted on two tasks, as shown in Fig. 3, to answer the following questions:
- •
Can our method be robust to the quality of demonstrations and learn effectively from low-quality demonstrations?
- •
Are a variable reward (VR) and hindsight copy (HC) two effective techniques to improve the performance of our
Conclusion
In this paper, we proposed a novel adversarial imitation learning method, RAIL, that can learn from low-quality demonstrations and eventually outperform a sub-optimal expert by a large margin. In our method, to improve the sample efficiency of the method, we propose a new off-policy framework for adversarial imitation learning, and design a new experience replay method, VRHER, to provide an additional reward for the agent, and thus the agent can outperform the sub-optimal expert. Meanwhile, a
CRediT authorship contribution statement
Guoyu Zuo: Conceptualization, Methodology, Writing - review & editing, Funding acquisition. Qishen Zhao: Methodology, Software, Writing - original draft, Investigation. Kexin Chen: Formal analysis, Investigation, Data curation. Jiangeng Li: Writing - review & editing, Supervision. Daoxiong Gong: Project administration, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (42)
- et al.
Reinforcement learning: An introduction
IEEE Trans. Neural Netw.
(1998) - et al.
Human-level control through deep reinforcement learning
Nature
(2015) - et al.
Mastering the game of go without human knowledge
Nature
(2017) - et al.
Learning dexterous in-hand manipulation
(2018) - et al.
End-to-end training of deep visuomotor policies
J. Mach. Learn. Res.
(2015) - M. Kuderer, S. Gulati, W. Burgard, Learning driving styles for autonomous vehicles from demonstration, in: IEEE...
- A.H. Qureshi, B. Boots, M.C. Yip, Adversarial imitation via variational inverse reinforcement learning, in:...
- et al.
Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards
(2017) - D.A. Pomerleau, Alvinn: An autonomous land vehicle in a neural network, in: Advances in Neural Information Processing...
Efficient training of artificial neural networks for autonomous navigation
Neural Comput.
(1991)
Maximum Entropy Inverse Reinforcement Learning, Vol. 36 (5)
Third-person imitation learning
Cited by (9)
A novel teacher–student hierarchical approach for learning primitive information
2024, Expert Systems with ApplicationsCamera view planning based on generative adversarial imitation learning in indoor active exploration
2022, Applied Soft ComputingCitation Excerpt :Pan et al. [57] proposed an explainable GAIL method to extract both global and local knowledge of drivers’ passenger-seeking strategies. Zuo et al. [58] presented a robust adversarial imitation learning method in order to make an agent outperform the sub-optimal expert that provides the demonstration. Their recent work [59] introduced the deterministic policy into the GAIL method to make the robot quickly imitate the policy from demonstration without reward engineering, and another work [60] enabled an agent to learn effectively from imperfect demonstration by multiple demonstrators.
Model-Based Imitation Learning Using Entropy Regularization of Model and Policy
2022, IEEE Robotics and Automation LettersOffline reinforcement learning with anderson acceleration for robotic tasks
2022, Applied Intelligence