Transfer reinforcement learning via meta-knowledge extraction using auto-pruned decision trees☆
Introduction
Reinforcement learning (RL) is to learn a policy that can optimize a long-term performance index in sequential decision tasks, which are usually modeled as Markov Decision Problem (MDPs) [1]. Earlier research on RL algorithms focused on MDPs with discrete state and action spaces but many real-world problems need to deal with MDPs with continuous or high-dimensional state and action spaces. Therefore, value function approximation (VFA) or policy function approximation (PFA) has become a major research topic of RL, to enhance the learning efficiency and generalization ability of traditional tabular RL algorithms [2]. Until now, various feature representation methods were proposed for VFA or PFA, such as neural networks, kernel methods [3] and manifold methods [2]. In the past decade, by using deep neural networks as value or policy function approximators, deep reinforcement learning (DRL) algorithms have been widely studied. Until now, DRL algorithms have achieved state-of-the-art performance in computer games (e.g. Atari [4] and GO [5]) and some promising results have been obtained in more complex control tasks [6]. However, the lack of transferability and interpretability is a key obstacle for RL algorithms to be widely applied in many real-world tasks.
The notion of transfer capability of learning algorithms originates from the research on transfer learning in pattern recognition. In transfer learning for pattern recognition, a classifier or regression model is trained in a source domain and it is expected that the learned features and the model can be transferred to a target domain. But in RL, the transfer capability requires a control policy learned in an original MDP can be used to accelerate the learning control processes in target MDPs with different dynamics. This kind of transfer RL is usually denoted as inter-task transfer RL [7]. The transfer capability is important not only for applying RL in different learning control tasks with some similarities but also for transferring the learned policies from simulated tasks to real-world problems. In practice, a policy trained by RL in simulators often fails to work well in real-world control problems with dynamics changing a little. Knowledge transfer from policies pre-trained in source MDPs has not been gaining much attention until recently [8], [9], [10]. However, many transfer RL algorithms are designed to pre-train a policy in a group of source MDPs, even for a whole distribution of MDPs, which is impractical for some real problems. Moreover, the previous algorithms usually consider the knowledge obtained in one source MDP equally rather than distinguishing the importance of knowledge.
Despite the above issues for improving the transfer capability of RL, policies learned by RL with generic nonlinear function approximators such as neural networks are in fact black-box models. In many real safety-critical applications such as autonomous vehicles, it is necessary to know how the decisions are generated by learned control policies of RL algorithms [11]. Some recent works stepped toward making the policies interpretable and transparent [12], [13]. But these works did not improve the performance of policies with interpretable results, or even harm the performance.
For comparison, humans are good at summing up experience they have faced. When facing a similar task again, humans are able to make decisions with intuitive explanations. Such transferable explanations can represent general rules of a certain type of task, which can be regarded as meta-knowledge. In fact, humans can make decisions by some simple judgments without lots of features and accurate values of each feature, while an RL-trained policy requires complex numerical computations. Therefore, it is promising to develop human-like meta-knowledge extraction methods for RL by retraining a transparent model so that the meta-knowledge can be learned by mimicking a trained policy.
Motivated by the above, In this paper, we propose a novel transfer reinforcement learning approach via meta-knowledge extraction using auto-pruned decision trees. A novel transfer reinforcement learning approach via meta-knowledge extraction using auto-pruned decision trees is proposed. Based on the data samples generated from pre-trained policies in source MDPs learned via RL algorithm, a meta-knowledge extraction with re-training an auto-pruned decision tree algorithm is developed. By estimating the uncertainty of state–action pairs in pre-trained policies based on the entropy value of leaf nodes and similarities between the source MDPs and target MDPs, the state spaces of meta-knowledge are determined. Then, a hybrid policy is generated by integrating the extracted meta-knowledge and the policies learned on the target MDPs, by judging whether the observing state is in the state spaces of meta-knowledge. Based on the proposed transfer RL approach, for MDPs with discrete action and continuous action, two meta-knowledge-based transfer reinforcement learning (MKRL) algorithms are developed. We demonstrate the effectiveness of our approach in seven learning tasks and their variants on Gym and Mujoco.
The main contributions of this paper can be summarized in two aspects. Firstly, a novel transfer RL approach is proposed by using meta-knowledge extraction, which can be easily integrated with different RL algorithms. It is verified that the extracted meta-knowledge using auto-pruned decision trees can represent transferable and interpretable policies which are suitable both for source MDPs and target MDPs. Furthermore, it is shown that the learning process in the target MDPs with different dynamics can be significantly accelerated. The second contribution is that two novel transfer RL algorithms, i.e. MKA3C and MKPPO, are proposed for MDPs with discrete action spaces and continuous action spaces, respectively. Comprehensive experiments were conducted in several learning tasks and their variants on Gym and Mujoco. The experimental results showed that the proposed MKA3C and MKPPO can obtain better performance than previous baselines both in transfer learning efficiency and interpretability.
The remaining parts of the paper are organized as follows. In Section 2, related works are discussed. In Section 3, the RL problem and the transfer RL setting are formulated. The proposed algorithms for learning the meta-knowledge and improving both the transfer capability and the interpretability of RL are presented in Section 4. In Section 5, comprehensive performance evaluation and comparisons were conducted on seven typical tasks and their variants. Section 6 concludes this paper and suggests future work.
Section snippets
Related work
In this paper, we focus on the inter-task transfer RL setting with changing dynamics. Solutions for such transfer RL can be summarized into three categories: representation-based, instance-based and multi-task-based.
The representation-based transfer RL adapts the feature of knowledge representation of a target MDP by leveraging the high-level representation from the source MDPs. Policy distillation (PD) [14], [15] is a popular method for learning the representation of the source MDPs. A typical
RL and transfer RL
Before introducing the transfer RL problem, some basic definitions of RL are given first. RL algorithms learn a near-optimal policy through interacting with the environment. The learning process can be modeled as Markov decision processes (MDPs) which comprise state spaces , action spaces which can be divided into two categories: continuous action spaces and discrete action spaces. A transition dynamic distribution with conditional probability density satisfying the Markov property.
Meta-knowledge extraction for transfer reinforcement learning
In this section, firstly, a novel meta-knowledge extraction method with auto-pruned decision trees is proposed for MDPs. Then, two meta-knowledge based transfer reinforcement learning (MKRL) algorithms are designed for MDPs with discrete action spaces and continuous action spaces, which takes advantage of both the meta-knowledge and RL algorithms to generate an integrated policy. Fig. 2 illustrates the overall setup. Finally, some discussions and performance analysis about MKRL are given.
Performance evaluation and comparisons
The aim of transfer RL is to improve the learning performance of the target MDPs in three metrics including the jump-start improvement, learning speed improvement and asymptotic improvement [41]. According to the three metrics, we conduct experiments for answering the following questions. (1) When the dynamics of the target MDPs are similar to the source MDPs, can MKRL with auto-pruned decision trees perform well? (2) When the similarity between the target MDPs and the source MDPs is low, can
Conclusion
How to improve the transfer capability and interpretability of RL algorithms is an important and challenging problem, especially for MDPs with changing state transition dynamics. In this paper, we propose a novel transfer RL approach by using meta-knowledge extraction with auto-pruned decision trees, which can extract the transferable and interpretable policies. Then, the meta-knowledge based transfer reinforcement learning (MKRL) algorithm is proposed for target MDPs with in generic cases of
CRediT authorship contribution statement
Yixing Lan: Conceptualization, Methodology, Software. Xin Xu: Conceptualization, Funding acquisition, Writing – review & editing. Qiang Fang: Writing – original draft, Methodology. Yujun Zeng: Writing – original draft, Methodology. Xinwang Liu: Writing – review & editing. Xianjian Zhang: Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (46)
- et al.
Explainability in deep reinforcement learning
Knowl. Based Syst.
(2021) - et al.
Interpretable policies for reinforcement learning by genetic programming
Eng. Appl. Artif. Intell.
(2018) - et al.
Reinforcement Learning: An Introduction
(2018) - et al.
Manifold-based reinforcement learning via locally linear reconstruction
IEEE Trans. Neural Netw. Learn. Syst.
(2017) - et al.
Online learning control using adaptive critic designs with sparse kernel machines
IEEE Trans. Neural Netw. Learn. Syst.
(2013) - Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, Koray...
- et al.
Mastering the game of go without human knowledge
Nature
(2017) - et al.
Proximal policy optimization algorithms
(2017) - et al.
Transfer Learning
(2020) - et al.
Universal successor features approximators
Sequential transfer in reinforcement learning with a generative model
REPAINT: knowledge transfer in deep actor-critic reinforcement learning
Verifiable reinforcement learning via policy extraction
Towards interpretable reinforcement learning with state abstraction driven by external knowledge
IEICE Trans. Inf. Syst.
Knowledge transfer for deep reinforcement learning with hierarchical experience replay
Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards
Transfer in deep reinforcement learning using successor features and generalised policy improvement
Transfer learning for user adaptation in spoken dialogue systems
Model-agnostic meta-learning for fast adaptation of deep networks
Cited by (6)
Lifelong reinforcement learning with temporal logic formulas and reward machines
2022, Knowledge-Based SystemsCitation Excerpt :Lifelong learning (or continual or multitask learning) has received increasing interest in recent years, due to its potential to reduce agents’ training time in dynamic environments. Lan et al. [25] used an auto-pruned decision tree model to extract meta-knowledge from source MDPs and transfer it to the target MDPs to improve the performance of RL agents. Zhang et al. [26] associated an accelerating bioinspired optimizer with transfer reinforcement learning, which was applied in large-scale power systems to solve the reactive power optimization problem.
Deep Reinforcement Learning-Based Air Defense Decision-Making Using Potential Games
2023, Advanced Intelligent SystemsA Novel Pessimistic Decision Tree Pruning Approach for Classification
2023, 2023 6th International Conference on Electrical Information and Communication Technology, EICT 2023Knowledge Distillation in Granular Fuzzy Models by Solving Fuzzy Relation Equations
2023, Studies in Computational Intelligence