Abstract
Using Deep Reinforcement Learning (DRL) algorithm to deal with autonomous driving tasks usually have unsatisfied performance due to lack of robustness and means to escape local optimum. In this article, we designs a Survival-Oriented Reinforcement Learning (SORL) model that tackle these problems by setting survival rather than maximize total reward as first priority. In SORL model, we model autonomous driving task as Constrained Markov Decision Process (CMDP) and introduce Negative-Avoidance Function to learn from previous failure. The SORL model greatly speed up the training process and improve the robustness of normal Deep Reinforcement Learning algorithm.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Deep Reinforcement learning (DRL), an approach using deep neural networks in Reinforcement Learning (RL) methods has been very successful in control problems in recent years. With a well designed reward, programs can learn to tackle complex tasks such as play Atari games using image as input [1], play the game of Go [2] and achieve human level performance. Subsequently, a number of new reinforcement learning algorithms [3,4,5] have been developed and step further to improve performance and robustness of the learning program. With inspiration of success of DRL in virtual field [1, 2] and the widely developed DRL control algorithms, it seems promising to apply similar method on real world problems such as autonomous driving tasks.
Autonomous driving is a challenging and complicate task for program to learn. With camera and sensor data as inputs, the program needs to learn to select appropriate driving policy that can stay on the right track and avoid accidents. Several progress have been made in this field. Shalev-Shwartz et al. [6] uses reinforcement learning to maximize the “desire” of abstract driving policy choice like overtaking or merging. Sallab et al. [7] use Deep Deterministic Actor Critic algorithm and Deep Q-Network to learn Lane Keeping.
However, there are several challenges still lie in the way. Problem 1. DRL training is highly sensitive to the noise and reward function, the converging time maybe significantly varies depend on the reward function designed and random noise. To train the algorithm efficiently, one needs to carefully select reward function and control random noise. Problem 2. DRL algorithms have a poor mechanism of exploring state-action space and are easily trapped in local optimum. Actually, RL algorithms are usually gradients based and cannot guarantee global maximum unless the data from global maximum regions are well treated. Therefore, in order to teach agent the right driving policy, one not only needs to accurately define the global optimum but carefully avoid other undesired local optimum.
In this article, we proposed a Survival-Oriented Reinforcement Learning (SORL) model to tackle the sensitivity problem and local optimum problem. The core idea of the model is: optimized policy should prefer to survive in given time steps than “die” somewhere between with better total reward. In other words, the DRL algorithm should try other policy if current policy cannot reaches the max-allowed-step.
In SORL model, the DRL learning process is modelled as a Constrained Markov Decision Process (CMDP) with continuous state-action space. We introduce a new structure called Negative-Avoidance Function (NA Function) to DRL algorithm that can learn from failure from previous training. The SORL model combines normal DRL algorithm aim at optimize total reward with the NA Function, using NA Function as a constraint and can helps normal DRL algorithms escape faster from undesired local optimum. Because the structure is independent of normal DRL algorithms, the SORL model can use different DRL algorithms for different situations.
We test the model for lane keeping tasks on TORCS – a car racing game simulator. The training procedure and time of the DRL algorithm Deep Deterministic Policy Gradient (DDPG) [3] and our Survival-Oriented DDPG (SO-DDPG), the SORL model using DDPG as DRL algorithm are compared. Our SO-DDPG algorithm shows a significant increase in speed and robustness: For the same environment parameter, the number episodes takes for convergence for DDPG varies from 1000 to more than 2000, while SO-DDPG needs only 600 episodes with about a max of 400 episode deviation. The SO-DDPG also shows it’s insensitivity to design of NA Function, which makes NA Function easy to designed.
2 Background
In this section, some basic mechanisms for Reinforcement Learning and Deep Reinforcement Learning are introduced. Those mechanisms are useful for our Survival-Oriented Reinforcement Learning (SORL) model.
Most of Reinforcement Learning algorithm are based on Markov Decision Process (MDP). A normal MDP is a 5-tuple \((S,A,P(.|.,.),R(.,.),\gamma )\), where the S is a set of continuous states, A is a set of continuous actions, \(P(s_{+1}|s,a)\) denotes the probability of environment transit to new state \(s_{+1}\) given state s and the action a and R(s, a) is the set of reward assigned for the action a taken under state s.
The formal learning setup is: The agent acquires the state s from the environment Using policy \(\pi :\mathcal {S}\times \mathcal {A}\); Agent select an action \(a=\rho _{\pi }(s)\) and apply this action a to the environment; the environment returns a reward r(s, a) and new state \(s_{+1}\) according to \(P(s_{+1}|s,a)\); Using the \((s, a, r, s_{+1})\) pairs, agent train itself and get a new action \(a_{+1}=\rho _{\pi }(s_{+1})\) to be applied in the environment.
The goal of the algorithm is to find a optimized policy \(\pi \) that can maximize the long-run total reward \(R_{total}=\sum _{t=t_{0}+1}^{\infty }\gamma ^t r(s,a)\) given initial state \(s_{t_0}\). The \(\gamma \in [0,1]\) is the discount factor that restrict the summation to finite value. Many reinforcement learning algorithms like DDPG and RDPG adopt the action value function Q(s, a) as a mean to represent total reward. The Q-function gives the expectation value of long-run total reward with respect to given state and the action corresponding to the policy and state. It can be written as follows:
The Q-function, like other function used to represent value of long-run total reward, obeys the recursive relation called Bellman Equation:
If the Q-function is well-known, one can directly get the optimum policy by looking for the action that maximize Q-function in given state:
For Deep Reinforcement Learning (DRL) algorithms, deep neural networks are used as function approximators for values in the RL algorithms. For example, the total reward and policy can be rewritten as function approximators as \(Q(s,a;\theta )\). We consider a type of DRL algorithms [3,4,5] equipped with similar actor-critic architectures, with function approximators for long-run total reward function and policy function. The common structure is shown on the Figs. 1 and 2 below.
3 Autonomous Driving Problem Analysis
In this section, we analyses autonomous driving problem in detail. We show the reason that DRL tending to converge in local optimum matters in training the model. It’s also shown that extra structure to help escape local optimum is also needed to avoid danger condition. Then several methods in related works that may get over these problems are examined, which help justify construction of Survival-Oriented Reinforcement Model in next section.
Considering the lane keeping task in autonomous driving, the program is required to learn to keep in lane with a rational speed. This control process can be modelled as a Markov Decision Process. Every time step, the driver obtain a state of outside and some information of car itself. Base on the state, the driver make a decision of turning left or right, step on brake or throttle. The reward is designed to award action and state that keep car follow the track while penal others.
But like other similar real-world control problem, this lane keeping tasks have characteristics that simple MDP-based DRL model cannot take account for. There are two intrinsic differences.
-
Hard-to-defined Reward
First of all, it’s not a good idea to define a optimum policy for autonomous driving tasks. The reason is that what we actually want the program to learn is a large set of policy aim at follow the lane, control the speed, not follow the lane in certain position, control the car with certain optimum speed. Although [8] have shown that neural networks can learn abstract rules, these progress haven’t been developed enough to use in the control problems.
-
Safety Issues
Finally, for autonomous driving problem, safety is absolutely the first concern. It appears possible to solve this issue by assign a low or negative reward to the actions and states that leads to the accident. But Shalev-Shwartz et al. [6] have proved that, for rare accident with few sample available, the reward should be set extremely low for the program to learn to avoid.
Apart from the difficulties that model autonomous driving problem as a MDP process, there are problems with MDP-based algorithm themselves. DRL algorithms like DDPG, RDPG and A3C don’t have a good architecture to escape from local optimum. The \(\gamma \) discount factor using in the algorithm prefers local optimum to global ones.
Some RL algorithms have been raised to avoid these problems. [9, 10] provides a way directly learning global optimum from demonstration. [11] suggest dividing reward into multiple ones for different RL programs to learn and give the action base on the combination of different learned policies. [12, 13] provide an approach to learn from teachers or demonstrations to avoid the local optimum to some aspect.
To let problem avoid constraint like danger condition or local optimum, one apparent approach is model problems as Constrained Markov Decision Process (CMDP). CMDP has been widely studied in RL regime [14,15,16] for constrained optimization problem. [15] raised an actor-critic RL algorithm for problems with discrete state. But non-of those approach uses non-linear function approximators like neural network.
To balance between exploration and exploitation and avoid local optimum, appealing method would be \(\epsilon \) - greedy approach where the algorithm accept temporal worse policy with a possibility to better explore the space. This requires some non-gradient based settings in the algorithm.
In order to add some structure to normal DRL algorithm to let the program learn to avoid danger condition as well as better explore the state-action space, we designs a Survival-Oriented Reinforcement Learning Model. The detail will be described in the next section.
4 Survival-Oriented Reinforcement Learning Model
In order to allow the DRL learning algorithm escape local optimum and moreover, detect and avoid danger condition or accidents, we introduce Survival-Oriented Reinforcement Learning (SORL) model. Unlike the normal DRL model that aims simply at maximize the total designed reward function, we consider the real world environment where the program should value safety as first priority. For this reason, we raise a proposition:
Proposition 1
(Survival Proposition). For the program, learns to survive in the environment (agent reaches max allowed step) is more important than maximize total reward.
To achieve this, we add a new function called Negative-Avoidance Function (NA Function) D(s, a) to the DRL algorithm in order to help program learns to survive. The CMDP system is designed then as a 6-tuple \((S,A,P(.|.,.),R(.,.),\gamma , D(s,a))\). The extra D(s, a) gives danger index of given state and action, which assess whether the action chosen by policy \(\pi \) is “safe” enough under state s.
Like reward the NA Function is not given directly by the environment, but there are clues – early termination means danger. If the environment terminates at some time step \(n<n_{max}\), then there should be some reason that causes the early termination. The cause may come from series of action under certain states, hence one can use a NA Function D(s, a) to assess the aptitude of danger.
Some characters of NA Function can be inferred easily. At the start \((s_0,a_0)\), it’s rational to set \(D(s_0,a_0)=0\). When the agent takes action and goes further, the danger may increase or decrease. Finally at termination, the danger reaches maximum \(D(s_n,a_n)=1\), which cause the environment terminates. Hence, we use the proposition:
Proposition 2
(NA Function Proposition). For a n steps interaction episodes, where the agent gets states \(\{s_0,s_1,...,s_n\}\), have action \(\{a_0,a_1,...,a_{n-1}\}\), the aptitude of danger should start from zero, statistically increase as the agent goes further until reaches max when the environment terminated, which can be defined as:
Which satisfy:
If the DRL algorithm can learn a good NA function, the program can detect danger situation and avoid them uses a simple mechanism – if the NA Function reaches some threshold \(D(s_k,a_k)\le D_{threshold}\) and is higher than the temporal reward environment provides \(D(s_k,a_k) > r(s_k,a_k)\), one can change policy to some \(a_k=f(a_k)\) to avoid early termination.
Neural networks are used as function approximators for policy \(a = \rho _{\pi }(s|\theta ^1)\), Q-function \(Q(s,a) = Q(s,a|\theta ^2)\) and danger assess function \(D(s,a)=D(s,a|\theta ^3)\). Then under the MDP setting, the survival proposition mathematically add negative avoidance constraint requirement and MDP optimization problem become CMDP problem:
Lemma 1
(Survival Proposition for CMDP). For the DRL algorithm, it’s more important to reach max-allowed-step than maximize Reward, the optimized policy should choose action that maximize total reward with temporal reward larger than temporal danger. The optimized policy written as:
The learning process of the model different from simple MDP, which can be written as:
-
1.
The agent observes a state \(s_{t_0}\) from the environments.
-
2.
The normal learning program (DDPG, RDPG, etc.) gives a reward-based action \(r^{action}_{t_0}\).
-
3.
Danger assess function gives the danger index for the previous state and action \(d_{t_0} = D(s_{t_{-1}},ar_{t_{-1}})\).
-
4.
If the danger index is larger than the reward in previous time step \(d_{t_0} > r(a_{t_{-1}},s_{t_{-r}})\), consider the agent as in “danger” and choose the danger avoidance action as real action \(a_{t_0} = f(ar_{t0})\) base on certain function f(a), else use reward-based action as real action \(a_{t_0} = ar_{t_0}\).
-
5.
The environment receives the action \(a_{t_0}\), returns the reward \(r(s_{t_0},a_{t_0})\), next state \(s{t_1}\). If the environment terminates, we assign each s, a pair a danger index D(s, a) according to certain rule.
-
6.
The learning program uses \(r(s,a),s,a,s_{+1} \) to train the learning program, uses D(s, a) to train danger assess program with supervised training.
If use this model to modify actor-critic structure, the adjusted structure of the system can be shown in Figs. 3 and 4 below.
There are several advantages for define an extra NA Function. First, the NA function can help deal with the problem where the optimum is hard to defined. By using different types of NA Function, one can adjust the DRL algorithms’ sensitivity to danger condition.
Besides, if the DRL algorithm temporally converges at optimum with early termination, as the D(s, a) is learned, the early termination will finally been assigned a high NA Function value, which leads the DRL algorithm to try other policy. This help the DRL algorithm to move out from early termination, since early termination is defined in this model as a worse choice than finish max-allowed-step.
Finally, the extra-structure of NA function does not depend on specific DRL algorithm. From the adjust learning process one can see that, the SORL Model doesn’t have a specific requirement on which DRL algorithm to use. One can change different DRL algorithm in the model to deal with different tasks. The SORL model using DDPG as reward-based DRL algorithm will be described in detail in the next section.
5 Survival-Oriented DDPG Algorithm
The SORL model can built on different DRL algorithms that aims at maximize total reward. The SORL model built on DDPG is described in Algorithm 1. The \(Q(s,a|\theta ^1)\) and \(\rho _{\pi }(s|\theta ^2)\) are trained using DDPG algorithm. \(Q(s,a|\theta ^1)\) is trained by minimizing the loss using the Bellman Eq. 2 and policy function \(\rho _{\pi }(s|\theta ^2)\) using gradient of \(J(\theta ^2)\):
We also uses Ornstein-Uhlenbeck noise \(\epsilon \) added to action \(a = \rho _{\pi }(s)+\epsilon \) as DDPG does for the purpose of exploring stae-action space and robustness of DRL algorithm.
As for the NA function approximator, the \(D(s_i,a_i|\theta ^3)\) is trained using supervise learning assuming the predefined \(f_d(n,n_{max})\) in Proposition 2 is the true distribution. The exact \(f_d(n,n_{max})\) depends on the environment like reward does. Therefore given the \(s_i,a_i\) and the max step n in the episode, the loss of NA Function Approximator is:
Apart from loss function, another thing is needed to be concerned. If the NA Function give a high danger index, it’s required to change original action \(a=\rho _{\pi }(s)+\epsilon \) into \(a = f(\rho _{\pi }(s)+\epsilon )\) that helps the agent escape local optimum and explore state-action space. This function is also depend on the environment and needed to be defined according to the environment.
data:image/s3,"s3://crabby-images/a3782/a37821d805a40210b5c786b9d3e58fb2fd01f813" alt="figure a"
6 Experiment and Results
In this section, we first use SO-DDPG and DDPG algorithm to learn lane keeping task. For the purpose of test if SORL model can increase the learning speed, SO-DDPG and DDPG algorithms are trained with same environment parameter. After that, we use different NA Function for SO-DDPG algorithm to test the sensitivity of SORL model to NA Function.
We use The Open Racing Car Simulator (TORCS) as the environment to learning lane keeping tasks with SO-DDPG algorithm and DDPG algorithm. [17] provides a Application Programming Interface (API) for translating data between the DRL learning algorithm and TORCS. DRL algorithm takes a feature vector as input, including sensor data like obstacle distance and position of car in the track. The action available including brake, throttle and turn.
We set the environment to terminate if collision happens or the car get stuck, which is more close to real world driving problem. Termination condition is necessary for SORL model, since NA-Function \(D(s_i,a_i)=f_d(i,n)\) needs the termination step. Hence unlike [7], condition of termination is always activated in the learning process.
As mentioned in SORL model, the NA Function \(D(s_i,a_i)=f_d(i,n)\) and avoid policy \(a=f(\rho _{\pi }(s)+\epsilon )\) are like reward and needed to be defined according to the environment. Here we use: \(D(s_i,a_i)=Exp\left( \frac{-(n-i)^2}{2\min (20,n/5)^2}\right) \) and \(f(\rho _{\pi }(s)+\epsilon ) = -\rho _{\pi }(s)-\epsilon \). The \(D(s_i,a_i)\) function make use of the form of normal distribution, which is just for convenient. There are some reason to set avoidance policy as \(f(a)=-a\). If policy have converged at local optimum, then the action are designed to approach towards local optimum. A rational policy that tries to escape local minimum would be choose the opposite action.
6.1 Efficiency and Robustness of SORL Model
The SO-DDPG using the equations above and normal DDPG algorithm are tested on TORCS. Using the same environment setting, reward function and NA Function, We choose the track CG Speedway Number 1 for test. We trains the SO-DDPG and DDPG for a number of times. Four results are selected and shown in Fig. 5.
Figure 5 shows how max achieved total reward is changed during the learning process. The four experiments are done using slightly different reward functions. The left two figure uses reward \(r=v_x \cos (\theta _x)\) and right two figure we use \(r=70\tanh (\frac{v_x \cos (\theta _x)}{70})\), where \(v_x\) is speed of car towards it’s heading direction and \(\theta _x\) is the angle between this direction and direction of track. The NA Functions and avoid policy are kept unchanged. We can see the DDPG algorithm is highly sensitive to the reward function and noise that used to explore state-action space. In the left two sub-figure in Fig. 5, DDPG algorithm escape the local optimum. But as shown in the right side of the figure, the DDPG get trapped in local optimum and can’t escape.
In contrast, we can see that although our method SO-DDPG also get trapped in local optimum for some time, it can escape the local optimum faster despite the reward function is adjusted.
6.2 SORL Model Sensitivity to NA Function
Different choice of NA Function and avoid policy may influence the converging speed of algorithm. In order to test if SORL model is sensitive to NA Function, in this section, we compare SO-DDPGs that using different NA Function with DDPG. The NA Functions selected are:
where u(x) is the unit step function. Two NA Functions both assume the final \(\frac{1}{5}\) of the \(s_i,a_i\) pairs may be the real cause for early termination.
Figure 6 also shows how max achieved total reward is changed during learning process. From Fig. 6 we can see that, also the NA Function changes a lot, it doesn’t prevent the SO-DDPG to converge fast and stable to global optimum. Hence SORL model is not sensitive to NA Function.
7 Conclusion
In this article, we analyses the difficulties that DRL algorithms faces when learning real world control problem. The DRL algorithm needs to develop a structure that can escape from local optimum and being robust to reward function and noise.
To tackle this problem, we introduces the Survival-Oriented Reinforcement Learning model that model autonomous driving problem as a Constrained Markov Decision Process. The SORL model introducing Negative-Avoidance Function and danger avoidance mechanism into normal DRL model so that the adjusted DRL structure can learn from previous failure in training process. The SORL model is not model-based and can uses different DRL algorithms like DDPG as normal DRL model.
The experiments of learning lane keep tasks in TORCS for SO-DDPG and DDPG algorithm shows proves that our SORL model is not sensitive to reward function and may speed up the converging of DRL algorithm. Besides, the experiments of SO-DDPG using different NA Functions also shows that SORL model is no sensitive to NA Functions design.
References
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning, arXiv preprint arXiv:1509.02971 (2015)
Heess, N., Hunt, J.J., Lillicrap, T.P., Silver, D.: Memory-based control with recurrent neural networks, arXiv preprint arXiv:1512.04455 (2015)
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Shalev-Shwartz, S., Shammah, S., Shashua, A.: Safe, multi-agent, reinforcement learning for autonomous driving, arXiv preprint arXiv:1610.03295 (2016)
Sallab, A.E., Abdou, M., Perot, E., Yogamani, S.: End-to-end deep reinforcement learning for lane keeping assist, arXiv preprint arXiv:1612.04340 (2016)
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., Colmenarejo, S.G., Grefenstette, E., Ramalho, T., Agapiou, J., et al.: Hybrid computing using a neural network with dynamic external memory. Nature 538(7626), 471–476 (2016)
Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 1. ACM (2004)
Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems, pp. 4565–4573 (2016)
Laroche, R., Fatemi, M., Romoff, J., van Seijen, H.: Multi-advisor reinforcement learning, arXiv preprint arXiv:1704.00756 (2017)
Zhan, Y., Ammar, H.B., et al.: Theoretically-grounded policy advice from multiple teachers in reinforcement learning settings with applications to negative transfer (2016)
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al.: Learning from demonstrations for real world reinforcement learning, arXiv preprint arXiv:1704.03732 (2017)
Altman, E.: Constrained Markov Decision Processes, vol. 7. CRC Press, Boca Raton (1999)
Borkar, V.S.: An actor-critic algorithm for constrained Markov decision processes. Syst. Control Lett. 54(3), 207–213 (2005)
Chow, Y., Ghavamzadeh, M., Janson, L., Pavone, M.: Risk-constrained reinforcement learning with percentile risk criteria, arXiv preprint arXiv:1512.01629 (2015)
Loiacono, D., Cardamone, L., Lanzi, P.L.: Simulated car racing championship: competition software manual, arXiv preprint arXiv:1304.1672 (2013)
Acknowledgements
This work was supported by National Key Basic Research Program of China (No. 2016YFB0100900), National Natural Science Foundation of China (No. 61171113), and Science and Technology Innovation Committee of Shenzhen (No. 20150476).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Ye, C., Ma, H., Zhang, X., Zhang, K., You, S. (2017). Survival-Oriented Reinforcement Learning Model: An Effcient and Robust Deep Reinforcement Learning Algorithm for Autonomous Driving Problem. In: Zhao, Y., Kong, X., Taubman, D. (eds) Image and Graphics. ICIG 2017. Lecture Notes in Computer Science(), vol 10667. Springer, Cham. https://doi.org/10.1007/978-3-319-71589-6_36
Download citation
DOI: https://doi.org/10.1007/978-3-319-71589-6_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71588-9
Online ISBN: 978-3-319-71589-6
eBook Packages: Computer ScienceComputer Science (R0)