Deep reinforcement learning with reference system to handle constraints for energy-efficient train control

doi:10.1016/j.ins.2021.04.088

Information Sciences

Volume 570, September 2021, Pages 708-721

https://doi.org/10.1016/j.ins.2021.04.088 Get rights and content

Abstract

Train energy-efficient control involves complicated optimization processes subject to constraints such as speed, time, position and comfort requirements. Conventional optimization techniques are not apt at accumulating numerous solution instances into decision intelligence by learning for consecutively confronted new problems. Deep reinforcement learning (DRL), which can directly output control decisions based on current states, has shown great potentials for next-generation intelligent control. However, if the DRL is directly applied to energy-efficient train control, the received results are almost unsatisfactory. The reason lies in that the agent may get into confusion about how to trade off those constraints, and spend great computational time performing a large number of meaningless explorations. This article attempts to propose an approach of DRL with a reference system (DRL-RS) for proactive constraint handling, where the reference system deals with checking and correcting the agent’s learning progresses to avoid stepping farther and farther onto the erroneous road. The proposed approach is evaluated by the numerical experiments on train control in metro lines. The experimental results demonstrate that the DRL-RS can achieve faster learning convergence, compared with the directly applied DRL. Furthermore, it is possible to reduce more energy consumption than the commonly used genetic algorithm.

Introduction

Railway transportation is one of the major energy demand sectors, and various measures have been taken to save energy in this sector [1]. Especially, urban rail transit (URT) has gradually become the backbone of public transportation systems in a city, as a convenient and environmentally friendly transportation tool [2]. With the booming growth of URT, the huge electric energy consumption has become an increasingly prominent problem, and energy saving and emission reduction is inevitably imperative. The railroad industry continues to look at ways to improve the safety and energy efficiency of railway transport systems to maintain sustainable development, which depends on advanced train control equipment. The automatic train operation (ATO) system equipped in a train is capable of performing automatic control on a train by tracking target speed profiles and adjusting train control strategies. To enhance the performance of ATO, the design of high-level energy-efficient speed profiles for train operation control is an important function for the onboard control equipment.

The energy-efficient train control was initiated on the basis of optimal control theory, in particular, the Pontryagin maximum principle [3], [4], [5], [6], [7], [8]. The optimized control sequence is a combination of distinct phases such as maximum traction, cruising, coasting and maximum braking during train operations. Cheng and Howlett [3] proposed an approach to generate a feasible driving strategy towards optimality by adjusting adjoint parameters. Khmelnitsky [4] brought forward an approach to obtain energy-efficient trajectories on variable gradient profiles subject to arbitrary speed restrictions using complementary optimality conditions. Liu and Golovitcher [5] developed an algorithm to obtain analytical information for optimal operation states and their sequences in optimal speed profiles. Howlett et al. [6] presented key equations and critical speeds to calculate key switching points so as to get a satisfactory and feasible solution for energy saving. Albrecht et al. [7], [8] put up a generalized model for optimal train control and proved the existence of optimal switching points. Considering piecewise constant gradients, the algebraic formulae for adjoint variables were established, the general bounds on the positions of optimal switching points were found out, and the integral forms of necessary conditions were analyzed for optimal switching.

Various intelligent algorithms were applied to obtain the optimal combination sequence and exact switching points of driving regimes for different operation situations and train types [9]. Chang and Sim [10] made an attempt of utilizing genetic algorithm (GA) to solve the train trajectory optimization problem, with a fitness function being the weighted sum of energy consumption, operation punctuality and riding comfort. Wong and Ho [11] proposed a hierarchical genetic algorithm to determinate the appropriate number of coasting points through inspecting their corresponding train movement performance. Acikbas and Soylemez [12] put up a combined application of artificial neural networks (NNs) and GA to find optimal coasting points for multiple trains on multiple lines, taking regenerative braking energy into account. Sicre et al. [13] presented an approach of GA with fuzzy parameters to realize energy-efficient railway traffic regulation when a significant delay occurs, sufficiently utilizing regenerative energy generated from train braking. Other intelligent optimization algorithms, such as dual heuristic programming (DHP) [14], fuzzy optimization [15], [16], ant colony optimization (ACO) [17], particle swarm optimization (PSO) [18], [19], and regression tree [20], were also employed to handle the design and analysis of energy-efficient train speed profiles.

The true energy consumption depends on the realistic train control tracking performance [15], [16]. How to make the ATO adopt energy-efficient control actions is an intelligent decision problem based on current train running states. In recent years, deep reinforcement learning (DRL), combing the advantages of deep learning and reinforcement learning, is considered to be one of the most promising technologies in machine learning [21], [22]. It has shown great potential in practical engineering applications, especially in the control of intelligent systems [23]. Belletti et al. [24] presented a DRL approach to achieve adaptive and robust control for traffic management systems, and developed a novel multi-agent control algorithm for large cyber-physical systems. Li et al. [25] proposed an improved deep Q-network (DQN) strategy to learn an effective human–machine cooperative driving scheme, which can avoid potential pedestrian crossing collisions, and thereby achieve higher security. By applying two replay memory buffers, the learning process of the optimal driving policy can be shortened. Tong et al. [26], [27] proposed the DRL-based algorithms to deal with the issues of task scheduling and resource allocation in the cloud computing and mobile edge computing environments, respectively. Martinez et al. [28] developed a DRL algorithm to realize an adaptive decision on the early classification of temporal sequences. Wang et al. [29] brought forward a DRL-based method for dynamic resource allocation in network slicing to respond to the huge challenges posed by massive data in various applications. Wu et al. [30] proposed a learning method of dueling DQN, autonomous navigation and obstacle avoidance (ANOA), for unmanned surface vehicles. Other practical engineering applications of DRL to communications [31], power systems [32] and so on, also show the inspiring abilities of DRL.

However, as far as we know, there are few works focusing on the DRL approach applied to the energy-efficient trajectory optimization of train ATO systems. Train operation control is a multi-objective optimization problem concerning safety, punctuality, passenger comfort and energy saving, which is uneasy for an agent to learn. To support complex functionalities, train control systems may involve more intricate constraints. Constraint handling in DRL is generally incorporated into the reward function design. For example, Li et al. [33] designed a reward function of DRL in motion planning of free-float dual-arm space manipulators, to avoid the violations of end-effector velocity constraints and the self-collision between two arms, besides controlling the distance to the capturing point. Attention has also been paid to safe explorations in DRL to shorten learning time and prevent damage outcomes in case of constraint violations for on-line practical applications. Kou et al. [34] directly added a safety layer on the top of the actor network for one kind of DRL, deep deterministic policy gradient (DDPG). This safety layer copes with a constrained optimization problem to guarantee voltage constraints in power distribution networks. Andersen et al. [35] proposed a dreaming variational autoencoder (DVAE) for safe learning of DQN, which involves the risk evaluation on an action together with the safe predictive model. The applications of DRL to practical systems may entail the interferences to modulate learning processes for proactive constraint satisfactions. From a new perspective regarding learning frameworks, this paper attempts to propose a DRL with a reference system (DRL-RS) to proactively handle constraints for energy-efficient train control. We mainly focus on the extension of two typical DRL approaches, i.e., DQN and DDPG. The reference system for DRL acts as an expert intervening in agent and environment interactions to improve action selection policy.

The remainder of this paper is organized as follows. In Section 2, the problem of energy-efficient train operations is formulated. Section 3 presents the entire methodology of DRL-RS. Section 4 demonstrates the numerical experiments and results. Finally, the conclusions are drawn in Section 5.

Section snippets

Problem statement

The external forces that a train is subject to include tractive or braking force and additional operational resistances. The basic equation of train movement can be described by $m \frac{d^{2} x_{i}}{d t^{2}} = U_{i} - r (v_{i}) - m g sin (θ (x_{i}))$ $\frac{d x_{i}}{dt} = v_{i}$ where m is the train mass. x_i and v_i are the train position and speed at current instant i, respectively. U_i is the external tractive or braking force at current instant i. $r (v_{i})$ is the basic operational resistance, generally described by the Davis equation $r (v_{i}) = a + b v_{i} + c v_{i}^{2}$ [36],

DRL-RS structure

The external force, current speed, current position, and current running time are the four essential variables that are related to and restricted by each another in train operation control processes. The speed and distance are the cumulative results of external forces over time on a train. In turn, the choice of an external force is also influenced by the states of the other three variables. Moreover, when a train reaches the destination, the speed should reach a specified value such as zero,

Simulation conditions

In order to evaluate the proposed DRL-RS algorithms, we collected practical data from a metro line to perform numerical simulations. These data include train parameters and railway line data. Train parameters include train mass m = 2.8 × 10⁵ Kg, train length L_tr = 110 m, minimum unit-mass braking force U^min = -1 m/s², maximum unit-mass tractive force U^max = 0.8 m/s², and running resistance coefficients a = 2.75, b = 1.4 × 10^-2, c = 7.5 × 10^-4. Railway line data include stations, line length,

Conclusions

In this study, a DRL-RS has been proposed for energy-efficient train control to deal with complicated constraints. The role of the RS is to judge whether the actions and their accumulations will bring about constraint dissatisfactions. The reward-based RL at the aspect of dealing with constraints is a little identical to the penalty function approach to compromise multiple objectives. Therefore, a proactive constraint handling system is incorporated into DRL to adjust learning directions and

CRediT authorship contribution statement

Mengying Shang: Software, Investigation, Writing - original draft. Yonghua Zhou: Conceptualization, Writing - review & editing, Supervision. Hamido Fujita: Conceptualization, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was financially supported by the Beijing Natural Science Foundation under Grant L191017, the National Natural Science Foundation of China under Grant 61673049, and the Fundamental Research Funds for the Central Universities of China under Grant 2020YJS007.

References (42)

M. Tomita et al.
Energy-saving railway systems based on superconducting power transmission
Energy
(2017)
R. Liu et al.
Energy-efficient operation of rail vehicles
Transp. Res. A, Policy Practice.
(2003)
P. Howlett et al.
local energy minimization in optimal train control
Automatica
(2009)
A. Albrecht et al.
The key principles of optimal train control – Part 1: Formulation of the model, strategies of optimal type, evolutionary lines, location of optimal switching points
Transp. Res. B, Methodol.
(2016)
A. Albrecht et al.
The key principles of optimal train control – Part 2: Existence of an optimal strategy, the local energy minimization principle, uniqueness, computational techniques
Transp. Res. B, Methodol.
(2016)
G.M. Scheepmaker et al.
Review of energy-efficient train control and timetabling
Eur. J. Oper. Res.
(2017)
C. Sicre et al.
Real time regulation of efficient driving of high speed trains based on a genetic algorithm and a fuzzy model of manual driving
Eng. Appl. Artif. Intell.
(2014)
A.P. Cucala et al.
Fuzzy optimal schedule of high speed train operation to minimize energy consumption with uncertain delays and driver’s behavioral response
Eng. Appl. Artif. Intell.
(2012)
A. Fernández-Rodríguez et al.
Balancing energy consumption and risk of delay in high speed trains: A three-objective real-time eco-driving algorithm with fuzzy parameters
Transp. Res. C.
(2018)
M. Domínguez et al.
Multi objective particle swarm optimization algorithm for the design of efficient ATO speed profiles in metro lines
Eng. Appl. Artif. Intell.
(2014)

J. Cheng et al.

A note on the calculation of optimal strategies for the minimization of fuel consumption in the control of trains

IEEE Trans. Autom. Control.

(1993)

Cited by (28)

Deep Deterministic Policy Gradient Virtual Coupling control for the coordination and manoeuvring of heterogeneous uncertain nonlinear High-Speed Trains
2024, Engineering Applications of Artificial Intelligence
The new railway mobility paradigm Virtual Coupling, recognized as one of the most promising solutions to face the ever-increasing demand in railway transportation and the saturation of network capacity, enables two or more trains to virtually couple in a single convoy, hence reducing the headway among them. Within this context, the design of an effective and robust Virtual Coupling control strategy for uncertain nonlinear heterogeneous train convoys, able to simultaneously cope with uncertain nonlinearities and unexpected/unpredictable external factors, is an open challenge in the railway field. This is very crucial for High-Speed Trains, where, due to the high-speed operating ranges, uncertain factors have a stronger impact on Virtual Coupling performance. To deal with this issue, this work exploits the ability of Deep Reinforcement Learning based strategies in defining an optimal control policy via an iterative exploration of the surrounding unknown environment and without detailed knowledge of the plant dynamics. Specifically, we propose a novel Deep Deterministic Policy Gradient based control strategy to coordinate and manage the High-Speed Trains convoy such that this latter can autonomously adapt its behaviour to all the encountered driving scenarios. The effectiveness of the proposed approach is evaluated via simulation analyses, carried-out via an ad-hoc implemented Virtual Coupling Train System simulation platform. After verifying the efficiency of the training process in ensuring the fulfilment of the Virtual Coupling control objectives, extensive non-trivial simulations, also involving cooperative manoeuvres, are performed for the validation phase. Results confirm how the proposed model-free Deep Deterministic Policy Gradient approach guarantees the Virtual Coupling for nonlinear heterogeneous High-Speed Train convoys despite the co-presence of uncertainties and unknown external factors. Finally, the advantages and the benefits of the proposed data-driven control are disclosed via a comparison analysis against model-based control strategies.
A bi-objective optimization model of metro trains considering energy conservation and passenger waiting time
2024, Journal of Cleaner Production
Continuous growth of energy consumption for metro systems has raised the public's concern. Meanwhile, both operators and passengers pursue for efficient metro services, especially during peak hours. This paper aims at designing a bi-objective optimization model and a new solution approach so that the energy conservation of metro trains and less passenger waiting time can be achieved. Firstly, considering complex routes, a multi-particle train operation model with the objectives of punctuality, parking accuracy, energy efficiency and safety provided is established for the minimization of traction energy consumption through investigating the optimal force coefficients and coast positions, while suitable dwell times and headway times of metro trains are found trough building a timetable model, which balances the regenerative energy, transfer and non-transfer passenger waiting time. Secondly, a deep reinforcement learning (DRL) and non-dominated sorting genetic algorithm II (NSGA-II) based two-layer solution approach is introduced for model calculation. Finally, comparison shows that the proposed model can achieve the energy conservation of metro trains by 9.16% with passenger waiting time decreased by 15.88%. More experiments further demonstrate the superiority of the developed model and solution approach.
Value function factorization with dynamic weighting for deep multi-agent reinforcement learning
2022, Information Sciences
Citation Excerpt :
Multi-agent reinforcement learning (MARL) has emerged as an extremely active research field in the last decade and has shown excellent success in the multi-agent system [1,2]. In recent years, deep reinforcement learning has been comprehensively studied and has accomplished beyond human-level performances in challenging games [3–5] and applications [6–8]. At the same time, the application of deep multi-agent reinforcement learning to tackle complex and large-scale problems has achieved considerable attention.
In many real-world scenarios, multiple agents necessitate coordination with each other because of their limited observation and communication capability. Deep multi-agent reinforcement learning has demonstrated significant success in such challenging settings making use of value decomposition. One of the representative methods is QMIX, which factorizes the multi-agent global Q-value into individual Q-values and limits the joint action Q-value to a monotonic assumption leveraging an implicit mixing method. However, this assumption restricts it to representing certain value functions in which the ordering of an agent’s actions is based on the actions of others. WQMIX presents two weighting schemes to tackle this restriction but the weighting function is simple that limits the performance of methods, more appropriate weighting scheme is required to be considered. To tackle this issue, we present a more complex and accurate weighting scheme, which we call Dynamic Weighting (DW), as opposed to the fixed weighting in WQMIX. Our proposed method DW-QMIX guarantees a more general decomposition than QMIX or WQMIX and places accurate importance on the better joint actions thus leading to obtaining the optimal policy. Extensive experiments on the simulation environments and real-life systems demonstrate that our proposed method outperforms the existing multi-agent reinforcement learning methods.
A model-based hybrid soft actor-critic deep reinforcement learning algorithm for optimal ventilator settings
2022, Information Sciences
Citation Excerpt :
Reinforcement learning (RL) methods are a series of ML algorithms that have become very popular in recent years. RL algorithms learn a policy by maximizing expected cumulative rewards, which outperform other traditional ML and ANN-based algorithms in optimizing sequential decision problems [6,7]. A quantity of RL algorithms has been designed for prediction tasks, which can be roughly categorized into two groups: action value algorithms and direct policy search RL methods.
A ventilator is a device that mechanically assists in pumping air into the lungs, which is a life-saving supportive therapy in an intensive care unit (ICU). In clinical scenarios, each patient has unique physiological circumstances and specific respiratory diseases, thus requiring individualized ventilator settings. Long-term supervision by experienced clinicians is essential to perform the task of precisely adjusting ventilator parameters and making timely modifications. Moreover, a tiny clinical error can result in severe lung injury, induce multi-system organ dysfunction, and increase mortality. To reduce the workload of clinicians and prevent medical errors, machine learning (ML), or more specifically, reinforcement learning (RL) methods, have been developed to automatically adjust the ventilator’s parameters and select optimal strategies. However, the ventilator settings contain both continuous (e.g., frequency) and discrete parameters (e.g., ventilation mode), making it challenging for conventional RL-based approaches to handle such problems. Meanwhile, it is necessary to develop models with high data efficiency to overcome medical data insufficiency. In this paper, we propose a model-based hybrid soft actor-critic (MHSAC) algorithm that is developed based on the classic soft actor-critic (SAC) and model-based policy optimization (MBPO) framework. This algorithm can learn both continuous and discrete policies according to the current and predictive state of patient’s physiological information with high data efficiency. Results reveal that our proposed model significantly outperforms the baseline models, achieving superior efficiency and high accuracy in the OpenAI Gym simulation environment. Our proposed model is capable of resolving mixed action space problems, enhancing data efficiency, and accelerating convergence, which can generate practical optimal ventilator settings, minimize possible medical errors, and provide clinical decision support.
Data-driven models for train control dynamics in high-speed railways: LAG-LSTM for train trajectory prediction
2022, Information Sciences
Citation Excerpt :
To realize energy-efficient train control, Shang et al. investigated a deep learning approach with a reference system (DRL-RS) for proactive constraint handling, where the reference system checks and corrects the learning progress of the agent. Compared with the DRL algorithm and genetic algorithm, DRL-RS can converge more quickly and significantly reduce the energy consumption of trains [31]. Although these data-driven models have noticeably improved the accuracy and flexibility of physical TCMs, they still have the following drawbacks.
The construction of an accurate train control model (TCM) is crucial to the design of automatic train operation and real-time traffic management systems in high-speed railways. Traditional physical-driven models usually fail to reflect the “true” dynamics of high-speed trains (HSTs) because of the strong nonlinearity and uncertainty due to air resistance, frequently switching working conditions, and variations in external influencing factors such as weather and temperature. Although some data-driven deep learning models have recently been proposed for environmental adaptation, they are all “black-box” models, which cannot explain how the input of the models affects the HST output. To overcome these issues, this study constructs a novel long short-term memory with lagged information (LAG-LSTM) model by combining the physical-driven HST model and an “interpretable” deep learning model. Specifically, our LAG-LSTM model contains three modules: a time-delay variable module to model the transform delay of control variables, state variable enhancement module to extract the key features among high-dimensional input data, and Pre-LSTM module to predict the future train trajectory with given control variables. We collected field data from the Beijing-Shanghai high-speed railway and developed a data filter method and a normalization procedure to overcome the positioning errors of HSTs and construct a standard data set. Finally, we tested the effectiveness of our LAG-LSTM by comparing it with six deep learning structures, including fully connected neural network, recursive neural network, standard LSTM, and LSTM with convolutional layers. The results show that LAG-LSTM can accurately predict the trajectories of HSTs and outperforms other deep learning models. Regarding prediction accuracy, LAG-LSTM improved the performance of the traditional LSTM by $13.5 %$ to $23.3 %$ .
Deep learning feature-based setpoint generation and optimal control for flotation processes
2021, Information Sciences
Computer vision-based control is a nonintrusive, cost-effective, and reliable technique for flotation process control. It is known that deep learning features can depict the complex behavior of the froth surface more comprehensively and accurately than handcrafted features. However, few studies have tried to use additional information to improve flotation performance through optimal control. To this end, we have attempted to develop a novel deep learning feature-based two-layer optimal control scheme. The first layer is proposed for setpoint generation of high-dimensional features using improved fuzzy association rule reasoning. Then, an offline conservative double Q-learning control layer that can learn from historical industrial records by mitigating bootstrapping error in action value functions is developed. The proposed method can adapt the setpoint to the change in process feeds. Meanwhile, in contrast to traditional approximate dynamic programming methods that need to interact with real/simulated process systems, this controller can work without any further interactions, which makes it possible to transfer the success of reinforcement learning algorithms to complex industrial process control where opportunities to explore are missing. Experiments demonstrate that the proposed method is effective and promising for practical flotation process control.

View all citing articles on Scopus

View full text

Deep reinforcement learning with reference system to handle constraints for energy-efficient train control

Abstract

Introduction

Section snippets

Problem statement

DRL-RS structure

Simulation conditions

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgements

Energy

Transp. Res. A, Policy Practice.

Automatica

Transp. Res. B, Methodol.

Transp. Res. B, Methodol.

Eur. J. Oper. Res.

Eng. Appl. Artif. Intell.

Eng. Appl. Artif. Intell.

Transp. Res. C.

Eng. Appl. Artif. Intell.

Knowl. Based Syst.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Knowl. Based Syst.

Transp. Res. C

Energy Build

Energy-efficient train tracking operation based on multiple optimization models

IEEE Trans. Intell. Transp. Syst.

A note on the calculation of optimal strategies for the minimization of fuel consumption in the control of trains

IEEE Trans. Autom. Control.