Skeleton-based traffic command recognition at road intersections for intelligent vehicles

doi:10.1016/j.neucom.2022.05.107

Neurocomputing

Volume 501, 28 August 2022, Pages 123-134

https://doi.org/10.1016/j.neucom.2022.05.107 Get rights and content

Highlights

•
Pioneering research on traffic command recognition distinguishing directions and gestures.
•
A two-stage recognition model exploiting skeletal geometry and co-occurrence features.
•
A specialized dataset for recognizing Chinese traffic commands at road intersections.

Abstract

Understanding traffic officer commands is a fundamental perception task for intelligent vehicles in driver assistance and autonomous driving. Previous studies have emphasized explicit traffic command gesture recognition but have not considered situations where the traffic officer is controlling the subjects in other directions, which would also influence decision-making of the ego vehicle. To fill in the gap, this article aims to research visual skeleton-based recognition of traffic commands occurring at road intersections, where both command directions and gestures should be determined. Specifically, a two-stage recognition framework for four cross-shaped directions and eight command gestures is proposed. Two kinds of handcrafted features, including upper-body geometric features and keypoint co-occurrence features, are established with estimated 2D human keypoint coordinates and heatmaps and further combined into a deep learning network. The first stage handles human body orientation classification, while the second stage addresses command gesture recognition with extra usage of the output from the first stage. Combining the recognized body orientation and command gesture, the type of traffic command can ultimately be inferred. For training and validation, a dataset termed the Chinese Traffic Command at Intersections (CTCX) is built. The proposed method gains an outperforming edit accuracy of 89.67% on the CTCX test set, demonstrating its effectiveness. This work provides a foundation in this area and is expected to inspire more research on traffic command recognition with directions in the near future.

Introduction

Just as human drivers can understand regulated traffic command gestures when traffic officers are controlling, intelligent vehicles should be capable of recognizing them as well. Traffic command gesture recognition is a fundamental perception task in driver assistance and autonomous driving. This task is particularly critical under mixed traffic scenarios because it can inform drivers or vehicles of driving situations and improve safety.

Recent years have witnessed a revolution in methodology and database research on traffic command gesture recognition [1], [2], [3], [4], [5], owing to the advancement of deep learning in computer vision, especially in human action recognition. Being aware of the significance of automatically recognizing traffic commands for intelligent driving, a growing number of researchers have devoted attention to related studies. To date, the proposed methods and public databases have been increasing, and the gap between their scientific research and real applications has been narrowing. Previous studies mainly aimed to recognize explicit commands, such as eight kinds of Chinese regulated traffic command gestures. Generally, only the command gestures directed to the vehicle must be recognized. However, command gestures in other directions also influence the ego vehicle. For example, as shown in Fig. 1, when a driver notices that a traffic officer gestures for the vehicles coming in the crosswise direction to go straight, he/she is obligated to stop at the same time. Humans are able to decipher these seemly unrelated gestures. Supposing a vehicle could be informed of the controls to other vehicles, it would have a more comprehensive knowledge of the surrounding environment and tend to make correct decisions rather than continue driving by mistake. Therefore, it is necessary to make intelligent vehicles learn these associated skills.

To serve the understanding of implicit traffic commands, we specify an extended traffic command recognition task to obtain awareness of both command gestures and directions. As a startup, direction recognition is simplified as a four-direction classification task, which adapts to the circumstance of typical road intersections. High accuracy and great robustness are the primary objectives of this new task for the sake of on-board practical applications. To our knowledge, this work is the first to clearly describe the issue and aim to resolve it.

The human skeleton is a compact representation for action and has been a widely used modality in human action recognition [6], [7], traffic agent intent [8] and trajectory prediction [9], [10]. Previous works on traffic command gesture recognition have also illustrated the superiority of skeleton information usage [4], [11], [5], [12]. Considering the compactness, robustness and reusability of skeleton modality, we propose a two-stage recognition framework based on an estimated 2D skeleton. The discriminative features are generated by combining the handcrafted skeletal geometry and co-occurrence with deep learning. The architecture of the two stages is simple, and the output of the first stage continues to be fed into the second stage as a part of the features. We also build a dataset containing Chinese traffic command gesture instances in the four directions for training and validation. To summarize, this work makes contributions in three ways:

•
A pioneering study on traffic command recognition distinguishing directions and gestures is carried out to fill in research blanks, promoting the perception capability of intelligent vehicles.
•
With estimated 2D human skeletons, a two-stage recognition framework with discriminative features combined with handcrafted skeletal geometry and co-occurrence is presented to tackle the challenge. Our approach is characterized by a simple but effective structure and attains significant performance in the experiments.
•
An extended dataset termed “Chinese Traffic Command at Intersections” (CTCX) is established with videos of traffic command gestures in the four cross-shape directions for methodology studies. The experimental results verify the effectiveness of the proposed approach¹.

The rest of the article is organized as follows. Section 2 introduces the related and illuminating works. Section 3 gives a specific definition of the proposed problem, and Section 4 explains the details of the presented approach. Section 5 illustrates the dataset establishment and the evaluation metrics, demonstrates the experimental results and comparisons, and discusses the influences of model settings and parameter selection, as well as generalization. Finally, conclusions are drawn in Section 6.

Section snippets

Related Work

Traffic command gesture recognition is a subcategory and practical application of human action recognition. Because of the specific onboard scenarios, the approaches based on signals of wearable sensors [13] or depth sensors [2], [3] are limited for real use. Moreover, it is difficult for RGB-based recognition models [14], [15], [1] to adapt to diverse scenes because the current data amount of traffic command gesture instances cannot guarantee the requirements of generalization. Recently, an

Problem Formulation

When human drivers notice a traffic officer directing traffic at road intersections, normally the first aim is to identify which direction the officer is controlling and then to recognize the meaning of the gesture. In this paper, considering the case of intersections, the command directions are modeled in terms of an orientation set including self, left, opposite and right. In line with the Chinese traffic rules, there are eight standard traffic control gestures including stop, go straight,

Methodology

As illustrated in Fig. 2, our proposed method is basically a two-stage framework sharing a partially similar network architecture. Human pose estimation is conducted as data preprocessing to extract skeleton information. Then, the spatial process is launched to generate the upper-body geometric features and keypoint co-occurrence features for the human instance at each frame. In the temporal process, LSTM layers are used to further process the spatial features across the frames. At the last

Experiments

This section demonstrates the experiments for validating the proposed method quantitatively and qualitatively. The dataset, evaluation metrics and implementation details are introduced successively, and the experimental results and discussion are presented at the end.

Conclusion

This article, which aims at Chinese traffic command recognition at road intersections, is the first to clarify the issue of implicit traffic commands and formulates the problem as an associated task of orientation classification and gesture recognition. We propose a two-stage recognition framework that takes advantage of a concise LSTM-based network and combines handcrafted features with deep learning. The upper-body geometric features make use of only seven human keypoints but show remarkable

CRediT authorship contribution statement

Sijia Wang: Conceptualization, Methodology, Investigation, Writing - original draft. Kun Jiang: Data curation, Resources, Project administration. Junjie Chen: Formal analysis, Writing - review & editing. Mengmeng Yang: Data curation, Writing - review & editing. Zheng Fu: Validation, Visualization. Tuopu Wen: Methodology, Software. Diange Yang: Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by the National Natural Science Foundation of China (Grant Nos. U1864203, 52102396, and 52102464), and Sharing-Van Automatic Driving Development Project (Grant No. HT20082302).

Sijia Wang received her B.S. degree in automotive engineering from Tsinghua University, Beijing, China in 2017. She is currently working toward the Ph.D. degree at School of Vehicle and Mobility, Tsinghua University, Beijing, China. Her research interests include pose estimation and activity recognition of vulnerable road users for autonomous driving.

References (42)

J. He et al.
Visual recognition of traffic police gestures with convolutional pose machine and handcrafted features
Neurocomputing
(2020)
S. Wang et al.
Simple but effective: Upper-body geometric features for traffic command gesture recognition
IEEE Transactions on Human-Machine Systems (Early Access)
(2021)
J. Wang et al.
Deep 3d human pose estimation: A review
Computer Vision and Image Understanding
(2021)
M. Raza et al.
Appearance based pedestrians’ head pose and body orientation estimation using deep learning
Neurocomputing
(2018)
Z. Fan et al.
An online approach for gesture recognition toward real-world applications
F. Guo et al.
Gesture recognition of traffic police based on static and dynamic descriptor fusion
Multimedia Tools and Applications
(2017)
C. Ma et al.
Traffic command gesture recognition for virtual urban scenes based on a spatiotemporal convolution neural network
ISPRS International Journal of Geo-Information
(2018)
J. Wiederer et al.
Traffic control gesture recognition for autonomous vehicles
B. Ren, M. Liu, R. Ding, H. Liu, A survey on 3d skeleton-based action recognition using learning method, arXiv preprint...
S. Gaglio et al.
Human activity recognition process using 3-d posture data
IEEE Transactions on Human-Machine Systems
(2015)

Z. Fang et al.

Intention recognition of pedestrians and cyclists by 2d pose estimation

IEEE Transactions on Intelligent Transportation Systems

(2019)

R.Q. Mínguez et al.

Pedestrian path, pose, and intention prediction through gaussian process dynamical models and pedestrian activity recognition

IEEE Transactions on Intelligent Transportation Systems

(2018)

J. Liang et al.

Peeking into the future: Predicting future person activities and locations in videos

Z. Fang et al.

Traffic police gesture recognition by pose graph convolutional networks

T. Yuan et al.

Accelerometer-based chinese traffic police gesture recognition system

Chinese Journal of Electronics

(2010)

F. Guo, J. Tang, C. Zhu, Gesture recognition for chinese traffic police, in: International Conference on Virtual...

Z. Cai et al.

Max-covering scheme for gesture recognition of chinese traffic police

Pattern Analysis and Applications

(2015)

X. Xiong et al.

Traffic police gesture recognition based on gesture skeleton extractor and multichannel dilated graph convolution network

Electronics

(2021)

T.L. Munea et al.

The progress of human pose estimation: A survey and taxonomy of models applied in 2d human pose estimation

IEEE Access

(2020)

S. Gupta et al.

Conventionalized gestures for the interaction of people in traffic with autonomous vehicles

F. Guo et al.

Chinese traffic police gesture recognition in complex scene

Cited by (5)

Modified online sequential extreme learning machine algorithm using model predictive control approach
2023, Intelligent Systems with Applications
This paper stresses its contribution based on improving the learning dynamics of the online sequential extreme learning machine (OS-ELM) algorithm using a control system approach. We develop a predictive learning framework that enables optimization with a finite horizon using model predictive control (MPC). A Lyapunov inequality function for a discrete-time linear time-varying systems (DLTV) systems is utilized to guarantees learning dynamics stability. The numerical finding shows that the learning dynamics of our approach fit the sequential learning in the OS-ELM. To enhance the performance, we combine our model with principal component analysis (PCA) for dimensionality reduction and robust principal component analysis (RPCA) for handling data outliers. In this paper, two models were proposed: Alg. 1 is a modified OS-ELM with PCA, and Alg. 2 is a modified OS-ELM with RPCA. The experiment on regression and classification tasks has been conducted to show the efficacy of our proposed models. For regression tasks, our proposed model shows significant results in reducing the normalized mean square error (nRMSE). For the classification tasks, the accuracy performance has significantly increased. The increasing of the percentage performance improvement rate (PIR%) compared to the classic OS-ELM is reported as follows: Alg. 1 (4.83%) and Alg. 2 (3.03%) for binary classification; Alg. 1 (8.54%) and Alg. 2 (7.54%). The region of curve-area under curve (ROC-AUC) provides better discrimination results in differentiating between classes. From evaluation performance indicators, our proposed models show competitive results compared to other ELM types, such as kernel-based ELM (K-ELM) and multi-layer ELM (ML-OSELM and ML-ELM). We apply our proposed models for human gesture recognition to a case study of traffic gestures used by Indonesian police to regulate traffic flow. The experiment results show significant improvement in classifying human gestures, i.e., weighted-accuracy performance: Alg. 1 (93.8%); Alg. 2 (93.2%); and OS-ELM (81.8%).
Real-Time Visual Recognition of Ramp Hand Signals for UAS Ground Operations
2023, Journal of Intelligent and Robotic Systems: Theory and Applications
Real-Time Evaluation of Perception Uncertainty and Validity Verification of Autonomous Driving
2023, Sensors
Traffic Control Gesture Recognition Based on Masked Decoupling Adaptive Graph Convolution Network
2023, CICTP 2023: Emerging Data-Driven Sustainable Technological Innovation in Transportation - Proceedings of the 23rd COTA International Conference of Transportation Professionals
A YOLO-based Method for Improper Behavior Predictions
2023, Proceedings of IEEE InC4 2023 - 2023 IEEE International Conference on Contemporary Computing and Communications

Kun Jiang received his B.S. degree in mechanical and automation engineering from Shanghai Jiao Tong University, China in 2011. Then, he received his master’s degree in the mechatronics system and his Ph.D. degree in information and systems technologies from the University of Technology of Compiègne (UTC), Compiègne, France, in 2013 and 2016, respectively. He is an assistant research professor at the School of Vehicle and Mobility of Tsinghua University, Beijing, China. His research interests include autonomous vehicles, high-precision digital maps, and sensor fusion.

Junjie Chen received his Ph.D. degree in traffic information engineering and control from Beijing Jiaotong University in 2020. He was a research assistant at Carnegie Mellon University (CMU), Pittsburgh, PA, USA, from 2018 to 2020. He currently holds a postdoctoral position at Tsinghua University, Beijing, China. His research interests include nonparametric Bayesian learning, platoon operation control, and recognition and application of human driving characteristics.

Mengmeng Yang received her Ph.D. degree in photogrammetry and remote sensing from Wuhan University, Hubei, China in 2018. She is now conducting postdoctoral research at the School of Vehicle and Mobility of Tsinghua University, Beijing, China. Her research interests include high-definition maps for autonomous driving.

Zheng Fu received her master’s degree in pattern recognition and intelligent systems from Nanjing University of Posts and Telecommunications, Jiangsu, China, in 2019. She is pursuing the Ph.D. degree at School of Vehicle and Mobility, Tsinghua University, Beijing, China. Her current research interests include human 3-D pose estimation and human intention analysis for autonomous driving.

Tuopu Wen received his B.S. degree from Electronic Engineering, Tsinghua University, Beijing, China in 2018. He is currently working toward the Ph.D. degree at School of Vehicle and Mobility of Tsinghua University, Beijing, China. His research interests include computer vision, high definition maps, and high precision localization for autonomous driving.

Diange Yang is a professor at the School of Vehicle and Mobility of Tsinghua University. He received his B.S. and Ph.D. from Tsinghua University in 1996 and 2001, respectively. His research work mainly focuses on intelligent connected vehicles and autonomous driving. He has published over 120 articles, registered more than 60 national patents, and authored over 10 software copyrights. He has received numerous awards during his career, including the Distinguished Young Science Technology talent of Chinese Automobile Industry in 2011 and the Excellent Young Scientist of Beijing in 2010. He was also the recipient of the Second Prize of National Technology Invention Rewards of China in 2010 and in 2013.

View full text

Skeleton-based traffic command recognition at road intersections for intelligent vehicles

Highlights

Abstract

Introduction

Section snippets

Related Work

Problem Formulation

Methodology

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Neurocomputing

IEEE Transactions on Human-Machine Systems (Early Access)

Computer Vision and Image Understanding

Neurocomputing

An online approach for gesture recognition toward real-world applications

Gesture recognition of traffic police based on static and dynamic descriptor fusion

Multimedia Tools and Applications

Traffic command gesture recognition for virtual urban scenes based on a spatiotemporal convolution neural network

ISPRS International Journal of Geo-Information

Traffic control gesture recognition for autonomous vehicles

Human activity recognition process using 3-d posture data

IEEE Transactions on Human-Machine Systems

Intention recognition of pedestrians and cyclists by 2d pose estimation

IEEE Transactions on Intelligent Transportation Systems

Pedestrian path, pose, and intention prediction through gaussian process dynamical models and pedestrian activity recognition

IEEE Transactions on Intelligent Transportation Systems

Peeking into the future: Predicting future person activities and locations in videos

Traffic police gesture recognition by pose graph convolutional networks

Accelerometer-based chinese traffic police gesture recognition system

Chinese Journal of Electronics

Max-covering scheme for gesture recognition of chinese traffic police

Pattern Analysis and Applications

Traffic police gesture recognition based on gesture skeleton extractor and multichannel dilated graph convolution network

Electronics

The progress of human pose estimation: A survey and taxonomy of models applied in 2d human pose estimation

IEEE Access

Conventionalized gestures for the interaction of people in traffic with autonomous vehicles

Chinese traffic police gesture recognition in complex scene