Complex Network-based features extraction in RGB-D human action recognition

https://doi.org/10.1016/j.jvcir.2021.103371Get rights and content

Abstract

Analysis of human behavior through visual information has been one of the active research areas in computer vision community during the last decade. Vision-based human action recognition (HAR) is a crucial part of human behavior analysis, which is also of great demand in a wide range of applications. HAR was initially performed via images from a conventional camera; however, depth sensors have recently embedded as an additional informative resource to cameras. In this paper, we have proposed a novel approach to largely improve the performance of human action recognition using Complex Network-based feature extraction from RGB-D information. Accordingly, the constructed complex network is employed for single-person action recognition from skeletal data consisting of 3D positions of body joints. The indirect features help the model cope with the majority of challenges in action recognition. In this paper, the meta-path concept in the complex network has been presented to lessen the unusual actions structure challenges. Further, it boosts recognition performance. The extensive experimental results on two widely adopted benchmark datasets, the MSR-Action Pairs, and MSR Daily Activity3D indicate the efficiency and validity of the method.

Introduction

Nowadays, human behavior analysis (HBA) has attracted more interest in the field of artificial intelligence and machine learning. Motivated by the wide possible application areas, some significant advances have been made in learning and recognition of human behavior, especially using computer vision techniques [2], [3]. HBA can be widely applied to various domains, such as surveillance in public areas, shopping centers, and airports, home care for elderly people and children, Human-Computer/Robot Interaction (HCI/HRI), video retrieval, computer gaming [1], [7], [9], [13]. For instance, to build a human–computer interface that intelligently serves people, a system should be able to not only sense the human movements but also understand their actions and intention. The goal of human activity recognition is to automatically detect and analyze human activities in real-time from the information acquired from different types of sensors, such as RGB cameras, range sensors, or other sensing modalities [1], [9], [10]. As a result, if machines could automatically interpret the activities that people perform in daily life, many tasks would be revolutionized [4]. Depending on the complexity, the human motions can be conceptually categorized as gestures, actions, and activities with interactions. Gestures are normally regarded as the atomic elements of human movements, such as “turning head to the left”, “raising left leg”, and “crouching”. Actions usually refer to a single human motion that consists of one or more gestures, like “walking”, “throwing”, etc. In the most complex scenario, the subject could interact with objects or other subjects, for instance, “two persons fighting” and “people playing football” [10].

Recently, the progress of sensor technologies has led to affordable high-definition depth cameras, such as Microsoft Kinect [48]. Depth camera exploits structured- light to capture the depth map in real-time. Pixels in a depth map represent the depth of a scene rather than a measure of the intensity of color [7]. After the recent release of cost-effective depth sensors, studies on various applications on 3D data have significantly been increased [6]. In general, depth imagery is inexpensive and overcomes some of the limitations of RGB images. It has some advantages over conventional RGB images such as providing the 3D structure of the scene, robustness against illumination changes, better segmentation, proper background subtraction, and motion estimation [1], [7], [9], [10], [15], [29]. However, there remain some crucial problems for algorithms using various types of data [1], [9]. As it is known, the human body is an articulated system of rigid segments connected by joints. Hence, human motion is often considered as a continuous evolution of the configuration of the segments or body posture. If the body joints could reliably be extracted and tracked, action recognition would be obtained using the tracked joint positions [11]. The Kinect provides 3D positions of 20 skeletal joints for each subject tracked in the scene as shown in Fig. 1[35].

In general, Human gestures depend on many factors, such as the environment, culture, personality differences, and emotions. Furthermore, people may perform similar actions in different ways. It becomes a more complex task, when a person may even perform an action differently on some occasions. Due to the large diversity of human body size, appearance, shape, the complexity of human-object interactions, and complicated Spatio-temporal structures, the task of automatically recognizing actions is very challenging [5], [7]. Accordingly, major challenges in vision-based human action recognition can be enumerated as image processing tasks that mostly emerged from the environment, occlusions, cluttered background, shadows, illumination changes, and view changes. They result in similar actions to generate a different ’’appearance’’ from different perspectives. Furthermore, scale variances due to the subject appearance at different distances to the camera or different subjects with different body sizes cause problematic in intra-class and inter-class similarity of actions [1], [2], [10], [12], [14]. As a result, individuals may perform an action in different directions with different characteristics of body part movements, while two actions may be only distinguished by very subtle Spatio-temporal details [1]. The main contribution of this paper consists of the following concepts;

First, the introduction of a novel method based on complex network analysis is presented. With the advent of recent studies in network science theory, many real world problems could be modeled as a complex network. The network is composed of a large collection of non-trivial interconnected nodes where the links lead to some important meanings or behaviors. Complex network analysis attempts to discover some important patterns in a network using graph theory and statistical measures. In this study, the Spatio-Temporal complex network, constructed from RGB-D information of human actions in video sequences, is used to extract the crucial structural and functional features to improve the classification rates of different human actions while keeps the lower computational time.

In this way, we introduce the concept of meta-paths generated from the Spatio-Temporal network to capture the semantic relationships across 20 skeletal joints of each subject which are defined as a path over the graph of network schema. We show how meta-paths analyses are crucial to mine the network to further extract some meaningful patterns. Meta-path also enables us to deal with noise and momentary obstruction in images. Finally, some semantic rules have been studied to identify redundant information that is useless links between some skeletal joints of human, on the network. In this regard, the proposed method leads us to lessen the computational time of human action classification.

The remainder of this paper is organized as follows. Section 2 summarizes some recent and related studies in human action recognition. In Section 3, the human action recognition method is outlined. Section 4 details the experimental data sets. Experimental results and comparison with the state-of-the-art are specified in Section 5. Finally, some concluding remarks are presented in Section 6.

Section snippets

Related works

The human body is an articulated system of rigid segments connected by joints. Hence, human action could be considered as a continuous evolution of the configuration of these segments [11]. In 1975, the initial studies on human action recognition indicated that humans can recognize activities with only looking at the light spots attached to the major joints of a person. In computer vision, there are several studies on extracting joints or detecting body parts and tracking them in the temporal

Proposed HAR method

In this section, the proposed human action recognition method is discussed in detail. As shown in Fig. 2. The main steps of the method are keyframe selection, meta-path generation, and complex network construction. Finally, some structural and semantic features are extracted from the established complex network for action learning. Network schema is obtained from the problem domain where the relationship between the frames is meaningful.

Keyframe selection which leads to the reduction of time

Experimentation

To demonstrate the validity and efficiency of the proposed method, some experiments have been carried out on MSR-Action Pairs and MSR Daily Activity 3D datasets.

Conclusion

In this paper, a complex network-based feature extraction method from the RGB-D sensors has been proposed as a new framework to perform proper human action recognition. Results obtained with extensive experiments indicate the validity and efficiency of using proper meta- paths in the proposed human action recognition method. In this study, we have attempted to recognize an action by choosing multi meta-path for expressing different samples with some compact feature vectors based on complex

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (63)

  • Yanli Ji

    One-shot learning based pattern transition map for action early recognition

    Signal Process.

    (2018)
  • Xiaopeng Ji

    Skeleton embedded motion body partition for human action recognition using depth sequences

    Sig. Process.

    (2018)
  • Yongxiong Wang et al.

    A self-adaptive weighted affinity propagation clustering for key frames extraction on human action recognition

    J. Vis. Commun. Image Represent.

    (2015)
  • Yu. Zhou et al.

    Human action recognition with skeleton induced discriminative approximate rigid part model

    Pattern Recogn. Lett.

    (2016)
  • Bangli Liu

    RGB-D sensing based human action and interaction analysis: A survey

    Pattern Recogn.

    (2019)
  • Juan C. Nunez

    Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition

    Pattern Recogn.

    (2018)
  • Jun Kong et al.

    Collaborative multimodal feature learning for RGB-D action recognition

    J. Vis. Commun. Image Represent.

    (2019)
  • Ye Gu

    Multiple stream deep learning model for human action recognition

    Image Vis. Comput.

    (2020)
  • Amitesh Singh Rajput et al.

    Privacy-preserving human action recognition as a remote cloud service using RGB-D sensors and deep CNN

    Expert Syst. Appl.

    (2020)
  • Wang, Jiang, Zicheng Liu, and Ying Wu. Human Action Recognition with Depth Cameras. Springer, 2014...
  • Y. Zhao

    Combing RGB and depth map features for human activity recognition

    (2012)
  • M. Ye

    “A survey on human motion analysis from depth data.” Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications

    (2013)
  • W. Li et al.

    Action recognition based on a bag of 3d points

    Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on

    (2010)
  • J. Sung

    Unstructured human activity detection from rgbd images

    Robotics and Automation (ICRA), 2012 IEEE International Conference on

    (2012)
  • J. Wang

    Mining actionlet ensemble for action recognition with depth cameras

    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

    (2012)
  • Y.-Y. Lin

    Depth and skeleton associated action recognition without online accessible rgb-d cameras

    Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on

    (2014)
  • S. Sempena

    Nur Ulfa Maulidevi, and Peb Ruswono Aryan. “Human action recognition using dynamic time warping.”

    Electrical Engineering and Informatics (ICEEI), 2011 International Conference on

    (2011)
  • Zhang, Jing, et al. “RGB-D-based action recognition datasets: A survey.”Pattern Recognition 60 (2016):...
  • P.u. Climent-Pérez

    Optimal joint selection for skeletal data from RGB-D devices using a genetic algorithm

  • A.A. Chaaraoui et al.

    Adaptive human action recognition with an evolving bag of key poses

    Autonomous Mental Development, IEEE Transactions on

    (2014)
  • Victoria Bloom et al.

    G3D: A gaming action dataset and real time action recognition evaluation framework

    2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

    (2012)
  • Cited by (13)

    • Depth cue enhancement and guidance network for RGB-D salient object detection

      2023, Journal of Visual Communication and Image Representation
    • Spatial and temporal information fusion for human action recognition via Center Boundary Balancing Multimodal Classifier

      2023, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Human action recognition based on the depth videos is considered as an active domain in computer vision applications including human–computer interaction [1,2], security surveillance [3–5], automates driving [6], and robotics applications [7]. Research on human action recognition can be divided into four categories based on the data type of human actions: (1) RGB video-based human action recognition [8–11], (2) depth video-based human action recognition [12,13], (3) depth skeleton-based human action recognition [14,15], (4) acceleration-based human action recognition [16,17], (5) hallucination-based human action recognition [18,19]. Compared with depth skeletons and acceleration data, depth videos are more convenient to be collected without additional pose estimation algorithms or wearable devices.

    • VirtualActionNet: A strong two-stream point cloud sequence network for human action recognition

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      3D human action recognition is crucial in wide application scenarios such as video surveillance, medical services, autonomous driving, video analysis [1–5]. 3D human action recognition methods can be categorized as RGB sequence-based methods [6–8], skeleton sequence-based methods [9–11], and depth sequence-based methods [12–14]. Compared to RGB sequence-based methods, depth sequences contain rich 3D structural information, providing invariance to lighting conditions, texture or color changes, and protection of personal privacy.

    View all citing articles on Scopus
    View full text