Elsevier

Neurocomputing

Volume 488, 1 June 2022, Pages 212-225
Neurocomputing

One-shot Video Graph Generation for Explainable Action Reasoning

https://doi.org/10.1016/j.neucom.2022.02.069Get rights and content

Abstract

Human action analysis is a critical yet challenging task for understanding diverse video content. Recently, to enable explainable reasoning of video actions, a spatio-temporal video graph structure was proposed to represent the video state changes at the semantic level. However, its requirement of tedious manual annotation of all the video frames is a serious limitation. Obviously, this approach would have a tremendously expanded applicability if the video graph generation process can be automated. In this paper, a One-Shot Video Graph (OSVG) generation approach is proposed for more effective explainable action reasoning, which only requires a one-time annotation of the objects in the starting frame of the video. We first estimate the predefined relevant objects across the temporal dimension by employing a proposed one-shot target-aware tracking strategy. This helps obtain the object locations and links objects simultaneously across all video frames. Then, the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. In addition, to further enhance the reasoning accuracy of performed actions, a video graph smoothing mechanism is designed with a fully-connected Conditional Random Field (CRF). By sequentially examining every state transition (including attributes and relationships) of the smoothed video graph, the actions occurring can be recognized using pre-defined rules. Experiments on the CAD-120++ dataset and a newly collected NTU RGBD++ dataset have verified that the proposed OSVG is able to outperform other state-of-the-art video action reasoning strategies on both state recognition and action recognition accuracy.

Introduction

Human action understanding in videos is an important and challenging task. In recent studies, the achievements of deep learning have motivated a constant emergence of the end-to-end neural networks for video action classification [1], [2], [3], [4], [5] with a significantly improved accuracy. But to analyze the human action more comprehensively, the semantic information, such as who (particular objects), when (time), where (object locations) and how (what kind of state changes), of the action also need be well explained. Due to the lack of rules for logical reasoning, most of the proposed approaches cannot provide those information when the action classification/recognition is on-the-fly, and further limit their applications in more generic human–machine interaction environments.

An effective solution to bridge such a semantic gap is an Explainable Video Action Reasoning (EVAR) framework proposed by Zhuo et al. [6]. In this framework, the performed actions can be detected and recognized based on pre-fined rules when the attribute or relationship changes (shown in Fig. 1). Unlike the other earlier end-to-end strategies [1], [2], [3], [4], [5], EVAR connects the classical logic reasoning and the contemporary deep learning models using a video graph, which enables the performed action detection and explanation by temporally interpreting the semantic state changes of video content with prior knowledge. However, the requirement of tremendous manual annotations heavily limits the EVAR to generate the video graph automatically, which is a crucial weakness in its practical applications.

Minimizing the onerous manual annotation work via automatic video graph generation is an extremely challenging problem. A plausible way to automate would be to represent the semantic state changes by linking the same object of predicted scene graphs across all video frames. And for the scene graph representation of each frame, the relevant attributes [7] and relationships [8], [9], [10] among objects can be estimated based on the accurate object localization/recognition [11], [12], [13] during each state transition. To make this process feasible leads us to object tracking [14], [15], [16], instead of object detection [11], [12], [13]. The main reason is the unreliable detection performance in low resolution videos, especially when the action involves objects of small size, e.g., human hands in CAD-120++ dataset [6].

EVAR [6] manually annotates objects across the whole video for the purpose of linking the objects. In contrast, we focus on automatic video graph generation based on the proposed one-shot target-aware tracking. To obtain the object locations and to link objects across all frames simultaneously, the concerned objects are estimated along the temporal dimension based only on first-frame annotations, which can also be regarded as exemplars in one-shot detection. Such a mechanism is able to avoid tedious manual annotations, as well as take the advantage of the initial template to re-detect the objects when they disappear temporarily. Then for scene graph representation, the relevant attributes and relationships among the detected objects are predicted.

Since the explainable action is estimated by state transition (requiring smoothing between realistic actions), a video graph smoothing approach is introduced based on fully-connected CRF [17], which is to smooth the attributes and relationships sequences during initialization. Unlike [6] that uses a sliding window to smooth the state sequences in short term, our proposed smoothing approach employs the long-term dependencies of the whole state sequence. Our approach maps the original hard state labels into a probabilistic space of states (attributes/relationships). More accurate explainable action recognition is obtained after the state smoothing operation. For the final explainable video action reasoning, there are two action reasoning models (attribute-transition based and relationship-transition based) [6] being used on the video graph to explain how those actions are executed, including who (particular objects), when (time), where (object locations) and how (what kind of state changes).

Our main contributions can be summarized as: (1) We propose a One-Shot Video Graph (OSVG) generation to achieve automated explainable action reasoning; (2) A video graph smoothing approach based on fully-connected CRF is introduced to smooth the state sequence with long-term temporal dependencies; (3) The NTU RGBD [18] dataset is re-annotated as NTU RGBD++ with additional objects, attributes, relationships for open benchmark verification of explainable video action reasoning. Experimental results on the datasets of CAD-120++ (daily life) [6] and new NTU RGBD++ (daily and social life) show competitive performance for both state recognition and video action reasoning.

This journal paper has augmented our previous work EVAR [6] in different aspects. First, the video graph generation is extended from extensive manual annotation to automatic generation with a proposed one-shot target-aware tracking strategy. This is able to enhance the practical applicability by avoiding tremendous labeling work. Second, instead of using sliding window for video graph smoothing, a new CRF smoothing approach is introduced to further improve the accuracy of explainable action recognition via reducing the false positive actions. Third, to achieve an additional evaluation on more open benchmark, a new dataset NTU RGBD++ is constructed based on NTU RGBD dataset [18], and substantial experiments are conducted to verify the superior performance of the proposed work.

Section snippets

Video Action Recognition

Substantial progress has been made in the study of video action recognition over the past decade [1], [19], [2], [3], [4], [5], [20], [21], [22], [23]. However, rather than only to predict a simple action label for a long video sequence, research on explainable action analysis with reasoning has been rather limited until recently.

Early logical action recognition [24], [25], [26], [27] usually employed low-level features (motion, location and trajectories) to explain what had happened to the

Explainable Video Action Reasoning (EVAR)

The core of EVAR [6] is to interpret semantic state (attribute and relationship) changes of video content with pre-defined rules. Following the problem definition in [6], we only consider atomic actions, i.e. an action is defined by one attribute (open: closed open) or relationship (pick: not_holding holding) change. For complex activity that involves multiple object and atomic actions, they can be reasoned by the first-order logic. For example, the “having_meal” activity can be defined by

One-shot Video Graph Generation

In the previous work [6], video graphs are constructed with manually annotated object locations, which is time consuming. To reduce the annotation cost, we only use the annotated objects on the first video frame and propose a one-shot tracking strategy for efficient object localization. As illustrated in Fig. 2, the overall framework of the proposed video graph generation method consists a one-shot object tracking module and a scene graph prediction module. Specifically, the tracking module is

Explainable Action Reasoning

By sequentially observing the video state changes in video graph, performed actions can be detected and explained by the pre-defined rules. However, due to the initialized video graph is not as consistent as the semantic state changes in the real world, we propose a fully-connected CRF method for video graph smoothing, then an attribute-based action reasoning model and a relationship-based action reasoning models are used for the final action reasoning.

Experiments

In this section, the implementation details of the proposed work are first discussed, followed by the introduction of two explainable action reasoning datasets: CAD-120++ dataset [6] and new collected NTU RGBD++ dataset. For more substantial demonstration, we report the tracking performance, the accuracy of video graph generation, and the accuracy of action recognition. Finally, the effectiveness of some modules is discussed and analyzed in the ablation study.

Conclusions

In this paper, we propose a one-shot video graph generation method for explainable action reasoning. Given a video sequence based on one-shot annotation, we first estimate the concerned object locations by employing a proposed one-shot target-aware tracker. Then the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. After that, a fully-connected CRF model is used to smooth the attributes and

CRediT authorship contribution statement

Yamin Han: Conceptualization, Methodology, Software, Writing - original draft. Tao Zhuo: Methodology, Writing - review & editing. Peng Zhang: Data curation, Writing - review & editing. Wei Huang: Visualization. Yufei Zha: Validation. Yanning Zhang: Supervision. Mohan Kankanhalli: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research is supported by National Natural Science Foundation of China 61971352 & 61862043 & 62002188, and the Fundamental Research Funds for the Central Universities 3102019ZY1004, as well as the Natural Science Foundation of Shaanxi Province 2019JQ-014 &2018JM6015 and Natural Science Foundation of Jiangxi Province with No. 20204BCJ22011.

Yamin Han received the B.S. degree from the Northwest Agriculture and Forestry University, China in 2015. She received her PhD degree from Northwestern Polytechnical University, China in 2021. Currently, she is an associate professor with College of Information Engineering, Northwest Agriculture and Forestry University, China. Her current research interests include object tracking and detection, action recognition and reasoning, machine learning, and computer vision.

References (48)

  • X. Liang et al.

    Deep variation-structured reinforcement learning for visual relationship and attribute detection

  • C. Lu, R. Krishna, M. Bernstein, L. Fei-Fei, Visual relationship detection with language priors, in: European...
  • Z. Cui et al.

    Context-dependent diffusion network for visual relationship detection

  • K. Liang, Y. Guo, H. Chang, X. Chen, Visual relationship detection with deep structural ranking, in: Thirty-Second AAAI...
  • S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in:...
  • J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully convolutional networks, in: Advances in...
  • T.-Y. Lin et al.

    Feature pyramid networks for object detection

  • L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H. Torr, Fully-convolutional siamese networks for object...
  • B. Li et al.

    High performance visual tracking with siamese region proposal network

  • H. Fan et al.

    Siamese cascaded region proposal networks for real-time visual tracking

  • P. Krähenbühl, V. Koltun, Efficient inference in fully connected crfs with gaussian edge potentials, in: Advances in...
  • A. Shahroudy et al.

    Ntu rgb+ d: A large scale dataset for 3d human activity analysis

  • D. Tran et al.

    Learning spatiotemporal features with 3d convolutional networks

  • S.D. Tran et al.

    Event modeling and recognition using markov logic networks

  • Cited by (6)

    • Video surveillance using deep transfer learning and deep domain adaptation: Towards better generalization

      2023, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Moving on, Zhuo et al. (2019) deploy prior knowledge and state transitions to develop an explainable human action analysis and understanding system. Similarly, in Han et al. (2022), Han et al. propose an explainable action reasoning approach using one-shot video graph generation. In Roy et al. (2021), the accuracy-explanability gap is addressed by adopting an interpretable, tractable probabilistic DL algorithm.

    • GDRL: An interpretable framework for thoracic pathologic prediction

      2023, Pattern Recognition Letters
      Citation Excerpt :

      Advances in deep learning has resulted in significant performance in medical image analysis tasks [1–3].

    • A Study on the Use of Attention for Explaining Video Summarization

      2023, NarSUM 2023 - Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos, Co-located with: MM 2023

    Yamin Han received the B.S. degree from the Northwest Agriculture and Forestry University, China in 2015. She received her PhD degree from Northwestern Polytechnical University, China in 2021. Currently, she is an associate professor with College of Information Engineering, Northwest Agriculture and Forestry University, China. Her current research interests include object tracking and detection, action recognition and reasoning, machine learning, and computer vision.

    Tao Zhuo received the M.S. and Ph.D. degrees in computer science and technology from Northwestern Polytechnical University, Xi’an, China, in 2012 and 2016, respectively. He worked with the School of Computing, National University of Singapore, as a Research Fellow until 2021. He is currently work with Shandong Artificial Intelligence Institute, Qilu University of Technology. His research interests include image/video processing, computer vision, and machine learning.

    Peng Zhang received the B.E. degree from Xi’an Jiaotong University, China, in 2001, and the Ph.D. degree from Nanyang Technological University, Singapore, in 2011. He is now a Professor with the School of Computer Science, Northwestern Polytechnical University, China. He is also the Chief Scientist with the Mekitec OY, Finland. He has published more than 80 research papers in various journals, including CVPR, ACM Multimedia, the IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE TRANSACTIONS ON MULTIMEDIA, and IEEE TRANSACTIONS ON MEDICAL IMAGING, and act as the PI in three grants of NSFC. His current research interests include computer vision, pattern recognition, and machine learning.

    Wei Huang received the B.Eng. and M.Eng. degrees from the Harbin Institute of Technology, China, in 2004 and 2006, respectively, and the Ph.D. degree from Nanyang Technological University, Singapore, in 2011. He worked with the University of California San Diego, USA, as well as the Agency for Science Technology and Research, Singapore, as a Postdoctoral Research Fellow until 2012. He has published 80+ academic journal/conference papers and has been acting as Principal Investigator in 14 national/provincial grants, including three NSFC grants and two NSF key grants in Jiangxi Province. His research interests include machine learning, pattern recognition, image processing, and computer vision. Dr. Huang received the Jiangxi Provincial Natural Science Award in 2018, the Most Interesting Paper Award of ICME-ASMMC in 2016, the Best Paper Award of MICCAI-MLMI in 2010, and was entitled the provincial young scientist of the Jiangxi Province in 2015.

    Yufei Zha received the Ph.D. degree in information and communication engineering from Airforce Engineer University, Xi’an, China, in 2009. He is currently an Associate Professor with the Northwestern Polytechnical University, Xi’an, China. His current research interests include object detection, visual tracking, and machine learning.

    Yanning Zhang received the B.S. degree from the Department of Electronic Engineering, Dalian University of Technology, Dalian, China, in 1988, the M.S. degree from the School of Electronic Engineering, and the Ph.D. degree from the School of Marine Engineering, Northwestern Polytechnical University, Xian, China, in 1993 and 1996, respectively. She is currently a Professor at the School of Computer Science, Northwestern Polytechnical University. Her current research interests include computer vision and pattern recognition, image and video processing, and intelligent information processing. Dr. Zhang was the Organization Chair of the Asian Conference on Computer Vision 2009, and served as the program committee chairs of several international conferences.

    Mohan Kankanhalli is the Provost’s Chair Professor at the Department of Computer Science of the National University of Singapore. He is the Director of N-CRiPT and also the Dean, School of Computing at NUS. Mohan obtained his BTech from IIT Kharagpur and MS & PhD from the Rensselaer Polytechnic Institute. His current research interests are in Multimedia Computing, Multimedia Security & Privacy, Image/Video Processing and Social Media Analysis. He is active in the Multimedia Research Community and is on the editorial boards of several journals. Mohan is a Fellow of IEEE.

    View full text