Elsevier

Pattern Recognition

Volume 121, January 2022, 108256
Pattern Recognition

A hierarchical model for learning to understand head gesture videos

https://doi.org/10.1016/j.patcog.2021.108256Get rights and content

Highlights

  • Propose a hierarchical model to understand head gesture videos.

  • Utilize the multi-task learning framework.

  • Compress features using stacked BLSTM.

  • Propose several applications in multiple fields.

Abstract

Head gesture videos recorded of a person bear rich information about the individual. Automatically understanding these videos can empower many useful human-centered applications in areas such as smart health, education, work safety and security. To understand a video’s content, low-level head gesture signals carried in the video that capture characteristics of both human postures and motions need to be translated into high-level semantic labels. To meet this aim, we propose a hierarchical model for learning to understand head gesture videos. Given a head gesture video of an arbitrary length, the model first segments the full-length video into multiple short clips for clip-based feature extraction. Multiple base feature extraction procedures are then independently tuned via a set of peripheral learning tasks without consuming any labels of the goal task. These independently derived base features are subsequently aggregated through a multi-task learning framework, coupled with a feature dimensionality reduction module, to optimally learn to accomplish the end video understanding task in an weakly supervised manner, utilizing the limited amount of video labels available of the goal task. Experimental results show that the hierarchical model is superior to multiple state-of-the-art peer methods in tackling versatile video understanding tasks.

Introduction

Head gesture videos bear rich information about a captured human subject. Understanding behavioral clues latent in such videos can empower a range of human-centered multimedia applications. Given the variety of potential application areas, such as smart health, education, work safety and security surveillance, it is highly desirable that a generic method is able to learn to interpret head gesture videos for any assigned video understanding task. Ideally, each time when the model is applied to execute such a learning to understand task, the model is anticipated to learn an optimal set of video features that most informatively support the video understanding task.

To respond to the above demand, this paper introduces a general purpose model that can learn to understand human head gesture videos of variable lengths in empowering a variety of applications using only a small amount of task-specific video labels. The learning process involves the automatic identification and extraction of an optimal set of features to accomplish any assigned video understanding task utilizing a multi-task learning framework through transfer learning. The multi-task learning scheme [1] helps the hierarchical model attain a satisfactory training performance through learning features from a set of auxiliary tasks that do not consume any labels of the goal task. These features are subsequently leveraged and trained to generate either numeric or text labels through a deep network to tackle the concerned video understanding task.

To acquire human head gesture videos, this work utilizes a vision-based approach, which is inspired by prior studies on inferring human skeletal movements using visual sensors [2]. This approach attains much user-friendliness because it eliminates the inconvenience incurred by alternative approaches [3], [4] that rely on sensors in contact with human bodies for human posture and gesture acquisition. Building upon an extended line of prior investigations on acquiring 3D head gestures using a single commodity color camera, this study advances the state-of-the-art by developing a new video understanding model that optimally learns to interpret head gesture videos for tackling a variety of video comprehension tasks.

The proposed model leverages the bi-directional long short-term memory (BLSTM) network [5] and the multi-task learning scheme [1] to understand head gesture videos according to a set of visual features optimally derived by the model. This choice is due to the network’s satisfactory capability in modeling temporal dependency among sequential data of varying lengths. Gold standard labels are annotated on the granularity of full-length sequences, which makes human specialists more comfortable and more accurate when annotation. The proposed video understanding model is responsible of interpreting both meanings suggested by individual head gestures embedded in a sequence and those arising from interactions among adjacent gestures. In this way, the trained model is capable of assessing head gesture videos of any length, which is suited for empowering a variety of applications. Thanks to such design of the proposed learning approach, the amount of human labeling efforts is tremendously reduced compared with the alternative design that otherwise demands gold standard labels annotated for individual head gestures, the latter approach of which calls for costly human labeling efforts.

The main contributions of this paper include:

  • We propose a generic model that learns to understand head gesture videos for tackling a variety of video understanding tasks in high-level semantics. Once trained, the model is able to generate both numeric and text labels representing automatic content understanding results for videos of arbitrary lengths in executing any given video comprehension task.

  • The proposed model is designed utilizing a multi-task learning framework, which enables the model to be satisfactorily trained over a limited amount of labeled data through transfer learning. The framework also enables the model to learn to extract a tailored set of features to optimize its performance in tackling any given video understanding task.

  • We further demonstrate the usefulness of the model in developing several case applications about head gesture video in experiments, including 1) assessing the health impact of head gestures on cervical vertebrae for preventive care, 2) understanding student concentration levels in front of computers for distance education, and 3) classifying head motion pattern for low-level semantic understanding.

Section snippets

Related work

We will review selected prior studies on a few media computing topics that most closely relate to this current study.

Multi-Task Learning (MTL): When more than one objective function is sought in building a media application, multi-task learning [1] presents an effective solution strategy. Along this line, Multi-Task Convolutional Neural Network [6] is recently proposed in which parameters of lower layers of the network are shared whereas the network is then divided into multiple branches at

Method

When designing the hierarchical model, we acknowledge that only a limited amount of videos may carry task-relevant labels, which is generally true with videos captured everyday using pervasive cameras and especially the case in health care, education and industrial applications. Therefore, we first leverage a set of auxiliary tasks to learn to extract features from head gesture videos and then transfer these features to tackle our video understanding task as a strategy to minimize the model’s

Experimentation

To validate the advantage of the proposed video understanding model, we report the performance of the proposed method in comparison with that of multiple state-of-the-art peer methods over a series of video understanding tasks. We use ten-fold cross-validation in all our benchmark experiments. Unless otherwise noted, all experiments are conducted with the dynamic frame skipping mechanism employed. Due to the length constraint, details of our model tuning process are provided in the appendix and

Conclusion

This paper presents a hierarchical model for learning to understand head gesture videos of any length and in a variety of tasks. The model is able to train itself over a small amount of labeled samples thanks to a multi-task learning framework adopted. For each given task, the model also learns to extract an optimal set of features through transferring and augmenting three sets of video features, which are first separately learned from a set of auxiliary tasks and then optimally aggregated for

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors gratefully acknowledge the anonymous reviewers for their comments to help us to improve our paper, and also thank for their enormous help in revising this paper. This work was partially supported by NSF of China (Nos.6217071048,61672326), and also sponsored by Zhejiang Lab (NO. 2020NB0AB02).

Jiachen Li is a Ph.D. candidate of School of Software, Shandong University. He received his B.E. from Ocean University of China in 2016. His research interests include object tracking, pose estimation, and head posture analysis, etc.

References (31)

  • S. Ruder

    An overview of multi-task learning in deep neural networks

    CoRR

    (2017)
  • R. Ranjan et al.

    Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2019)
  • A. Kendall et al.

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

    CVPR

    (2018)
  • J. Cao et al.

    Partially shared multi-task convolutional neural network with local constraint for face attribute learning

    CVPR

    (2018)
  • M. Haußmann et al.

    Variational bayesian multiple instance learning with gaussian processes

    CVPR

    (2017)
  • Cited by (6)

    Jiachen Li is a Ph.D. candidate of School of Software, Shandong University. He received his B.E. from Ocean University of China in 2016. His research interests include object tracking, pose estimation, and head posture analysis, etc.

    Songhua Xu is a computer scientist. He received his M.S., M.Phil., and Ph.D. from Yale University, New Haven, CT, USA, all in computer science. His research interests include healthcare informatics, information retrieval, knowledge management and discovery, intelligent web and social media, visual analytics, user interface design, and multimedia.

    Xueying Qin is a professor of School of Software, Shandong University. She received her Ph.D. from Hiroshima University of Japan in 2001, and M.S. and B.S. from Zhejiang University and Peking University in 1991 and 1988, respectively. Her main research interests are augmented reality, video-based analyzing, and photorealistic rendering, etc.

    View full text