Video quality evaluation toward complicated sport activities for clustering analysis

https://doi.org/10.1016/j.future.2021.01.018Get rights and content

Highlights

  • We propose a ranking algorithm to extract the position of human five figures, based on which the deep hand gesture representation is hierarchically learned.

  • The acoustic feature from many human activities also contributes to the quality assessment. We extract multiple acoustic features from the audio associated with each human activity video.

  • Empirical results have shown that our probabilistic quality model is highly extensible, where additionally visual/acoustic features can be encoded according to different applications.

Abstract

Automatically clustering various sophisticated human activities (e.g., dancing, martial arts, and gymnastics) based on their quality scores is an indispensable technique in physical training, human–computer interaction, etc. Conventionally, many action recognition models are built upon the visual/semantic appearance of human body movements. Recently, due to the introduction of Microsoft Kinect, many skeleton-based human action understanding frameworks have been proposed. In this work, we propose a novel method to cluster the quality of complicated human actions towards contactless operative video reading system (COVRS). More specifically, we first extract the skeleton by leveraging the Kinect, which is subsequently fed into an aggregation deep neural network to extract the deep feature for each human action skeleton. In COVRS, the human hand gesture is an informative clue. Thus, we propose a ranking algorithm to extract the position of human five figures, based on which the deep hand gesture representation is hierarchically learned. Noticeably, it is observable that, the acoustic feature from many human activities also contributes to the quality assessment. We extract multiple acoustic features from the audio associated with each human activity video. Finally, based on the above human skeleton and hand gesture deep features, as well as the shallow acoustic features, we employ a probabilistic model to integrate them for clustering the various human activities using the quality of COVRS. Comprehensive experimental have demonstrated the effectiveness and efficiency of our method. Besides, empirical results have shown that our probabilistic quality model is highly extensible, where additionally visual/acoustic features can be encoded according to different applications.

Introduction

It is generally acknowledged that human activities can express their emotions and languages. Intelligently understanding the quality various human activities has lot of application. For example, contactless operative video reading system (COVRS), a special device for capturing hand motion images to recognize movements, is a significant application in medical domain. The quality score can guide the performer of human hand movement professionally and skillfully. Besides, for some sport activity like figure skating, the skating rink is usually huge and thus it is difficult for the trainer to inspect the action details of each skater. In this way, we expect an intelligent system that can automatically assessing the quality of human movement, and accordingly feedback the trainer for adjusting their activity. In the literature, there are two directions of human activities recognition. The first direction is simply based on images/videos. It involves three main steps: action components discovery, shallow/deep feature extraction, and feature classification. Among them, action components discovery attempts to localize the human body parts which influence the activity attributes. The shallow/deep feature extraction engineers the feature for each action component and subsequently combine them to represent each human activity. Such representation is finally classified into different categories. The second direction is to employ Kinect for representing human skeleton, which characterizes informative feature to understand each human activity. Distinguished from the conventional visual appearances, the Kinect-based human skeleton is free from the noisy backgrounds and partial occlusion. Moreover, it can be robustly predicted by the hardware-friendly model. In the literature, many Kinect-based human activity understanding frameworks have been proposed and some have even been commercialized. However, it is still difficult to directly use the Kinect-based skeleton for assessing human activities aesthetic quality due to the following reasons:

  • (1)

    Each Kinect-based human skeleton contains multiple human body parts which are interacted with each other. How to design an end-to-end deep neural network to represent such human skeleton remains unsolved. Previously deep architectures can only represent the entire image or multiple randomly-cropped image patches. For these inter-correlated human parts, deeply representing them might be a challenge. Moreover, for some human activities like piano, only characterizing the relatively coarse human skeleton is far from satisfactory, we need additional fine-grained hand gesture information. However, hand gestures cannot be captured by Kinect. It is necessary to capture this channel feature and further seamlessly integrate it for deeply representing each human activity;

  • (2)

    Each human activity video (such as piano and artistic gymnastics) is typically associated with some audio. The audio information is also contributive to evaluating the aesthetic quality of human activities. Typically, each audio clip can be described by a series of acoustic features, like MFCC, LSF, and so on. In practically, how to optimally fuse these heterogeneous features from both visual and acoustic sources is a real difficulty. Potential challenges include how to exploit the intrinsic inter-correlations among different features. Moreover, how to determine the weights of different features channels is another difficulty;

  • (3)

    Aesthetically quantifying human activities is a subjective task, which requires the experiences of massive-scale users. However, it is difficult to integrate the aesthetic quality opinion of multiple users with different occupations, backgrounds, races, and educations. Ideally, we want to model the aesthetic quality experiences of multiple professional trainers, but achieving such goal remains unsolved. Furthermore, toward an extensible framework, we expect a system that can flexibly encode auxiliary feature channels in order to satisfy different applications. How to build such extensible system is a tough challenge.

To successfully handle the aforementioned problems, we propose a novel multi-view learning shallow/deep features-based human activity quality model. Our quality model can automatically rank test images based on their quality clues. The model can capture both the global rough human skeleton by making use of the Kinect. Meanwhile, the fine-grained activity details, different hand gestures, can also be integrated. Also, the environmental audio information associated with each activity video can also be nicely incorporated. An overview of our proposed method is presented in Fig. 1. More specifically, given massive-scale human activities image/videos, we first employ Kinect to extract the skeleton structure from each image/video. Meanwhile, we extract the hand gesture of each performer as well as the environmental audios. Subsequently, we design two aggregation deep neural networks to present the human skeleton and hand gesture respectively. And multiple acoustic features are extracted from the environmental audio. Afterward, we propose a multi-view feature learning algorithm to seamlessly integrate these shallow/deep features, wherein the weight of each feature channel is dynamically tuned. Based on the fused feature, a probabilistic transferal model is developed to optimally encode the quality experiences of human activities towards contactless operative video reading system (COVRS). Extensive experimental results on our collected human activity quality data set have demonstrated the competitiveness of our method. Moreover, our method is validated as highly flexible to support auxiliary feature channels.

Our method can be applied to contactless operative video reading system to evaluate the quality of human movements and hand gestures, and to improve the safety and reliability of non-contact surgery. In summary, the key contribution of this work can be briefed as three fold: (1) a novel computational quality model to quantify human activity aesthetics, whereas the previous quality models mainly deal with image/video visual attractiveness; (2) a novel deep aggregation network to integrate deep features from multiple activity components to represent each image/video; and (3) a new multi-view feature learning algorithm to seamlessly and collaboratively combine multiple heterogeneous features, wherein the weights of multiple features can be adjusted dynamically.

Section snippets

Related work

Our work is closely related to two research topics in image/video analytics and machine learning: information quality assessment and data quality model. In the following, we will briefly review each of the two topics in the following.

Our proposed approach

As aforementioned, our proposed human activity recognition framework involves three key components: multimodal feature extraction (Kinect-based human skeleton, hand gesture recognition, and acoustic features), deep feature representation, and probabilistic quality model. In the following, we will introduce each module carefully.

Experimental results and analysis

In this section, we conduct experiment on a mobile robot equipped with a Kinect sensor capturing real-time human hand gesture. We define the relationship between hand gestures and robot behaviors. Then, we design an application of hand gesture recognition in sign language system.

Conclusions

This paper proposes a novel quality model to cluster human activities towards contactless operative video reading system. The proposed framework involves there modules: Kinect-based multimodal visual/acoustic feature extraction to represent human gesture; deep learning to encode the above features; and probabilistic transferal model for cluster the various human activities. The main goal of our method is to improve the performance of contactless operative video reading system under different

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Wei Yang was born in Linzhou, Henan, P.R. China, in 1988. He received the Master degree from Chengdu Sport University. Now, he works in Fenyang College of Shanxi Medical University. His research interests include Physical education and training and public physical education services.

E-mail: [email protected]

References (14)

  • AhmadM. et al.

    Variable silhouette energy image representations for recognizing human actions

    Image Vis. Comput.

    (2010)
  • ZhangZ. et al.

    Recognizing human action and identity based on affine-SIFT

  • WangL. et al.

    Action recognition with trajectory-pooled deep convolutional descriptors

  • GanC. et al.

    Devnet: A deep event network for multimedia event detection and evidence recounting

  • AgarwalA. et al.

    Recovering 3D human pose from monocular images

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2006)
  • KolevK. et al.

    Fast joint estimation of silhouettes and dense 3D geometry from multiple images

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • N.R.Howe

    Silhouette lookup for monocular 3D pose tracking

    Image Vis. Comput.

    (2007)
There are more references available in the full text version of this article.

Wei Yang was born in Linzhou, Henan, P.R. China, in 1988. He received the Master degree from Chengdu Sport University. Now, he works in Fenyang College of Shanxi Medical University. His research interests include Physical education and training and public physical education services.

E-mail: [email protected]

Jian Wang was born in Fenyang, Shanxi, P.R. China, in 1988. He received the Master degree in Computer science from Northwestern Polytechnical University. Now he studies for a doctorate at Northwestern Polytechnical University. He works in Fenyang College of Shanxi Medical University. His research interests include digital image processing, and artificial intelligence.

E-mail: [email protected]

Jinlong Shi was born in Binzhou, Shandong, P.R. China, in 1989. He received the Master degree from Chengdu sport University, P.R. China. Now, he works in School of Department of Physical Education, Northwest University. His research interests include physical education and training and sports geography.

E-mail: [email protected]

View full text