Video quality evaluation toward complicated sport activities for clustering analysis
Introduction
It is generally acknowledged that human activities can express their emotions and languages. Intelligently understanding the quality various human activities has lot of application. For example, contactless operative video reading system (COVRS), a special device for capturing hand motion images to recognize movements, is a significant application in medical domain. The quality score can guide the performer of human hand movement professionally and skillfully. Besides, for some sport activity like figure skating, the skating rink is usually huge and thus it is difficult for the trainer to inspect the action details of each skater. In this way, we expect an intelligent system that can automatically assessing the quality of human movement, and accordingly feedback the trainer for adjusting their activity. In the literature, there are two directions of human activities recognition. The first direction is simply based on images/videos. It involves three main steps: action components discovery, shallow/deep feature extraction, and feature classification. Among them, action components discovery attempts to localize the human body parts which influence the activity attributes. The shallow/deep feature extraction engineers the feature for each action component and subsequently combine them to represent each human activity. Such representation is finally classified into different categories. The second direction is to employ Kinect for representing human skeleton, which characterizes informative feature to understand each human activity. Distinguished from the conventional visual appearances, the Kinect-based human skeleton is free from the noisy backgrounds and partial occlusion. Moreover, it can be robustly predicted by the hardware-friendly model. In the literature, many Kinect-based human activity understanding frameworks have been proposed and some have even been commercialized. However, it is still difficult to directly use the Kinect-based skeleton for assessing human activities aesthetic quality due to the following reasons:
- (1)
Each Kinect-based human skeleton contains multiple human body parts which are interacted with each other. How to design an end-to-end deep neural network to represent such human skeleton remains unsolved. Previously deep architectures can only represent the entire image or multiple randomly-cropped image patches. For these inter-correlated human parts, deeply representing them might be a challenge. Moreover, for some human activities like piano, only characterizing the relatively coarse human skeleton is far from satisfactory, we need additional fine-grained hand gesture information. However, hand gestures cannot be captured by Kinect. It is necessary to capture this channel feature and further seamlessly integrate it for deeply representing each human activity;
- (2)
Each human activity video (such as piano and artistic gymnastics) is typically associated with some audio. The audio information is also contributive to evaluating the aesthetic quality of human activities. Typically, each audio clip can be described by a series of acoustic features, like MFCC, LSF, and so on. In practically, how to optimally fuse these heterogeneous features from both visual and acoustic sources is a real difficulty. Potential challenges include how to exploit the intrinsic inter-correlations among different features. Moreover, how to determine the weights of different features channels is another difficulty;
- (3)
Aesthetically quantifying human activities is a subjective task, which requires the experiences of massive-scale users. However, it is difficult to integrate the aesthetic quality opinion of multiple users with different occupations, backgrounds, races, and educations. Ideally, we want to model the aesthetic quality experiences of multiple professional trainers, but achieving such goal remains unsolved. Furthermore, toward an extensible framework, we expect a system that can flexibly encode auxiliary feature channels in order to satisfy different applications. How to build such extensible system is a tough challenge.
To successfully handle the aforementioned problems, we propose a novel multi-view learning shallow/deep features-based human activity quality model. Our quality model can automatically rank test images based on their quality clues. The model can capture both the global rough human skeleton by making use of the Kinect. Meanwhile, the fine-grained activity details, different hand gestures, can also be integrated. Also, the environmental audio information associated with each activity video can also be nicely incorporated. An overview of our proposed method is presented in Fig. 1. More specifically, given massive-scale human activities image/videos, we first employ Kinect to extract the skeleton structure from each image/video. Meanwhile, we extract the hand gesture of each performer as well as the environmental audios. Subsequently, we design two aggregation deep neural networks to present the human skeleton and hand gesture respectively. And multiple acoustic features are extracted from the environmental audio. Afterward, we propose a multi-view feature learning algorithm to seamlessly integrate these shallow/deep features, wherein the weight of each feature channel is dynamically tuned. Based on the fused feature, a probabilistic transferal model is developed to optimally encode the quality experiences of human activities towards contactless operative video reading system (COVRS). Extensive experimental results on our collected human activity quality data set have demonstrated the competitiveness of our method. Moreover, our method is validated as highly flexible to support auxiliary feature channels.
Our method can be applied to contactless operative video reading system to evaluate the quality of human movements and hand gestures, and to improve the safety and reliability of non-contact surgery. In summary, the key contribution of this work can be briefed as three fold: (1) a novel computational quality model to quantify human activity aesthetics, whereas the previous quality models mainly deal with image/video visual attractiveness; (2) a novel deep aggregation network to integrate deep features from multiple activity components to represent each image/video; and (3) a new multi-view feature learning algorithm to seamlessly and collaboratively combine multiple heterogeneous features, wherein the weights of multiple features can be adjusted dynamically.
Section snippets
Related work
Our work is closely related to two research topics in image/video analytics and machine learning: information quality assessment and data quality model. In the following, we will briefly review each of the two topics in the following.
Our proposed approach
As aforementioned, our proposed human activity recognition framework involves three key components: multimodal feature extraction (Kinect-based human skeleton, hand gesture recognition, and acoustic features), deep feature representation, and probabilistic quality model. In the following, we will introduce each module carefully.
Experimental results and analysis
In this section, we conduct experiment on a mobile robot equipped with a Kinect sensor capturing real-time human hand gesture. We define the relationship between hand gestures and robot behaviors. Then, we design an application of hand gesture recognition in sign language system.
Conclusions
This paper proposes a novel quality model to cluster human activities towards contactless operative video reading system. The proposed framework involves there modules: Kinect-based multimodal visual/acoustic feature extraction to represent human gesture; deep learning to encode the above features; and probabilistic transferal model for cluster the various human activities. The main goal of our method is to improve the performance of contactless operative video reading system under different
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Wei Yang was born in Linzhou, Henan, P.R. China, in 1988. He received the Master degree from Chengdu Sport University. Now, he works in Fenyang College of Shanxi Medical University. His research interests include Physical education and training and public physical education services.
E-mail: [email protected]
References (14)
- et al.
Variable silhouette energy image representations for recognizing human actions
Image Vis. Comput.
(2010) - et al.
Recognizing human action and identity based on affine-SIFT
- et al.
Action recognition with trajectory-pooled deep convolutional descriptors
- et al.
Devnet: A deep event network for multimedia event detection and evidence recounting
- et al.
Recovering 3D human pose from monocular images
IEEE Trans. Pattern Anal. Mach. Intell.
(2006) - et al.
Fast joint estimation of silhouettes and dense 3D geometry from multiple images
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) Silhouette lookup for monocular 3D pose tracking
Image Vis. Comput.
(2007)
Cited by (4)
Automated player identification and indexing using two-stage deep learning network
2023, Scientific ReportsPunch Anticipation in a Karate Combat with Computer Vision
2021, UMAP 2021 - Adjunct Publication of the 29th ACM Conference on User Modeling, Adaptation and Personalization
Wei Yang was born in Linzhou, Henan, P.R. China, in 1988. He received the Master degree from Chengdu Sport University. Now, he works in Fenyang College of Shanxi Medical University. His research interests include Physical education and training and public physical education services.
E-mail: [email protected]
Jian Wang was born in Fenyang, Shanxi, P.R. China, in 1988. He received the Master degree in Computer science from Northwestern Polytechnical University. Now he studies for a doctorate at Northwestern Polytechnical University. He works in Fenyang College of Shanxi Medical University. His research interests include digital image processing, and artificial intelligence.
E-mail: [email protected]
Jinlong Shi was born in Binzhou, Shandong, P.R. China, in 1989. He received the Master degree from Chengdu sport University, P.R. China. Now, he works in School of Department of Physical Education, Northwest University. His research interests include physical education and training and sports geography.
E-mail: [email protected]