Conditional distance based matching for one-shot gesture recognition

doi:10.1016/j.patcog.2014.10.026

Pattern Recognition

Volume 48, Issue 4, April 2015, Pages 1302-1314

https://doi.org/10.1016/j.patcog.2014.10.026 Get rights and content

Highlights

•
We propose a new distance measure called conditional distance between two gestures sequences when we have only one or a few samples per gesture class.
•
Conditional distance is the distance between query and model gesture sequences in the presence of a third (anchor) gesture sequence.
•
We propose speedup strategy for computing conditional distances by pre-selecting the anchor.
•
We also propose a condition distance based simultaneous gesture segmentation and recognition called conditional level building.
•
We show results of 82% on a multiple subject dataset spanning 179 classes.

Abstract

A problem of matching gestures, where there are one or few samples per class, is considered in this paper. The proposed approach shows that much better results are achieved if the distance between the pattern of frame-wise distances of two gesture sequences with a third (anchor) sequence from the modelbase is considered. Such a measure is called as conditional distance and these distance pattern are referred to as “warp vectors”. If these warp vectors are similar, then so are the sequences; if not, they are dissimilar. At the algorithmic core, there are two dynamic time warping processes, one to compute the warp vectors with the anchor sequences and the other to compare these warp vectors. In order to reduce the complexity a speedup strategy is proposed by pre-selecting “good” anchor sequences. Conditional distance is used for individual and sentence level gesture matching. Both single and multiple subject datasets are used. Experiments show improved performance above 82% spanning 179 classes.

Introduction

Analyzing and recognizing human gestures is important for human computer interaction. The large number of human gesture categories such as sign language, traffic signals, everyday actions and also subtle cultural variations in gesture classes makes gesture recognition an interesting and challenging problem. In most naturally occurring scenarios, gestures are connected together in continuous varying stream, without any obvious break between individual gestures [21]. Identifying each one of these individual gestures gives a good representation for ultimately translating visual communication into speech or other form of interaction. Such labeling tasks have many challenges.

Labeling theses continuous gesture stream or query involves matching temporally segmented individual gestures to a modelbase. If the modelbase has many samples per class, statistics of that particular class can be learnt. Having one or few samples deters any class statistic learning approaches to classification, as the full range of variation is not covered. One of the key components of a matching algorithm, apart from feature extraction, would be gesture to gesture distances. These distances should define, in a concrete way, what it means for data points of such a class space to be “near to” or “far away from”¹ each other. One commonly used approach would be to take pair-wise distances (using a distance function) between all available models with the query and discern which are closer (classified as same) or far away from (classified as not same). A commonly used distance function would be dynamic time warping with 1-Nearest Neighbor approach for classification [12].

In our work, a matching algorithm based on a level building approach is proposed. This algorithm is based on a framework of one-shot learning (single sample per class). The proposed algorithm is capable of handling – (a) Isolated and continuous gesture queries; (b) eliminates the need for temporal segmentation; and (c) single sample per class scenarios. A new distance function is defined and this serves as the center of our algorithm. Each gesture sequence is seen as a curve and each curve as a data point on a space that is formed by all the gesture classes. Pair-wise distance pattern between two gesture sequences conditioned on a third (anchor) sequence is considered. These distance pattern vectors are called as “warp vectors”. And such a process is called as “conditional distance”. At the algorithmic core, there are two dynamic time warping processes, one to compute the warp vectors with the anchor sequence and the other to compare these warp vectors. Such measures have been proposed earlier, for example, Mahalanobis distance, where intra-class variations, or the variation between the instances, accounts for the scaling that the distance measure needs. Our work explores the variations between the classes itself as the modelbase is framed as a single sample per class.

Given a situation where the model base is large (number of classes is also large); the disadvantage of such a distance would be the computational cost. There is a need for pre-selecting the anchor gesture (or class). A speedup strategy is proposed based on pre-selection of anchor gestures from the modelbase. The proposed distance is computed with every gesture to every other gesture in the modelbase. For each such distance, an anchor gesture is determined. Majority anchor gesture is selected and distances between query and model is computed only on this chosen anchor gesture.

Conditional distance gives the distance between two isolated gestures. In order to label multiple connected gestures, a simultaneous segmentation and recognition matching algorithm called level building algorithm is used. Dynamic programming implementation of the level building algorithm is employed. The core of this algorithm depends on a distance function that compares two gesture sequences. This distance function is replaced by conditional distance. Hence, this version of level building is called as conditional level building (clb).

Earlier version of this work was proposed in [19]. In this paper, a more detailed use, analysis and results for conditional distance is provided. The main differences are listed as follows:

1.
This paper shows that conditional distance increases performance over the baseline distances in two gesture modelbase contexts – single category-single subject and multiple category-multiple subjects and shows that conditional distance satisfies metric properties in practice.
2.
An anchor pre-selection strategy to speed up computation of the proposed distance is proposed. Anchor behavior and selection with and without the proposed strategy is analyzed.
3.
A conditional distance version of level building algorithm for recognizing connected gestures is also proposed.
4.
Results are shown on gesture challenge datasets [1] (Fig. 1) and compared our results with state-of-the-art on those datasets.

Conventions: A time sequence or a gesture sequence is an array of images taken at certain times. The sampling rate is same as the regular video sampling rate. Gesture sequence can have a length n and are indexed from 1 to n. The l₂ distance between feature vectors x and y is $‖ x - y ‖_{2}$ and it satisfies the triangle inequality $‖ x - z ‖_{2} \leq ‖ x - y ‖_{2} + ‖ y - z ‖_{2}$ . A summary of frequently used conventions is provided in Table 1.

Section snippets

Related work

Any gesture recognition task, might that be a series of gestures connected to form a single query or a single gesture query, involves comparing an incoming query against a training set of gestures. A collection of all the instances of all the classes available for training is referred to as a modelbase. These modelbases can have many instances per gesture class or they might have just one instance per class. If there are many instances then the recognition can be based on learning statistics of

Conditional distance: distance between two gestures

Conditional distance is the concept of finding distance between two gesture sequences using a third (anchor) sequence. Our motivation for conditional distance comes from other classification domains. One example [32] is the work on semantic comparisons for search-engine queries. Given a ranked list for a query documents that are clicked on can be assumed as to be semantically closer than those documents that the user decided not to click on (e.g: $A_{click}$ is closer to $B_{click}$ than $A_{click}$ is to $C_{}$

Modelbase based pre-selection of anchors

The disadvantage of computing the anchor as shown in Eq. (3), is the computational cost. In order to reduce the number of comparisons, the following steps are proposed to speedup the computation of conditional distance:

1.
Given a modelbase M: ${X_{1}, X_{2}, \dots, X_{N}}$ , our goal is to find which of these model sequences qualify as a majority anchor $X_{j}$ for a particular model $X_{i}$ . As the modelbase is in a one-shot framework, model themselves are tested as query sequence.
2.
Conditional distance is computed using Eq.

Model to single gesture: temporal segmentation

Given a query sequence, with multiple connected gestures, a strategy to perform temporal segmentation first, and then the proposed distance is used to compare each temporally segmented query against all model sequences in the modelbase. Temporal segmentation is performed using a covariance matrix and its strategy is described below.

A new joint query sequence is proposed and it represents a 2-channel image. The first channel refers to the intensity and the second channel refers to the depth

Model to series of connected gestures: conditional level building (cLB)

Temporal segmentation has the drawback of increased computational cost and matching requires very precise segmentation. A simultaneous segmentation and recognition matching algorithm called level building algorithm is used. A level building matching algorithm has been used to label multiple connected gestures [27], [28], [43]. The proposed level building version is based on conditional distances and hence referred to as conditional level building (cLB). Our cLB algorithm varies from the

Discussion on conditional distance: metric properties

Conditional distance is defined as the distance between two warp vectors. The elements of the warp vectors represent the distance between two aligned frames. As there are two warp vectors, conditional distances can be seen as the aligned distances between 2 pairs of frames. In conditional distance, two gestures are similar when distance is small. This alignment is achieved by DTW and hence we consider the metric properties of conditional distances to follow the properties of DTW. Generalized

Dataset

All the results in this paper are based on gesture sequences extracted from the Chalearn Gesture Challenge dataset [1]. Two modelbase versions are used as two separate datasets representing: single category-single subject and multiple category-multiple subjects. The first dataset follows the same batch wise categorization of gestures that has single subject associated with a single category. There three such datasets, each consisting of 1800 sequences and has 8 to 15 model sequences based on

Conclusion

In this paper a new distance measure is proposed based on conditional distance and warp vectors. As models and query both are conditioned on an anchor sequence, the distance measure takes into account how varied a particular model is from every other model in the modelbase. The shown improvement in the performance shows that the vector of distances should not be ignored and also shows that the proposed distance measure is metric in practice. As the proposed approach captures how far a

Conflict of interest

None.

Acknowledgement

The authors would like to acknowledge the use of the services provided by Research Computing at the University of South Florida. This research was supported in part by NSF grant 1217676.

Ravikiran Krishnan received the B.Tech degree in Information Science from Vishweshwaraiah Technological University in 2007. He received M.S. degree in Computer Science from University of South Florida (USF), Tampa. He is currently pursuing his Ph.D. degree from USF. His research interests include sign language and gesture recognition, machine learning and audiovisual analysis. He is a member of the IEEE computer Society.

References (43)

R.J. Hathaway et al.
Nerf c-meansnon-Euclidean relational fuzzy clustering
Pattern Recognit.
(1994)
R.J. Hathaway et al.
Relational duals of the c-means clustering algorithms
Pattern Recognit.
(1989)
W. Kong et al.
Towards subject independent continuous sign language recognitiona segment and merge approach
Pattern Recognit.
(2014)
D. Lemire
Faster retrieval with a two-pass dynamic-time-warping lower bound
Pattern Recognit.
(2009)
H. Li et al.
Model-based segmentation and recognition of dynamic gestures in continuous video streams
Pattern Recognit.
(2011)
U. Mahbub et al.
A template matching approach of one-shot-learning gesture recognition
Pattern Recognit. Lett.
(2013)
H.-I. Suk et al.
Hand gesture recognition based on dynamic Bayesian network framework
Pattern Recognit.
(2010)
E. Vidal et al.
On the verification of triangle inequality by dynamic time-warping dissimilarity measures
Speech Commun.
(1988)
D. Weinland et al.
A survey of vision-based methods for action representation, segmentation and recognition
Comput. Vis. Image Underst.
(2011)
Chalearn gesture dataset (cgd2011). ChaLearn, California,...

G. Al-Naymat et al.

Sparsedtwa novel approach to speed up dynamic time warping

Proceedings of the Eighth Australasian Data Mining Conference

(2009)

P. Besl et al.

A method for registration of 3-d shapes

IEEE Trans. Pattern Anal. Mach. Intell.

(1992)

F. Casacuberta et al.

On the metric properties of dynamic time warping

IEEE Trans. Acoust. Speech Signal Process.

(1987)

N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: International Conference on Computer...

O. Danielsson, B. Rasolzadeh, S. Carlsson, Gated Classifiers: Boosting Under High Intra-Class Variation, 2011, pp....

L.S. Di Wu, Fan Zhu, One shot learning gesture recognition from RGBD images, in: International Conference on Computer...

A. Elgammal, V. Shet, Y. Yacoob, L. Davis, Gesture recognition using a probabilistic framework for pose matching, in:...

A. Elgammal, V. Shet, Y. Yacoob, L.S. Davis, Learning Dynamics for Exemplar-Based Gesture Recognition, 2003, pp....

L. Fei-Fei et al.

One-shot learning of object categories

IEEE Trans. Pattern Anal. Mach. Intell.

(2006)

F. Florez, J. Garcia, J. Garcia, A. Hernandez, Hand Gesture Recognition Following the Dynamics of a Topology-Preserving...

I. Guyon, V. Athitsos, P. Jangyodsuk, H.J. Escalante, B. Hamner, Results and Analysis of the Chalearn Gesture Challenge...

Cited by (27)

Machine learning for intrusion detection in industrial control systems: Applications, challenges, and recommendations
2022, International Journal of Critical Infrastructure Protection
Methods from machine learning are used in the design of secure Industrial Control Systems. Such methods focus on two major areas: detection of intrusions at the network level using the information acquired through network packets, and detection of anomalies at the physical process level using data that represents the physical behavior of the system. This survey focuses on four types of methods from machine learning for intrusion and anomaly detection, namely, supervised, semi-supervised, unsupervised, and reinforcement learning. The literature available in the public domain was carefully selected, analyzed, and placed along a 10-dimensional space for ease of comparison. This multi-dimensional approach is found valuable in the comparison of the methods considered and enables a scientific discussion on their utility in specific environments. The challenges associated in using machine learning, and gaps in research, are identified and recommendations made.
A novel forget-update module for few-shot domain generalization
2022, Pattern Recognition
Citation Excerpt :
The problem of learning to generalize to new classes with a limited number of label examples, called few-shot learning (FSL), has attracted considerable attention in the past few years. Recently, research on FSL begin to appear in many areas, such as image classification [6–10], gesture recognition [11], activity recognition [12], and logo retrieval [13]. Many FSL methods have been proposed to learn new transformable visual concepts with limited samples in recent years.
Existing Few-Shot Learning (FSL) methods learn and recognize new classes with the help of prior knowledge. However, they cannot handle this task well in a cross-domain scenario when training and testing sets are from different domains, since the fact that prior knowledge in different domains often varies greatly. To solve this problem, in this paper, we propose a few-shot domain generalization method, which is designed to extract relationship embeddings using Forget-Update Modules named FUM. The relationship embedding considers valuable relational information between samples in a specific task, and the forget-update module takes into account differences between domains and adjusts the distribution of relational embeddings through forgetting and updating mechanisms based on specific tasks. To evaluate the few-shot domain generalization ability of FUM, extensive experiments on eight cross-domain scenarios and six same-domain scenarios are conducted, and the results show that FUM achieves superior performances compared to recent few-shot learning methods. Visualization results also show that the distribution of the relationship embeddings extracted by FUM has stronger few-shot domain generalization ability than the feature embeddings used in the existing FSL methods.
Few-shot prototype alignment regularization network for document image layout segementation
2021, Pattern Recognition
Despite the great performance in layout analysis tasks made by semantic segmentation, they usually need a large number of annotated images for training and are difficult to learn a new category which is absent in the training categories. Meta-learning and few-shot segmentation have been developed to solve the above two difficulties. In this paper, we propose a novel method dubbed Few-Shot Prototype Alignment Regularization Network (FS-PARN). The FS-PARN method is inspired by recent studies in both metric learning and few-shot segmentation, which just need a few annotated images to solve the above two difficulties. Our FS-PARN method can make better use of the information of the support set by metric learning and have a better effect on image segmentation. It learns classification prototype within an embedding space and then completes pixel classification by matching each pixel on the query image with the learned prototype. In addition to obtaining high-quality prototypes through metric learning methods, our FS-PARN method also introduces prototype alignment regularization between support and query sets to make segmentation better. Notably, our FS-PARN model achieves the mean-IoU score of 28.8% and 31.7% on the practical document image datasets, i.e. PASCAL-5i, DSSE-200, and Layout Analysis Dataset, for 1-shot and 5-shot settings respectively.
Few-shot activity recognition with cross-modal memory network
2020, Pattern Recognition
Deep learning based action recognition methods require large amount of labelled training data. However, labelling large-scale video data is time consuming and tedious. In this paper, we consider a more challenging few-shot action recognition problem where the training samples are few and rare. To solve this problem, memory network has been designed to use an external memory to remember the experience learned in training and then apply it to few-shot prediction during testing. However, existing memory-based methods just update the visual information with fixed label embeddings in the memory, which cannot adapt well to novel activities during testing. To alleviate the issue, we propose a novel end-to-end cross-modal memory network for few-shot activity recognition. Specifically, the proposed memory architecture stores the dynamic visual and textual semantics for some high-level attributes related to human activities. And the learned memory can provide effective multi-modal information for new activity recognition in the testing stage. Extensive experimental results on two video datasets, including HMDB51 and UCF101, indicate that our method could achieve significant improvements over other previous methods.
Scheduled sampling for one-shot learning via matching network
2019, Pattern Recognition
Citation Excerpt :
For example, if we want to know what is the zebra, we just need to search one or few images and then figure out the characteristics of the zebra quickly, which is a species of horse family united by its distinctive black and white striped coats. Considering the mentioned, one-shot learning, where just one labeled image for each visual category are used for training, has attracted more and more attentions [7,8]. This learning pattern not only reduces the workload of image data annotation, but also makes the machine work like the human.
Considering human can learn new object successfully from just one sample, one-shot learning, where each visual class just has one labeled sample for training, has attracted more and more attention. In the past years, most researchers achieve one-shot learning by training a matching network to map a small labeled support set and an unlabeled image to its label. The support set is combined by one image with the same label as unlabeled image and few images with other labels generated by random sampling. This random sampling strategy easily generates massive over-easy support sets in which most labels are less relevant to the label of unlabeled image. It leads to the limitation of matching network for one-shot prediction over indistinguishable label sets. For this issue, we propose a novel metric to evaluate the learning difficulty of support set, where this metric jointly considers the semantic diversity and similarity of visual labels. Based on the metric, we introduce a scheduled sampling strategy to train the matching network from easy to difficult. Extensive experimental results on three datasets, including mini-Imagenet, Birds and Flowers, indicate that our method could achieve significant improvements over other previous methods.
Saliency for fine-grained object recognition in domains with scarce training data
2019, Pattern Recognition
Citation Excerpt :
Early work on this topic is attributed to Fei-Fei et al. [19], who showed that, taking advantage of previously learned categories, it is possible to learn new categories using one or very few samples per class. More recently, [20] proposed a conditional distance measure that takes into account how a particular appearance model varies with respect to every other model in a model database. The approach has been applied to one-shot gesture recognition.
This paper investigates the role of saliency to improve the classification accuracy of a Convolutional Neural Network (CNN) for the case when scarce training data is available. Our approach consists in adding a saliency branch to an existing CNN architecture which is used to modulate the standard bottom-up visual features from the original image input, acting as an attentional mechanism that guides the feature extraction process. The main aim of the proposed approach is to enable the effective training of a fine-grained recognition model with limited training samples and to improve the performance on the task, thereby alleviating the need to annotate a large dataset. The vast majority of saliency methods are evaluated on their ability to generate saliency maps, and not on their functionality in a complete vision pipeline. Our proposed pipeline allows to evaluate saliency methods for the high-level task of object recognition. We perform extensive experiments on various fine-grained datasets (Flowers, Birds, Cars, and Dogs) under different conditions and show that saliency can considerably improve the network’s performance, especially for the case of scarce training data. Furthermore, our experiments show that saliency methods that obtain improved saliency maps (as measured by traditional saliency benchmarks) also translate to saliency methods that yield improved performance gains when applied in an object recognition pipeline.

View all citing articles on Scopus

Dr. Sudeep Sarkar is Professor of Computer Science and Engineering and Associate Vice President for Research & Innovation at the University of South Florida (USF) in Tampa, where he started his professional career in 1993. He received his B.Tech. degree in Electrical Engineering from the Indian Institute of Technology, Kanpur, and his M.S. and Ph.D. degrees in Electrical Engineering from The Ohio State University.

Dr. Sarkar’s research interests are in computer vision and pattern recognition, i.e. designing of computer algorithms to extract and to infer information from images and video. In particular, he is interested in perceptual organization of video and audio signals, establishing identity using biometrics, automated gesture and sign language recognition, and human gait analysis. In his administrative capacity, he started and leads the faculty external awards, honors, and prizes initiative to bring prestige and recognition to the university and to honor deserving scholars. He also coordinates and writes trans-disciplinary, university-wide, grants and contract opportunities related to innovation and technology transfer.

He is the recipient of the National Science Foundation CAREER award, the USF Teaching Incentive Program Award for Undergraduate Teaching Excellence, the Outstanding Undergraduate Teaching Award, and the Ashford Distinguished Scholar Award. He served on the editorial boards for the IEEE Transactions on Pattern Analysis and Machine Intelligence, Pattern Analysis & Applications Journal, Pattern Recognition journal, IET Computer Vision, and the IEEE Transactions on Systems, Man, and Cybernetics, Part-B.

He is currently the Co Editor-in-Chief for the Pattern Recognition Letters and also the associate editor for IEEE Transactions on Pattern Analysis and Machine Intelligence.

He is Fellow of AAAS, Fellow of IEEE, Fellow of IAPR, an IEEE-CS Distinguished Visitor Program Speaker (2010–2012), and the charter member of the National Academy of Inventors.

View full text

Conditional distance based matching for one-shot gesture recognition

Highlights

Abstract

Introduction

Section snippets

Related work

Conditional distance: distance between two gestures

Modelbase based pre-selection of anchors

Model to single gesture: temporal segmentation

Model to series of connected gestures: conditional level building (cLB)

Discussion on conditional distance: metric properties

Dataset

Conclusion

Conflict of interest

Acknowledgement

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit. Lett.

Pattern Recognit.

Speech Commun.

Comput. Vis. Image Underst.

Sparsedtwa novel approach to speed up dynamic time warping

Proceedings of the Eighth Australasian Data Mining Conference

A method for registration of 3-d shapes

IEEE Trans. Pattern Anal. Mach. Intell.

On the metric properties of dynamic time warping

IEEE Trans. Acoust. Speech Signal Process.

One-shot learning of object categories

IEEE Trans. Pattern Anal. Mach. Intell.