Conditional distance based matching for one-shot gesture recognition
Introduction
Analyzing and recognizing human gestures is important for human computer interaction. The large number of human gesture categories such as sign language, traffic signals, everyday actions and also subtle cultural variations in gesture classes makes gesture recognition an interesting and challenging problem. In most naturally occurring scenarios, gestures are connected together in continuous varying stream, without any obvious break between individual gestures [21]. Identifying each one of these individual gestures gives a good representation for ultimately translating visual communication into speech or other form of interaction. Such labeling tasks have many challenges.
Labeling theses continuous gesture stream or query involves matching temporally segmented individual gestures to a modelbase. If the modelbase has many samples per class, statistics of that particular class can be learnt. Having one or few samples deters any class statistic learning approaches to classification, as the full range of variation is not covered. One of the key components of a matching algorithm, apart from feature extraction, would be gesture to gesture distances. These distances should define, in a concrete way, what it means for data points of such a class space to be “near to” or “far away from”1 each other. One commonly used approach would be to take pair-wise distances (using a distance function) between all available models with the query and discern which are closer (classified as same) or far away from (classified as not same). A commonly used distance function would be dynamic time warping with 1-Nearest Neighbor approach for classification [12].
In our work, a matching algorithm based on a level building approach is proposed. This algorithm is based on a framework of one-shot learning (single sample per class). The proposed algorithm is capable of handling – (a) Isolated and continuous gesture queries; (b) eliminates the need for temporal segmentation; and (c) single sample per class scenarios. A new distance function is defined and this serves as the center of our algorithm. Each gesture sequence is seen as a curve and each curve as a data point on a space that is formed by all the gesture classes. Pair-wise distance pattern between two gesture sequences conditioned on a third (anchor) sequence is considered. These distance pattern vectors are called as “warp vectors”. And such a process is called as “conditional distance”. At the algorithmic core, there are two dynamic time warping processes, one to compute the warp vectors with the anchor sequence and the other to compare these warp vectors. Such measures have been proposed earlier, for example, Mahalanobis distance, where intra-class variations, or the variation between the instances, accounts for the scaling that the distance measure needs. Our work explores the variations between the classes itself as the modelbase is framed as a single sample per class.
Given a situation where the model base is large (number of classes is also large); the disadvantage of such a distance would be the computational cost. There is a need for pre-selecting the anchor gesture (or class). A speedup strategy is proposed based on pre-selection of anchor gestures from the modelbase. The proposed distance is computed with every gesture to every other gesture in the modelbase. For each such distance, an anchor gesture is determined. Majority anchor gesture is selected and distances between query and model is computed only on this chosen anchor gesture.
Conditional distance gives the distance between two isolated gestures. In order to label multiple connected gestures, a simultaneous segmentation and recognition matching algorithm called level building algorithm is used. Dynamic programming implementation of the level building algorithm is employed. The core of this algorithm depends on a distance function that compares two gesture sequences. This distance function is replaced by conditional distance. Hence, this version of level building is called as conditional level building (clb).
Earlier version of this work was proposed in [19]. In this paper, a more detailed use, analysis and results for conditional distance is provided. The main differences are listed as follows:
- 1.
This paper shows that conditional distance increases performance over the baseline distances in two gesture modelbase contexts – single category-single subject and multiple category-multiple subjects and shows that conditional distance satisfies metric properties in practice.
- 2.
An anchor pre-selection strategy to speed up computation of the proposed distance is proposed. Anchor behavior and selection with and without the proposed strategy is analyzed.
- 3.
A conditional distance version of level building algorithm for recognizing connected gestures is also proposed.
- 4.
Results are shown on gesture challenge datasets [1] (Fig. 1) and compared our results with state-of-the-art on those datasets.
Conventions: A time sequence or a gesture sequence is an array of images taken at certain times. The sampling rate is same as the regular video sampling rate. Gesture sequence can have a length n and are indexed from 1 to n. The l2 distance between feature vectors x and y is and it satisfies the triangle inequality . A summary of frequently used conventions is provided in Table 1.
Section snippets
Related work
Any gesture recognition task, might that be a series of gestures connected to form a single query or a single gesture query, involves comparing an incoming query against a training set of gestures. A collection of all the instances of all the classes available for training is referred to as a modelbase. These modelbases can have many instances per gesture class or they might have just one instance per class. If there are many instances then the recognition can be based on learning statistics of
Conditional distance: distance between two gestures
Conditional distance is the concept of finding distance between two gesture sequences using a third (anchor) sequence. Our motivation for conditional distance comes from other classification domains. One example [32] is the work on semantic comparisons for search-engine queries. Given a ranked list for a query documents that are clicked on can be assumed as to be semantically closer than those documents that the user decided not to click on (e.g: is closer to than is to
Modelbase based pre-selection of anchors
The disadvantage of computing the anchor as shown in Eq. (3), is the computational cost. In order to reduce the number of comparisons, the following steps are proposed to speedup the computation of conditional distance:
- 1.
Given a modelbase M: , our goal is to find which of these model sequences qualify as a majority anchor for a particular model . As the modelbase is in a one-shot framework, model themselves are tested as query sequence.
- 2.
Conditional distance is computed using Eq.
Model to single gesture: temporal segmentation
Given a query sequence, with multiple connected gestures, a strategy to perform temporal segmentation first, and then the proposed distance is used to compare each temporally segmented query against all model sequences in the modelbase. Temporal segmentation is performed using a covariance matrix and its strategy is described below.
A new joint query sequence is proposed and it represents a 2-channel image. The first channel refers to the intensity and the second channel refers to the depth
Model to series of connected gestures: conditional level building (cLB)
Temporal segmentation has the drawback of increased computational cost and matching requires very precise segmentation. A simultaneous segmentation and recognition matching algorithm called level building algorithm is used. A level building matching algorithm has been used to label multiple connected gestures [27], [28], [43]. The proposed level building version is based on conditional distances and hence referred to as conditional level building (cLB). Our cLB algorithm varies from the
Discussion on conditional distance: metric properties
Conditional distance is defined as the distance between two warp vectors. The elements of the warp vectors represent the distance between two aligned frames. As there are two warp vectors, conditional distances can be seen as the aligned distances between 2 pairs of frames. In conditional distance, two gestures are similar when distance is small. This alignment is achieved by DTW and hence we consider the metric properties of conditional distances to follow the properties of DTW. Generalized
Dataset
All the results in this paper are based on gesture sequences extracted from the Chalearn Gesture Challenge dataset [1]. Two modelbase versions are used as two separate datasets representing: single category-single subject and multiple category-multiple subjects. The first dataset follows the same batch wise categorization of gestures that has single subject associated with a single category. There three such datasets, each consisting of 1800 sequences and has 8 to 15 model sequences based on
Conclusion
In this paper a new distance measure is proposed based on conditional distance and warp vectors. As models and query both are conditioned on an anchor sequence, the distance measure takes into account how varied a particular model is from every other model in the modelbase. The shown improvement in the performance shows that the vector of distances should not be ignored and also shows that the proposed distance measure is metric in practice. As the proposed approach captures how far a
Conflict of interest
None.
Acknowledgement
The authors would like to acknowledge the use of the services provided by Research Computing at the University of South Florida. This research was supported in part by NSF grant 1217676.
Ravikiran Krishnan received the B.Tech degree in Information Science from Vishweshwaraiah Technological University in 2007. He received M.S. degree in Computer Science from University of South Florida (USF), Tampa. He is currently pursuing his Ph.D. degree from USF. His research interests include sign language and gesture recognition, machine learning and audiovisual analysis. He is a member of the IEEE computer Society.
References (43)
- et al.
Nerf c-meansnon-Euclidean relational fuzzy clustering
Pattern Recognit.
(1994) - et al.
Relational duals of the c-means clustering algorithms
Pattern Recognit.
(1989) - et al.
Towards subject independent continuous sign language recognitiona segment and merge approach
Pattern Recognit.
(2014) Faster retrieval with a two-pass dynamic-time-warping lower bound
Pattern Recognit.
(2009)- et al.
Model-based segmentation and recognition of dynamic gestures in continuous video streams
Pattern Recognit.
(2011) - et al.
A template matching approach of one-shot-learning gesture recognition
Pattern Recognit. Lett.
(2013) - et al.
Hand gesture recognition based on dynamic Bayesian network framework
Pattern Recognit.
(2010) - et al.
On the verification of triangle inequality by dynamic time-warping dissimilarity measures
Speech Commun.
(1988) - et al.
A survey of vision-based methods for action representation, segmentation and recognition
Comput. Vis. Image Underst.
(2011) - Chalearn gesture dataset (cgd2011). ChaLearn, California,...
Sparsedtwa novel approach to speed up dynamic time warping
Proceedings of the Eighth Australasian Data Mining Conference
A method for registration of 3-d shapes
IEEE Trans. Pattern Anal. Mach. Intell.
On the metric properties of dynamic time warping
IEEE Trans. Acoust. Speech Signal Process.
One-shot learning of object categories
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (27)
Machine learning for intrusion detection in industrial control systems: Applications, challenges, and recommendations
2022, International Journal of Critical Infrastructure ProtectionA novel forget-update module for few-shot domain generalization
2022, Pattern RecognitionCitation Excerpt :The problem of learning to generalize to new classes with a limited number of label examples, called few-shot learning (FSL), has attracted considerable attention in the past few years. Recently, research on FSL begin to appear in many areas, such as image classification [6–10], gesture recognition [11], activity recognition [12], and logo retrieval [13]. Many FSL methods have been proposed to learn new transformable visual concepts with limited samples in recent years.
Few-shot prototype alignment regularization network for document image layout segementation
2021, Pattern RecognitionFew-shot activity recognition with cross-modal memory network
2020, Pattern RecognitionScheduled sampling for one-shot learning via matching network
2019, Pattern RecognitionCitation Excerpt :For example, if we want to know what is the zebra, we just need to search one or few images and then figure out the characteristics of the zebra quickly, which is a species of horse family united by its distinctive black and white striped coats. Considering the mentioned, one-shot learning, where just one labeled image for each visual category are used for training, has attracted more and more attentions [7,8]. This learning pattern not only reduces the workload of image data annotation, but also makes the machine work like the human.
Saliency for fine-grained object recognition in domains with scarce training data
2019, Pattern RecognitionCitation Excerpt :Early work on this topic is attributed to Fei-Fei et al. [19], who showed that, taking advantage of previously learned categories, it is possible to learn new categories using one or very few samples per class. More recently, [20] proposed a conditional distance measure that takes into account how a particular appearance model varies with respect to every other model in a model database. The approach has been applied to one-shot gesture recognition.
Ravikiran Krishnan received the B.Tech degree in Information Science from Vishweshwaraiah Technological University in 2007. He received M.S. degree in Computer Science from University of South Florida (USF), Tampa. He is currently pursuing his Ph.D. degree from USF. His research interests include sign language and gesture recognition, machine learning and audiovisual analysis. He is a member of the IEEE computer Society.
Dr. Sudeep Sarkar is Professor of Computer Science and Engineering and Associate Vice President for Research & Innovation at the University of South Florida (USF) in Tampa, where he started his professional career in 1993. He received his B.Tech. degree in Electrical Engineering from the Indian Institute of Technology, Kanpur, and his M.S. and Ph.D. degrees in Electrical Engineering from The Ohio State University.
Dr. Sarkar’s research interests are in computer vision and pattern recognition, i.e. designing of computer algorithms to extract and to infer information from images and video. In particular, he is interested in perceptual organization of video and audio signals, establishing identity using biometrics, automated gesture and sign language recognition, and human gait analysis. In his administrative capacity, he started and leads the faculty external awards, honors, and prizes initiative to bring prestige and recognition to the university and to honor deserving scholars. He also coordinates and writes trans-disciplinary, university-wide, grants and contract opportunities related to innovation and technology transfer.
He is the recipient of the National Science Foundation CAREER award, the USF Teaching Incentive Program Award for Undergraduate Teaching Excellence, the Outstanding Undergraduate Teaching Award, and the Ashford Distinguished Scholar Award. He served on the editorial boards for the IEEE Transactions on Pattern Analysis and Machine Intelligence, Pattern Analysis & Applications Journal, Pattern Recognition journal, IET Computer Vision, and the IEEE Transactions on Systems, Man, and Cybernetics, Part-B.
He is currently the Co Editor-in-Chief for the Pattern Recognition Letters and also the associate editor for IEEE Transactions on Pattern Analysis and Machine Intelligence.
He is Fellow of AAAS, Fellow of IEEE, Fellow of IAPR, an IEEE-CS Distinguished Visitor Program Speaker (2010–2012), and the charter member of the National Academy of Inventors.