Elsevier

Pattern Recognition

Volume 48, Issue 4, April 2015, Pages 1302-1314
Pattern Recognition

Conditional distance based matching for one-shot gesture recognition

https://doi.org/10.1016/j.patcog.2014.10.026Get rights and content

Highlights

  • We propose a new distance measure called conditional distance between two gestures sequences when we have only one or a few samples per gesture class.

  • Conditional distance is the distance between query and model gesture sequences in the presence of a third (anchor) gesture sequence.

  • We propose speedup strategy for computing conditional distances by pre-selecting the anchor.

  • We also propose a condition distance based simultaneous gesture segmentation and recognition called conditional level building.

  • We show results of 82% on a multiple subject dataset spanning 179 classes.

Abstract

A problem of matching gestures, where there are one or few samples per class, is considered in this paper. The proposed approach shows that much better results are achieved if the distance between the pattern of frame-wise distances of two gesture sequences with a third (anchor) sequence from the modelbase is considered. Such a measure is called as conditional distance and these distance pattern are referred to as “warp vectors”. If these warp vectors are similar, then so are the sequences; if not, they are dissimilar. At the algorithmic core, there are two dynamic time warping processes, one to compute the warp vectors with the anchor sequences and the other to compare these warp vectors. In order to reduce the complexity a speedup strategy is proposed by pre-selecting “good” anchor sequences. Conditional distance is used for individual and sentence level gesture matching. Both single and multiple subject datasets are used. Experiments show improved performance above 82% spanning 179 classes.

Introduction

Analyzing and recognizing human gestures is important for human computer interaction. The large number of human gesture categories such as sign language, traffic signals, everyday actions and also subtle cultural variations in gesture classes makes gesture recognition an interesting and challenging problem. In most naturally occurring scenarios, gestures are connected together in continuous varying stream, without any obvious break between individual gestures [21]. Identifying each one of these individual gestures gives a good representation for ultimately translating visual communication into speech or other form of interaction. Such labeling tasks have many challenges.

Labeling theses continuous gesture stream or query involves matching temporally segmented individual gestures to a modelbase. If the modelbase has many samples per class, statistics of that particular class can be learnt. Having one or few samples deters any class statistic learning approaches to classification, as the full range of variation is not covered. One of the key components of a matching algorithm, apart from feature extraction, would be gesture to gesture distances. These distances should define, in a concrete way, what it means for data points of such a class space to be “near to” or “far away from”1 each other. One commonly used approach would be to take pair-wise distances (using a distance function) between all available models with the query and discern which are closer (classified as same) or far away from (classified as not same). A commonly used distance function would be dynamic time warping with 1-Nearest Neighbor approach for classification [12].

In our work, a matching algorithm based on a level building approach is proposed. This algorithm is based on a framework of one-shot learning (single sample per class). The proposed algorithm is capable of handling – (a) Isolated and continuous gesture queries; (b) eliminates the need for temporal segmentation; and (c) single sample per class scenarios. A new distance function is defined and this serves as the center of our algorithm. Each gesture sequence is seen as a curve and each curve as a data point on a space that is formed by all the gesture classes. Pair-wise distance pattern between two gesture sequences conditioned on a third (anchor) sequence is considered. These distance pattern vectors are called as “warp vectors”. And such a process is called as “conditional distance”. At the algorithmic core, there are two dynamic time warping processes, one to compute the warp vectors with the anchor sequence and the other to compare these warp vectors. Such measures have been proposed earlier, for example, Mahalanobis distance, where intra-class variations, or the variation between the instances, accounts for the scaling that the distance measure needs. Our work explores the variations between the classes itself as the modelbase is framed as a single sample per class.

Given a situation where the model base is large (number of classes is also large); the disadvantage of such a distance would be the computational cost. There is a need for pre-selecting the anchor gesture (or class). A speedup strategy is proposed based on pre-selection of anchor gestures from the modelbase. The proposed distance is computed with every gesture to every other gesture in the modelbase. For each such distance, an anchor gesture is determined. Majority anchor gesture is selected and distances between query and model is computed only on this chosen anchor gesture.

Conditional distance gives the distance between two isolated gestures. In order to label multiple connected gestures, a simultaneous segmentation and recognition matching algorithm called level building algorithm is used. Dynamic programming implementation of the level building algorithm is employed. The core of this algorithm depends on a distance function that compares two gesture sequences. This distance function is replaced by conditional distance. Hence, this version of level building is called as conditional level building (clb).

Earlier version of this work was proposed in [19]. In this paper, a more detailed use, analysis and results for conditional distance is provided. The main differences are listed as follows:

  • 1.

    This paper shows that conditional distance increases performance over the baseline distances in two gesture modelbase contexts – single category-single subject and multiple category-multiple subjects and shows that conditional distance satisfies metric properties in practice.

  • 2.

    An anchor pre-selection strategy to speed up computation of the proposed distance is proposed. Anchor behavior and selection with and without the proposed strategy is analyzed.

  • 3.

    A conditional distance version of level building algorithm for recognizing connected gestures is also proposed.

  • 4.

    Results are shown on gesture challenge datasets [1] (Fig. 1) and compared our results with state-of-the-art on those datasets.

Conventions: A time sequence or a gesture sequence is an array of images taken at certain times. The sampling rate is same as the regular video sampling rate. Gesture sequence can have a length n and are indexed from 1 to n. The l2 distance between feature vectors x and y is xy2 and it satisfies the triangle inequality xz2xy2+yz2. A summary of frequently used conventions is provided in Table 1.

Section snippets

Related work

Any gesture recognition task, might that be a series of gestures connected to form a single query or a single gesture query, involves comparing an incoming query against a training set of gestures. A collection of all the instances of all the classes available for training is referred to as a modelbase. These modelbases can have many instances per gesture class or they might have just one instance per class. If there are many instances then the recognition can be based on learning statistics of

Conditional distance: distance between two gestures

Conditional distance is the concept of finding distance between two gesture sequences using a third (anchor) sequence. Our motivation for conditional distance comes from other classification domains. One example [32] is the work on semantic comparisons for search-engine queries. Given a ranked list for a query documents that are clicked on can be assumed as to be semantically closer than those documents that the user decided not to click on (e.g: Aclick is closer to Bclick than Aclick is to C

Modelbase based pre-selection of anchors

The disadvantage of computing the anchor as shown in Eq. (3), is the computational cost. In order to reduce the number of comparisons, the following steps are proposed to speedup the computation of conditional distance:

  • 1.

    Given a modelbase M: {X1,X2,,XN}, our goal is to find which of these model sequences qualify as a majority anchor Xj for a particular model Xi. As the modelbase is in a one-shot framework, model themselves are tested as query sequence.

  • 2.

    Conditional distance is computed using Eq.

Model to single gesture: temporal segmentation

Given a query sequence, with multiple connected gestures, a strategy to perform temporal segmentation first, and then the proposed distance is used to compare each temporally segmented query against all model sequences in the modelbase. Temporal segmentation is performed using a covariance matrix and its strategy is described below.

A new joint query sequence is proposed and it represents a 2-channel image. The first channel refers to the intensity and the second channel refers to the depth

Model to series of connected gestures: conditional level building (cLB)

Temporal segmentation has the drawback of increased computational cost and matching requires very precise segmentation. A simultaneous segmentation and recognition matching algorithm called level building algorithm is used. A level building matching algorithm has been used to label multiple connected gestures [27], [28], [43]. The proposed level building version is based on conditional distances and hence referred to as conditional level building (cLB). Our cLB algorithm varies from the

Discussion on conditional distance: metric properties

Conditional distance is defined as the distance between two warp vectors. The elements of the warp vectors represent the distance between two aligned frames. As there are two warp vectors, conditional distances can be seen as the aligned distances between 2 pairs of frames. In conditional distance, two gestures are similar when distance is small. This alignment is achieved by DTW and hence we consider the metric properties of conditional distances to follow the properties of DTW. Generalized

Dataset

All the results in this paper are based on gesture sequences extracted from the Chalearn Gesture Challenge dataset [1]. Two modelbase versions are used as two separate datasets representing: single category-single subject and multiple category-multiple subjects. The first dataset follows the same batch wise categorization of gestures that has single subject associated with a single category. There three such datasets, each consisting of 1800 sequences and has 8 to 15 model sequences based on

Conclusion

In this paper a new distance measure is proposed based on conditional distance and warp vectors. As models and query both are conditioned on an anchor sequence, the distance measure takes into account how varied a particular model is from every other model in the modelbase. The shown improvement in the performance shows that the vector of distances should not be ignored and also shows that the proposed distance measure is metric in practice. As the proposed approach captures how far a

Conflict of interest

None.

Acknowledgement

The authors would like to acknowledge the use of the services provided by Research Computing at the University of South Florida. This research was supported in part by NSF grant 1217676.

Ravikiran Krishnan received the B.Tech degree in Information Science from Vishweshwaraiah Technological University in 2007. He received M.S. degree in Computer Science from University of South Florida (USF), Tampa. He is currently pursuing his Ph.D. degree from USF. His research interests include sign language and gesture recognition, machine learning and audiovisual analysis. He is a member of the IEEE computer Society.

References (43)

  • G. Al-Naymat et al.

    Sparsedtwa novel approach to speed up dynamic time warping

    Proceedings of the Eighth Australasian Data Mining Conference

    (2009)
  • P. Besl et al.

    A method for registration of 3-d shapes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1992)
  • F. Casacuberta et al.

    On the metric properties of dynamic time warping

    IEEE Trans. Acoust. Speech Signal Process.

    (1987)
  • N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: International Conference on Computer...
  • O. Danielsson, B. Rasolzadeh, S. Carlsson, Gated Classifiers: Boosting Under High Intra-Class Variation, 2011, pp....
  • L.S. Di Wu, Fan Zhu, One shot learning gesture recognition from RGBD images, in: International Conference on Computer...
  • A. Elgammal, V. Shet, Y. Yacoob, L. Davis, Gesture recognition using a probabilistic framework for pose matching, in:...
  • A. Elgammal, V. Shet, Y. Yacoob, L.S. Davis, Learning Dynamics for Exemplar-Based Gesture Recognition, 2003, pp....
  • L. Fei-Fei et al.

    One-shot learning of object categories

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2006)
  • F. Florez, J. Garcia, J. Garcia, A. Hernandez, Hand Gesture Recognition Following the Dynamics of a Topology-Preserving...
  • I. Guyon, V. Athitsos, P. Jangyodsuk, H.J. Escalante, B. Hamner, Results and Analysis of the Chalearn Gesture Challenge...
  • Cited by (27)

    • A novel forget-update module for few-shot domain generalization

      2022, Pattern Recognition
      Citation Excerpt :

      The problem of learning to generalize to new classes with a limited number of label examples, called few-shot learning (FSL), has attracted considerable attention in the past few years. Recently, research on FSL begin to appear in many areas, such as image classification [6–10], gesture recognition [11], activity recognition [12], and logo retrieval [13]. Many FSL methods have been proposed to learn new transformable visual concepts with limited samples in recent years.

    • Scheduled sampling for one-shot learning via matching network

      2019, Pattern Recognition
      Citation Excerpt :

      For example, if we want to know what is the zebra, we just need to search one or few images and then figure out the characteristics of the zebra quickly, which is a species of horse family united by its distinctive black and white striped coats. Considering the mentioned, one-shot learning, where just one labeled image for each visual category are used for training, has attracted more and more attentions [7,8]. This learning pattern not only reduces the workload of image data annotation, but also makes the machine work like the human.

    • Saliency for fine-grained object recognition in domains with scarce training data

      2019, Pattern Recognition
      Citation Excerpt :

      Early work on this topic is attributed to Fei-Fei et al. [19], who showed that, taking advantage of previously learned categories, it is possible to learn new categories using one or very few samples per class. More recently, [20] proposed a conditional distance measure that takes into account how a particular appearance model varies with respect to every other model in a model database. The approach has been applied to one-shot gesture recognition.

    View all citing articles on Scopus

    Ravikiran Krishnan received the B.Tech degree in Information Science from Vishweshwaraiah Technological University in 2007. He received M.S. degree in Computer Science from University of South Florida (USF), Tampa. He is currently pursuing his Ph.D. degree from USF. His research interests include sign language and gesture recognition, machine learning and audiovisual analysis. He is a member of the IEEE computer Society.

    Dr. Sudeep Sarkar is Professor of Computer Science and Engineering and Associate Vice President for Research & Innovation at the University of South Florida (USF) in Tampa, where he started his professional career in 1993. He received his B.Tech. degree in Electrical Engineering from the Indian Institute of Technology, Kanpur, and his M.S. and Ph.D. degrees in Electrical Engineering from The Ohio State University.

    Dr. Sarkar’s research interests are in computer vision and pattern recognition, i.e. designing of computer algorithms to extract and to infer information from images and video. In particular, he is interested in perceptual organization of video and audio signals, establishing identity using biometrics, automated gesture and sign language recognition, and human gait analysis. In his administrative capacity, he started and leads the faculty external awards, honors, and prizes initiative to bring prestige and recognition to the university and to honor deserving scholars. He also coordinates and writes trans-disciplinary, university-wide, grants and contract opportunities related to innovation and technology transfer.

    He is the recipient of the National Science Foundation CAREER award, the USF Teaching Incentive Program Award for Undergraduate Teaching Excellence, the Outstanding Undergraduate Teaching Award, and the Ashford Distinguished Scholar Award. He served on the editorial boards for the IEEE Transactions on Pattern Analysis and Machine Intelligence, Pattern Analysis & Applications Journal, Pattern Recognition journal, IET Computer Vision, and the IEEE Transactions on Systems, Man, and Cybernetics, Part-B.

    He is currently the Co Editor-in-Chief for the Pattern Recognition Letters and also the associate editor for IEEE Transactions on Pattern Analysis and Machine Intelligence.

    He is Fellow of AAAS, Fellow of IEEE, Fellow of IAPR, an IEEE-CS Distinguished Visitor Program Speaker (2010–2012), and the charter member of the National Academy of Inventors.

    View full text