The impact of motion dimensionality and bit cardinality on the design of 3D gesture recognizers

https://doi.org/10.1016/j.ijhcs.2012.11.005Get rights and content

Abstract

The interactive demands of the upcoming ubiquitous computing era have set off researchers and practitioners toward prototyping new gesture-sensing devices and gadgets. At the same time, the practical needs of developing for such miniaturized prototypes with sometimes very low processing power and memory resources make practitioners in high demand of fast gesture recognizers employing little memory. However, the available work on motion gesture classifiers has mainly focused on delivering high recognition performance with less discussion on execution speed or required memory. This work investigates the performance of today's commonly used 3D motion gesture recognizers under the effect of different gesture dimensionality and bit cardinality representations. Specifically, we show that few sampling points and low bit depths are sufficient for most motion gesture metrics to attain their peak recognition performance in the context of the popular Nearest-Neighbor classification approach. As a practical consequence, 16x faster recognizers working with 32x less memory while delivering the same high levels of recognition performance are being reported. We present recognition results for a large gesture corpus consisting in nearly 20,000 gesture samples. In addition, a toolkit is provided to assist practitioners in optimizing their gesture recognizers in order to increase classification speed and reduce memory consumption for their designs. At a deeper level, our findings suggest that the precision of the human motor control system articulating 3D gestures is needlessly surpassed by the precision of today's motion sensing technology that unfortunately bares a direct connection with the sensors' cost. We hope this work will encourage practitioners to consider improving the performance of their prototypes by careful analysis of motion gesture representation rather than by throwing more processing power and more memory into the design.

Highlights

► We investigate gesture dimensionality for 3D gesture recognizers. ► We study the impact of gesture bit depth on classification performance. ► We provide guidelines and toolkit to assist practitioners in their gesture designs.

Introduction

The recent availability of low-cost motion sensing technology embedded in mobile devices (Lane et al., 2010) has led to a wide proliferation of systems and applications employing gesture commands (Li, 2009, Liu et al., 2009, Ni and Baudisch, 2009, Ruiz and Li, 2011, Zhai et al., 2009). Nowadays, user-interface practitioners and designers have at their disposal a wide range of devices able to sense motion: mobile phones (Hinckley et al., 2000, Murao et al., 2011, Rekimoto, 1996), game controllers (Hoffman et al., 2010, Lee, 2008), and even wrist watches (Kim et al., 2007). Practitioners also benefit of a large selection of machine learning algorithms for recognizing gestures. These include Nearest-Neighbor (NN) classifiers that work with various gesture metrics (Anthony and Wobbrock, 2010, Kratz and Rohs, 2010, Kratz and Rohs, 2011, Li, 2010, Vatavu et al., 2012a, Wobbrock et al., 2007) but also more elaborate approaches such as Hidden Markov Models (HMMs) (Schlömer et al., 2008), Support Vector Machines (SVMs) (Wu et al., 2009), and Adaptive Boosting (Hoffman et al., 2010).

When considering the practical needs for developing and using such gestural interfaces, the NN classification approach stands out among its peer techniques for reasons such as ease of implementation for practitioners and ease of customization for users. The technique is simple to understand, implement, and debug by a practitioner not particularly interested in mastering all the complex details of more elaborate machine learning procedures. For such reasons, a $-family of gesture classifiers ($1, $N, $P) has been proposed in the human–computer interaction community to assist designers and practitioners implementing gesture recognition on new platforms (Anthony and Wobbrock, 2012, Li, 2010, Vatavu et al., 2012a, Wobbrock et al., 2007). As for the advantages for users, new commands can be easily added to the gesture set without the need to retrain or change the inner structure of the recognizer as would be the case for learning new state transition probabilities for HMMs (Schlömer et al., 2008), support vectors for SVMs (Wu et al., 2009), or weights for neural networks (Bailador et al., 2007).

The Nearest-Neighbor technique has been successfully used to classify gestures with near 99% accuracy while employing the Euclidean distance (Kratz and Rohs, 2010, Wobbrock et al., 2007), angular cosine similarity (Anthony and Wobbrock, 2012, Kratz and Rohs, 2011, Li, 2010), dynamic time warping (Liu et al., 2009, Wobbrock et al., 2007), and minimum-cost point cloud alignments (Vatavu et al., 2012a). However, besides recognition rate, the performance of a classifier is also judged by its execution speed and memory requirements. In the NN approach, both execution time and required memory depend directly on the representation adopted for gestures in terms of number of sampling points (gesture dimensionality) and precision of the measurement process (gesture bit cardinality). These factors become critical as sensing gradually disappears into the ambient through miniaturization (Ni and Baudisch, 2009) forcing designers to optimize execution time and minimize memory consumption for devices with sometimes extremely limited resources.

To discuss just one such example, eZ430-Chronos from Texas Instruments1 is a wrist watch that can capture accelerated motion with its embedded 3-axis accelerometer, store data in its 32 KB of flash memory, and process it with a 20 MHz 16-bit microcontroller. However, in order to store a gesture set such as the one from (Hoffman et al., 2010) with enough training samples to assure robust recognition, a minimum of 94 KB would be needed2 which is three times the memory of the device! Therefore, even in the age of practically unlimited amounts of memory that get cheaper by the day and 1 GHz processing available on mobile devices,3 the particular attention to data representation as manifested since the early days of computing (Agerwala, 1976, Das and Nayak, 1990) is still actual.

Data dimensionality and bit cardinality are also closely related to other design decisions that practitioners need to take, especially when implementing functions in dedicated hardware. Implementing functions in hardware (classifiers included) represents sometimes the last remaining option for designers to speed-up their code. For example, Sart et al. (2010) argue that software optimization ideas for dynamic time warping are close to exhaustion and therefore any new enhancement would come from moving computations on dedicated hardware such as GPUs (Graphics Processing Units) and FPGAs (Field-Programmable Gate Arrays). As a result, designers have already started to consider such options for processing human motion such as the FPGA data glove design of Park et al. (2008). Also, besides keeping memory consumption low, practitioners of dedicated hardware are interested in the bit depth of their architectures in order to reduce consumed power (Mallik et al., 2006), minimize circuit area (Lee et al., 2005), and reduce latency in their designs (Zhang et al., 2010).

Despite such important connections between gesture representation and system performance, there is no study investigating the performance of 3D gesture classifiers under various sampling rates (gesture dimensionality) and bit depths (gesture bit cardinality). However, we argue that such a study would be useful in providing assistance to practitioners in optimizing their specific designs. In the lack of such information, prototypers have been experimenting different options for their designs with the result of very different gesture representations being reported which may confuse a newcomer to the field (see Table 1 illustrating a few examples). Although an important topic with practical implications, the fundamental problem of finding the intrinsic representation of motion data has been only marginally addressed by researchers. In this line of work, the Protractor gesture classifier (Li, 2010) used a reduced dimensionality to optimize the original $1 recognizer (Wobbrock et al., 2007). Recognition experiments reported in Vatavu (2011) showed that low data dimensionality can still deliver high recognition accuracy but for 2D motions only. Working on time series, Bagnall et al. (2006), Rakthanmanon et al. (2011), Xi et al. (2006) showed that data mining algorithms can benefit from reduced bit cardinality in representing their data and Hu et al. (2011) employed the Minimal Description Length (MDL) framework to investigate the natural intrinsic representation of time series in terms of approximation model, dimensionality, and alphabet cardinality. Building on such previous works, Rakthanmanon et al. (2012) exploited lower bounding and early abandoning techniques (Keogh et al., 2009) to search in trillions of data points fast and accurately. Vatavu (2012) explored the bit depth of point-based gesture representations and found that 2D motions can be represented using lower bit cardinalities without affecting recognition rate considerably. However, understanding the true dimensionality and bit cardinality of 3D gesture data is still unanswered despite the important implications for practitioners prototyping the way toward the interactive gadgets of ubiquitous computing.

This work is the first investigation of the impact that gesture dimensionality and bit cardinality have on the performance of 3D gesture recognizers. We provide empirical evidence that few sampling points and low bit depths are enough for most gesture metrics to attain peak recognition performance with the Nearest-Neighbor classification technique. We do so by computing recognition rates on a large gesture corpus ( 20,000 samples) for both user-dependent and user-independent training scenarios. Specifically, we found eight sampling points more than sufficient for Euclidean and Cosine metrics to deliver high recognition performance while a linear relationship was detected between gesture dimensionality and the size of the gesture set for dynamic time warping. Also, only 3–5 bits per x, y, z gesture channels were found to provide sufficient representation resolution for the tested metrics to deliver their highest level of recognition accuracy. In turn, the impact on execution time and memory requirements is considerable. We report 16x faster recognizers needing 32x less memory while still delivering high level recognition performance.

In addition to our empirical findings, we propose a mathematical model for explaining the effect of gesture dimensionality and bit cardinality on recognizers' performance. A toolkit is also provided to assist practitioners in optimizing the sampling rate and bit depths of their 3D gesture designs. We validate our toolkit on gesture sets that are publicly available in the community.

We believe the contributions of this work will impact the community of practitioners of gesture-based interfaces with the following implications:

  • (1)

    Inform performance-oriented design: Low memory requirements are sometimes inevitable (e.g., the motion sensing wrist watch example) and practitioners need to reduce memory consumption of their designs. However, besides memory reduction, small dimensionalities and low bit depths can be exploited by several hardware architectures such as FPGAs in order to increase parallelism, reduce circuit area and power consumption (Kinsman and Nicolici, 2010, Lee et al., 2005, Mallik et al., 2006).

  • (2)

    Inform cost-oriented design: Practitioners can make an informed decision about the precision of sensing they actually need for their application (e.g., deciding whether to include an 8-bit instead of a 12-bit or 16-bit precision accelerometer into the design can help reducing the total cost). For example, practitioners developing with Phidgets4 (Greenberg and Fitchett, 2001) may have to decide whether they need 16 bits of precision for sensing accelerated motion5 or just 12 bits at half price.6

  • (3)

    Foster exploration of new software architectures for gestures: New software architectures have started to emerge for gestures providing practitioners with web services that deliver gesture recognition (Kohlsdorf et al., 2011, Van Seghbroeck et al., 2010, Vatavu et al., 2012b) or gesture suggestions from large databases collected by exploiting the “wisdom of the crowd” (Ouyang and Li, 2012). Such architectures emerging from the practice of Service-Oriented Computing7 reduce programming load and encourage code reusability across new platforms. However, in order for such designs to become practical, network bandwidth for transferring gesture data needs to be minimized. Therefore, careful analysis of gesture representation in terms of dimensionality and bit depth is needed.

  • (4)

    Promote conscientious design: An important conclusion of this work is that throwing more resolution at a problem won't necessarily improve accuracy. Instead, employing more resolution than needed may even have negative effects such as more memory and power consumption and even lower recognition performance in some cases (Sima and Dougherty, 2008). We hope this work will encourage practitioners to consider improving the performance of their prototypes by careful analysis of gesture representation rather than attempting to do so by throwing more processing power and more memory into the design.

Section snippets

Gesture preliminaries

This section introduces the main concepts used throughout the article such as gesture dimensionality and bit cardinality and briefly discusses the metrics employed for the recognition experiments. We understand by gesture a set of 3D points ordered by their acquisition time{pi=(xi,yi,zi)R3|i=1,n}where n is the size of the set and xi, yi, zi are coordinates in 3D which can be position or acceleration values. For this study, we employ gesture metrics that work directly on gestures represented as

The impact of gesture dimensionality on recognition rate

We start our investigation with an experiment designed to understand how recognition accuracy is affected by the dimensionality of input gesture motion. The effect of dimensionality on recognition rate is analyzed under two scenarios: user-dependent and user-independent training. In the first scenario, gesture recognizers are trained and tested on data acquired from the same user. For the user-independent scenario, recognizers are tested with gestures captured from different users than those

The combined effect of gesture dimensionality and the size of the gesture set

The results obtained in the first experiment on a rather difficult set of gestures (Hoffman et al., 2010) are intriguing as they show that even low sampling rates are enough in order to obtain high recognition accuracies. However, in their practice, designers are facing various application requirements for which various number of gestures will be employed. Except for extreme cases (Kristensson and Zhai, 2004, Zhai et al., 2009), a small number of gesture commands are likely to be proposed for

The impact of gesture bit cardinality on recognition rate

We continue our investigation by analyzing the effect of bit cardinality on recognition performance of the proposed gesture recognizers. The experiment design was similar to the first study including both user-dependent and user-independent training. We expect recognition rates to be higher for larger bit depths as gesture motions will be represented at finer resolutions. Similar to hypotheses H1–3 for gesture dimensionality, we propose verifying the following hypotheses on the impact of bit

The impact on execution time and system memory

Besides recognition rate, two more factors define the performance of a gesture recognizer: response time and the amount of memory required to deliver the classification result. We discuss in the following the impact of gesture dimensionality and bit cardinality on these two important factors.

Modeling the impact of gesture dimensionality and bit cardinality on recognition accuracy

The results of the gesture dimensionality and bit cardinality experiments are extremely intriguing as they show that high recognition performance can be obtained even for low sampling rates and small bit depths. In turn, such small data dimensionality means faster execution times and lower memory requirements for recognizers. However, while experiments showed that only few sampling points represented on few bits are sufficient to deliver high recognition performance, a deeper look into what

Gesture analysis tool

Experimental results showed that few sampling points and bit depths can attain the same high level of recognition accuracy as much finer resolutions. For some metrics, empirical results also showed that dimensionality is related to the number of distinct gestures included in the gesture set. Specifically, we found an upper bound of eight points to be more than enough for the Euclidean and Cosine metrics, while a linear dependence relating sampling rate and gesture set size was derived for DTW

Conclusion

This work investigated the impact of gesture dimensionality and bit cardinality on the performance of 3D gesture recognizers. In the recent context of empowering practitioners with easy-to-implement gesture recognizers (such as $1, $N, and $P) to encourage experimentation of gesture-based interfaces for new platforms and environments, we showed that careful analysis of gesture representation is extremely important when designing under constrained resources. We hope the results of this study

References (73)

  • C. Sima et al.

    The peaking phenomenon in the presence of feature-selection

    Pattern Recognition Letters

    (2008)
  • T. Agerwala

    Microprogram optimizationa survey

    IEEE Transactions on Computers

    (1976)
  • Anthony, L., Wobbrock, J.O., 2010. A lightweight multistroke recognizer for user interface prototypes. In: Proceedings...
  • Anthony, L., Wobbrock, J.O., 2012. $N-Protractor: a fast and accurate multistroke recognizer. In: Proceedings of...
  • Ashbrook, D., Starner, T., 2010. Magic: a motion gesture design tool. In: Proceedings of the 28th International...
  • A. Bagnall et al.

    A bit level representation for time series data mining with shape based similarity

    Data Mining and Knowledge Discovery

    (2006)
  • Bailador, G., Roggen, D., Tröster, G., Triviño, G., 2007. Real time gesture recognition using continuous time recurrent...
  • Bérard, F., Wang, G., Cooperstock, J.R., 2011. On the limits of the human motor control precision: the search for a...
  • Bragdon, A., Zeleznik, R., Williamson, B., Miller, T., LaViola, Jr., J.J., 2009. Gesturebar: improving the...
  • Cao, X., Zhai, S., 2007. Modeling human performance of pen stroke gestures. In: Proceedings of the SIGCHI Conference on...
  • Chen, M., AlRegib, G., Juang, B.-H., 2012. 6DMG: a new 6D motion gesture database. In: Proceedings of the 3rd...
  • J. Cohen

    A power primer

    Psychological Bulletin

    (1992)
  • Corey, P., Hammond, T., 2008. GLADDER: combining gesture and geometric sketch recognition. In: Proceedings of the 23rd...
  • Das, S.R., Nayak, A.R., 1990. A survey on bit dimension optimization strategies of microprograms. In: Proceedings of...
  • Dubuisson, M., Jain, A., 1994. A modified Hausdorff distance for object matching. In: Proceedings of the 12th IAPR...
  • J.H. Friedman et al.

    An algorithm for finding best matches in logarithmic expected time

    ACM Transactions on Mathematical Software

    (1977)
  • M. Friedman

    The use of ranks to avoid the assumption of normality implicit in the analysis of variance

    Journal of the American Statistical Association

    (1937)
  • S. Ghahramani

    Fundamentals of Probability

    (2000)
  • Greenberg, S., Fitchett, C., 2001. Phidgets: easy development of physical interfaces through physical widgets. In:...
  • Hinckley, K., Pierce, J., Sinclair, M., Horvitz, E., 2000. Sensing techniques for mobile interaction. In: Proceedings...
  • Hoffman, M., Varcholik, P., LaViola, J.J., 2010. Breaking the status quo: improving 3D gesture recognition with...
  • Hu, B., Rakthanmanon, T., Hao, Y., Evans, S., Lonardi, S., Keogh, E., 2011. Discovering the intrinsic cardinality and...
  • F. Itakura

    Minimum prediction residual principle applied to speech recognition

    IEEE Transactions on Acoustics, Speech and Signal Processing

    (1975)
  • Kara, L.B., Stahovich, T.F., 2004. Hierarchical parsing and recognition of hand-sketched diagrams. In: Proceedings of...
  • Keogh, E., 2002. Exact indexing of dynamic time warping. In: Proceedings of the 28th International Conference on Very...
  • E. Keogh et al.

    Supporting exact indexing of arbitrarily rotated shapes and periodic time series under Euclidean and warping distance measures

    The VLDB Journal

    (2009)
  • Kim, J., He, J., Lyons, K., Starner, T., 2007. The gesture watch: a wireless contact-free gesture based wrist...
  • A.B. Kinsman et al.

    Bit-width allocation for hardware accelerators for scientific computing using SAT-modulo theory

    Transactions on Computer-Aided Design of Integrated Circuits and Systems

    (2010)
  • Kohlsdorf, D., Starner, T., Ashbrook, D., 2011. Magic 2.0: a web tool for false positive prediction and prevention for...
  • Kratz, S., Rohs, M., 2010. The $3 recognizer: simple 3D gesture recognition on mobile devices. In: Proceedings of the...
  • Kratz, S., Rohs, M., 2011. Protractor 3D: a closed-form solution to rotation-invariant 3D gestures. In: Proceedings of...
  • Kristensson, P.-O., Zhai, S., 2004. SHARK2: a large vocabulary shorthand writing system for pen-based computers. In:...
  • N.D. Lane et al.

    A survey of mobile phone sensing

    Communications Magazine

    (2010)
  • Lee, D.-U., Gaffar, A.A., Mencer, O., Luk, W., 2005. Minibit: bit-width optimization via affine arithmetic. In:...
  • J.C. Lee

    Hacking the Nintendo Wii Remote

    IEEE Pervasive Computing

    (2008)
  • Y. Li

    Beyond pinch and flickenriching mobile gesture interaction

    Computer

    (2009)
  • Cited by (32)

    • 3D uniformity measurement of stirring system based on dual-camera positioning

      2023, Powder Technology
      Citation Excerpt :

      Lellouche et al. [35] studied the distribution of distances between elements in a compact set using the cube line picking theory. Vatavu [36] investigated the performance of 3D motion gesture recognizers under the impact of different gesture dimensions and recognizer designs using this theory. Vybornova et al. [37] estimated the packet path length between nodes of a flying ubiquitous sensor network, and mathematical models of balls and cubic groups were built using this theory.

    • FORTE: Few Samples for Recognizing Hand Gestures with a Smartphone-Attached Radar

      2023, Proceedings of the ACM on Human-Computer Interaction
    • iFAD Gestures: Understanding Users' Gesture Input Performance with Index-Finger Augmentation Devices

      2023, Conference on Human Factors in Computing Systems - Proceedings
    View all citing articles on Scopus
    View full text