Abstract
To realize a hands-free controlled system by recognition of mid-air gestures still a bundle of serious problems exists. It is not really clarified how commands have to be interpreted by gestures because it is possible to understand the stroke phases as static as well as dynamic. But depending on which meaning will be used the gesture itself has to be executed in different manners. With video sequences of different interpretations and an online questionnaire this question was examined. The results and also pending problems led to a first solution of a mobile and hands-free controlled transportation system (for picking, lifting and transportation of small boxes) in logistic domains.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Which kind of input devices may be used in industrial environments? A lot of devices exist, like touchscreens, buttons, keyboards or remote control devices, depending on the task the user has to fulfill. But are these different devices necessary? During communication people (mostly) do not touch each other, but they of course have interaction. The main type of this interaction between people of course are spoken words. But as already Austin [1] wrote: “The third division of the external port or oratory, or of delivery, is gesture.” Human beings use gestures to enhance their speech.
In industrial environments certain difficulties do exist to use speech for communication between worker and machine. Technical speech recognition needs a signal (nearly) clear from noise. German laws limit noise levels to 85 dB(A) in production areas [2]. Taking these limitations into account, for proper speech recognition a microphone has to be located directly in front of the mouth. This is in contradiction to most usual working conditions.
2 Propaedeutics
Preferring interactions without additional handheld or body mounted devices, an alternative method to interact with a machine is discussed here: the usage of mid-air gestures instead of voice commands. McNeill [3] “classif[ies] gesture movements into four major categories: iconic, metaphoric, deictic (pointing) and beat gestures”. In the technical approach we propose, only iconic and pointing gestures are regarded.
To define a gesture which is not combined with speech, the technical system requires a specific hint in order to classify a certain action as a gesture. Comparing mid-air gestures with speech recognition, Wigdor [4] identifies that this “live mic problem” is similar in both interaction techniques.
Kendon [5] divided gestures into three parts called preparation (1), stroke (2) and recovery (3). Using a gesture for Human-Machine-Interface (HMI) part one and two are necessary. In most cases the preparation part has to act as a wakeup event (see “live mic problem”). This preparation must be equal even if the command and the following stroke will be different. Depending on “the live mic problem” in [6] three phases were defined: “(1) registration…, (2) continuation […] and (3) termination”. Alternatively Pavlovic [7] defined that during “the preparation phase […] the hand […moves] from some resting position” to starting point. The stroke will be the interaction command.
Although a plenty of 3D recognition systems like Microsoft® Kinect™ with as well a lot of libraries available for the public exist, they all do not have special algorithms to detect, whether a body movement is part of a preparation phase entering the interaction area (see Fig. 1), or if it is only a spontaneously movement. So it is necessary to newly define a wake up event during the preparation phase.
McNeill [3] identified that iconic gestures will be performed in the center region, and deictic gestures will be performed in more peripheral areas.
Our technical approach with vision based system uses OPEN NI (NITE) libraries to identify the “joints” corresponding to Fig. 2. These joints could be recognized easily in peripheral areas, but in the center area the vision based approach will lead to an erroneous capture of a lot of points (see Fig. 3, wrong identifications of both arms). It is obvious that a pointing gesture towards the camera system cannot be identified by this system setup.
Joints used for identification by OPEN NI (NITE) (modified, based on [10])
Pavlovic [7] mentioned that “human gestures are a dynamic process, it is important to consider the temporal characteristics of gestures”. He discussed the temporal aspect in order to characterize the “preparation and reaction […] by rapid change in position of the hand while the stroke […] exhibits relatively slower hand motion.”
The discussion of the temporal aspect might also be interesting for the definition of the stroke phase. The question might be if the stroke phase of a deictic (pointing) gesture is more static, like a body posture, and the stroke phase of an iconic (in case of an action) is more dynamic? For an iconic gesture in case of a concrete event or object this question might be answered without discussion.
The model of McNeill [3] uses even five phases for one gesture-phrase (see Fig. 4). He added a pre-stroke hold (1a) and a post-stroke hold (2a). “The pre-stroke hold (optional) is the position and hand posture […] held […] until the stroke begins.” This hold is necessary to sync speech and gesture, but is this hold also necessary for the HMI-use of gesture?
If the stroke phase (2) should be defined as dynamically, then it is necessary to regard as well as on temporal aspect as on spatial (the necessity to first enter the interaction area). Under this conditions it is possibly necessary to have a pre- or post-stroke hold or sometimes both.
3 Method
The general question of using gestures in human computer interaction has to be divided into at least two single aspects:
-
The human users’ comprehension and application of gestures (what does influence the user acceptance).
-
How could such an intuitive gesture be recognized by a technical system?
To analyze the first aspect, we developed and applied an online questionnaire.
The main questions were the following:
-
Will a user accept to command a technical system, e.g. a personal assisting robot, with gestures?
-
Is there a direct relationship between intended commands and preferred gestures? For example in the case the user wants to force the system to approach himself, we predict that he will use a dynamic stroke phase, like beckoning.
-
Will be a pre- or post-stroke hold or both necessary if the stroke phase is dynamic?
To answer these questions a set of video sequences were recorded. To limit the amount of possible commands, six commands were selected and rated as static or dynamic. The three predicted dynamic commands were “come over”, “lift up” and “turn around”, and the three predicted static commands were “register”, “stop” and “select a (certain) object”. For each of these commands two dynamic and two static gestures were presented. According to the definition of Pavlovic [7], each of the gestures starts and ends outside the interaction area, and all video sequences start with the rapid move into the interaction area (preparation phase).
Depending on the above mentioned definition the question could still not be answered clearly, when (at which time) and/or where (at which position) the stroke phase does start. The answer to this question leads to the necessity of an explicit description of gestures, which finally is useable in identification algorithms of technical systems.
An example to explain this approach could be the pointing gesture, which could be defined as to be static (posture) at an shoulder angle of more than 20° and an elbow angle of more than 160°(gained during the stroke phase (2) and hold for a time of minimally 5 s. (post-stroke hold (2a), see Fig. 5). During the recovery phase the user can release his arm and return to the resting position, the gesture recognition is finished already.
To fulfil requirements of the online survey to give two opportunities of each command a second video, presenting a dynamic stroke phase, was necessary. This pointing gesture could also be described only as dynamic stroke with following procedure: move the hand up to the shoulder, and direct it from the shoulder to the object chosen. After the arm reaches its fully stretched position, the interaction area directly has to be left (without post-stroke hold) to the resting position with hands on the hips (see Fig. 5).
If this dynamic definition of gestures shall be used for recognition in technical systems it will provide some additional problems. An explanation will be given by pointing gesture like above. Like shown in Fig. 6 it is generally difficult to recognize the current position of hand-arm during a pointing gesture. As a dynamic gesture the hand has to be observed the whole time, and online the algorithm has to recognize this as the movement of the hand joint. Because the aim is to detect that one certain object was pointed to continuously, a resulting pointing vector (which matches the mean point where the desired object is placed) has to be calculated. By the use of static gesture approach it is much easier to calculate this vector only by following the line from shoulder joint over hand joint to an object after the post-stroke hold was performed.
In dynamic gesture case the vector changes all the time, and additionally the elbow angle has to be monitored. Only if the elbow angle is larger than 160° and after this, the arm left the calculated vector, the desired gesture was confirmed to have been performed. But the thresholds and criteria for this moment of leaving the calculated vector still have to be defined.
The online questionnaire was created to investigate only the first aspect. The usability definition exposes the parameters effectiveness, efficiency and satisfaction to measure how a user will achieve specific goals in particular environments [11]. To evaluate effectiveness and efficiency, a first prototype of the technical gesture recognition system has to be available. If it would be available [12] we propose to measure fatigue, naturalness, gesture duration and accuracy as indicators for the quality of a gesture. This method only could be used if the person under analysis could be observed optically. So in this early morning of the study and with an online questionnaire only the satisfaction could be analyzed. The dimensions for the satisfaction are acceptance and intuitiveness of gestures shown. As intuitiveness is an abstract term it is not as easy to evaluate as acceptance. With the online questionnaire acceptance will be evaluated by the following questions:
-
Can I imagine that I interact with a mid-air gesture?
-
Do I feel really strange when I’m interacting with mid-air gestures?
-
Do I think that interacting with gestures is complicated?
-
Do I have fun when I’m interacting with gestures?
Another research question was to examine, how intuitiveness of gestures could be identified and what does intuitiveness of a gesture generally mean.
As indicators to answer these questions three criteria were used: awareness level, authenticity and comprehensibility. The participants had to rate the gestures with a four-stage scale (it fits, it quite fits, it doesn’t rather fit, it doesn’t fit). The following questions had to be answered:
-
Do I know the gesture from my daily conversation?
-
Do I find the gesture to be authentic?
-
If I cannot talk, would I use this gesture?
-
Do I feel strange/uncomfortable when I have to perform this gestures?
-
Can I understand why this gesture should be used for this command?
-
Are the steps to perform the gesture easy?
To compare dynamic and static stroke phases of a gesture with the user understanding of the type of command (dynamic or static) the following items had been used:
-
Rate the gestures with a four-stage scale (static, rather static, rather dynamic, and dynamic) about your understanding of the command.
-
List the gestures presented in accordance to your understanding of the command.
4 Results
The survey was finished by 90 participants, mostly students from Technische Universität Ilmenau, 73 % of these in age between 21 years and 30 years.
Figure 7 shows the main results of the acceptance test. About 70 % of participants could accept to perform a mid-air gesture to interact with a technical system in their workspace.
For test of intuitiveness it first was necessary to review the chosen items. As Table 1 shows relations between all items are significant at the 5 % level, but only between the four items “gesture is known”, “gesture is authentic”, “if speaking is impossible using this gesture instead” and “understanding gesture in connection with command” the interrelation is significantly positive at 5 % level. To describe intuitiveness with only one consolidated value the average rating of the six items is used. To raise the weight of the items with best interrelation and highest correlation, the items “gesture is authentic” and “understanding gesture in connection with command” were weighted double.
The participants had to give an explicit order of all gestures for one command. The corresponding values have been compared. The results shown in Table 2 demonstrate that the predicted understanding of the type of a gesture is mostly equal to the intuitive understanding of a gesture.
But as shown at the command lift up, it is sometimes difficult (for probands and probably for later potential users) to define what makes a gesture to be statical or dynamic. As an example the value of intuitiveness of three gestures for command “lift up” was equal 2.0. These gestures were listed at first place for this command by 24 % up to 37 % of probands. That means that half of the probands understand this command as statical (but with two different gestures), and another third as dynamic. Additionally the second dynamic gesture was not accepted as intuitive (value of intuitiveness of 2.8) and was classified as not qualified for the command by 50 % of the probands (Fig. 8).
This part of research shows that it is probably impossible to find a definite type of gesture for all commands, which implies that it is better to have two or more implementations. Thus the later users are able to choose which one fits best at their own, and the system can be designed as a self-learning system for these gesture implementations.
5 Summary and Outlook
As shown by the example of only one command (“lift up”) it is not easy to exactly predefine if a gesture will be accepted by the user in a special dynamic or static version. Even as has been demonstrated with the pointing gesture in chapter three it is not even easy to define what a dynamic or a static gesture is. Even as McNeill defines that the preparation phase is used to enter the interaction area the stroke phase has a dynamic part, and so it is only possible to define or determine the difference(s) between a static and a dynamic gesture as time during the pre- or post-stroke hold.
The two definitions of gestures described (cp. Chapter 3) show how many parameters have to be calculated during gesture recognition. But in most setups not only two gestures are needed to use for interaction with a technical systems. Additional algorithms to distinguish different commands are necessary to solve this problem of complexity.
The next step of the development of robust mid-air gestures for human machine interaction is to define more parameters to describe a gesture (and its phases). This description of a gesture to humans in the best manner will be a real presenter, showing how the gesture will be performed. Only the second best way is showing a good performed and recorded video sequence which shows all the desired details.
For the recognition by a technical system all the general parameters like joints, angels or time aspects need to be identified and described. These parameters must fit for a set of gestures. Currently a first test system was created with only one pointing gesture.
In the current state of that system a combination between a pointing gesture for the estimated position (see Fig. 9 a) and an image based touch confirmation for the fine position (see Fig. 9 b) is used. This combination is necessary due to the multiple problems of robust recognition, to have a safe and proper system for use in industrial environments. Using this recognition algorithm a personal assisting system for lifting and carrying of small boxes (project called KLARA) will execute the lifting and carrying process. For this system the described principles of execution and detection of mid-air-gestures are used. Currently only the static pointing gesture, described in chapter 3, is used to select a certain box. Even to physically restricted persons this solution will give the ability to handle these boxes, because the system can pick boxes from the floor or from above users head only by using the pointing gesture.
References
Austin, G.: Chironomia; or a Treatise on rethorical delivery. Printed for T. Cadell & W. Davies in the Strand; by Bulmer, Cleveland-Row, St. Jame’s, London (1806)
LärmVibrationsArbSchV: Verordnung zum Schutz der Beschäftigten vor Gefährdungen durch Lärm und Vibrationen (Lärm- und Vibrations- Arbeitsschutzverordnung - LärmVibrationsArbSchV), 06 March 2007 . http://bundesrecht.juris.de/bundesrecht/l_rmvibrationsarbschv/gesamt.pdf. Accessed January 2015
McNeill, D.: Hand and Mind. What Gestures Reveal About Thought. University of Chicago Press, Chicago (1992)
Wigdor, D., Wixon, D.: Brave Nui World Designing Natural User Interfaces for Touch and Gesture. Morgan Kaufmann/Elsevier, Burlington (2011)
Kendon, A.: Gesture Visible Action as Utterance. Cambridge University Press, Cambridge (2004)
Walter, R., Bailly, G., Müller, J.: StrikeAPose: revealing mid-air gestures on public displays. In: Mackay, W.E. und A. Special Interest Group on Computer-Human Interaction (Hg.) Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 841–850. ACM (2013)
Pavlovic, V.I., Sharma, R., Huang, T.S.: Visual interpretation of hand gestures for human-computer interaction: a review. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 677–695 (1997)
MICROSOFT: Human Interface Guidelines V1.8 (HIG). https://msdn.microsoft.com/en-us/library/jj663791.aspx and http://go.microsoft.com/fwlink/?LinkID=247735. Accessed February 2015
Deutsche Gesetzliche Unfallversicherung (DGUV) (Hg.), Ergonomische Maschinengestaltung. von Werkzeugmaschinen der Metallbearbeitung. Berlin (2010). http://publikationen.dguv.de/dguv/pdf/10002/i-5048-1.pdf
thearmagamer. Wooden dummy (rigged). Blend Swap, LLC (2014). http://www.blendswap.com/blends/view/72452. Accessed January 2015
DIN EN ISO 9241-11, Ergonomische Anforderungen für Bürotätigkeiten mit Bildschirmgeräten Teil 11: Anforderungen an die Gebrauchstauglichkeit – Leitsätze, Beuth Verlag GmbH, Berlin (1999)
Barclay, K., Wei, D., Lutteroth, C., Sheehan, R.: A quantitative quality model for gesture based user interfaces. In: Proceedings of the 23rd Australian Computer-Human Interaction Conference (OzCHI 2011), pp. S.31–S.39. ACM, New York (2011). http://doi.acm.org/10.1145/2071536.2071540
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Nowack, T. et al. (2015). Phases of Technical Gesture Recognition. In: Kurosu, M. (eds) Human-Computer Interaction: Interaction Technologies. HCI 2015. Lecture Notes in Computer Science(), vol 9170. Springer, Cham. https://doi.org/10.1007/978-3-319-20916-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-20916-6_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20915-9
Online ISBN: 978-3-319-20916-6
eBook Packages: Computer ScienceComputer Science (R0)