The usage of IMUs in HCI has been explored for gestural input; the most common approach is to place a single IMU on the gesturing finger [
28,
29,
30,
84,
107]. However, very little is known about the relationship between the precise position of IMU(s) and its effect on classification performance. To understand the multitude of factors affecting the overall classification performance, we sought to systematically investigate different perspectives, including the quantity of IMUs, variation between different finger segments, alternative IMU placement locations to simultaneously achieve higher recognition and usability, lastly, evaluate the feasibility of a user-independent recognition model. An in-depth understanding would not only enable taking full advantage of the IMU sensing capabilities and fine-tuning IMU placement to achieve the maximum performance for a given set of gestures, but also uncover hidden patterns to identify optimal designs of gesture sensing devices.
This section first describes our classification pipeline and a series of empirical analyses, which offers new insights into the design of sparse IMU layouts for hand microgesture recognition.
4.1 Feature Extraction and Classifier Selection
Aiming to understand the underlying factors affecting performance rate due to IMUs’ location, we started off by creating a classification pipeline. Given the size of our search space has the large number of 393 K layouts, we created a gesture detection pipeline with two essential requirements: scalable and rapid train-test time.
Feature Extraction. From a given trial and for each of the 9 axes of an IMU, we extract six statistical features: maximum, mean, median, minimum, standard deviation, and variance. In total, the number of features from all 17 IMUs
\(\times\) 9 axes
\(\times\) 6 features amounts to 918. To compile this list of features, we drew inspiration from the automatic feature extraction library, TsFresh [
15], which has shown promising results in prior work on gesture and activity recognition [
27,
45,
57]. Due to multiple sensors and reduced computational load, we used the minimum configuration of the library’s functionalities. To further minimize the effect of different trial lengths, we removed the sum and length features. Due to the lower sampling rate of our 17-IMUs setup as compared to single-sensor approaches [
53], we did not extract features from the frequency domain. However, we note that our released dataset will allow the research community to feed more features of TsFresh into the neural network [
45], take advantage of a single feature, such as derivatives as input into the neural network [
84], or further perform feature engineering for input in non-neural-network or neural-network classifiers to improvise the recognition rate based on the optimal location. In Section
4.1, we show the correlation of our selected features and a different set of features from related work to show the correlation in the ranking of layouts.
Method. We selected 10 random participants as training set and the remaining two as test set (80:20 split) and created grasp-independent models, i.e., the class labels do not include any grasp information. We also performed a leave-one-person-out analysis in Section
4.5. For our multi-class classification, we used 19 classes: (3 fingers
\(\times\) 6 gestures) + 1 Static hold. Different IMU layouts may contain different amounts of IMUs (from 1–17); therefore, to compare different state-of-the-art classifiers and estimate the classification time required for the full combinatorial classification, we evaluated randomly selected 100 layouts for a given IMU count of 1–17, totaling 1,435 layouts. Note, for count = 1, 16, and 17, the total possible layouts are slower than 100.
Classifier Selection. We fed our extracted features into multiple commonly used classifiers to evaluate their recognition rate and training time. Specifically, we used scikit-learn’s implementation of Support Vector Classification (SVC), Logistic Regression (LR), k-nearest neighbors (KNNs), Random Forest (RF) with max_depth = 30; and PyTorch implementation for Neural Network (NN) with 4 fully connected layers of decreasing hidden layer size (n = 1,024, 512, 256, ReLU activation) and a final softmax activated classification layer. Only NN models were trained on a GPU machine and others on a 40-core CPU. We used the default parameters for all the classifiers to perform trial-by-trial basis classification. As a performance metric, we used the macro average of the F1 score because it considers both precision and recall.
Results. As shown in Figure
5, the F1 score and training time largely depend on the choice of classifiers. Since we wanted to use the same classifier for multiple settings in the following analyses, as well as the later-described computational design tool (see Section
6)—we opted for Random Forest. This classifier achieves an average F1 score close to the highest one obtained by Neural Network while having a lower training time than Neural Network. Furthermore, RF models can be easily computed on a consumer-grade CPU machine. In-line with findings from prior work [
101], our results show that Random Forest Classifier has superior performance than KNN.
As shown above, our released dataset allows for generating results with various classification models techniques. Through our analysis, we found that, while different models may yield different accuracy levels, the order of performance of individual layouts is very similar. Specifically, to understand our results’ dependence on a particular classifier, we used F1 scores of all layouts with sensor count = 1 from the top-performing classifiers, namely KNN, Ridge, RF, and NN. Following that, we sorted the results alphabetically by IMU labels. Then, using a pairwise Spearman correlation (as used by Guzdial et al. [
32] for comparing ranked lists), we obtained a correlation of 0.919, 0.975, and 0.919 with p<0.001 for RF vs. KNN, NN, and Ridge, respectively.
In addition, we conducted a similar analysis to understand the change in the ranking of IMUs for different sets of features. We selected five features (maximum, minimum, mean, skewness, and kurtosis) used in the existing literature on IMU sensing [
28] and trained 17 models with RF. Subsequently, similar to the analysis comparing different classifiers, we calculated the Spearman correlation on the F1 score of alphabetically-sorted IMU’s list from both feature sets. Our results show a high correlation of 0.995 with p<0.001 between the layout ranking produced by 2 different set of features, indicating that while selecting other features may result in a different F1 score, the order of IMUs remains very similar.
4.2 Identifying Sparse Layouts for a Given IMU Count
The large count of IMUs offers the possibility of creating vast layout combinations. However, not every count and layout may produce a similar recognition performance. Therefore, an important aspect that we examined was identifying the best-performing sparse layout for a given number of IMUs. This analysis provides three major insights: Firstly, it allows us to understand how the recognition performance varies with the number of IMUs. Secondly, it gives insights into the interval in which F1 scores fall for any given number of IMUs. Lastly, the results inform the optimal IMU placement location with a fixed budget of sensors [
10]. Of note, we use the term
IMU Count to refer to any given amount of IMUs from 1–17.
Method. To explore the full combinatorial space, we trained models with all possible layouts from 1 to 17 IMUs on our initial train-test split as described in Section
4.1. Moreover, to systematically understand the variation in performance for both types of microgestures, we performed this analysis for three conditions: Freehand, Grasping, and Both Combined. This totals to 3
\(\times\) (
\(2^{17} - 1\)) = 393,213 models. For each model, we performed multi-class classification with 19 classes: (3 fingers
\(\times\) 6 gestures) + 1 Static hold. Note, Grasping and Both Combined conditions utilized grasp-independent models; therefore, we did not encode grasp information in the class labels. In Section
4.6, we compare our results with grasp-dependent models.
Results. Figure
6 plots the F1 score on the test set from each 393 K models trained in all three conditions (Freehand, Grasping, Both Combined), organized by the count of IMUs present in the model. We now discuss each condition in turn:
(1)
Freehand microgestures: The results provide a complete overview of the large performance difference that depends on the IMU count and, for a given IMU count, on the specific location of IMUs comprised in a model. As shown in Figure
6(a), the highest F1 score for count = 1 is 0.62
(M-midd). Adding a second IMU increases the F1 score to 0.84
(T-midd, M-dist); the F1 score further increases to 0.90
(T-midd, I-prox, M-midd) and 0.93
(T-midd, I-prox, M-dist, R-prox) with 3 and 4 IMUs, respectively. On the contrary, the lowest F1 score for count = 1 was 0.2
(Forearm), and for count = 2 was 0.19
(R-prox, Forearm). Amongst all models, the maximum F1 score of 0.97
(T-prox, I-dist, I-prox, M-dist, M-midd, R-midd, P-midd, Forearm) is achieved with count = 8. It should also be noted that a F1 score of 0.90 can be achieved with as little as 3 IMUs, and henceforth only a maximum increase of 4% occurs with the addition of more IMUs. The F1 score drops to 0.89 when all 17 IMUs are included. To further investigate this drop, we trained 100 classifiers with random states from 0-99 for count = 17. We only change the seed values for this investigation, while training classifiers for other analyses have a constant seed value with default parameters to allow reproducible results. Out of 100 models, 4 models achieved the maximum F1 score of 0.96, which is close to the maximum F1 score of 0.97 achieved by some other higher counts. Overall, 93 out of 100 models achieved an F1 score of greater or equal to 0.90, and only 7 models have an F1 score in the range of 0.88 (lowest) and 0.89. This explains the reason for the drop we observed at count = 17.
(2)
Grasping microgestures: Here, our classification setting is more challenging than Freehand microgestures due to the inclusion of all 12 Grasp variations. This results in a slight drop in overall performance (see Figure
6(b)). For count = 1, the highest F1 score was 0.54
(I-midd). Adding an additional IMU (count = 2) gradually increased the performance to 0.72
(I-prox, M-midd), for count = 3 to 0.88
(T-dist, I-prox, M-prox), and for count = 4 to 0.90
(T-dist, I-midd, I-prox, M-prox). Similar to Freehand, the IMU located on the forearm achieved the lowest F1 score of 0.17 for count = 1. Across all models, the maximum F1 Score of 0.93
(T-dist, I-dist, I-prox, M-dist, M-prox, Handback) is first achieved at count = 6. Note, the general pattern of variation in the maximum and minimum F1 score is similar to the Freehand condition, and an F1 score of 90% can be observed with a small number of IMUs (count = 4). Afterward, the maximum increment in F1 score is only 3%.
(3)
Both Combined microgestures: As shown in Figure
6(c), we observed a similar overall trend when gestures in Freehand and all Grasp variations were classified together. The maximum performance achieved with one IMU was 0.53
(I-midd). Adding more IMUs resulted in an increase of F1 score to 0.74
(I-prox, M-midd), 0.88
(T-dist, I-prox, M-prox) and 0.89
(T-dist, I-prox, M-midd, M-prox) for IMU count = 2, 3, and 4, respectively. Conversely, the minimum F1 score for counts = 1, 2, 3, and 4 is 0.18
(Forearm), 0.23
(P-dist, P-midd), 0.26
(P-dist, P-prox, Forearm), 0.28
(P-dist, P-midd, P-prox, Forearm) respectively. The min and max difference of the F1 score within each IMU count shows a similar pattern as the other two conditions. Across all counts, the maximum F1 score of 0.92
(T-dist, T-midd, I-dist, I-prox, M-dist, M-midd, M-prox, R-dist) is first achieved with count = 8. At count = 5, an F1 score of 91% is obtained, and only a 1% increase is seen with more IMUs.
4.2.1 Relevance of each IMU.
Multiple layouts may achieve a performance close to the top-most layout in each count as shown in Figure
6. To better understand what locations on the hand and finger are more likely to contribute to top-scoring layouts, we analyzed the top 5% best-scoring layouts (marked in green color in Figure
6). Specifically, we introduce an
Occurrence Score metric that quantifies the occurrences of each IMU in the top 5% layouts (see Eq.
1). Here, a higher score of an IMU indicates its frequent presence in the top layouts. For a set
\(I\) of possible IMUs, the Occurrence Score of an IMU
\(i\) is
where we calculate the mean of an individual IMU’s occurrence over all IMU counts. It is important to note that this is not the overall occurrence in the total space of 393 K models but rather how frequently it occurs in the top layouts.
Results:. We examined the Occurrence Score of each IMU as shown in Figure
7 and derived patterns that guide our further analysis. Since the gestures were performed by Thumb, Index, and Middle fingers, the IMUs from these three fingers appear more often in the top 5% layouts in all three conditions (Freehand, Grasping, and Both Combined). Interestingly, the Occurrence Score varies greatly across different segments of the same finger. The comparison between Freehand and Grasping conditions revealed three considerable differences: First, we observe that an IMU placed on the tip of the Thumb (T-dist) has a high Occurrence Score of 0.67 for Grasping microgestures, whereas it is only 0.33 for Freehand microgestures. We assume this is related to the nature of gestures performed on the palm in the Freehand condition, wherein the Thumb stretches out at a larger distance and bends lesser than during Grasping microgestures. In a typical grasp, the Thumb supports the object; hence the distance to reach the surface for performing a Grasping microgesture is relatively smaller. Second, for all fingers except the Thumb, Grasping microgestures tend to favor IMU placement on the proximal segment over the fingertip. In contrast, Freehand microgestures show a clear tendency to favor placement on the fingertip for Index and Middle fingers. Below, we investigate the effect of IMU position on classification performance in more detail.
Implications:. For all three conditions, we noticed that a higher IMU count does not necessarily translate to higher recognition performance. F1 scores close to the optimal can be achieved already with a fairly small number of IMUs (3 to 6). We observed a large variation in performance depending on where a given number of IMUs is placed on the hand and fingers, which also depends on the microgesture condition as shown in Figure
7. These findings highlight the importance of creating a layout by choosing a right number of IMUs, a right combination of fingers, and finger segments for the desired set of grasp and microgestures to achieve optimal recognition accuracy.
4.3 Performance of IMU Placement at Segment Level
Having identified that the choice of finger segments for IMU placement can be crucial for obtaining high recognition performance, we now aim at investigating the influence of finger segments on recognition performance more systematically. This also informs the design of minimal form-factor devices that place IMUs only at the optimal segment.
Method. We used our initial 80:20 train-test split of the participants’ data and evaluated using a single IMU under multiple settings. To reduce any effects caused by different grasp variations, we created grasp-dependent models. Moreover, for a clear understanding of individual fingers and their respective gestures, we performed finger-wise classification, i.e., atmost six gestures and one static hold class per finger. Overall, we trained 17 single-IMU layouts
\(\times\) [(1 Freehand
\(\times\) 3 gesturing fingers) + (9 Grasp variations
\(\times\) 3 gesturing fingers) + (3 Grasp variations
\(\times\) 1 gesturing finger)] = 561 models. For the analysis in this section, we focus on the IMU on gesturing fingers and on three representative grasp variations that have been identified in prior work to each represent a cluster of Grasping microgestures [
83]. The detailed results, including IMUs on non-gesturing fingers and all 12 grasp variations will be released with our dataset.
Results. As illustrated by Figures
8 and
9, the F1 score varies greatly across different segments for Freehand as well as Grasping microgestures. In particular, it indicates that for some cases, the F1 score for a gesture may even rise from 0.0 to 1.0 depending on what segment the IMU is placed on the same finger. In the following, we highlight this effect for Freehand as well as Grasping microgestures.
(1)
Freehand: The kinematics for each finger varies, and the motion required for each gesture is also different. As a result, the F1 score can have a large difference across segments (shown in Figure
8). We observed that the optimal segment is different for different fingers. In particular, for Thumb gestures, the middle segment (midd) achieved an average F1 score of 0.93, whereas the other two segments, i.e., distal (dist) and proximal (prox), have a relatively lower score of 0.72 and 0.60, respectively. The optimal segment for Index gestures is different: here, the prox-segment has an average F1 score of 0.91, while the performance on the other two segments is considerably lower with 0.78 (I-midd) and 0.76 (I-prox). For the Middle gestures, all segments achieved a similar F1 score of 0.60-0.65, the segment choice is still prominent for individual gestures wherein the performance may differ with 20-40% for Adduction, Abduction, and Circumduction. In contrast, the performance difference across segments is lower for the Tap gesture (10% –13%). Surprisingly, due to the hand bio-mechanics, the IMU on the Handback can detect Thumb Flexion and Tap with an F1 score of 0.82 and 0.70, respectively. This finding can be beneficial to detect finger gestures in settings where a user might not want to wear any sensor on the finger (e.g., while working in a kitchen or car workshop). We investigate this aspect of recognizing gestures from a non-gesturing finger in more detail in the next section.
(2)
Grasping: Our results reveal a strong influence of segment choice for Grasping microgestures (see Figure
9). Similar to the Freehand condition, we observed a large difference in F1 score across different segments of the same finger. Furthermore, it is noteworthy that there are dissimilarities in the pattern of optimal segment across different grasp variations. This relates to the distinctive finger postures in different grasps, affecting how a finger moves while performing the gesture. In particular, for the Thumb and Index gestures on Cylindrical-S and Spherical-S, the dist segment appeared as the optimal segment in both grasp variations. However, for the Middle finger gestures, the optimal segment is different across all three grasp variations (Cylindrical-S has dist, Lateral-S has mid, and Spherical-S has prox). Moreover, the Index and Middle gestures on Spherical-S have a relatively lower variance across segments, which could be explained by the bigger real estate that affords comparatively larger movements than the other two grasp variations. In general, the substantial difference in the recognition performance at the segment level is due to the intricacies of the grasp variation, finger, and gesture.
Implications. Depending on the grasp, finger, and type of movement during the gesture, the single-IMU performance across segments greatly varies. This formally validates our initial findings from the full combinatorial classification results: The choice of finger segment for the IMU sensor placement can have a very strong influence on classification performance. However, since these classification results differ based on the subset of grasps and chosen gesture classes, a one-fits-all design solution will likely not lead to best results. Hence, we propose a computational design tool in Section
6, which provides layout recommendations based on the user-defined parameters.
4.4 Placing IMU on a Non-gesturing Finger
Finger co-activation is a widely known phenomenon in bio-mechanics [
78]. Our goal is to leverage finger co-activation and investigate if micro-movements caused in neighboring fingers are sufficient for gesture detection from a non-gesturing finger. This would be beneficial in situations where placement of an IMU on the gesturing finger would hinder the primary activity–e.g., having an IMU on the Index finger may hinder situations like using a knife. In such scenarios, placing the IMU on an alternative location capable of detecting gestures from a neighboring finger would be more desirable.
Method:. To investigate the possibility of detecting gestures with any single finger, we used our initial 80:20 train-test split and trained five models for each of the three gesturing fingers; each model comprised a total of three IMUs placed on every segment of the respective finger. For a detailed analysis, we performed grasp-dependent and finger-wise classification. This gives a total of 5 fingers w/ IMUs \(\times\) 3 gesturing fingers = 15 models for Freehand. We trained another 150 models [(5 fingers w/ IMUs \(\times\) 9 grasp variations \(\times\) 3 gesturing fingers) + (5 fingers fingers w/ IMUs \(\times\) 3 grasp variations \(\times\) 1 gesturing finger]. In each multi-class model, we included all six gestures for an individual finger and the static class - totaling up to seven classes.
Results. Figures
10 and
11 show the F1 score on the test set for Freehand and Grasping when models are trained with IMUs on different fingers. These results indicate the feasibility of detecting gestures from IMUs on the non-gesturing finger:
(1)
Freehand: We observed the effect of finger co-activation and the feasibility of detecting gestures from IMUs on a non-gesturing finger for all three gesturing fingers (see Figure
10). Unsurprisingly, placing an IMU on the gesturing finger results in a higher F1 score in most cases. However, it is important to note that depending on the finger and gesture, the IMUs on a non-gesturing finger can even yield a higher F1 score than when placed on the gesturing finger. This is particularly visible with gestures performed by the Middle finger. This observation is in line with findings from prior work that have reported the middle finger to induce higher involuntary movement in adjacent fingers [
78,
86]. For Middle Circumduction, for instance, the F1 score on a non-gesturing finger (Thumb) increases by 34% (from 0.67 to 1.00) compared to placing an IMU on the gesturing finger (Middle). This can be explained by the involuntary Thumb movement caused while performing the Middle Circumduction on the palm. Also, Index Adduction achieved a 5% higher F1 score through placing IMUs on a non-gesturing finger (Middle) than gesturing finger. Even though Thumb has the least tendency amongst all the fingers to induce movements in the neighboring fingers, placing an IMU on the non-gesturing finger (Middle or Ring) produces a similar F1 score as that on the gesturing finger (Thumb) for Flexion, Extension and Circumduction. These promising results of placing an IMU on the non-gesturing fingers show the feasibility of detecting gestures beyond the conventional placement strategies.
(2)
Grasping: As mentioned in prior work, fingers in contact with the object get support, thereby reducing the effect of co-activation [
82]. Thus, all Thumb and Index gestures on Cylindrical-S (Knife) achieved the highest performance when the IMUs are placed on the gesturing finger. In spite of that, we observed that the non-gesturing finger can detect Thumb and Index gestures with a drop of only 15–20% from the F1 score obtained by an IMU on the gesturing finger. While this reduction is considerable, it may be acceptable for some gestures in settings that do not allow for augmenting the gesturing finger with IMUs. Based on the grasp type and gesture, the IMUs on the non-gesturing finger may even achieve a higher performance than the gesturing fingers, e.g., on Spherical-S (Pestle), Thumb Extension and Circumduction achieved a higher F1 score of 0.83 and 0.95, respectively, through IMUs on the non-gesturing finger (Index). In contrast, the IMUs placed on the gesturing finger (Thumb) achieved a comparatively lower score of 0.67 and 0.87. On Cylindrical and Spherical grasps, all fingers are in close contact with object but not all grasp types have the same contact fingers. For example, while holding Lateral-S (Spoon), the Ring and Pinky fingers are suspended in the air, which causes an involuntary movement in the other adjacent non-gesturing finger. As a result, the gesturing (Middle) and non-gesturing (Pinky) finger IMUs achieve a similar F1 score for Middle Abduction and can also detect Middle Flexion with an F1 score of 0.80 (0.15 lower from the IMUs on the gesturing finger). Additionally, we observed the possibility of detecting gestures with non-gesturing fingers that are in contact with the object. With these many different factors affecting the performance, it is challenging for a designer to place the sensor at an alternative location intuitively.
Implications When the hands are busy, instrumenting gesturing fingers might not be possible in all cases. For example, while writing, instrumenting fingers involved in gripping the pen might hinder the primary activity. In such scenarios, placing an IMU on neighboring fingers can be efficient. Our findings show that placing IMUs on a non-gesturing finger may enable gesture detection at a comparable or even higher performance rate.
4.5 Generalizability of Layouts across Participants
Next, we aim at understanding the extent of inter-personal differences in recognition performance. This is a crucial question because there can be inter-personal variations in the way the microgestures are performed. If there is a large difference in classification results across participants, the design tool that we describe in later Section
6 would need to account for it while suggesting a sparse layout.
Method. A comprehensive Leave-one-person-out (LOPO) evaluation with 12 participants \(\times\) 393,213 layouts = 4,718,556 models will approximately take 25 days of computation time on our 40-core machine. To circumvent this problem, we first identified the best layout according to the F1 score for a given count of IMUs on our 80:20 participants split from the combinatorial results obtained with the combined condition (Freehand+Grasping). Subsequently, we used these best layouts and trained 204 models (12 participants \(\times\) 17 best layouts for the IMU Counts) for a LOPO evaluation.
Results. Figure
12 depicts the results of the LOPO evaluation. We observe that the difference in F1 score from our randomly selected 80:20 train-test split and any LOPO model is about
\(\pm\)6%. It is worth noting that most participants achieved higher performance than our randomly chosen participants.
Implications. Despite the inter-personal variations in how the gestures are performed, our recognition pipeline still scales well and achieves high recognition performance with user-independent models. We observed only little variation in F1 scores across participants, which demonstrates that model predictions generalize to data from new users.
4.6 Grasp-dependent v/s Grasp-independent Models
In our combinatorial analysis, we trained grasp-independent classifiers by combining all grasp variations. Here, we aim at investigating if these initial results can be further improved if a subset of grasps is selected. This would be relevant for application cases that comprise selected activities with a known set of grasps, or for systems that can identify the current grasp, e.g., by using activity recognition.
Method. We classified all 12 grasp variations separately (grasp-dependent models) by using our initial 80:20 split of participants’ data with 19 classes [(3 fingers \(\times\) 6 gestures) + 1 static hold]. To save on the computation time, we performed the full combinatorial evaluation of grasp-dependent models until IMU count = 5. There were 12 grasp variations \(\times\) \(\displaystyle \sum _{r=1}^{5}\) \({}^{17}C_{r}\) layouts = 112, 812 models.
Results. For 9 out of 12 grasp variations, the F1 score increased when the model is trained on a specific activity (see Figure
13). Grasps like Lateral-S (Spoon), Tip-S (Needle), Lateral-L (Paper) showed an improvement in recognition of 20–30% compared to the grasp-independent model. In contrast, grasps like Cylindrical-S (Knife) and Tip-L (Pen) did not show any increment, which can be due to the object’s geometry. Specifically, on such grasp variations, the fingers are tightly packed, hindering the finger movement while performing gestures.
Implications. The performance tends to improve if the model is trained for a specific grasp variation. Therefore, when a subset of grasp-variations are chosen that map to a specific context, our results from the combinatorial analysis can further improve. This feature of selecting grasps is also integrated in our later presented design tool for finding a sparse layout.