Keywords

1 Introduction

Human Computer Interaction (HCI) platforms have attained tremendous demand in today’s world. In contrast to conventional human-machine interconnecting tools, gestural air-writing nowadays serves as an important substituent for natural and effortless HCI. It allows users to interact intuitively with computing devices by writing freely in an unrestricted and comfortable way [1].

However, gestural air-writing is different from conventional pen-based writing or 2D space handwriting in that the stroke flow is continuous and there are no intermediate pauses in between consecutive characters as well as different segments of a character [2]. For example, an in-air handwritten Assamese character is generally completed in a single continuous stroke, and some intermediate repositioning movements connect the adjacent strokes of the character. The connecting links which occurs in between adjacent characters as well as within individual characters are termed as ligatures or movement epentheses. The presence of these ligatures makes the task of writing event detection and segmentation a challenging one. Moreover, these irrelevant connecting motions are diverse and vary widely depending upon users and their speed of articulation [1]. Therefore, the primary objective of this work is to implement a forward gestural character spotting and segmentation system which is user-independent and which shall form an important constituent for effective HCI.

Being a popular and intriguing research topic of current time, several in-air handwritten gesture spotting and segmentation techniques has been formulated by combining different algorithms. Chen et al. [3], Amma et al. [4] and Schick et al. [5] have utilized HMMs for modeling separate characters after which word recognition is performed by concatenating character HMMs with a repositioning HMM for describing the translocation that occurs between individual characters. Although HMM is a well-established method for modeling continuous characters, however with the increase in number of character patterns the model training becomes extensive as individual HMMs are constructed per label. Again, certain studies have considered alignment-based approaches for gesture spotting and recognition. For example, Frovola and Berman [6] and Jin et al. [7] have used most probable longest common sub-sequence (MPLCS) and dynamic time warping (DTW) approaches for segmenting out the potential character segments from gesture streams by measuring the similarity between extracted features of hand gesture template and a predefined template in the database. Alignment-based methods become computationally complex when the range of patterns increases, as these require warping of the temporal series with each and every template sequence in the database. Instead of HMMs or temporal alignment based methods for gestural sequence modeling, the proposed work achieves the same task by employing a distinctive feature set which determines the terminal points of character segments, extracts them and hence models the transience in their shapes, thus avoiding the need of extensive training or matching. Also, from survey it is seen that many systems have adopted preambles like manual gestures, physical buttons and other explicit signals for depicting the beginning and terminating points of character fragments in a temporal character stream. Murata and Shin [8] and Ayachi et al. [9] have utilized gestural commands, while Amma et al. [4] have used manual keys for temporal character segmentation. In contrast, our proposed method does not necessitate explicit preambles to specify the writing events in a continuous character sequence.

More specifically, the main contribution of this paper is to develop an efficient framework for gestural character spotting and segmentation by incorporating a two-stage approach. In the first stage, a window-based scheme is adopted to observe the fluctuations of a kinematic feature set and hence to spot the start and end points of different character segments existing inside a character pattern. In the second stage, spatiotemporal and statistical heuristics are applied to categorize the character segments obtained between terminal points as valid and ligature patterns.

The rest of the paper is organized as follows. In Sect. 2, the proposed methodology of air-written gesture spotting and segmentation is presented with elaborate description of all the individual processes involved. Section 3 discusses the results obtained through experimental evaluation on a large dataset with dynamic variations. Finally Sect. 4 concludes the paper and highlights some future scope of the present work.

2 Proposed System

The overall schematic block diagram of the proposed gestural air-writing detection and segmentation framework for spotting and identifying the valid strokes from a character sequence is shown in Fig. 1. The working of the first two modules of the proposed system i.e. hand segmentation and hand tracking are elaborated in [10].

Fig. 1.
figure 1

Schematic block diagram of proposed air-writing spotting and segmentation system.

This work mainly concentrates on the air-writing spotting and segmentation tasks. The detailed working methodology of these processes is described as follows.

2.1 Gestural Air-Writing Spotting and Segmentation

The process of air-writing detection and segmentation requires recognizing the relevant stroke segments in a continuous character pattern. So for automatic spotting and segmentation of air-writing, we have amalgamated certain distinctive spatiotemporal features derived from hand tracking signals, and have employed a window-based approach for modeling the statistical variability in character patterns. The detailed functioning of the air-writing spotting and segmentation modules are described in the following sections.

Air-Writing Spotting.

The workflow of the proposed air-writing detection module is shown in Fig. 2. Firstly, each sample of the hand trajectory is represented by positional coordinates pi = (xi, yi). Then, a set of motion features is derived and a sliding window is superimposed over this motion data to look for the presence of a writing event. The general characteristics of writing events include acute changes in writing direction and high average velocity in comparison to idling events. So, a window (of length L) is positioned at each designated frame by sliding it through a shift width (w), and the spotting algorithm observes the desired properties within these windows to determine the boundary points (frames) of all the writing segments inside a character pattern. Here, we have empirically selected window length of 30 frames (2 s) and shift width of 2 frames (0.13 s).

Fig. 2.
figure 2

Block diagram of proposed air-writing spotting module

On the selected sliding windows, the following motion features are estimated:

  • Average velocity

    $$ \overline{v} = \frac{1}{L}\sum\limits_{i = 1}^{L} {\left| {v_{i} } \right|} $$
    (1)
  • Average angular change in orientation

    $$ \overline{{{\Delta \theta }}} \text{ = }\sum\limits_{{\text{i = 1}}}^{\text{L}} {{\Delta \theta }_{\text{i}} } $$
    (2)

where, L indicates the number of samples inside the window, vi and Δθi denotes the velocity and angular change in orientation for a sample. These features are then normalized to reduce the effect of variations in writing style and speed. The normalized features are given by

$$ V^{'} = \frac{{\overline{v} }}{{v_{\hbox{max} } }} $$
(3)
$$ \Delta \theta^{'} = \frac{{\overline{\Delta \theta } }}{{\Delta \theta_{\hbox{max} } }} $$
(4)

where, vmax and \( \Delta \theta_{\hbox{max} } \) denotes the maximum velocity and angular change obtained within an observation window.

Practically, it is observed that while air-writing a character segment these feature values tend to increase, while towards the end of a writing segment the values of these parameters decreases and reaches a minimum value. However, since there are large number of character patterns having wide variations in writing speed and shape, so global threshold values for V′ and Δθ′ shall not be fruitful for demarcating the start and end points of all the character patterns in the database. So, the computed features (V′ and Δθ′) for a sample are compared with adaptively computed thresholds (TV and Tθ respectively) determined using K-Means clustering algorithm. Hence, a trajectory sample is designated as start point (S) or end point (E) of a writing segment based on whether the kinematic parameters (V′ and Δθ′) are greater or less compared to their respective adaptive thresholds (TV and Tθ). The resulting end points signify non-writing events and are thus eliminated from the character stream. Finally, the handwriting (HW) segments procured between each pair of boundary points are separated out for further processing.

Algorithm for determination of adaptive thresholds (TV and Tθ) using K-Means clustering approach:

  1. 1.

    Choose a window Win having length of 6 samples, and consider the feature values V′i and Δθ′i (i \( \in \) 1 to 6) inside this window.

  2. 2.

    Take two clusters CL1 and CL2 and initially assume their cluster centers (C1 and C2) as the 2nd and 4th feature points (i.e. V′2, Δθ′2 and V′4, Δθ′4).

  3. 3.

    Calculate the distance of all the feature points (V′i and Δθ′i) within Win from the cluster centers (C1 and C2) using Euclidean distance.

  4. 4.

    Assign each of the feature points to the cluster CL1 or CL2 with nearest cluster center.

  5. 5.

    Update the cluster centers (C1 and C2) by taking the mean of all the feature points within the respective clusters (CL1 and CL2).

  6. 6.

    Repeat steps 3, 4 and 5 until the cluster centers C1 and C2 do not change.

  7. 7.

    Compute adaptive thresholds (TV and Tθ) by taking the mean of the converged cluster centers obtained in step 6.

Air-Writing Segmentation.

After successful extraction of HW segments, the next task is to classify them into valid and ligature patterns so that the ligatures can be suppressed and the valid patterns can be passed on for recognition of the overall character pattern. The flow diagram of the gestural air-writing segmentation module is shown in Fig. 3.

Fig. 3.
figure 3

Block diagram of proposed air-writing segmentation module

In our work, we are concerned with modeling the variations which occurs in the shape of Assamese characters. To consummate this task, a composite feature set is proposed comprising of kurtosis, entropy and normalized length of HW segment. These features are described as follows.

  • Kurtosis (K): The kurtosis of a distribution is given by the ratio of fourth moment (m4) and second moment (m2) squared [11]

    $$ K(x) = \frac{{m_{4} }}{{m_{2}^{2} }} = \frac{{m_{4} }}{{\left( {\sigma^{2} } \right)^{2} }} = \frac{1}{N}\sum {\left( {\frac{x - \mu }{\sigma }} \right)^{4} = E(z^{4} )} $$
    (5)

where z (= x  µ/σ) is standardized value, x represents the observation values, µ is the mean, σ is the standard deviation and E() is the expectation operator. The K value for a normal distribution is 3 and the K value for all other distributions reflects the variations of observations from a normal distribution. Equation (5) has a simpler interpretation and is given as [11]

$$ K(x) = var(z^{2} ) + 1 $$
(6)

where var() is the variance. Accordingly, kurtosis is inferred as the extent to which observations are dispersed away from the shoulders of a distribution (z2 = 1, i.e. z = ± 1).

In this study, we take scaled inverse kurtosis (K) as the first feature for character segmentation. It is given as

$$ K'(x) = \frac{c}{K(x)} $$
(7)

where c is a scaling constant empirically taken as 6, and K(x) is the kurtosis determined using Eq. (6). This implies that, if the observation values x of a character segment are more dispersed from the shoulders, then K < 3 and hence the K′ value will be more compared to the segments exhibiting less dispersion which have K ≥ 3.

  • Entropy (H): Entropy gives a measure of uncertainty of probability distribution and is given by [12]

    $$ H = - \sum\limits_{i} {P_{i} } \log P_{i} $$
    (8)

    where Pi denotes the probability of observation values.

  • Normalized segment length (L): The normalized length of HW segment is given by

    $$ L^{'} = \frac{\left\| d \right\|}{h} $$
    (9)

    where d is the total distance traversed by the HW segment and h is the height of a minimum-area bounding rectangle [13] encompassing the segment.

So, in this procedure we firstly approximate every consecutive “n” points of the HW segment with a least square (LS) fitting ellipse [13] and compute its angular eccentricity values, which describe the variations in shape of the HW segments. The angular eccentricity of a point P(x,y) on an ellipse with semi-major axis of length a and semi-minor axis of length b is given by

$$ \alpha = \tan^{ - 1} \left( {\frac{ay}{bx}} \right) $$
(10)

From the α profile of a HW segment, the inverse kurtosis value (K(α)) and entropy of its probability distribution (H(α)) is calculated using Eqs. (7) and (8). Simultaneously, the normalized length of the HW segment (L′) is computed using Eq. (9). Now, in gestural air-writing of Assamese characters the valid character segments shall have more variations in α value than the ligatures. Ligatures are generally simple and have less variability. So, according to the concept of kurtosis, the valid HW segments will have K′ value more compared to the ligature segments. Concurrently, the valid HW segments will have higher amount of uncertainty in α values i.e. more information content, and hence will have higher entropy value than ligature segments. Further, the ligature segments being simple in nature will inherently have shorter normalized length than valid segments. Here, we have considered combination of three features (K′, H and L′) for character segmentation, because in certain cases if a single feature fails to capture the distinguishing traits of valid and ligature patterns, then the score from the remaining features shall aid in correct character segmentation. Thus, the final score of a HW segment is given by the weighted sum of kurtosis, entropy and signal length.

$$ score = \omega_{1} K^{'} + \omega_{2} H + \omega_{3} L^{'} $$
(11)

Here, the weights ω1, ω2 and ω3 are taken uniformly as 1.

Thus, the final score of all the HW segments inside a character pattern are computed using Eq. (11), and the decision threshold (TD) is determined by taking the average of the first two smallest scores. Suppose, C = {C1, C2, C3… Cn} is a character pattern consisting of n HW segments, then a character segment is classified as valid (‘1’) or ligature pattern (‘0’) using the following decision equation

$$ D(C_{i} ) = \left\{ {\begin{array}{*{20}l} {1,{\text{ score}}(Ci) > T_{D} } \hfill \\ {0,\quad \;\;{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
(12)

In case of characters consisting of only one HW segment, i.e. without any connecting links, only the kurtosis value (K) computed using Eq. (6) is taken as discriminating feature for classifying it into valid or ligature pattern. If the K value is less than 3, it implies that the character segment has wide variations and hence it will be designated as valid pattern, conversely it will be classified as ligature pattern.

3 Experimental Results and Discussions

We evaluate the performance of our proposed air-writing spotting and segmentation model on an Assamese character dataset. The experimental database consists of 46 Assamese characters, comprising of 11 vowels, 10 numerals and the first 25 consonants. Each character is written 20 times by the 2 subjects, thus forming an overall mixed dataset of 1840 patterns. Thus, the database encompasses dynamic variations in trajectory shape and duration of the character patterns. The video corpus of gestural characters is generated using a webcam with frame rate of 15 frames/s and resolution of 640 × 360. The following sections describe the results derived from our gestural air-writing detection and segmentation modules.

3.1 Air-Writing Spotting Results

In this section, we depict the profile distributions for the feature set (V′ and Δθ′) employed for gesture spotting and the corresponding writing event detection. Figure 4(a) and (b) shows the distributions of normalized velocity (V′) and relative angular change in orientation (Δθ′) used for gesture spotting of a continuous character “ ”. It is observed that during the commencement and eventual writing of a HW segment the feature values increase, while towards the end of the writing event these values tend to decline and attains a minimum point.

Fig. 4.
figure 4

Variation of (a) V′ and (b) Δθ′ profiles for continuous character “

Figure 5 illustrates a few occurrences of the air-writing path of character pattern “ ” along with the start and end points (S and E) of writing segments obtained by employing the proposed feature set and window-based analysis. So, at the end of gesture spotting it is seen that we obtain three HW segments for the pattern “ ”, which has to be classified as valid and ligature in the next stage.

Fig. 5.
figure 5

Hand tracking output at few instances of continuous character pattern “ ” depicting the spotted start and end points (S and E)

3.2 Air-Writing Segmentation Results

Figure 6 shows the angular eccentricity (α) profile which outlines the variations in shape of all the three HW segments of the character pattern “ ”.

Fig. 6.
figure 6

Variation in angular eccentricity (α) values for continuous character “

Table 1 highlights the features values (K′, H and L′) for all the three HW segments, their corresponding scores and the final decision. From Fig. 6 and Table 1, it is seen that the valid segments (1st and 3rd) have more variations in α values compared to ligature segments, and hence these have higher inverse kurtosis value. Similarly, there is more uncertainty in α values for valid segments, and hence it has more entropy than ligatures. Further, it is observed that valid HW segments have greater normalized length than ligature patterns.

Table 1. Character segment classification results for character pattern “

3.3 Performance Evaluation

We evaluate the performance of our proposed system by calculating the overall Segment Error Rate (SER). For a character pattern Ci, the SER is given by

$$ SER = \frac{E}{{N_{C} }} = \frac{S + D}{{N_{C} }} $$
(13)

where Nc is the total number of HW segments in the character pattern, E is the total number of segment errors in the character pattern. It is given by the sum of substitution (S) and deletion (D) errors [14]. Therefore, the overall SER is given by

$$ {\text{Overall SER = }}\frac{{\sum\limits_{i = 1}^{N} {SER_{i} } }}{N} \times 100\% $$
(14)

where N indicates the total number of character patterns in the database.

The composite SER of our proposed system with and without score fusion is presented in Table 2.

Table 2. Overall SER results for the proposed system

By comparing the results obtained using score fusion to the results acquired considering individual features, it is seen that there is a considerable improvement in performance while considering combined features. This is because, in some characters there are certain valid character segments which have very less variations, and in such cases the system will fail to resolve the ambiguities and falsely interpret it as ligature if only individual features are taken into consideration. The idea of aggregating three features into one platform helps in producing comparatively lesser SER than considering single features for classification.

4 Conclusion

In this paper, we have formulated a vision-based forward air-writing detection and segmentation mechanism for determining the principle character segments from continuous character patterns. For gesture spotting and character segmentation, we have implemented a sliding window-based approach and applied a mixed feature set by integrating spatiotemporal and statistical parameters. Experimental evaluation on continuous Assamese character patterns reveal that our proposed technique offers an overall segment error rate of around 1.3%, thereby demonstrating its effectiveness in extricating out legitimate portions from a character sequence. In this study, the efficiency of the spotting and segmentation paradigm has been validated for the vowels, numerals and a few consonants of Assamese vocabulary. Later, the remaining consonants and a few words shall be taken into consideration, and the database shall be upgraded by including more samples from different participants. In future work, a challenging task related to this field of study deals with the recognition of certain Assamese characters having similar trajectory pattern. In real world scenarios, the proposed system in conjunction with a recognition module can function as a complementary modality for applications such as communicative aid for hearing-impaired people, smart media controller, game controllers and yet more.