Abstract
Automatic sign language recognition (SLR) is a current area of research as this is meant to serve as a substitute for sign language interpreters. In this paper, we present the design of a continuous SLR system that can extract out the meaningful signs and consequently recognize them. Here, we have used height of the hand trajectory as a salient feature for separating out the meaningful signs from the movement epenthesis patterns. Further, we have incorporated a unique set of spatial and temporal features for efficient recognition of the signs encapsulated within the continuous sequence. The implementation of an efficient hand segmentation and hand tracking technique makes our system robust to complex background as well as background with multiple signers. Experiments have established that our proposed system can identify signs from a continuous sign stream with a 92.8% spotting rate.
1 Introduction
Sign language is a natural mode of communication used by deaf people for easy interaction in daily life. The need for sign language recognition (SLR) systems is increasing in recent times, as they have become a key ingredient in the process of intercommunication between the hearing impaired and the common people. Coarticulation in sign language is a vital aspect that makes the task of SLR a perplexing one. In simple terms, coarticulation is a phenomenon that combines one sign to the next in a signed expression. Signs appear to be significantly contrasting when they occur in a sentence compared to appearing isolated [12]. These contrasting characteristics are more apparent especially at the beginning and at the end of a sign, and can be considerably different under different sentence contexts. This fact complicates the process of recognition of signs embedded in a continuous stream.
Movement epenthesis (ME) is a special attribute of coarticulation where a transitional movement occurs between two signs [14] and is observed in continuous hand gesture recognition. In sign language, ME may occur in global motion (where the entire hand moves) as well as in local motion (where only fingers move), during transition from one sign to the next [9]. In this paper, we have dealt with the modeling of ME in global motion. Several works have used ME as part of SLRs. In Ref. [14], Yang et al. have proposed a parallel approach for simultaneous segmentation and matching of signs to continuous sign sentences involving ME, using a dynamic time warping-based approach. However, the setback of their proposed system is that the signs and the MEs will have to be matched with all the sentences in their database in order to get a correct recognized sign output. This increases the computational complexity of the system, and the system is limited to a minimal set of sign sentences. Kelly et al. [8] have reported a hidden Markov model (HMM)-based gesture recognition system that has the potential to categorize a given gesture sequence as one of the pretrained gestures or ME by calculating the log-likelihood of an observation sequence and thereby comparing it with a threshold. However, the limitation of their system is that it requires explicit modeling of ME segments, which, in turn, restricts their system to a confined set of vocabulary as it is capable of recognizing only eight different signs and 100 different types of MEs. A non-uniform rational B-spline-based interpolation function has been used by Chuang et al. [6] for identifying ME where a combination of distance, smoothness, and image distortion costs are used for determining each and every cut point pair. These points signify the start and end point of each sign. However, this method of ME detection requires a predefined database constituting of hand trajectory, sign language, and eigenhand database. A conditional random field (CRF)-based adaptive threshold model was proposed by Yang et al. [15] for classification of meaningful signs and non-sign patterns. The threshold model was constructed by incorporating an additional label for non-sign patterns using the weights of state and transition feature functions of the original CRF. Due to this feature, non-sign patterns (or MEs) are not required for training their system. They have used two motion-based and four location-based features for recognition. However, their system provides a recognition rate of about 87% for spotting signs from continuous sequences, which is less compared to our proposed system, which delivers a recognition rate of roughly around 93%. This is because of the inclusion of a unique set of both spatial and temporal features into our proposed system for recognizing the extracted signs.
In this paper, we present the design of a continuous SLR system that can extract out the meaningful signs and consequently recognize them. In comparison to Refs. [6, 8, 14], our proposed system does not require any explicit depiction of ME segments, and further it is not confined to a specific set of sign sentences. We have implemented the height of the hand trajectory as a feature for symbolizing the ME phase, which prevails in a signed utterance. Our proposed continuous SLR system is designed for spotting signs embedded in a continuous sign sentence by utilizing a two-step approach. First, height of the hand trajectory is used as a key element for segmenting out the meaningful sign frames. Secondly, a distinctive feature set (comprising two spatial features and two temporal features) is used for recognizing the segmented signs. An additional asset of our proposed system is that it can respond effectively to various background conditions like complex background, daylight and dimlight conditions, background with multiple signers, and so on. This is because of the contour processing part of the hand segmentation module, which plays a crucial role in efficient segmentation of signs under the above background situations. A CRF is trained extensively with a set of data that include specific samples recorded under complex background, daylight and dimlight conditions, background with multiple signers, etc. Experimental results show that the system is robust enough and provides consistent performance under the conditions identified.
2 Proposed System
The overall block diagram of the proposed continuous SLR system for recognizing signs embedded in a continuous sign stream is shown in Figure 1. The detailed descriptions of all the steps involved are described below.
2.1 Hand Segmentation
The first step of hand segmentation involves the capture of input frames using a webcam and face detection. Next, face removal is done using a Haar classifier [3]. It is done to mask out the face region. This is followed by skin color segmentation [10] with some associated morphological closing and opening operation to segment out the hand region, which is our region of interest. However, this step will yield a noisy output if the background comprises cluttered objects and multiple signers. So, to combat such situations, a contour processing stage is incorporated. The flowchart of the contour processing stage is shown in Figure 2. The detailed working of the contour processing stage is described in Ref. [5].
2.2 Hand Tracking
After successful hand segmentation, the next step is to find out the hand trajectory made while performing the signed utterance. The flowchart of the hand tracking stage for both one-handed and two-handed signs is shown in Figure 3. In this step, at first, the centroid of the contour(s) obtained at the output of contour processing stage is found out using simple geometric moments [11]. In case of one-handed signs, the centroid of the largest contour in the current frame is determined and is then connected to the centroid of the largest contour in the previous frame. In case of two-handed signs, the main principle used for finding out the trajectories of both hands separately is that the distance between the centroids of the same hand will always be less than that between different hands. According to this principle, the contours for which this comparative distance is less will be connected.
Algorithm of hand tracking for two-handed signs [4]:
Let, prevC1 be the centroid of the first largest contour in the previous frame and currC1 be the centroid of the first largest contour in the current frame.
Similarly, let prevC2 be the centroid of the second largest contour in the previous frame and currC2 be the centroid of the second largest contour in the current frame.
Further, let d1 be the distance between prevC1 and currC1,
d2 be the distance between prevC2 and currC2,
d3 be the distance between prevC1 and currC2, and
d4 be the distance between prevC2 and currC1.
Then, the proposed algorithm of hand tracking can summarized as follows:
Step 1: Compute d1, d2, d3, and d4
Step 2: if (d1<d3 and d2<d4) then
currC1 and currC2 are unchanged
else
currC1 and currC2 are swapped
Step 3: Connect currC1 and prevC1, currC2, and prevC2
2.3 ME Detection
The proposed ME detection module for detecting the ME frames from a continuous sign sequence is shown in Figure 4. In the proposed model, the height of the hand trajectory (H) is used as a feature for describing the ME phase. For extracting this feature, a selected number of points (say p) of the hand trajectory (obtained at the output of hand tracking stage) is approximated by a minimum-area bounding rectangle, as shown in Figure 5. The height of this rectangle (H) serves to consummate our goal of defining the ME phase. This is done by considering an assumption according to which the acceleration of the hand will be very slow during the commencement and end of a sign. Thus, during this period, the p points will come closer to each other and as such the height of the minimum-area bounding rectangle (H) will decrease. Hence, this phase can be characterized as the ME phase and subsequently the frames corresponding to this phase can be rejected from the input sign sequence. Here, we have defined Hcode as a feature for symbolizing the ME frames.
where T1 and T2 are empirically selected thresholds for the height of the minimum-area bounding rectangle. Here, the experimental values of T1 and T2 are taken to be 18 and 60, respectively, and the number of points p is taken to be 5.
Thus, the frames for which Hcode=small will be marked as ME frames and will be consequently discarded from the input sign sequence. So, the system detects ME satisfactorily when the speed of transition from one sign to the next is comparatively slower than while performing a sign.
2.4 Feature Extraction
After segmenting out the valid sign frames from the input sign sequence using the ME detection module, the next step involves extracting out some salient features for representing the valid sign segments, which will subsequently play a crucial role in the successful recognition of the segmented signs.
While static hand gestures are modeled in terms of hand configuration and palm orientation, dynamic hand gestures require hand trajectories and orientation in addition to these [1]. So, we have proposed a set of spatial and temporal features for achieving this objective.
Spatial features: From the hand contour obtained from contour processing stage, a pairwise geometric histogram (PGH) is constructed to describe the configuration of the hand and fingers. A PGH is an extension of the chain code histogram; however, one important distinction is that the discriminating ability of PGH is higher.
The PGH is a powerful shape descriptor that is applied to polygonal shapes. It can also be applied to irregular shapes, if the shape is first approximated with a polygon [7]. Suppose a polygon is considered to be defined by the edge points of an irregular shape; then, the successive edge points define the line (or edge) segments of the polygon. The PGH is constructed as shown in Figure 6. Each of the edges of the polygon is successively chosen to be the “base edge.” Then, each of the other edges is considered relative to that base edge and three values are computed: dmin (smallest distance between the two edges), dmax (largest distance between the two edges), and θ (the angle between them). The PGH is a two-dimensional histogram whose dimensions are the angle and the distance. In particular, for every edge pair, there is a bin corresponding to (dmin, θ) and a bin corresponding to (dmax, θ). For each such pair of edges, those two bins are incremented in addition to all bins for intermediate values of d (i.e. values between dmin and dmax) [2].
From the PGH obtained from the segmented hand contours, the minimum and maximum values are extracted and taken as spatial features.
Temporal features: From the hand trajectory of the valid sign segment, we take say p number of points at a time, and approximate it with a minimum-area bounding rectangle. The height (H) and orientation of the rectangle with respect to the vertical (θ) (as shown in Figure 7) constitutes our proposed temporal feature set.
2.5 Recognition Using a CRF Classifier
In our proposed system, we have used a CRF classifier for the purpose of recognition. CRF is advantageous in comparison to HMM because it does not consider strong independent assumptions about the observations and can be trained with a fewer samples than HMM [13].
It is a statistical classifier that is based on conditional probability for segmenting and labeling sequential data. CRFs use a single exponential distribution to model all labels of given observations. In CRFs, the probability of label sequence Y, given observation sequence X, is found using a normalized product of potential functions. The conditional probability is given by [15]
In Eq. (2),
where tv(Yi − 1, Yi, X, i) is a transition feature function of observation sequence X at positions i and i – 1. A transition feature function indicates whether a feature value is observed between two states or not. sm(Yi, X, i) is a state feature function of observation sequence at position i. A state feature function indicates whether a feature value is observed at a particular label or not. Yi − 1 and Yi are labels of observation sequence X at position i and i – 1. n is the length of the observation sequence. λv and μm are weights of transition and state feature functions, respectively. Zθ(X) is the normalization factor.
The accuracy of the proposed system model is calculated by finding out the sign spotting/recognition rate (RR) using
where C is the number of correct spottings and N is the number of test signs [15].
3 Experimental Results and Discussion
The video corpus is generated by taking into account some dynamic hand gestures comprising different combinations of numerals ranging from 0 to 9. Two possible combinations are shown in Figure 8. The video sequences are captured by means of a webcam having a frame rate of 15 frames/s and resolution of 640×360.
The experimental results obtained at different stages of our proposed system are described below.
3.1 Hand Segmentation Results
The performance of the hand segmentation module was verified both qualitatively and quantitatively. Figure 9A and B show the outputs of hand segmentation considering a complex background with multiple signers for both one-handed and two-handed inputs, respectively. Also, the results obtained for daylight and dimlight conditions are shown in Figure 10A and B. The visual content justifies that our proposed hand segmentation scheme is robust to complex background, background with multiple signers, and daylight and dimlight conditions. This is mainly due to the incorporation of the contour processing stage in the hand segmentation module.
In order to justify the quantitative performance, the number of false positives (FP) and false negatives (FN) are considered as parameters. The number of FP indicates an approximate number of frames where an incorrect contour is detected along with the desired contours, and the number of FN indicates an approximate number of frames where a desired contour is not detected.
Table 1 shows the comparative results for hand segmentation in terms of number of FP and number of FN, taking into account four different background conditions viz. complex background, background with multiple gesturers, daylight condition, and dimlight condition.
Approaches | Background | No. of FP | No. of FN |
---|---|---|---|
Skin color segmentation | Complex | 32 | 0 |
Multiple gesturers | 32 | 0 | |
Daylight | 28 | 0 | |
Dimlight | 15 | 2 | |
Frame differencing | Complex | 32 | 0 |
Multiple gesturers | 32 | 0 | |
Daylight | 32 | 0 | |
Dimlight | 32 | 0 | |
Skin color + frame differencing | Complex | 1 | 0 |
Multiple gesturers | 32 | 0 | |
Daylight | 25 | 0 | |
Dimlight | 11 | 2 | |
Proposed model | Complex | 0 | 0 |
Multiple gesturers | 0 | 0 | |
Daylight | 1 | 0 | |
Dimlight | 0 | 1 |
As the results show, the proposed model of hand segmentation provides the least number of FP and FN in comparison to the other three methods, and thereby proves to be more robust and effective with respect to the stated background conditions.
3.2 Hand Tracking Results
Figure 11A and B show the results of hand tracking. The results prove that our proposed method gives an accurate trajectory even in the presence of a complex background.
3.3 Variation of the Proposed Feature for Characterizing the ME Phase
The variation of the height of the minimum-area bounding rectangle at different instances for the continuous sign sequence “8–3” is shown in Figure 12. As seen from the figure, the height of the minimum-area bounding rectangle becomes very small during the transition from sign “8” to sign “3,” and hence this phase is defined to be the ME phase. The associated heights (Hcode) corresponding to sign and ME frames are also shown in the figure.
3.4 Recognition Results
The performance of our proposed continuous SLR system was tested by taking ten different sign sequences. The recognition results obtained using the CRF classifier (trained with isolated numerals from 0 to 9) is shown in Table 2. The results show that our proposed system offers a recognition rate of around 93%. The system can be tested for any possible combinations of continuous sign sequences involving ME.
Continuous signs | N | C | RR (%) |
---|---|---|---|
8–3 | 7 | 7 | 100 |
2–5 | 7 | 6 | 85.714 |
9–6 | 7 | 5 | 71.428 |
0–4 | 7 | 7 | 100 |
1–2 | 7 | 7 | 100 |
5–6 | 7 | 7 | 100 |
9–7 | 7 | 7 | 100 |
3–1 | 7 | 6 | 85.714 |
6–0 | 7 | 7 | 100 |
4–8 | 7 | 6 | 85.714 |
Overall recognition rate, RR (%) | 92.853 |
4 Conclusion
In this paper, we have devised a continuous SLR system for classifying signs present in a continuous sign sentence involving ME. ME detection is accomplished by employing the height of the hand trajectory as a feature. In addition to this, we have implemented a combination of spatial and temporal features for efficient recognition of the signs obtained after removing the ME frames from the input sign sequence. Further, the ability to handle different background conditions adds to the proficiency of our proposed system.
In the near future, the system can also be utilized for detecting ME in case of double-handed signs.
Bibliography
[1] M. K. Bhuyan, D. Ghosh and P. K. Bora, Co-articulation detection in hand gestures, in: Proceedings of IEEE Region 10 Conference TENCON 2005, pp. 1–4, Melbourne, Qld., November 2005.10.1109/TENCON.2005.300947Search in Google Scholar
[2] G. Bradski and A. Kaehler, Learning OpenCV, 1st ed., O’ Reilly Media, USA, 2008.Search in Google Scholar
[3] Q. Chen, N. D. Georganas and E. M. Petriu, Hand gesture recognition using Haar-like features and a stochastic context-free grammar, IEEE Trans. Instrum. Meas.57 (2008), 1562–1571.10.1109/TIM.2008.922070Search in Google Scholar
[4] A. Choudhury, A. K. Talukdar and K. K. Sarma, A conditional random field based Indian sign language recognition system under complex background, in: Proceedings of International Conference on Communication Systems and Network Technologies (CSNT), pp. 900–904, Bhopal, India, April 2014.10.1109/CSNT.2014.185Search in Google Scholar
[5] A. Choudhury, A. K. Talukdar and K. K. Sarma, A novel hand segmentation method for multiple-hand gesture recognition system under complex background, in: Proceedings of IEEE International Conference on Signal Processing and Integrated Networks (SPIN), pp. 136–140, Noida, Delhi-NCR, India, February 2014.10.1109/SPIN.2014.6776936Search in Google Scholar
[6] Z. J. Chuang, C. H. Wu and W. S. Chen, Movement epenthesis generation using NURBS-based spatial interpolation, IEEE Trans. Circuits Syst. Video Technol.16 (2006), 1313–1323.10.1109/TCSVT.2006.883509Search in Google Scholar
[7] A. C. Evans, N. A. Thacker and J. E. W. Mayhew, Pairwise representations of shape, in: Proceedings of the 11th International Conference on Pattern Recognition (IAPR), pp. 133–136, The Hague, Netherlands, vol. 1, August 1992.Search in Google Scholar
[8] D. Kelly, J. McDonald and C. Markham, Recognizing spatiotemporal gestures and movement epenthesis in sign language, in: Proceedings of the 13th International Conference on Machine Vision and Image Processing, pp. 145–150, Dublin, September 2009.10.1109/IMVIP.2009.33Search in Google Scholar
[9] E. Ormel, O. Crasborn and E. v. d. Kooij, Coarticulation of hand height in sign language of the Netherlands is affected by contact type, J. Phon.41 (2013), 156–171.10.1016/j.wocn.2013.01.001Search in Google Scholar
[10] S. L. Phung, A. Bouzerdoum and D. Chai, Skin segmentation using color pixel classification: analysis and comparison, IEEE Trans. Pattern Anal. Mach. Intell.27 (2005), 148–151.10.1109/TPAMI.2005.17Search in Google Scholar PubMed
[11] G. X. Ritter and J. N. Wilson, Handbook of Computer Vision Algorithms in Image Algebra, 2nd ed., CRC Press, Boca Raton, 2001.Search in Google Scholar
[12] J. Segouat and A. Braffort, Toward modeling sign language coarticulation, Gesture Embodied Commun. Hum.-Comput. Interact.5934 (2010), 325–336.10.1007/978-3-642-12553-9_29Search in Google Scholar
[13] R. Yang and S. Sarkar, Detecting coarticulation in sign language using conditional random fields, in: Proceedings of International Conference on Pattern Recognition (ICPR), vol. 2, pp. 108–112, Hong Kong, August 2006.Search in Google Scholar
[14] R. Yang, S. Sarkar and B. Loeding, Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming, IEEE Trans. Pattern Anal. Mach. Intell.32 (2010), 462–477.10.1109/TPAMI.2009.26Search in Google Scholar PubMed
[15] H. D. Yang, S. Sclaroff and S. W. Lee, Sign language spotting with a threshold model based on conditional random fields, IEEE Trans. Pattern Anal. Mach. Intell.31 (2009), 1264–1277.10.1109/TPAMI.2008.172Search in Google Scholar PubMed
©2017 Walter de Gruyter GmbH, Berlin/Boston
This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.