Keywords

1 Introduction

Sign language is a widely used communication method for hearing or speech impaired people. It is quite difficult to learn sign language. If automatic translation for sign language could be realized, it would become very meaningful and valuable to both impaired people and physically unimpaired people. Although interpretation from sign language to speech has also been studied for many years, the technologies have not yet matured to the level at which they can be put into practical use. Specifically, some methods that use a special sensor or device [1, 2] are associated with high introduction costs or sensors must be attached to the body. The detection target is mainly limited to hand motions and hand shape, and finger motions are not included. For this reason, achieving highly accurate recognition is difficult and the number of words that can be recognized using these methods is limited [3, 4]. Thus, the results that can be obtained with existing technologies are insufficient in terms of developing a system that can be put into practical use. In addition, because the scenario for usage has necessarily been limited in experiments, in previous studies, researchers have not even attempted to put them to practical use.

The authors have been investigating a method of sign language recognition using the optical camera and colored gloves shown in Fig. 1 [5, 6]. Since an optical camera is implemented in smartphones, this configuration can be used anywhere, although the colored gloves are also necessary. By using an optical camera and colored gloves, each finger can be discriminated by color and therefore hand shape can be correctly detected.

Fig. 1.
figure 1

Proposed colored gloves (Color figure online)

Automatic translation is our final goal, but remains too difficult to realize with current technology. Therefore, the authors are now trying to produce a kind of learning tool for sign language. Video data relevant to sign language can be obtained from the web site. Figure 2 shows an image from an instruction video demonstrating sign language motion. A learner memorizes the motions of each sign from this video. However, it is quite difficult not only to memorize the motion but also to confirm the validity of the motion memorized from the video. A tool for checking the learned motion is essential. An example of the use of our current research technology is shown in Fig. 3. After memorizing the sign language motion, the learner displays the same motion in front of a web camera connected to a PC. If the PC recognizes his/her motion, the positive result shows on the display. The learner uses this system as a review tool for sign language.

Fig. 2.
figure 2

Instructional video for sign language motions

Fig. 3.
figure 3

Application of current investigation

In this paper, we describe recognition schemes for sign language motions, and experimental results. The hidden Markov model (HMM), which is used for recognition in several fields such as voice recognition, was used in this investigation. Since this method can easily be adapted to the recognition of a wide variety of feature elements of motions, excellent recognition performance can be expected.

2 Preparation of Learning Models for Recognition

The authors used DP matching [7] for sign language recognition in an earlier investigation [5, 6]. Hand motions were detected as the motion of the center of gravity of the colored region on the wrists. We decided to use the HMM recognition method, as this model can include many kinds of feature elements from video image data representing sign language motion, which can be useful in the instructional process. Improved recognition can be expected when the appropriate feature elements of sign language motions are used.

The instructional video data for sign language words included in Smart Deaf [8] is divided into 35 categories based on the usage areas of each sign. Each category includes roughly 50 to 100 words. In this study, recognition targets in sign language motions are selected according to the words included in each category. Demand for words relevant to medical and health issues is high, so we selected 25 words from this category (Table 1).

Table 1. Target sign language words

One of the most important tasks in a recognition investigation is to compose the dataset for a recognition experiment and its evaluation. It is especially important to have sufficient motion data for each sign language word in order to create the learning model for HMM. Multiple motion data sets from multiple operators are required for the learning process. The recognition process uses HMM and calculates the likelihood values that the input motion data represents each possible sign language motion. The word with the highest likelihood value is selected as the recognition result.

Clearly, it is quite important to gather correct motion data. Therefore, the authors asked for the cooperation of the person in charge of making the motion video of Smart Deaf to compose the set of motion data used in this investigation. Figure 4 shows the signer directing and checking the sign language motions of the experimenter. The motion data for 25 words were recorded for learning and evaluation. The number of persons used for collecting motion data, and the number of samples for each motion are indicated in Table 2. The total number of collected motion data samples was 2250 (60 × 25 + 30 × 25). The number of experimenters and the number of samples for each word were based on the pre-study as the optimum for applying the learning HMM and recognition performance. The conditions under which the motion data were captured are as follows.

Fig. 4.
figure 4

Scene of sign language instruction

Table 2. Collected motion data for learning and evaluation
  1. (i)

    Camera image resolution of 800 × 600 pixels is selected.

  2. (ii)

    Illumination is set at about 200 lx for both the camera side and singer side.

  3. (iii)

    Frame rate is 30 fps (frames per second). This is the maximum rate for a standard Web camera and smartphone.

  4. (iv)

    The distance between the camera and signer is one mater, as this distance is considered to coincide with a real situation.

  5. (v)

    The color of the signer’s clothes and the background wall is black to facilitate easy detection of the colored region of colored gloves.

  6. (vi)

    The height in the field of view of the camera is set at a position that prevents the wrists of the signer being detected when he/she lowers his/her arm in order to make clear the beginning and the end of a sign language motion.

We are now trying to compose sign language motion data as a corpus for analysis. Therefore, these conditions will be included the next time we collect motion data.

3 Feature Elements

DP matching was used for the recognition of hand motions in the previous investigation. However, hand speed and hand position are also important features of sign language motions. The following features were extracted from the motion data for each sign.

  1. (i)

    Shape of hand motion

The sequence of positions recorded as x and y coordinates indicates the shape of the hand motion. Since the size of the motion differs for each user, a normalization scheme for these data was applied, based on the normalization process represented by expression (1). Figure 5 shows the shape of a hand motion, i.e. the progressive positions of the center of the gravity of the wrist. The figure on the left is from the raw data before normalization, and the one on the right is the result of normalization. In addition, linear interpolation was applied when there was no color region detection due to occlusion or no color detection.

Fig. 5.
figure 5

Shape of hand motion before and after normalization

$$ \left. {\begin{array}{*{20}l} {norm\left( {x_{i} } \right) = \frac{{\left( {x_{i} - \overline{x} } \right)}}{{\sqrt {\frac{1}{n}\mathop \sum \nolimits_{i = 1}^{n} \left( {\left( {x_{i} - \overline{x} } \right)^{2} + \left( {y_{i} - \overline{y} } \right)^{2} } \right)} }}} \hfill \\ {norm\left( {y_{i} } \right) = \frac{{\left( {y_{i} - \overline{y} } \right)}}{{\sqrt {\frac{1}{n}\mathop \sum \nolimits_{i = 1}^{n} (\left( {x_{i} - \overline{x} } \right)^{2} + (y_{i} - \overline{y} )^{2} )} }}} \hfill \\ \end{array} } \right\} $$
(1)

Here,

i::

ith frame of motion data

n::

the total number of frames of sign language motion data

  1. (ii)

    Speed of hand motion

The speed of motion sometimes indicates meaning. The hand-position difference in each successive video frame can be regarded as a measure of the speed of motion, and is used as such for this feature element. Hand speed is defined by the following expression. An example of hand motion speed is shown in Fig. 6.

Fig. 6.
figure 6

Hand motion speed

$$ \left. {\begin{array}{*{20}c} {dx_{i} = x_{i} - x_{i - 1} } \\ {dy_{i} = y_{i} - y_{i - 1} } \\ \end{array} } \right\} $$
(2)

Here,

i::

ith frame of motion data

  1. (iii)

    Position of hand

The position of the hand includes meaning, so it should be included as a feature element, identified by its position in the image frame. Of course, the division by frame size (x:800, y:600) was used in the normalization process as represented by the following expression. An example of hand position is shown in Fig. 7.

Fig. 7.
figure 7

Hand position

$$ \left. {\begin{array}{*{20}c} {Px_{i} = x_{i}\, /\, 800} \\ {Py_{i} = y_{i}\, / \,600} \\ \end{array} } \right\} $$
(3)

Here,

i::

ith frame of motion data

  1. (iv)

    Hand shape by distance

The shape of the hand itself, including finger positions and shapes, is also used as a feature of sign language words. The distance between the center of gravity of the wrist and the center of gravity of each finger-tip was used in the earlier investigation [6]. The distance is obtained by expression (4).

$$ d_{i} = \sqrt {(fx_{i} - w_{x} )^{2} + (fy_{i} - w_{y} )^{2} } $$
(4)

Here,

d::

distance of center of gravity of wrist and each finger tip

i::

Each finger

(1: Thumb, 2: Forefinger, 3: Middle finger, 4: Ring finger, 5: Little finger)

(fx, fy)::

Center of gravity of colored region of each finger tip

(wx, wy)::

Center of gravity of colored region of wrist (Fig. 8)

Fig. 8.
figure 8

Hand shape by distance

  1. (v)

    Hand shape by number of pixels

When a finger is hidden, the distance represented by expression (4) is undefined. Nevertheless, the value 0 is set in method (iv). In this investigation, the number of visible pixels of each finger is also used as feature element for comparison. Of course, each value is divided by each maximum value as normalization. Either of (iv) or (v) is used in the recognition experiment.

4 Recognition Experiment Results

4.1 Recognition Experiment Result for Each Feature

All feature elements were used for the HMM learning process. The HMM learning and likelihood calculations were carried out with HTK. The dimensions of the state and initial values for the HMM, and the number of experimenters and samples used for HMM learning were determined in the pre-investigation. The data set shown in Table 2 was used for HMM learning and the recognition performance evaluation experiment.

The recognition success percentage for each feature element is shown in Table 3. The dimensions of each feature element and the number of HMM states are also shown in this table. The success ratios of the first through third ranks are shown in this table. The hand shape, i.e. the feature based on the number of pixels in each finger-tip image brought the best result.

Table 3. Recognition success ratio by each feature element

4.2 Recognition Experiment Using Combined Feature

This experiment was carried out using HTK (Hidden Markov model toolkit) [9]. Table 3 shows the recognition success percentage as well as the learning value of each feature. It is expected that recognition performance can be raised by combining features. Two methods were used to examine the potential for improving the recognition performance. One involved combining three features (shape, speed and position) for HMM learning, and the other was to sum the rankings of the recognition results obtained with each feature. The ranking was in the order of likelihood, that is, 1, 2, 3 ···, 25, where 25 was the total number of words that might be recognized. Shape, speed and position are the features of hand motions. Therefore, these features were combined and the HMM was trained by using these features. Of course, both methods are considered, and the hand shape result and their ranking were added to determine the total ranking, i.e. recognition result. The results are shown in Tables 4 and 5.

Table 4. Recognition results by combining feature elements
Table 5. Recognition results by sum of ranking of each result

Similar results were obtained by both methods. The success percentage rose to around 60%, and increased to over 80% when 3rd rankings were included. It seems possible to use this method as a review tool for confirming the correctness of a learner’s sign language motions.

5 Enhancement of Recognition Performance

The recognition results shown in Tables 4 and 5 are not sufficient for practical use. This section shows the methods we used to enhance recognition performance.

5.1 Selection of Candidates

As the number of words to be recognized increases, the more difficult it becomes to maintain recognition performance. The basic idea is that some of the recognition candidates should be eliminated before the HMM recognition process by using a number of relevant criteria. DP matching is used as the criterion in this investigation. This method has been widely used in voice and motion recognition [10]. There were experiences in the past investigation [5] where recognition using DP matching includes correct result when we selected the half of the recognition target words. This means that although the performance of DP matching cannot necessarily achieve the high recognition results obtainable using the HMM schemes, the possibility that correct results will be achieved is high if we take multiple candidates. Therefore, the authors used the results of DP matching to select the candidates of the recognition results before applying the HMM recognition scheme. DP matching calculates the distance of two vector elements, which is the motion data shown in Fig. 9. This is obtained from the movement of the center of gravity of the colored region of the wrist. This is an example of the motion of representing the word of “Diabetes”. These motion data in a time sequence are used for DP matching. The more similar two motions are the smaller the distance for two motions by DP matching.

Fig. 9.
figure 9

Example of motion data for DP matching

The proposed recognition process sequence is shown in Fig. 10. Before the initiating the recognition process using HMM, the threshold values are decided for each word to be recognized as a pre-process, and these values are used for selection of correct candidates by sigh language recognition using HMM. The authors add a data set for this purpose (Table 6) as data for the threshold. Since the similarity of two motions is obtained as the distance of two motions, the distance of each word was used as the criterion for motion data to be recognized. We obtained 1800 (60 × 30) pieces of distance data from 60 samples for learning and 30 samples for deciding the threshold. The threshold values were determined from these distance values. If the distance of data for learning and the data for evaluation exceeds this threshold, these words are removed from the candidate of correct recognition results because there is no possibility that they will qualify as recognition results due to distance. Since the number of candidates for the recognition target can be reduced by this scheme, this leads to enhancement of the recognition performance using HMM method described in Sects. 3 and 4.

Fig. 10.
figure 10

Performance enhancement by introduction of threshold

Table 6. Dataset for experiments for proposed method

5.2 Results and Evaluation

The results obtained by the proposed method are shown in Table 7. Two cases, that is, combining feature ({Shape, Speed, Position} and hand shape by distance) and each feature (shape, speed, position and hand shape by number of pixels) are used for evaluation. The results without a threshold and some with thresholds are investigated for comparison in this experiment. The threshold values are determined from the average value and the standard deviation of distance values shown in this table. It is verified that the recognition performance can be enhanced, from 61.2% to 63.7% (from 81.6% to 85.6%) and 55.8% to 61.3% (from 81.1% to 84.5%), by setting appropriate threshold values in the experiment.

Table 7. Recognition results by threshold

We confirmed the validity of this removal method. While the correct candidates must be retained, any word that has no possibility of a correct answer should be removed by the proposed method. Among 25 words to be recognized, the number of correct samples is 750(= 1 × 25 × 30), and the number to be eliminated is 18,000(= 24 × 25 × 30). We checked the eliminated samples using the proposed methods in each feature case. The results are shown in Table 8. The table on the left shows the removed candidates from incorrect candidates and the table on the right shows the results from correct samples. Almost half of the incorrect candidates and about 2% of the correct candidates can be eliminated. This shows the validity of the proposed method.

Table 8. Evaluation of eliminated candidates

6 Conclusion

HMM learning and likelihood techniques were used to recognize sign language motions. The authors composed a sign language motion data set for model learning and performance evaluation together with supervision from a human signer. Each feature element that could be used for recognition was investigated and extracted from the motion data generated from video recordings of sign language. The percentage success of the two proposed methods was around 60% based on the 1st rank rating, and over 80% based on the 1st through 3rd ranks for 25 words. In addition, the selection of candidates of correct answer was introduced to enhance the recognition percentage by using the DP matching results. The success percentage was increased from 61.2% to 63.7% (from 81.6% to 85.6%) and 55.8% to 61.3% (from 81.1% to 84.5%) by setting an appropriate threshold value.

It seems quite possible to use the proposed method as a learners’ reviewing tool. However, it is necessary to enhance the performance by adding features, for example, hand direction, the visibility of the palm and other features. It is considered that the final ranking should be arrived at by considering the sum of the reliabilities, i.e. a weighted summation of the recognition success percentage for each word. Rejection criteria for a recognition result should be introduced to enhance the reliability of recognition methods. These investigations will be undertaken in future studies.