American sign language (ASL) recognition based on Hough transform and neural networks

https://doi.org/10.1016/j.eswa.2005.11.018Get rights and content

Abstract

The work presented in this paper aims to develop a system for automatic translation of static gestures of alphabets and signs in American sign language. In doing so, we have used Hough transform and neural networks which is trained to recognize signs. Our system does not rely on using any gloves or visual markings to achieve the recognition task. Instead, it deals with images of bare hands, which allows the user to interact with the system in a natural way. An image is processed and converted to a feature vector that will be compared with the feature vectors of a training set of signs. The extracted features are not affected by the rotation, scaling or translation of the gesture within the image, which makes the system more flexible.

The system was implemented and tested using a data set of 300 samples of hand sign images; 15 images for each sign. Experiments revealed that our system was able to recognize selected ASL signs with an accuracy of 92.3%.

Introduction

The sign language is the fundamental communication method between people who suffer from hearing defects. In order for an ordinary person to communicate with deaf people, a translator is usually needed to translate sign language into natural language and vice versa (International Bibliography of Sign Language, 2005, International Journal of Language & Communication Disorders, 2005).

As a primary component of many sign languages and in particular the American Sign Language (ASL), hand gestures and finger-spelling language plays an important role in deaf learning and their communication. Therefore, sign language can be considered as a collection of gestures, movements, postures, and facial expressions corresponding to letters and words in natural languages.

A gesture is defined as a dynamic movement, such as waving hi, hello or good-bye. Simple gestures are made in two ways (Sturman and Zeltzer, 1994, Watson, 1993). The first way involves a simple or complex posture and change in the position or orientation of the hand, such as making a pinching posture and changing the hand’s position. The second way entails moving the fingers in some way with no change in the position and orientation of the hand, for example, moving the index and middle finger back and forth to urge someone to move closer. A complex gesture is one that includes finger; wrist or hand movement (i.e. changes in the position and orientation). There are two types of gesture interaction: communicative gestures work as a symbolic language (which is our focus in this research) and manipulative gestures provide multi-dimensional control. Moreover, we can divide gestures into static gestures (hand postures) and dynamic gestures (Cutler and Turk, 1998, Hong et al., 2000). Indeed the hand motion conveys as much meaning as their posture does.

A static sign is determined by a certain configuration of the hand, while a dynamic gesture is a moving gesture determined by a sequence of hand movements and configurations. Dynamic gestures are sometimes accompanied with body and facial expressions.

The aim of sign language recognition is to provide an easy, efficient and accurate mechanism to transform sign language into text or speech. With the help of computerized digital image processing (Gonzalez, Woods, & Eddins, 2004) and neural networks techniques (Haykin, 1999), the system that can recognize the alphabet flow can recognize and interpret ASL words and phrases. For a gesture recognition system, there are four main components: gesture modeling, gesture analysis, gesture recognition and gesture-based application systems.

American Sign Language (ASL) (International Bibliography of Sign Language, 2005, National Institute on Deafness and Other Communication Disorders, 2005) is a complete language that employs signs made with the hands and other gestures, including facial expressions and postures of the body. ASL also has its own grammar that is different from other sign languages such as English and Swedish. ASL consists of approximately 6000 gestures of common words with finger spelling used to communicate unclear words or proper nouns. Finger spelling uses one hand and 26 gestures to communicate the 26 letters of the alphabet. The 26 alphabets of ASL are shown in Fig. 1.

Attempts to automatically recognize sign language began to appear in the literature in the 90s. Research on hand gestures can be classified into two categories first category, relies on electromechanical devices that are used to measure the different gesture parameters such as the hand’s position, angle, and the location of the fingertips. Systems that use such devices are usually called glove-based systems (e.g. the work of (Grimes, 1983) at AT&T Bell Labs developed the “Digital Data Entry Glove”). Major problems with such systems, that they force the singer to wear cumbersome and inconvenient devices. As a result, the way by which the user interacts with the system will be complicated and less natural.

The second category exploits machine vision and image processing techniques to create visual based hand gesture recognition systems. Visual-based gesture recognition systems are further divided into two categories. The first one relies on using specially designed gloves with visual markers called “visual-based gesture with glove–markers (VBGwGM)” that help in determining hand postures (Dorner and Hagen, 1994, Fels and Hinton, 1993, Starner, 1995). A summary of selected research efforts listed in Table 1.

But using gloves and markers do not provide the naturalness required in human–computer interaction systems. Besides, if colored gloves are used, the processing complexity is increased.

As an alternative, the second kind of visual based hand gesture recognition systems can be called “pure visual-based gesture (PVBG)” (i.e. visual-based gesture without glove–markers). This type tries to achieve the ultimate convenience naturalness by using images of bare hands to recognize gestures.

Among many factors, five important factors must be considered for the successful development of a vision-based solution to collecting data for hand posture and gesture recognition (Ong and Ranganath, 2005, Starner, 1995, Sturman and Zeltzer, 1994, Watson, 1993).

  • The placement and number of cameras used.

  • The visibility of the object (hand) to the camera for simpler extraction of hand data/features.

  • The extraction of features from the stream or streams of raw image data.

  • The ability of recognition algorithms to extracted features.

  • The efficiency and effectiveness of the selected algorithms to provide maximize accuracy and robustness.

A number of recognition techniques are available and in some cases they can be applied for the two types of vision-based solutions (i.e. VBGwGM and PVBG). In general these recognition techniques can be categorized into three broad categories:

  • A.

    Feature extraction, statistics, and models.

  • This technique can be classified into six sub-categories:

    • 1.

      Template matching (e.g. research work of Darrell and Pentland, 1993, Newby, 1993, Sturman, 1992, Watson, 1993, Zimmerman et al., 1987).

    • 2.

      Feature extraction and analysis, (e.g. research work of Rubine, 1991, Sturman, 1992, Wexelblat, 1994, Wexelblat, 1995).

    • 3.

      Active shape models “smart snakes” (e.g. research work of Heap & Samaria, 1995).

    • 4.

      Principal component analysis (e.g. research work of Birk et al., 1997, Martin and James, 1997, Takahashi and Kishino, 1991).

    • 5.

      Linear fingertip models (e.g. research work of Davis and Shah, 1993, Rangarajan and Shah, 1991).

    • 6.

      Causal analysis (e.g. research work of Brand & Irfan, 1995).

  • B.

    Learning algorithms.

  • This technique can be classified into three sub-categories:

    • 1.

      Neural network (e.g. research work of Banarse, 1993, Fels, 1994, Fukushima, 1989, Murakami and Taguchi, 1991).

    • 2.

      Hidden Markov Models (e.g. research work of Charniak, 1993, Liang and Ouhyoung, 1998, Nam and Wohn, 1996, Starner, 1995).

    • 3.

      Instance-based learning (research work of Kadous, 1995; also see Aha, Dennis, & Marc, 1991).

  • C.

    Miscellaneous techniques.

  • This technique can be classified into three sub-categories:

    • 1.

      The linguistic approach (e.g. research work of Hand, Sexton, & Mullan, 1994).

    • 2.

      Appearance-based motion analysis (e.g. research work of Davis & Shah, 1993).

    • 3.

      Spatio-temporal vector analysis (e.g. research work of Quek, 1994).

Regardless of the approach used (i.e. VBGwGM or PVBG etc.), many researchers have been trying to introduce hand gestures to Human–Computer Interaction field. Charayaphan and Marble (1992) investigated a way using image processing to understand American Sign Language (ASL). Their system can correctly recognize 27 out of the 31 ASL symbols. Fels and Hinton (1993) developed a system using a VPL DataGlove Mark II with a Polhemus tracker as input devices. In their system, the neural network method was employed for classifying hand gestures. Another system using neural networks developed by Banarse (1993) was vision-based and recognized hand postures using a neocognitron network, a neural network based on the spatial recognition system of the visual cortex of the brain. Heap and Samaria (1995) extend active shape models, or “smart snakes” technique to recognize hand postures and gestures using computer vision. In their system, they apply an active shape model and a point distribution model for tracking a human hand. Starner and Pentland (1995) used a view-based approach with a single camera to extract two-dimensional features as input to HMMs. The correct rate was 91% in recognizing the sentences comprised 40 signs. Kadous (1996) demonstrated a system based on power gloves to recognize a set of 95 isolated Auslan signs with 80% accuracy, with an emphasis on computationally inexpensive methods. Grobel and Assan (1996) used HMMs to recognize isolated signs with 91.3% accuracy out of a 262-sign vocabulary. They extracted the features from video recordings of signers wearing colored gloves. Vogler and Metaxas (1997) used computer vision methods to extract the three-dimensional parameters of a signer’s arm motions, coupled the computer vision methods and HMMs to recognize continuous American sign language sentences with a vocabulary of 53 signs. They modeled context-dependent HMMs to alleviate the effects of movement epenthesis. An accuracy of 89.9% was observed. Yoshinori, Kang-Hyun, Nobutaka, and Yoshiaki (1998) used colored gloves and have shown that using solid colored gloves allows faster hand features extraction than simply wearing no gloves at all. Liang and Ouhyoung (1998), used HMMs for continuous recognition of Taiwan sign language with a vocabulary between 71 and 250 signs with DataGlove as input devices. However, their system required that gestures performed by the signer be slow to detect the word boundary. Yang and Ahuja (1999) investigated dynamic gestures recognition as they utilized skin colour detection and affine transforms of the skin regions in motion to detect the motion trajectory of ASL signs. Using a time delayed neural network, they recognised 40 ASL gestures with a success rate around 96%. But their technique potentially has a high computational cost when false skin regions are detected. A local feature extraction technique is employed to detect hand shapes in sign language recognition by Imagawa, Matsuo, Taniguchi, Arita, and Igi (2000). They used an appearance-based eigen method to detect hand shapes. Using a clustering technique, they generate clusters of hand shapes on an eigenspace with accuracy achieved a round 93% recognition of 160 words. Bowden and Sarhadi (2002) developed a non-linear model of shape and motion for tracking finger spelt American sign language. Their approach based on one-state transitions of the English Language which are projected into shape space for tracking and model prediction using an HMM like approach. Symeonidis (2000) used orientation histograms to recognize static hand gestures, specifically, a subset of American Sign Language (ASL). A pattern recognition system used a transform that converts an image into a feature vector, which will then be compared with the feature vectors of a training set of gestures. The system was implemented with a perceptron network. The main problem with this technique is how good differentiation one can achieve. This of course is dependent upon the images but it comes down to the algorithm as well. It may be enhanced using other image processing techniques like edge detection. For farther information and hot topics on this issue a modern and an excellent survey can be found in (Ong & Ranganath, 2005).

Section snippets

System overview

Our system is designed to visually recognize all static signs of the American Sign Language (ASL), all signs of the ALS alphabets, single digit numbers used in ASL (e.g. 3, 5, 7) and a sample of words (e.g. love, meet, more) using bare hands. The users/signers are not required to wear any gloves or to use any devices to interact with the system. However, different signers vary their hand shape size, body size, operation habit and so on, which bring about more difficulties in recognition.

Experiments results and analysis

In this section, we evaluate the performance of our recognition system by testing its ability to classify signs for both training and testing set of data. The effect of the number of inputs to the neural network is considered. In addition we discuss some problems in the performance of some signs due to the similarities between them.

Conclusions and future work

In this project, we developed a system for the purpose of the recognition of a subset of the American sign language. The system has two phases: the feature extraction phase and the classification phase. The work was accomplished by training a set of input data (feature vectors). Without the need of any gloves, an image for the sign is taken by a camera. After processing, feature extracting phase depends on Hough transformation which is tolerant to gaps in feature boundary descriptions and it is

References (57)

  • Cutler, R., & Turk, M. (1998). View-based interpretation of real-time optical flow for gesture recognition. IEEE...
  • Darrell, T., & Pentland, A. (1993). Recognition of space–time gestures using a distributed representation. MIT Media...
  • Davis J., & Shah M. (1993). Gesture recognition, Technical Report, Department of Computer Science, University of...
  • B. Dorner et al.

    Towards an American sign language interface

    Artificial Intelligence Review

    (1994)
  • R.O. Duda et al.

    Use of the Hough transformation to detect lines and curves in pictures

    Communications of the ACM

    (1972)
  • Fels, S. (1994). Glove-TalkII: mapping hand gestures to speech using neural networks—an approach to building adaptive...
  • S. Fels et al.

    GloveTalk: a neural network interface between a DataGlove and a speech synthesizer

    IEEE Transactions on Neural Networks

    (1993)
  • R.C. Gonzalez et al.

    Digital image processing using MATLAB

    (2004)
  • Grimes, G. (1983). Digital data entry glove interface device, Patent 4,414,537, AT & T Bell...
  • Grobel, K., & Assan, M. (1996). Isolated sign language recognition using hidden Markov models. In Proceedings of the...
  • Hand, C., Sexton, I., & Mullan, M. (1994). A linguistic approach to the recognition of hand gestures. In Proceedings of...
  • S. Haykin

    Neural networks: a comprehensive foundation

    (1999)
  • Heap, A. J., & Samaria, F. (1995). Real-time hand tracking and gesture recognition using smart snakes. In Proceedings...
  • Hong, P. et al. (2000). Gesture modeling and recognition using finite state machines. IEEE international conference on...
  • Hough, P. (1962). Method and means for recognizing complex patterns, US Patent...
  • Imagawa, K., Matsuo, H., Taniguchi, R., Arita, D., & Igi, S. (2000). Recognition of local features for camera-based...
  • International Bibliography of Sign Language, (2005)....
  • International Journal of Language & Communication Disorders, (2005). Available from...
  • Cited by (110)

    • A comprehensive survey and taxonomy of sign language research

      2022, Engineering Applications of Artificial Intelligence
    • Preeminent Sign Language System by Employing Mining Techniques

      2024, Lecture Notes in Networks and Systems
    View all citing articles on Scopus
    View full text