Elsevier

Speech Communication

Volume 41, Issues 2–3, October 2003, Pages 349-367
Speech Communication

The voicing feature for stop consonants: recognition experiments with continuously spoken alphabets

https://doi.org/10.1016/S0167-6393(02)00151-6Get rights and content

Abstract

We consider the possibility of incorporating phonetic features into a statistically based speech recognizer. We develop a two pass strategy for recognition with a hidden Markov model based first pass followed by a second pass that performs an alternative analysis using class-specific features.

For the voiced/voiceless distinction on stops for an alphabet recognition task, we show that a perceptually and linguistically motivated acoustic feature exists (the voice onset time (VOT)). We perform acoustic–phonetic analyses demonstrating that this feature provides superior separability to the traditional spectral features. Further, the VOT can be automatically extracted from the speech signal. We describe several such algorithms that can be incorporated into our two pass recognition strategy to reduce error rates by as much as 53% over a baseline HMM recognition system.

Introduction

There is little doubt that currently the most successful paradigm for speech recognition is a statistical approach typically using variants of a hidden Markov model (HMM) framework (Rabiner and Juang, 1993). While this approach has led to significant advances, many problems still remain. In this paper we investigate the possibility of using linguistically motivated features to correct some of the errors of current HMM based recognizers.

The notion of distinctive features (Jakobson et al., 1952) has long been regarded as a possible basis for automatic speech recognition. Unfortunately, few systems based on such principles have truly been implemented. Additionally, speech recognition research in this tradition has typically been conducted with hand-crafted rule-based approaches with relatively little statistical content to smooth over the inherent variability of the speech signal. At the same time, work in the mainstream statistical (primarily HMM based) approaches typically use a spectral sequence as features and ignore the possibility of linguistically motivated features. In our view, maximal benefits will emerge from a healthy union of statistical learning techniques with such feature systems. Our overall goal is to move towards such a feature based system. To demonstrate the feasibility of such a feature based approach, one will have to show that at least for one particular feature, a viable implementation exists. Specifically, one needs to ask the following questions: What are the acoustic correlates of a particular distinctive feature? Do such acoustic correlates provide better separability than traditional spectral features (or transformations there-of like cepstra etc.)? Can such correlates be reliably extracted in an automatic speech recognition system? This paper provides some answers to these questions on a limited task, i.e., alphabet recognition. As a starting point we examine the distinctive feature [voice]1 for stop consonants. As we shall see from an error analysis later, several of the errors in alphabet recognition occur due to a misclassification of this feature. To place our results in an appropriate context, it is worthwhile to emphasize some aspects of the work presented in this paper.

1. This paper should be viewed as a demonstration that at least for one particular case, i.e., the voiced/unvoiced distinction for stop consonants in spoken letters, a linguistically and perceptually motivated acoustic feature exists, can be automatically extracted and used for recognition leading to performance that is superior to state-of-the-art HMM systems. Few such demonstrations exist. For example, the experiments conducted by Fanty and Cole (1990), Hasegawa-Johnson (1996) and Djezzar and Haton (1995) suggest that linguistic features provide reasonable performance but they have not been compared to state-of-the-art HMM based systems. (One notable exception, perhaps, are the promising results in (Bitar and Espy-Wilson, 1996); see the section on prior work for further discussion of the issues and results of feature-based recognition.) At a time when many researchers are pessimistic about the future of acoustic–phonetic approaches, it is important to stress some of the positive results––the promising results on the voicing feature described in this paper suggests that it is worthwhile to investigate further the kinds of ideas discussed in (Stevens, 1995; Zue, 1985) where accounts of acoustical correlates of other phonetic distinctions have been presented.

2. We propose a two pass strategy for recognition. While the general idea of two pass strategies has been employed before in a number of different contexts, the details differ from system to system. In our case, we use a standard HMM based system as a first pass to obtain an initial tentative segmentation and classification of the speech signal. In the second pass, we employ a different analysis system that uses class-specific heterogeneous, acoustic–phonetic features to alter the segmentation and classification in a completely automatic manner. Since perceptual cues for recognition are presumably distributed in a non-uniform manner in the time-frequency plane, the second analysis system allows us to explore such alternative cues that have either perceptual or acoustic–phonetic status thereby engaging the traditions of research in speech perception and acoustic-phonetics. The second pass recognizer is also statistically based: it builds probabilistic models on the new heterogeneous features.

It is also worthwhile to reflect on the heterogeneous nature of the second pass system. In this paper, we suggest that the effective classification of stops requires the location of the burst and the onset of voicing. This requires, as we shall see, analysis with a 5 ms window moved every 1 ms to obtain sufficient temporal resolution. Such temporal resolution is smeared out when a standard cepstral front-end with a 30 ms window is moved every 10 ms. The fine-grained analysis we perform here is relevant only for stops and need not be performed for other sounds. A single recognition paradigm based on HMMs is constrained to use a fixed representation for all sound classes. Thus while performance on stops using HMM based systems alone might be potentially increased by moving to a smaller analysis window, the performance on all other sounds will be adversely affected. One of the points we wish to stress in this paper is the need to move to heterogeneous class-specific measurements and move away from recognition paradigms that are based on a single (“one size fits all”) representation of the speech signal. The two pass approach explored here is only one way of achieving such a goal.

3. We perform an analysis of the errors on a restricted task of recognizing continuously spoken alphabets using a state-of-the-art HMM system. We focus in particular on errors related to stops in spoken alphabets (“P”, “T”, “B”, “D”, “K”) and their confusions with each other along the voicing dimension as well as their confusions with vocalic alphabets (“A” and “E”). These are highly confusable sounds and require one to make fine phonetic distinctions that a human seems to make significantly better than current recognition systems. We propose that the voice onset time (VOT), an acoustically distinct and perceptually real quantity, can be used as a criterion for discriminating the voiced from unvoiced stops (in pre-stressed, syllable initial position). This is a primarily temporal cue that is poorly modeled by current recognition systems. Furthermore, we also demonstrate in an acoustic study that the separability of voiced stops from unvoiced stops is greater in this temporal space than in spectral spaces. In our two pass approach, whenever the first pass HMM system classifies a segment as a stop, we invoke the second pass, automatically extract an estimate of the VOT and reclassify. Most statistical classifiers that depend on spectral distinguishability of the sound classes perform poorly on tasks such as that considered in this paper where the acoustic correlate seems to be primarily a temporal one.

4. We describe several automatic VOT estimation algorithms in some detail and compare their performance against each other and to baselines of standard and durational HMM models. The most effective of the VOT estimation algorithms is able to reduce some crucial confusions along the voicing dimension by 53%. In a three way classification into the categories of voiced stops, unvoiced stops, and vowels, the algorithm reduces confusions by 35%.

The general idea of using linguistic features has been around for a while––in fact the early DARPA effort attempted to use this in a mostly rule-based framework. Some other examples of feature based approaches to other sound classes include Meng et al. (1991, vowels), Eide et al. (1993, broad classes), Espy-Wilson (1994, semivowels), Bitar and Espy-Wilson (1996, broad classes). Of these Bitar and Epsy-Wilson is the only one to show significant improvement over existing schemes. Both Eide et al. and Bitar and Epsy-Wilson force the acoustic–phonetic features into a frame-based HMM framework. An attempt to incorporate notions of phonological features has also been made by Deng (see Deng, 1997 for an overview). He has emphasized the issue of structuring HMMs to capture phonological interactions and constraints. However, no particular attention has been paid to the acoustic correlates of these features.

Stops have attracted considerable attention as a testbed for recognition paradigms. Lori Lamel worked in a spectrogram reading tradition with elaborate rules for the classification of stops; the acoustic features however were for the most part not derived automatically from the signal. Fanty and Cole (1990), Djezzar and Haton (1995), Hasegawa-Johnson (1996), have considered acoustic–phonetic features for stops though their results have never really been compared to HMM based systems in a systematic way.

It has been recognized in the past through the work of Lisker (1975), Klatt (1975) and others that the durational cue of VOT provides good separation and is psychophysically real. However, they have not addressed the issue of how such a measure can be automatically extracted from the signal; nor whether it provides superior separability to standard spectral measures. In the HMM tradition, VOT has not been considered as far as we know. In the acoustic–phonetic tradition, some recognition results exist using the VOT; mostly from hand-segmented speech; no account exists of how they compare with the performance of current HMM systems.

Section snippets

Error analyses on alphabet recognition

To ground our investigations in an ASR problem, we consider a continuously spoken alphabet recognition task. This consists of recognizing spelled letters of New Jersey town names (≈1200 town names) continuously spoken by a 100 different speakers. Each speaker produced 50 utterances resulting in 5000 utterances in all that were collected over a telephone channel. The conjunction of a continuous ASR task and degradations due to the telephone make this database a particularly challenging one to

Recognition experiments and strategy

Having demonstrated that the VOT provides better separability than usual spectral representations (or transformations like cepstra), the important question remains: Can one reliably extract it from the signal in an automatic manner and use it for superior recognition performance?

We describe below one possible way in which this can be done by using an automatic VOT estimation algorithm as a second pass to correct errors made by a baseline first pass system that uses an HMM based recognizer. In

Conclusions and future directions

This is a first step towards incorporating distinctive features as an error correcting device to discriminate between confusable pairs in a statistical recognizer. From an examination of the voicing feature for stops, we conclude that the VOT, a temporal cue for discriminating between voiced and unvoiced stops in syllable initial positions provides superior separability to spectral cues. Furthermore, it can be extracted automatically from the signal and improves current recognition scores

References (29)

  • P.D Eimas et al.

    Selective adaptation of linguistic feature detectors

    Cognitive Psychology

    (1973)
  • A.S Abramson et al.

    Discriminability along the voicing continuum: cross-language tests

  • N Bitar et al.

    A knowledge-based signal representation speech recognition

  • L Deng

    A dynamic feature based approach to speech modeling and recognition

  • L Djezzar et al.

    Exploiting acoustic–phonetic knowledge and neural networks for stop recognition

    Eurospeech

    (1995)
  • E Eide et al.

    A linguistic feature representation of the speech waveform

  • C Espy-Wilson

    A feature based approach to speech recognition

    Journal of the Acoustical Society of America

    (1994)
  • M Fanty et al.

    Speaker-Independent English Alphabet Recognition: Experiments with the E-set

    (1990)
  • J Glass et al.

    A Probabilistic Framework for Feature-Based Speech Recognition

    (1996)
  • Hasegawa-Johnson, M., 1996. Formant and burst spectral measurements with quantitative error models for speech sound...
  • R Jakobson et al.

    Preliminaries to Speech Analysis: The Distinctive Features and their Correlates

    (1952)
  • D.H Klatt

    Voice onset time, frication and aspiration in word-initial consonant clusters

    Journal of Speech and Hearing Research

    (1975)
  • P.K Kuhl et al.

    Speech perception by the chinchilla: Identification functions for synthetic VOT stimuli

    Journal of the Acoustical Society of America

    (1977)
  • C.-H Lee et al.

    A study on task independent subword selection and modeling for speech recognition

  • Cited by (0)

    View full text