Continuous speech recognition with sparse coding

https://doi.org/10.1016/j.csl.2008.06.002Get rights and content

Abstract

Sparse coding is an efficient way of coding information. In a sparse code most of the code elements are zero; very few are active. Sparse codes are intended to correspond to the spike trains with which biological neurons communicate. In this article, we show how sparse codes can be used to do continuous speech recognition. We use the TIDIGITS dataset to illustrate the process. First a waveform is transformed into a spectrogram, and a sparse code for the spectrogram is found by means of a linear generative model. The spike train is classified by making use of a spike train model and dynamic programming. It is computationally expensive to find a sparse code. We use an iterative subset selection algorithm with quadratic programming for this process. This algorithm finds a sparse code in reasonable time if the input is limited to a fairly coarse spectral resolution. At this resolution, our system achieves a word error rate of 19%, whereas a system based on Hidden Markov Models achieves a word error rate of 15% at the same resolution.

Introduction

The brain needs to form an internal representation of the outside world in order to interact with the world, and does so by representing or coding information in the activities of neurons. A neuron can be viewed as a binary element; it is either silent or it fires a spike. The activity pattern of several neurons over time is called a spatio-temporal pattern or a spike train. The brain uses these patterns to code stimuli.

What are the properties of the neural code? In a binary temporal system such as the brain there are two extremes of representing data: at the one end there is dense coding, where many neurons are very active in coding a stimulus. For dense coding, a small number of neurons are sufficient to code a large set of stimuli. At the other extreme there is local coding, where very few neurons are active. The extreme where just one neuron is active for a stimulus (1-of-N coding) would require a large number of neurons to represent all the different stimuli. The brain adopts a compromise between dense coding and local coding (Földiák and Young, 1995, Vinje and Gallant, 2000) which is called sparse coding.

Several studies have used sparse coding to create a feature set for sound or speech recognition tasks. The studies follow the same general approach: first a sound signal is encoded into a spike train, then the recognition task is performed by decoding the spike train. Cho and Choi (2005) perform sound classification with spikes. They classify a sound as belonging to one of ten classes, which include male speech, foot steps and flute sounds. Näger et al. (2002) show that transitions between vowels can be classified by learning the delays between spikes. Kwon and Lee (2004) use independent component analysis (ICA) to extract features from speech in order to do phoneme recognition. ICA is an algorithm that provides a sparse representation of a signal. Some studies illustrate isolated digit recognition (Loiselle et al., 2005, Mercier and Séguier, 2002, Verstraeten et al., 2005) while Holmberg et al. (2005) demonstrate isolated letter recognition. All of these studies consider only isolated samples.

In this article, we show how continuous speech recognition can be performed with sparse coding. We use a temporal linear generative model (TLGM) (Olshausen, 2002, Smith and Lewicki, 2005) to encode a speech signal into a spike train, although other encoding techniques could also be used. The TLGM is an extension of the linear generative model that has successfully been used to explain neural phenomena in the visual cortex (Olshausen and Field, 1997) and in the auditory cortex (Lewicki, 2002). The spike train is decoded by using the spike train model proposed by Oram et al. (1999). This study is an initial investigation into how continuous speech recognition can be done with sparse coding, and also what type of problems one encounters in the process. We would like to keep the recognition process simple, while still achieving reasonable results.

The recognition process we propose has three computational steps. Firstly, the raw waveform is transformed into a more natural representation, such as a spectrogram. Secondly, the modified representation is transformed into a sparse code. Finally, words are recognized by finding predetermined patterns in the sparse code. The following sections will cover each of these computational steps in turn.

Section snippets

Transforming the raw waveform

We use the TIDIGITS (Leonard and Doddington, 1993) dataset of continuous spoken digits by male and female speakers. The utterances are of variable length, consisting of a variable number of random digits. There are 11 different spoken digits, one for each number from “one” to “nine”, a “zero” and an “oh”. We choose this dataset because it is a rather simple set: it has a limited vocabulary and simple language model (since the probability that a certain word follows any other word is

Sparse codes

We use a TLGM to encode a spectrogram into a spike train. The reconstructed signal (in our case a spectrogram) xˆ is a convolution of the code a with the dictionary Φ. An element at time t in channel c of the reconstructed spectrogram is found with:xˆc,t=d=1Ndτ=1NtΦc,Δt{d}ad,τFor convenience, we use the integer index t to correspond to a segment in the spectrogram. The index t is related to actual time by multiplying t with 20 ms. Nt is the number of segments in the spectrogram that is being

Results related to sparse codes

To further speed up the training process we make use of two heuristics. Firstly, we start with a bigger value of λ than the one we would like to use; the bigger λ, the fewer non-zero elements are in the code and the more quickly it is computed. Secondly, we train the dictionary for the first 50 iterations only on a fifth of the samples in the dataset, since the dictionary initially does not have any structure that corresponds to the dataset – it will learn the basic structure even if the

Classification of the spike train

This section addresses the problem of decoding a spike train, i.e., how to infer or classify the spoken words encoded by the spike train.

Training

In the training process the parameters of models are adapted to fit the data. The models that need training are: the spike train models with each model having three mixture models (see Table 1); pΔt associated with each model; and also P(m). We use expectation maximization (EM) to train the models. Firstly the expectation step finds the most likely sequence of models m¯ and segment sizes Δt¯ for the entire dataset, then the maximization step adapts the model parameters to increase the

Results related to spike train classification

In the EM training process the spike train models converged to a stable solution in fewer than ten iterations. The performance on the test set (1000 utterances) is a word error rate (WER) of 19%. Out of the 3222 words in the test set, there are 65 deletions, 178 insertions and 368 replacements. Table 2 shows the confusion matrix of the replacements. The majority of confusions are predictable from phonetic similarities – for example, “one” and “nine” are often confused because of their common

Conclusion

We have shown how continuous speech recognition of a small dataset can be done with sparse coding and spike train classification. The results are promising: for the input features used here, the performance of speech recognition based on sparse coding is comparable to HMM-based speech recognition. These results are, however, not competitive with those achievable using state-of-the-art features; incorporating such features into our system would not have been feasible for computational reasons.

Acknowledgements

It is a pleasure to thank two anonymous reviewers for several useful insights on presentation, related work and conceptual matters.

References (42)

  • T.H. Cormen et al.

    Introduction to Algorithms

    (2001)
  • R.E. Fan et al.

    Working set selection using second order information for training support vector machines

    Journal of Machine Learning Research

    (2005)
  • P. Földiák et al.

    Sparse coding in the primate cortex

  • J.S. Garofolo et al.

    TIMIT Acoustic-Phonetic Continuous Speech Corpus

    (1993)
  • I. Gat et al.

    Spotting neural spike patterns using an adversary background model

    Neural Computation

    (2001)
  • H. Hermansky

    Perceptual linear predictive (PLP) analysis of speech

    Journal of the Acoustical Society of America

    (1990)
  • H. Hermansky et al.

    Rasta processing of speech

    IEEE Transactions on Speech and Audio Processing

    (1994)
  • Holmberg, M., Gelbart, D., Ramacher, U., Hemmert, W., 2005. Automatic speech recognition with neural spike trains. In:...
  • A.W.F. Huggins

    Temporally segmented speech

    Perception and Psychophysics

    (1975)
  • B.H. Juang et al.

    Maximum likelihood estimation for multivariate mixture observations of Markov chains

    IEEE Transactions on Information Theory

    (1986)
  • R.E. Kass et al.

    A spike-train probability model

    Neural Computation

    (2001)
  • Cited by (19)

    • Spectrum enhancement with sparse coding for robust speech recognition

      2015, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      Recently, sparse coding has been widely used for speech enhancement [33], speech separation [34] and robust speech recognition [35–44]. Smit et al. seem to be the first ones who introduced sparse coding into speech recognition [35]. In their method, a waveform is transformed into a spectrogram and sparse representations for the spectrogram are found by means of a linear generative model.

    • Minimax reconstruction risk of convolutional sparse dictionary learning

      2018, International Conference on Artificial Intelligence and Statistics, AISTATS 2018
    • The Optimized Dictionary based Robust Speaker Recognition

      2017, Journal of Signal Processing Systems
    View all citing articles on Scopus
    View full text