Continuous speech recognition with sparse coding
Introduction
The brain needs to form an internal representation of the outside world in order to interact with the world, and does so by representing or coding information in the activities of neurons. A neuron can be viewed as a binary element; it is either silent or it fires a spike. The activity pattern of several neurons over time is called a spatio-temporal pattern or a spike train. The brain uses these patterns to code stimuli.
What are the properties of the neural code? In a binary temporal system such as the brain there are two extremes of representing data: at the one end there is dense coding, where many neurons are very active in coding a stimulus. For dense coding, a small number of neurons are sufficient to code a large set of stimuli. At the other extreme there is local coding, where very few neurons are active. The extreme where just one neuron is active for a stimulus (1-of-N coding) would require a large number of neurons to represent all the different stimuli. The brain adopts a compromise between dense coding and local coding (Földiák and Young, 1995, Vinje and Gallant, 2000) which is called sparse coding.
Several studies have used sparse coding to create a feature set for sound or speech recognition tasks. The studies follow the same general approach: first a sound signal is encoded into a spike train, then the recognition task is performed by decoding the spike train. Cho and Choi (2005) perform sound classification with spikes. They classify a sound as belonging to one of ten classes, which include male speech, foot steps and flute sounds. Näger et al. (2002) show that transitions between vowels can be classified by learning the delays between spikes. Kwon and Lee (2004) use independent component analysis (ICA) to extract features from speech in order to do phoneme recognition. ICA is an algorithm that provides a sparse representation of a signal. Some studies illustrate isolated digit recognition (Loiselle et al., 2005, Mercier and Séguier, 2002, Verstraeten et al., 2005) while Holmberg et al. (2005) demonstrate isolated letter recognition. All of these studies consider only isolated samples.
In this article, we show how continuous speech recognition can be performed with sparse coding. We use a temporal linear generative model (TLGM) (Olshausen, 2002, Smith and Lewicki, 2005) to encode a speech signal into a spike train, although other encoding techniques could also be used. The TLGM is an extension of the linear generative model that has successfully been used to explain neural phenomena in the visual cortex (Olshausen and Field, 1997) and in the auditory cortex (Lewicki, 2002). The spike train is decoded by using the spike train model proposed by Oram et al. (1999). This study is an initial investigation into how continuous speech recognition can be done with sparse coding, and also what type of problems one encounters in the process. We would like to keep the recognition process simple, while still achieving reasonable results.
The recognition process we propose has three computational steps. Firstly, the raw waveform is transformed into a more natural representation, such as a spectrogram. Secondly, the modified representation is transformed into a sparse code. Finally, words are recognized by finding predetermined patterns in the sparse code. The following sections will cover each of these computational steps in turn.
Section snippets
Transforming the raw waveform
We use the TIDIGITS (Leonard and Doddington, 1993) dataset of continuous spoken digits by male and female speakers. The utterances are of variable length, consisting of a variable number of random digits. There are 11 different spoken digits, one for each number from “one” to “nine”, a “zero” and an “oh”. We choose this dataset because it is a rather simple set: it has a limited vocabulary and simple language model (since the probability that a certain word follows any other word is
Sparse codes
We use a TLGM to encode a spectrogram into a spike train. The reconstructed signal (in our case a spectrogram) is a convolution of the code with the dictionary . An element at time t in channel c of the reconstructed spectrogram is found with:For convenience, we use the integer index t to correspond to a segment in the spectrogram. The index t is related to actual time by multiplying t with 20 ms. is the number of segments in the spectrogram that is being
Results related to sparse codes
To further speed up the training process we make use of two heuristics. Firstly, we start with a bigger value of than the one we would like to use; the bigger , the fewer non-zero elements are in the code and the more quickly it is computed. Secondly, we train the dictionary for the first 50 iterations only on a fifth of the samples in the dataset, since the dictionary initially does not have any structure that corresponds to the dataset – it will learn the basic structure even if the
Classification of the spike train
This section addresses the problem of decoding a spike train, i.e., how to infer or classify the spoken words encoded by the spike train.
Training
In the training process the parameters of models are adapted to fit the data. The models that need training are: the spike train models with each model having three mixture models (see Table 1); associated with each model; and also . We use expectation maximization (EM) to train the models. Firstly the expectation step finds the most likely sequence of models and segment sizes for the entire dataset, then the maximization step adapts the model parameters to increase the
Results related to spike train classification
In the EM training process the spike train models converged to a stable solution in fewer than ten iterations. The performance on the test set (1000 utterances) is a word error rate (WER) of 19%. Out of the 3222 words in the test set, there are 65 deletions, 178 insertions and 368 replacements. Table 2 shows the confusion matrix of the replacements. The majority of confusions are predictable from phonetic similarities – for example, “one” and “nine” are often confused because of their common
Conclusion
We have shown how continuous speech recognition of a small dataset can be done with sparse coding and spike train classification. The results are promising: for the input features used here, the performance of speech recognition based on sparse coding is comparable to HMM-based speech recognition. These results are, however, not competitive with those achievable using state-of-the-art features; incorporating such features into our system would not have been feasible for computational reasons.
Acknowledgements
It is a pleasure to thank two anonymous reviewers for several useful insights on presentation, related work and conceptual matters.
References (42)
- et al.
Detection of spike patterns using pattern filtering, with applications to sleep replay
Neurocomputing
(2003) - et al.
Nonnegative features of spectro-temporal sounds for classification
Pattern Recognition Letters
(2005) Should recognizers have ears?
Speech Communication
(1998)- et al.
Phoneme recognition using ICA-based feature extraction and transformation
Signal Processing
(2004) - et al.
Speech recognition with spiking neurons and dynamic synapses: a model motivated by the human auditory pathway
Neurocomputing
(2002) - et al.
Sparse coding with an overcomplete basis set: a strategy employed by V1?
Vision Research
(1997) - et al.
Isolated word recognition with the liquid state machine: a case study
Information Processing Letters
(2005) How do humans process and recognize speech?
IEEE Transactions on Speech and Audio Processing
(1994)- et al.
Sparse and shift-invariant representation of music
IEEE Transactions on Audio, Speech, and Language Processing
(2006) - Cambridge University, Engineering Department, 2006. Hidden Markov Model Toolkit version 3.4....
Introduction to Algorithms
Working set selection using second order information for training support vector machines
Journal of Machine Learning Research
Sparse coding in the primate cortex
TIMIT Acoustic-Phonetic Continuous Speech Corpus
Spotting neural spike patterns using an adversary background model
Neural Computation
Perceptual linear predictive (PLP) analysis of speech
Journal of the Acoustical Society of America
Rasta processing of speech
IEEE Transactions on Speech and Audio Processing
Temporally segmented speech
Perception and Psychophysics
Maximum likelihood estimation for multivariate mixture observations of Markov chains
IEEE Transactions on Information Theory
A spike-train probability model
Neural Computation
Cited by (19)
Refining Sparse Coding Sub-word Unit Inventories with Lattice-constrained Viterbi Training
2016, Procedia Computer ScienceSpectrum enhancement with sparse coding for robust speech recognition
2015, Digital Signal Processing: A Review JournalCitation Excerpt :Recently, sparse coding has been widely used for speech enhancement [33], speech separation [34] and robust speech recognition [35–44]. Smit et al. seem to be the first ones who introduced sparse coding into speech recognition [35]. In their method, a waveform is transformed into a spectrogram and sparse representations for the spectrogram are found by means of a linear generative model.
MINQ8: general definite and bound constrained indefinite quadratic programming
2018, Computational Optimization and ApplicationsMinimax reconstruction risk of convolutional sparse dictionary learning
2018, International Conference on Artificial Intelligence and Statistics, AISTATS 2018The Optimized Dictionary based Robust Speaker Recognition
2017, Journal of Signal Processing Systems