Neuromorphic detection of speech dynamics

doi:10.1016/j.neucom.2010.07.023

Neurocomputing

Volume 74, Issue 8, 15 March 2011, Pages 1191-1202

https://doi.org/10.1016/j.neucom.2010.07.023 Get rights and content

Abstract

Speech and voice technologies are experiencing a profound review as new paradigms are sought to overcome some specific problems which cannot be completely solved by classical approaches. Neuromorphic Speech Processing is an emerging area in which research is turning the face to understand the natural neural processing of speech by the Human Auditory System in order to capture the basic mechanisms solving difficult tasks in an efficient way. In the present paper a further step ahead is presented in the approach to mimic basic neural speech processing by simple neuromorphic units standing on previous work to show how formant dynamics – and henceforth consonantal features – can be detected by using a general neuromorphic unit which can mimic the functionality of certain neurons found in the upper auditory pathways. Using these simple building blocks a General Speech Processing Architecture can be synthesized as a layered structure. Results from different simulation stages are provided as well as a discussion on implementation details. Conclusions and future work are oriented to describe the functionality to be covered in the next research steps.

Introduction

Neuromorphic Speech Processing is an emerging field which has attracted the attention of many researchers looking for new paradigms helping to better understand the underlying brain processes involved in speech perception, comprehension and production [10], [21]. This study can also be extended to cognitive audio (voice and sound processing by humans in general) when aspects as emotion or speaker recognition are concerned or in scene analysis [20], [21], [30]. The present paper is aimed to extend previous work on Neuromorphic Speech Processing [8] using a layered architecture of artificial Neuron-like Units derived from the functionality of the main types of neurons [14] found in the auditory pathways from the cochea to the primary and secondary auditory cortex [9]. In these early stages the typology a General Neuromorphic Computing Unit (GNCU) was defined using well-known paradigms from mask Image Processing [13]. It was also shown in the referred previous work [9] how one of these Mask Units can be adapted to model different processes as Lateral Inhibition to enhance Formant Detection. It was also shown how using different masks the GNCU could be configured to detect formant dynamics (ascending or descending resonance patterns appearing in certain speech sounds). The present work is intended to show how based on this GNCU a general layered architecture can be defined for the labelling of phonemes from formant positions and dynamics, advancing one step in the definition of a fully Bio-inspired Speech Processing Architecture. The paper is organized as follows: A brief description of formants and formant dynamics is given in Section 2. In Section 3 the different units found in the Auditory Pathway are defined accordingly to their functionality. The structure of the GNCU is shown to mimic the different units of interest for Speech Processing, and a Neuromorphic Speech Processing Architecture based on these units is presented. The purpose of Section 4 is to introduce plausible neural circuits to implement specific functions and comment results from simulations. Conclusions and future work are presented in Section 5.

Section snippets

Perceiving the dynamic nature of speech

Speech can be defined as the result of a complex interaction of the sound produced by either the vocal folds (pseudo-periodic vibration found in voiced speech) or the turbulent flow of air through constrictions along the vocal tract (broad-band a-periodic noise-like signal found in unvoiced speech). The articulation capabilities of the vocal and nasal tracts reduce or enhance the frequency contents of the resulting sound, which is perceived by the human auditory system as a flowing stream of

Neuromorphic computing for speech processing

The structure responsible for speech perception is the auditory system, described in Fig. 2 as a chain of different sub-systems integrated by the peripheral auditory system (outer, middle and inner ear) and the higher auditory centres. The most important organ of the peripheral auditory system is the cochlea (inner ear), which carries out the separation in frequency and time of the different components of sound and their transduction from mechanical to neural activity [1]. Electrical impulses

Simulating FM Units

From what has been exposed a clear consequence may be derived: formant structure plays a major role in the vowel and consonantal structure of speech. Formant detection, tracking and grouping in semantic units must be a crucial role in speech understanding. Therefore the simulation of these functionalities by neural-like simple units may be of most importance for neuromorphic speech processing. In what follows some of the capabilities of these structures will be shown with emphasis in the

Discussion and conclusions

Through the present paper it has been shown that formant-based speech processing may be carried out by well-known bio-inspired computing units. Special emphasis has been placed in the description of the biophysical mechanisms which are credited for being responsible of formant dynamics detection, as related to the perception of certain consonantal sounds.

A special effort has been devoted to the definition of a plausible neuromorphic or bio-inspired architecture composed of multiple modules of a

Acknowledgements

This work is being funded by grants TEC2006-12887-C02-01/02 and TEC-2009-14123-C04-03 from Plan Nacional de I+D+i, Ministry of Education and Science, by grant CCG06-UPM/TIC-0028 from CAM/UPM, and by project HESPERIA (http.//www.proyecto-hesperia.org) from the Programme CENIT, Centro para el Desarrollo Tecnológico Industrial, Ministry of Industry, Spain.

Pedro Gómez-Vilda was born in Burgo de Osma, Spain in 1952. He received the M.Sc. degree in Communications Engineering in 1978 and the Ph.D. degree in Computer Science from the Universidad Politécnica de Madrid, Madrid, Spain, in 1983. He is Professor in the Computer Science and Engineering Department, at Universidad Politécnica de Madrid since 1988. His current research interests are biomedical signal processing, speaker identification, cognitive speech recognition, and genomic signal

References (30)

S. Shamma
On the role of space and time auditory processing
Trends in Cognitive Sciences
(2001)
S. Shamma
Physiological foundations of temporal integration in the perception of speech
Journal of Phonetics
(2003)
J.B. Allen
Nonlinear cochlear signal processing and masking in speech perception
J.I. Arellano et al.
Ultrastructure of dendritic spines: correlation between synaptic and spine morphologies
Frontiers in Neuroscience
(2007)
Available from...
Available from...
Available from...
J.R. Deller et al.
Discrete-Time Processing of Speech Signals
(1993)
D.B. Geissler et al.
Time-critical integration of formants for perception of communication calls in mice
Proceedings of the National Academy of Science
(2002)
P. Gómez et al.
Architecture for cognitive audio
Lecture Notes on Computer Science
(2007)

P. Gómez et al.

Time–frequency representations in speech perception

Neurocomputing

(2009)

S. Greenberg et al.

Auditory processing of speech

S. Greenberg et al.

Speech processing in the auditory system: an overview

D.O. Hebb

The Organization of Behavior

(1949)

B. Jähne

Digital Image Processing

(2005)

Cited by (10)

Monitoring amyotrophic lateral sclerosis by biomechanical modeling of speech production
2015, Neurocomputing
Citation Excerpt :
This scheme would be responsible (under the convenient simplifications) of the vertical movement of the jaw–tongue system (Δym, Δyg). The present work is based in formant-like pattern detection on LPC spectrograms [14] produced from the speech signal using a Phonation Model Inversion which separates the vocal tract transfer function from the glottal source excitation [15]. This technique, initially designed to produce reliable estimates of the glottal source, leaves a highly robust estimate of the vocal tract transfer function as a side result.
Neuromotor Degenerative Diseases (NDD) affecting mainly sub-thalamic and extra-pyramidal neuromotor structures leave significant marks in speech and phonation correlates. These may be used in the characterization, detection, grading and monitoring diseases and their progress in a non-invasive way. Considering that speech and phonation recording can be carried out using handy and low-cost instrumentation, speech and phonation correlates may be quite adequate candidates to define specific NDD biomarkers for disease progress monitoring protocols. The purpose of the paper is to present the fundamentals of speech articulation biomechanical modeling from the level of signal processing to neuromotor activity inference. This backward pathway involves several inverse problems, which are addressed separately. Results from study cases relevant in Amyotrophic Lateral Sclerosis are presented and discussed. The conclusions of the research show that several correlates may be reliably established, and that monitoring disease state and progress may rely on some biomechanical correlates informing on jaw and tongue neuromotor residual activity. Possible applications of the methodology to other neurodegenerative diseases are also discussed.
Simulating the phonological auditory cortex from vowel representation spaces to categories
2013, Neurocomputing
Citation Excerpt :
This effect is emulated in the NSPA by the action of a frequency-domain mask WLI(r) implementing a type of Laplacian or Mexican Hat function along the frequency axis (m) which enhances formant activity and reduces the activity of neighbor channels in the input fiber spectrogram given by XCF(m) producing a set of formant coding activity fibers XLI(m). The details of implementation and performance of this process known as Lateral Inhibition Formant Profiling are already described in Ref. [8] and as such will not be further extended here. μs and σs being the means and standard deviations and Ω the frequency resolution.
Vowels are important clues supporting speech perception. Nevertheless in Computational Perception the definition of vowels is a very complex and elusive issue. The purpose of the present paper is to give a possible definition under the perceptual point of view. A vowel could be defined as an assignment of an acoustic–phonetic pattern to a specific categorical representation space. This assignment would be competitively instantiated in the cortical structures, depending on the specific phonological framework of the listener's language. An experimental framework is designed to test this definition on a Neuromorphic Speech Processing Architecture. Results from experiments to test reference patterns in Spanish, and possible extension to other languages with a larger repertoire of categories are presented and discussed.
Bioinspired auditory model for vowel recognition
2021, Electronics (Switzerland)
Sigma-Lognormal Modeling of Speech
2021, Cognitive Computation
Characterization of speech from amyotrophic lateral sclerosis by neuromorphic processing
2013, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
On computational working memory for speech analysis
2011, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

J. Manuel Ferrández Vicente was born in Elche, Spain in 1969. He received the M.Sc. degree in Computer Science in 1995, and the Ph.D. degree in 1998, all of them from the Universidad Politécnica de Madrid, Spain. He is currently Associate Professor at the Department of Electronics, Computer Technology and Projects at the Universidad Politécnica de Cartagena and Head of the Electronic Design and Signal Processing Research Group at the same University. His research interests include bioinspired processing, neuromorphic engineering and cognitive speech recognition.

Dr. Victoria Rodellar-Biarge was born in Huesca, Spain. She received the M.Sc. and the Ph.D. degree in Computer Science from the Universidad Politécnica de Madrid, Madrid, Spain. She is Associate Professor in the Computer Science and Engineering Department, at Universidad Politécnica de Madrid. Her current research interests are biomedical and genomic signal processing and reconfigurable logic designs for DSP. Dr. Rodellar-Biarge is a member of the IEEE.

Agustín Álvarez-Marquina was born in Madrid, Spain in 1969. He received the M.Sc. degree in Computer Science in 1994 and the Ph.D. degree in Computer Science from the Universidad Politécnica de Madrid, Madrid, Spain, in 1999. He is Associate Professor in the Computer Science and Engineering Department, at Universidad Politécnica de Madrid since 2000. His current research interests are speech recognition, speaker identification and architectures for digital signal processing

Luis Miguel Mazaira Fernández was born in Madrid, Spain in 1978. He received the M.Sc. degree in Computer Engineering in 2003, and the Certificate of Advance Studies (DEA) in 2005 from the Universidad Politécnica de Madrid. He is Assistant Professor in the Computer Science and Engineering Department, at Universidad Politécnica de Madrid since 2005 and is currently pursuing his Ph.D. degree with the GIAPSI research group. His current research interests are biomedical signal processing, speaker identification, cognitive speech recognition, pattern recognition.

Rafael Martínez-Olalla was born in Madrid, Spain in 1969. He received the M.Sc. degree in Communications Engineering in 1995 and the Ph.D. degree in Computer Science from the Universidad Politécnica de Madrid, Madrid, Spain, in 2002. He is Associate Professor in the Computer Science and Engineering Department, at Universidad Politécnica de Madrid since 2007. His current research interests are biomedical signal processing, speaker identification, and genomic signal processing.

Cristina Muñoz-Mulas was born in Madrid, Spain in 1982. She received the M.Sc. degree in Computer Science in 2006. She is Ph.D. student in the Computer science and Engineering Department at Universidad Politécnica de Madrid since 2007. Her research topic is about speaker identification and gender discrimination by speech signal processing.

View full text

Neuromorphic detection of speech dynamics

Abstract

Introduction

Section snippets

Perceiving the dynamic nature of speech

Neuromorphic computing for speech processing

Simulating FM Units

Discussion and conclusions

Acknowledgements

Trends in Cognitive Sciences

Journal of Phonetics

Nonlinear cochlear signal processing and masking in speech perception

Ultrastructure of dendritic spines: correlation between synaptic and spine morphologies

Frontiers in Neuroscience

Discrete-Time Processing of Speech Signals

Time-critical integration of formants for perception of communication calls in mice

Proceedings of the National Academy of Science

Architecture for cognitive audio

Lecture Notes on Computer Science

Time–frequency representations in speech perception

Neurocomputing

Auditory processing of speech

Speech processing in the auditory system: an overview

The Organization of Behavior

Digital Image Processing