A phonetically neutral model of the low-level audio-visual interaction

doi:10.1016/j.specom.2004.10.003

Speech Communication

Volume 44, Issues 1–4, October 2004, Pages 31-41

https://doi.org/10.1016/j.specom.2004.10.003 Get rights and content

Abstract

The improvement of detectability of visible speech cues found by Grant and Seitz [2000. The use of visible speech cues for improving auditory detection of spoken sentences. JASA 108, 1197–1208] has been related to the degree of correlation between acoustic envelopes and visible movements. This suggests that audio and visual signals could interact early during the audio-visual perceptual process on the basis of audio envelope cues. On the other hand, acoustic-visual correlations were previously reported by Yehia et al. [1998. Quantitative association of vocal tract and facial behavior. Speech Commun. 26 (1), 23–43]. Taking into account these two main facts, the problem of extraction of the redundant audio-visual components is revisited: the video parametrization of natural images and three types of audio parameters are tested together, leading to new and realistic applications in video synthesis and audio-visual speech enhancement. Consistent with Grant and Seitz’s prediction, the 4-subband envelope energy features are found to be optimal for encoding the redundant components available for the enhancement task. The proposed computational model of audio-visual interaction is based on the product, in the audio pathway, between the time-aligned audio envelopes and video-predicted envelopes. This interaction scheme is shown to be phonetically neutral, so that it will not bias phonetic identification. The low-level stage which is described is compatible with a late integration process, which may be used as a potential front-end for speech recognition applications.

Introduction

The perception of speech is greatly improved in the presence of visual information, the mouth movements and the talking face and a gain of intelligibility of about 10–15 dB is classically reported. In a seminal paper, Summerfield (1987) analyzed the origin of this gain, and the potential roles of the visual cues. First, these cues can provide complementary information about the place of articulation, which is often degraded in the audio signal, but is easiest to lip-read. A great part of the literature has focused on this main property, and the other possible factors have attracted little interest. Using the paradigm of speech detection in loud noise, Grant and Seitz (2000) assessed evidence for another mechanism evoked by Summerfield (1987), one based on the temporal coherence between lip movements and speech envelope cues. In the audio-visual (AV) condition, a release of masking of about 1.6 dB was found, relative to the audio only (AO) condition. These results were confirmed by Kim and Davis (2001) with a larger dataset. This facilitation of the detection of speech segments near the threshold was attributed to the linear correlation existing between the mouth aperture and the energy envelope of speech, overall and decomposed in subbands. A better correlation was found in the 2nd and 3rd formant regions, consistent with speechreaders’ ability to extract place of articulation. Although the level of this facilitation is presumably early, the role of temporal coherence was not considered apart from the role of phonetic complementarity.

The functional specificity of this enhancement process was revealed by an articulatory-feature detection paradigm proposed by Schwartz et al. (2002) (renewed after the failure of Barker et al. (1998) to retrieve an effect with the {/d/,/g/} contrast). This shows that the near-threshold transmission of the voicing feature is facilitated in the AV relative to the AO condition. Hence, the intervention of the phonetic audio-visual complementarity is discarded because the voicing cue is completely absent in the visual information. Remarkably, in this experiment, an identification gain was directly measured, despite the near threshold characteristic, and the origin of this gain was identified. The level of the interaction appears to be as pre-phonetic: in a detection-identification pathway, an audibility gain, corresponding to a detection improvement at the feature level, leads to an intelligibility gain due to a better phonetic identification.

These experiments (Grant and Seitz, 2000, Kim and Davis, 2001, Schwartz et al., 2002) firmly establish that an audio-visual interaction operates early, before the phonetic integration. Since these are basic detection tasks, very little is apparent about the details of the process, which must be known to design a computational model. To guide us, in their recent review developing many parallels between functional and neurophysiological data, Bernstein et al. (2004) distinguish AV interactions that result from information processing (i.e., integration) and those that simply modulate activity levels in the nervous system. Strictly speaking, if we consider that the detection is inherent in the process itself, the AV interaction participates in information processing. There is therefore an alternative position in which the detection is the task (i.e., the observable) rather than a function. The goal of this paper is to propose a type of interaction which is modulatory. Then, the main property to establish is what we propose to call phonetic neutrality, that is, the ability to enhance portions of the stimulus without biasing its phonetic content.

The mechanism that we propose to explore in the present paper is based on the exploitation of speech envelope cues. As shown by earlier experiments (Erber, 1972), the speech envelope cues carried by the overall RMS energy are barely intelligible in isolation, but are complementary to lip-reading cues. When spectral reduction is not complete, because of subband decomposition, speech intelligibility is remarkably increased with just four subband envelopes modulating white noise (Shannon et al., 1995). In this case, audio-visual speech complementarity operates effectively, leading to almost perfect intelligibility, because the place of articulation is not well transmitted by the AO signal (Berthommier, 2001), whereas voicing and manner of articulation are well represented. This is consistent with the blurring of formant structures (i.e., peaks and trajectories) in spectrally reduced speech (SRS). However, some place of articulation cues that are easy to identify in the acoustic signal are present, such as the burst of the plosives /g/ and /k/. The choice of this representation for modeling a modulatory AV interaction was practically initiated by convergent previous works (Barker and Berthommier, 1999; Berthommier, 2001). However, in the current framework, it is motivated (1) by the finding of correlations between the mouth aperture and the energetic envelope in subbands (Grant and Seitz, 2000), (2) by the coarse spectral and amplitude modulation characteristics of the filtered envelopes which are compatible with a low-level processing as well as by (3) the relative phonetic neutrality of speech envelope cues, due to the spectral reduction.

The last question to address to complete the foundation of a computational model concerns the type of AV transformation which is implicated. This was pointed out by Grant and Seitz (2000): “Exactly how much information about the temporal and spectral envelope can be gleaned via speechreading is not clear, although a recent study by Yehia et al. (1998) suggests that 70%–80% of the variance in the rms amplitude can be recovered by nonlinear transformations of facial motion”. The Yehia et al. (1998) study and further confirmations (Barker and Berthommier, 1999, Jiang et al., 2002) reported a significant association between acoustic features (Line Spectral Pairs + overall RMS energy) and the position of facial markers. This association can be captured with linear transformations, after frame-by-frame training with unlabeled audio-visual data. Thus, it is possible to predict some marker position information from the audio signal and vice versa.

Section snippets

Possible links between a sound and an image

Various bi-directional links between sounds and images can be exploited. In preliminary works (Berthommier, 2003a, Berthommier, 2003b), we explored audio-visual linear associations for various parameter types, avoiding the use of facial markers. The feasibility of two main applications, video synthesis and speech enhancement, was tested. We summarize the essential aspects of this earlier work below, and we refer the reader to the original papers for further technical details.

Method

The linear regression transformation matrix T_xy from audio data X to video data Y is estimated from the AV synchronous data of the training section of the database (about 20,000 frames, 400 s): $\begin{matrix} T_{xy} = (Y - μ_{y}) (X - μ_{x})^{T} {((X - μ_{x}) (X - μ_{x})^{T})}^{- 1} \\ \tilde{Y} = T_{xy} (X - μ_{x}) + μ_{y} \end{matrix}$

The first coefficient Y(1), carrying the global luminance is not predicted, and the mean of Y(1) calculated over the training section is substituted. In a second stage, the prediction of the 288 DCT coefficients per frame is performed at 50 frames per second

Method

Similar to video synthesis, the linear transformation matrix T_yx from video data Y to audio data X is estimated from the synchronous frames, audio and video, of the training section of the database: $\begin{matrix} T_{yx} = (X - μ_{x}) (Y - μ_{y})^{T} {((Y - μ_{y}) (Y - μ_{y})^{T})}^{- 1} \\ \tilde{X} = T_{yx} (Y - μ_{y}) + μ_{x} \end{matrix}$ The audio frame duration is 40 ms, half-overlapping, and the three types of predicted parameters for each frame are Sb4 (nbp = 4), LSP (nbp = 24 + 1), DCT (nbp = 16). In all cases, these coefficients are temporally filtered with a 4th order low-pass butterworth

Motivation

The prediction of the clean speech envelopes by Sb4 is quite good (see Berthommier, 2003b), and the predicted SRS is partly intelligible. For assessing the complete neutrality of the modulation, it is necessary to show that the predicted envelope does not bias the audio signal in the temporal domain (as the LSP method does in the spectral domain). This is not taken into account by the RA index, which is essentially a spectral distance. As mentioned in the introduction, the SRS carries residual

Conclusion

Following this model, one role of low-level interaction is to reinforce the amplitude modulation of the speech segments, without distortion of the phonetic cues, spectral or temporal. This could explain a speech detection improvement at the threshold level, and at the supra-threshold level, an intelligibility gain due to the visual cues. For applications, the property of phonetic neutrality allows us to use the model as an enhancement front-end for an audio speech recognition process, or an

Acknowledgments

This work is part of the CTI-STIC project “Etude psychoacoustique et modélisation computationnelle des mécanismes de décodage acoustico-phonétiques à partir de la parole dégradée spectralement et temporellement”. I thank L. Rebut, M. Heckmann and C. Savariaux for the elaboration of the audio-visual database, and J.-L. Schwartz, K. Grant, P. Welby for many corrections and suggestions.

References (19)

H. Yehia et al.
Quantitative association of vocal tract and facial behavior
Speech Commun.
(1998)
Barker, J.P., Berthommier, F., Schwartz, J.-L., 1998. Is primitive AV coherence an aid to segment the scene? In: Proc....
Barker, J.P., Berthommier, F., 1999. Estimation of speech acoustics from visual speech features: a comparison of linear...
L.E. Bernstein et al.
Audiovisual speech binding: convergence or association
Berthommier, F., 2001. Audio-visual recognition of spectrally reduced speech. In: Proc. AVSP’01, Aalborg, pp....
Berthommier, F., 2003a. Direct synthesis of video from speech sounds for new telecommunication applications. In: Proc....
Berthommier, F., 2003b. Audiovisual speech enhancement based on the association between speech envelope and video...
A.S. Bregman
Auditory Scene Analysis
(1990)
N.P. Erber
Speech-envelope cues as an acoustical aid to lip-reading for profoundly deaf children
JASA
(1972)

There are more references available in the full text version of this article.

Cited by (24)

Audiovisual saliency prediction via deep learning
2021, Neurocomputing
Neuroscience study verifies that synchronized audiovisual stimuli would make a stronger response of visual perception than an independent stimulus. Many researches show that audio signals would affect human gaze behavior in the viewing of natural video scenes. Thus in this paper, we propose a multi-sensory framework of audio and visual signals for video saliency prediction. It mainly includes four modules: auditory feature extraction, visual feature extraction, semantic interaction between auditory feature and visual feature, and feature fusion. With the inputs of audio and visual signals, we present a network architecture of deep learning to undertake the tasks of these four modules. It is an end-to-end architecture that could interact the semantics from its learned features of audio and visual stimuli. The numerical and visual results show our method achieves a significant improvement over eleven recent saliency models that are regardless of the audio stimuli, even some of them are state-of-the-art deep learning models.
Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments
2020, Information Fusion
Citation Excerpt :
Researchers have proposed different audio and visual feature extraction methods [11–16], and a range of approaches including for early integration [17], late integration [18] and hybrid integration [19], multi-modal datasets [20,21], and fusion techniques [19,22–24]. In particular, multimodal AV speech processing methods have shown significant performance improvement for automatic speech recognition (ASR), speech enhancement and speech separation [25–29]. Recently, authors in [30] developed an AV deep CNN (AVDCNN) speech enhancement model that integrates audio and visual cues into a unified network model.
Human speech processing is inherently multi-modal, where visual cues (e.g. lip movements) can help better understand speech in noise. Our recent work [1] has shown that lip-reading driven, audio-visual (AV) speech enhancement can significantly outperform benchmark audio-only approaches at low signal-to-noise ratios (SNRs). However, consistent with our cognitive hypothesis, visual cues were found to be relatively less effective for speech enhancement at high SNRs, or low levels of background noise, where audio-only (A-only) cues worked adequately. Therefore, a more cognitively-inspired, context-aware AV approach is required, that contextually utilises both visual and noisy audio features, and thus more effectively accounts for different noisy conditions. In this paper, we introduce a novel context-aware AV speech enhancement framework that contextually exploits AV cues with respect to different operating conditions, in order to estimate clean audio, without requiring any prior SNR estimation. In particular, an AV switching module is developed by integrating a convolutional neural network (CNN) and long-short-term memory (LSTM) network, that learns to contextually switch between visualonly (V-only), A-only and both AV cues at low, high and moderate SNR levels, respectively. For testing, the estimated clean audio features are utilised using an innovative, enhanced visually-derived Wiener filter (EVWF) for noisy speech filtering. The context-aware AV speech enhancement framework is evaluated in dynamic real-world scenarios (including cafe, street, bus, and pedestrians) at different SNR levels (ranging from low to high SNRs), using benchmark Grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality (PESQ) is used to evaluate the quality of the restored speech. For subjective testing, the standard mean-opinion-score (MOS) method is used. Comparative experimental results show the superior performance of our proposed context-aware AV approach, over A-only, V-only, spectral subtraction (SS), and log-minimum mean square error (LMMSE) based speech enhancement methods, at both low and high SNRs. The preliminary findings demonstrate the capability of our novel approach to deal with spectro-temporal variations in real-world noisy environments, by contextually exploiting the complementary strengths of audio and visual cues. In conclusion, our contextual deep learning-driven AV framework is posited as a benchmark resource for the multi-modal speech processing and machine learning communities.
How visual timing and form information affect speech and non-speech processing
2014, Brain and Language
Citation Excerpt :
This was examined by contrasting stimuli where the auditory and visual speech matched with those where they did not. This manipulation was based on demonstrations that there is a functional correspondence between lip and mouth movements and particular speech spectral properties (Berthommier, 2004; Girin, Schwartz, & Feng, 2001) and that seeing visual speech significantly up-regulates the activity of auditory cortex compared to auditory speech alone (Okada, Venezia, Matchin, Saberi, & Hickok, 2013). Combining these two observations leads to the prediction that visual speech form will facilitate decisions based on the processing of its auditory counterpart.
Auditory speech processing is facilitated when the talker’s face/head movements are seen. This effect is typically explained in terms of visual speech providing form and/or timing information. We determined the effect of both types of information on a speech/non-speech task (non-speech stimuli were spectrally rotated speech). All stimuli were presented paired with the talker’s static or moving face. Two types of moving face stimuli were used: full-face versions (both spoken form and timing information available) and modified face versions (only timing information provided by peri-oral motion available). The results showed that the peri-oral timing information facilitated response time for speech and non-speech stimuli compared to a static face. An additional facilitatory effect was found for full-face versions compared to the timing condition; this effect only occurred for speech stimuli. We propose the timing effect was due to cross-modal phase resetting; the form effect to cross-modal priming.
Should We Believe Our Eyes or Our Ears? Processing Incongruent Audiovisual Stimuli by Russian Listeners
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Rethinking the Mechanisms Underlying the McGurk Illusion
2021, Frontiers in Human Neuroscience
Visual analog of the acoustic amplitude envelope benefits speech perception in noise
2020, Journal of the Acoustical Society of America

View all citing articles on Scopus

View full text

A phonetically neutral model of the low-level audio-visual interaction

Abstract

Introduction

Section snippets

Possible links between a sound and an image

Method

Method

Motivation

Conclusion

Acknowledgments

Speech Commun.

Audiovisual speech binding: convergence or association

Auditory Scene Analysis

Speech-envelope cues as an acoustical aid to lip-reading for profoundly deaf children

JASA