A phonetically neutral model of the low-level audio-visual interaction

https://doi.org/10.1016/j.specom.2004.10.003Get rights and content

Abstract

The improvement of detectability of visible speech cues found by Grant and Seitz [2000. The use of visible speech cues for improving auditory detection of spoken sentences. JASA 108, 1197–1208] has been related to the degree of correlation between acoustic envelopes and visible movements. This suggests that audio and visual signals could interact early during the audio-visual perceptual process on the basis of audio envelope cues. On the other hand, acoustic-visual correlations were previously reported by Yehia et al. [1998. Quantitative association of vocal tract and facial behavior. Speech Commun. 26 (1), 23–43]. Taking into account these two main facts, the problem of extraction of the redundant audio-visual components is revisited: the video parametrization of natural images and three types of audio parameters are tested together, leading to new and realistic applications in video synthesis and audio-visual speech enhancement. Consistent with Grant and Seitz’s prediction, the 4-subband envelope energy features are found to be optimal for encoding the redundant components available for the enhancement task. The proposed computational model of audio-visual interaction is based on the product, in the audio pathway, between the time-aligned audio envelopes and video-predicted envelopes. This interaction scheme is shown to be phonetically neutral, so that it will not bias phonetic identification. The low-level stage which is described is compatible with a late integration process, which may be used as a potential front-end for speech recognition applications.

Introduction

The perception of speech is greatly improved in the presence of visual information, the mouth movements and the talking face and a gain of intelligibility of about 10–15 dB is classically reported. In a seminal paper, Summerfield (1987) analyzed the origin of this gain, and the potential roles of the visual cues. First, these cues can provide complementary information about the place of articulation, which is often degraded in the audio signal, but is easiest to lip-read. A great part of the literature has focused on this main property, and the other possible factors have attracted little interest. Using the paradigm of speech detection in loud noise, Grant and Seitz (2000) assessed evidence for another mechanism evoked by Summerfield (1987), one based on the temporal coherence between lip movements and speech envelope cues. In the audio-visual (AV) condition, a release of masking of about 1.6 dB was found, relative to the audio only (AO) condition. These results were confirmed by Kim and Davis (2001) with a larger dataset. This facilitation of the detection of speech segments near the threshold was attributed to the linear correlation existing between the mouth aperture and the energy envelope of speech, overall and decomposed in subbands. A better correlation was found in the 2nd and 3rd formant regions, consistent with speechreaders’ ability to extract place of articulation. Although the level of this facilitation is presumably early, the role of temporal coherence was not considered apart from the role of phonetic complementarity.

The functional specificity of this enhancement process was revealed by an articulatory-feature detection paradigm proposed by Schwartz et al. (2002) (renewed after the failure of Barker et al. (1998) to retrieve an effect with the {/d/,/g/} contrast). This shows that the near-threshold transmission of the voicing feature is facilitated in the AV relative to the AO condition. Hence, the intervention of the phonetic audio-visual complementarity is discarded because the voicing cue is completely absent in the visual information. Remarkably, in this experiment, an identification gain was directly measured, despite the near threshold characteristic, and the origin of this gain was identified. The level of the interaction appears to be as pre-phonetic: in a detection-identification pathway, an audibility gain, corresponding to a detection improvement at the feature level, leads to an intelligibility gain due to a better phonetic identification.

These experiments (Grant and Seitz, 2000, Kim and Davis, 2001, Schwartz et al., 2002) firmly establish that an audio-visual interaction operates early, before the phonetic integration. Since these are basic detection tasks, very little is apparent about the details of the process, which must be known to design a computational model. To guide us, in their recent review developing many parallels between functional and neurophysiological data, Bernstein et al. (2004) distinguish AV interactions that result from information processing (i.e., integration) and those that simply modulate activity levels in the nervous system. Strictly speaking, if we consider that the detection is inherent in the process itself, the AV interaction participates in information processing. There is therefore an alternative position in which the detection is the task (i.e., the observable) rather than a function. The goal of this paper is to propose a type of interaction which is modulatory. Then, the main property to establish is what we propose to call phonetic neutrality, that is, the ability to enhance portions of the stimulus without biasing its phonetic content.

The mechanism that we propose to explore in the present paper is based on the exploitation of speech envelope cues. As shown by earlier experiments (Erber, 1972), the speech envelope cues carried by the overall RMS energy are barely intelligible in isolation, but are complementary to lip-reading cues. When spectral reduction is not complete, because of subband decomposition, speech intelligibility is remarkably increased with just four subband envelopes modulating white noise (Shannon et al., 1995). In this case, audio-visual speech complementarity operates effectively, leading to almost perfect intelligibility, because the place of articulation is not well transmitted by the AO signal (Berthommier, 2001), whereas voicing and manner of articulation are well represented. This is consistent with the blurring of formant structures (i.e., peaks and trajectories) in spectrally reduced speech (SRS). However, some place of articulation cues that are easy to identify in the acoustic signal are present, such as the burst of the plosives /g/ and /k/. The choice of this representation for modeling a modulatory AV interaction was practically initiated by convergent previous works (Barker and Berthommier, 1999; Berthommier, 2001). However, in the current framework, it is motivated (1) by the finding of correlations between the mouth aperture and the energetic envelope in subbands (Grant and Seitz, 2000), (2) by the coarse spectral and amplitude modulation characteristics of the filtered envelopes which are compatible with a low-level processing as well as by (3) the relative phonetic neutrality of speech envelope cues, due to the spectral reduction.

The last question to address to complete the foundation of a computational model concerns the type of AV transformation which is implicated. This was pointed out by Grant and Seitz (2000): “Exactly how much information about the temporal and spectral envelope can be gleaned via speechreading is not clear, although a recent study by Yehia et al. (1998) suggests that 70%–80% of the variance in the rms amplitude can be recovered by nonlinear transformations of facial motion”. The Yehia et al. (1998) study and further confirmations (Barker and Berthommier, 1999, Jiang et al., 2002) reported a significant association between acoustic features (Line Spectral Pairs + overall RMS energy) and the position of facial markers. This association can be captured with linear transformations, after frame-by-frame training with unlabeled audio-visual data. Thus, it is possible to predict some marker position information from the audio signal and vice versa.

Section snippets

Possible links between a sound and an image

Various bi-directional links between sounds and images can be exploited. In preliminary works (Berthommier, 2003a, Berthommier, 2003b), we explored audio-visual linear associations for various parameter types, avoiding the use of facial markers. The feasibility of two main applications, video synthesis and speech enhancement, was tested. We summarize the essential aspects of this earlier work below, and we refer the reader to the original papers for further technical details.

Method

The linear regression transformation matrix Txy from audio data X to video data Y is estimated from the AV synchronous data of the training section of the database (about 20,000 frames, 400 s):Txy=(Y-μy)(X-μx)T(X-μx)(X-μx)T-1Y=Txy(X-μx)+μy

The first coefficient Y(1), carrying the global luminance is not predicted, and the mean of Y(1) calculated over the training section is substituted. In a second stage, the prediction of the 288 DCT coefficients per frame is performed at 50 frames per second

Method

Similar to video synthesis, the linear transformation matrix Tyx from video data Y to audio data X is estimated from the synchronous frames, audio and video, of the training section of the database:Tyx=(X-μx)(Y-μy)T(Y-μy)(Y-μy)T-1X=Tyx(Y-μy)+μxThe audio frame duration is 40 ms, half-overlapping, and the three types of predicted parameters for each frame are Sb4 (nbp = 4), LSP (nbp = 24 + 1), DCT (nbp = 16). In all cases, these coefficients are temporally filtered with a 4th order low-pass butterworth

Motivation

The prediction of the clean speech envelopes by Sb4 is quite good (see Berthommier, 2003b), and the predicted SRS is partly intelligible. For assessing the complete neutrality of the modulation, it is necessary to show that the predicted envelope does not bias the audio signal in the temporal domain (as the LSP method does in the spectral domain). This is not taken into account by the RA index, which is essentially a spectral distance. As mentioned in the introduction, the SRS carries residual

Conclusion

Following this model, one role of low-level interaction is to reinforce the amplitude modulation of the speech segments, without distortion of the phonetic cues, spectral or temporal. This could explain a speech detection improvement at the threshold level, and at the supra-threshold level, an intelligibility gain due to the visual cues. For applications, the property of phonetic neutrality allows us to use the model as an enhancement front-end for an audio speech recognition process, or an

Acknowledgments

This work is part of the CTI-STIC project “Etude psychoacoustique et modélisation computationnelle des mécanismes de décodage acoustico-phonétiques à partir de la parole dégradée spectralement et temporellement”. I thank L. Rebut, M. Heckmann and C. Savariaux for the elaboration of the audio-visual database, and J.-L. Schwartz, K. Grant, P. Welby for many corrections and suggestions.

References (19)

  • H. Yehia et al.

    Quantitative association of vocal tract and facial behavior

    Speech Commun.

    (1998)
  • Barker, J.P., Berthommier, F., Schwartz, J.-L., 1998. Is primitive AV coherence an aid to segment the scene? In: Proc....
  • Barker, J.P., Berthommier, F., 1999. Estimation of speech acoustics from visual speech features: a comparison of linear...
  • L.E. Bernstein et al.

    Audiovisual speech binding: convergence or association

  • Berthommier, F., 2001. Audio-visual recognition of spectrally reduced speech. In: Proc. AVSP’01, Aalborg, pp....
  • Berthommier, F., 2003a. Direct synthesis of video from speech sounds for new telecommunication applications. In: Proc....
  • Berthommier, F., 2003b. Audiovisual speech enhancement based on the association between speech envelope and video...
  • A.S. Bregman

    Auditory Scene Analysis

    (1990)
  • N.P. Erber

    Speech-envelope cues as an acoustical aid to lip-reading for profoundly deaf children

    JASA

    (1972)
There are more references available in the full text version of this article.

Cited by (24)

  • Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments

    2020, Information Fusion
    Citation Excerpt :

    Researchers have proposed different audio and visual feature extraction methods [11–16], and a range of approaches including for early integration [17], late integration [18] and hybrid integration [19], multi-modal datasets [20,21], and fusion techniques [19,22–24]. In particular, multimodal AV speech processing methods have shown significant performance improvement for automatic speech recognition (ASR), speech enhancement and speech separation [25–29]. Recently, authors in [30] developed an AV deep CNN (AVDCNN) speech enhancement model that integrates audio and visual cues into a unified network model.

  • How visual timing and form information affect speech and non-speech processing

    2014, Brain and Language
    Citation Excerpt :

    This was examined by contrasting stimuli where the auditory and visual speech matched with those where they did not. This manipulation was based on demonstrations that there is a functional correspondence between lip and mouth movements and particular speech spectral properties (Berthommier, 2004; Girin, Schwartz, & Feng, 2001) and that seeing visual speech significantly up-regulates the activity of auditory cortex compared to auditory speech alone (Okada, Venezia, Matchin, Saberi, & Hickok, 2013). Combining these two observations leads to the prediction that visual speech form will facilitate decisions based on the processing of its auditory counterpart.

  • Should We Believe Our Eyes or Our Ears? Processing Incongruent Audiovisual Stimuli by Russian Listeners

    2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
View all citing articles on Scopus
View full text