Skip to main content
Log in

Parallel linear dynamic models can mimic the McGurk effect in clinical populations

  • Published:
Journal of Computational Neuroscience Aims and scope Submit manuscript

Abstract

One of the most common examples of audiovisual speech integration is the McGurk effect. As an example, an auditory syllable /ba/ recorded over incongruent lip movements that produce “ga” typically causes listeners to hear “da”. This report hypothesizes reasons why certain clinical and listeners who are hard of hearing might be more susceptible to visual influence. Conversely, we also examine why other listeners appear less susceptible to the McGurk effect (i.e., they report hearing just the auditory stimulus without being influenced by the visual). Such explanations are accompanied by a mechanistic explanation of integration phenomena including visual inhibition of auditory information, or slower rate of accumulation of inputs. First, simulations of a linear dynamic parallel interactive model were instantiated using inhibition and facilitation to examine potential mechanisms underlying integration. In a second set of simulations, we systematically manipulated the inhibition parameter values to model data obtained from listeners with autism spectrum disorder. In summary, we argue that cross-modal inhibition parameter values explain individual variability in McGurk perceptibility. Nonetheless, different mechanisms should continue to be explored in an effort to better understand current data patterns in the audiovisual integration literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Altieri, N. (2016). Why we hear what we see: The temporal dynamics of audiovisual speech integration. In Houpt, J.W. and Blaha, L.M. (Eds.). Mathematical Models of Perception and Cognition (Vols. 1–2)

  • Altieri, N., & Hudock, D. (2014). Variability in audiovisual speech integration skills assessed by combined capacity and accuracy measures. International Journal of Audiology, 53, 710–718.

    Article  PubMed  Google Scholar 

  • Altieri, N., & Hudock, D. (2016). Normative data on audiovisual speech integration using sentence recognition and capacity measures. International Journal of Audiology, 55, 206–214.

    Article  PubMed  Google Scholar 

  • Altieri, N., Pisoni, D. B., & Townsend, J. T. (2011). Some behavioral and neurobiological constraints on theories of audiovisual speech integration: a review and suggestions for new directions. Seeing and Perceiving, 24, 513–539.

  • Altieri, N., Lentz, J., Townsend, J.T., & Wenger, M.J. (2016). The McGurk effect: An investigation of attentional capacity employing response times. Attention, Perception, & Psychophysics. (Online First)

  • Bergeson, T. R., & Pisoni, D. B. (2004). Audiovisual speech perception in deaf adults and children following cochlear implantation. In G. A. Calvert, C. Spence, & B. E. Stein (Eds.), The handbook of multisensory processes (pp. 153–176). Cambridge, MA: The MIT Press.

    Google Scholar 

  • Bishop, C. W., & Miller, L. M. (2011). Speech cues contribute to audiovisual spatial integration. PLoS ONE, 6(8), e24016.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Cienkowski, K. M., & Carney, A. E. (2002). Auditory–visual speech perception and aging. Ear and Hearing, 23, 439–449.

    Article  PubMed  Google Scholar 

  • Dodd, B., McIntosh, B., Erdener, D., & Burnham, D. (2008). Perception of the auditory–visual illusion in speech perception by children with phonological disorders. Clinical Linguistics & Phonetics, 22(1), 69–82.

    Article  Google Scholar 

  • Dupont, S., Aubin, J., & Menard, L. (2005). A study of the McGurk effect in 4 and 5-year-old French Canadian children. ZAS Papers in Linguistics, 40, 1–17.

    Google Scholar 

  • Eidels, A., Houpt, J., Altieri, N., Pei, L., & Townsend, J. T. (2011). Nice guys finish fast and bad guys finish last: a theory of interactive parallel processing. Journal of Mathematical Psychology, 55(2), 176–190.

    Article  PubMed  PubMed Central  Google Scholar 

  • French-St. George, M., & Stoker, R. G. (1988). Speechreading: an historical perspective. The Volta Review, 90(5), 17–31.

    Google Scholar 

  • Johnson, S. A., Blaha, L. M., Houpt, J. W., & Townsend, J. T. (2010). Systems factorial technology provides new insights on global–local information processing in autism spectrum disorders. Journal of Mathematical Psychology, 54, 53–72.

    Article  PubMed  Google Scholar 

  • Magnotti, J. F., & Beauchamp, M. S. (2015). The noisy encoding of disparity model of the McGurk effect. Psychonomic Bulletin & Review, 22, 701–709.

    Article  Google Scholar 

  • Mallick, D. B., Magnotti, J. F., & Beauchamp, M. S. (2015). Variability and stability in the McGurk effect: contributions of participants, stimuli, time, and response type. Psychonomic Bulletin & Review, 22, 1299–1307.

    Article  Google Scholar 

  • Massaro, D. W. (1987). Speech perception by ear and eye. In B. Dodd & R. Campbell (Eds.), Hearing by eye: the psychology of lip-reading (pp. 53–83). Hillsdale, NJ: Lawrence Erlbaum.

    Google Scholar 

  • Massaro, D. W. (2004). From multisensory integration to talking heads and language learning. In G. A. Calvert, C. Spence, & B. E. Stein (Eds.), The handbook of multisensory processes (pp. 153–176). Cambridge, MA: The MIT Press.

    Google Scholar 

  • McGurk, H., & MacDonald, J. W. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.

    Article  CAS  PubMed  Google Scholar 

  • Miller, J. (1982). Divided attention: evidence for coactivation with redundant signals. Cognitive Psychology, 14, 247–279.

  • Norrix, L. W., Plante, E., & Vance, R. (2006). Auditory–visual speech integration by adults with and without language-learning disabilities. Journal of Communication Disorders, 39(1), 22–36.

    Article  PubMed  Google Scholar 

  • Rosenblum, L. D., Schmuckler, M. A., & Johnson, J. A. (1997). The McGurk effect in infants. Perception & Psychophysics, 59(3), 347–357.

    Article  CAS  Google Scholar 

  • Sekiyama, K., Sochi, T., & Sakamoto, S. (2013). Enhanced audiovisual integration with aging in speech perception: a heightened McGurk effect in older adults. Frontiers in Psychology, 5(323).

  • Setti, A., Burke, K. E., Kenny, R., & Newell, F. N. (2013). Susceptibility to a multi- sensory speech illusion in older persons is driven by perceptual processes. Frontiers in Psychology, 5(323), 1–11.

    Google Scholar 

  • Soto-Faraco, S., Navarra, J., & Alsius, A. (2004). Assessing automaticity in audiovisual speech integration: evidence from the speeded classification task. Cognition, 92, B13–B23.

    Article  PubMed  Google Scholar 

  • Stevenson, R. A., Zemtsov, R. K., & Wallace, M. T. (2012). Individual differences in the multisensory temporal binding window predict susceptibility to audiovisual illusions. Journal of Experimental Psychology: Human Perception and Performance, 38, 1517–1529.

    PubMed  PubMed Central  Google Scholar 

  • Stevenson, R. A., Siemann, J. K., Woynaroski, T. G., Schneider, B. C., Camarata, S. M., & Wallace, M. T. (2014). Arrested development of audiovisual speech perception in autism Spectrum disorders. Journal of Autism and Developmental Disorders, 4(6), 1470–1477.

    Article  Google Scholar 

  • Strelnikov, K., Rouger, J., Lagleyre, S., Fraysse, J.-F., Demonet, O., & Barone, P. (2015). Increased audiovisual integration in cochlear-implanted deaf patients: independent components analysis of longitudinal positron emission tomography data. European Journal of Neuroscience, 41, 677–685.

    Article  CAS  PubMed  Google Scholar 

  • Sumby, W., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212–215.

    Article  Google Scholar 

  • Summerfield, Q. (1987). Some preliminaries to a comprehensive account of audio-visual speech perception. In B. Dodd & R. Campbell (Eds.), The psychology of lip-reading (pp. 3–50). Hillsdale, NJ: LEA.

    Google Scholar 

  • Tiippana, K., Andersen, T. S., & Sams, M. (2004). Visual attention modulates audiovisual speech perception. European Journal of Cognitive Psychology, 16(3), 457–472.

    Article  Google Scholar 

  • Tiippana, K., Mottonen, R., & Schwartz, J. L (2015). Multisensory and sensorimotor interactions in speech perception. Frontiers in Psychology, 6, 1–3.

  • Townsend, J. T. & Nozawa, G. (1995). Spatio-temporal properties of elementary perception: An investigation of parallel, serial and coactive theories. Journal of Mathematical Psychology, 39, 321–360.

  • Townsend, J. T., & Wenger, M. J. (2004). A theory of interactive parallel processing: new capacity measures and predictions for a response time inequality series. Psychological Review, 111(4), 1003–1035.

    Article  PubMed  Google Scholar 

  • van Wassenhove, V., Grant, K., & Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Science, U.S.A., 102, 1181–1186.

  • Wallace, M. T., Carriere, B. N., Perrault, T. J., Vaughan, J. W., & Stein, B. E. (2006). The development of cortical multisensory neurons. The Journal of Neuroscience, 15, 11844–11849.

    Article  Google Scholar 

  • Werker, J. F., & Hensch, T. K. (2015). Critical periods in speech perception: new directions. Annual Review of Psychology, 66, 173–196.

    Article  PubMed  Google Scholar 

  • White, T. P., Wigton, R. L., Joyce, D. W., Bobin, T., Ferragamo, C., Wasim, N., et al. (2014). Eluding the illusion? Schizophrenia, dopamine and the McGurk effect. Frontiers in Human Neuroscience, 9(565), 1–12.

    Google Scholar 

  • Woynaroski, T. G., Kwakye, L. D., Foss-Feig, J. H., Stevenson, R. A., Stone, W. L., & Wallace, M. T. (2013). Multisensory speech perception in children with autism Spectrum disorders. Journal of Autism and Developmental Disorders, 43, 2891–2902.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgments

The project described was supported by Grant No. (NIGMS) 5U54GM104944-03. Portions of this report, including the basic model set-up, appeared in the author’s Doctoral Dissertation and in Altieri (2016). Finally, we thank Ryan A. Stevenson for his data set.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicholas Altieri.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Action Editor: Jonathan David Victor

Appendix A

Appendix A

1.1 Model specifications

The linear dynamic parallel interactive model and its parameters were derived from the systems utilized by Eidels et al. (2011), and previously by Townsend and Wenger (2004) to describe how interactions can lead to changes in capacity. This Appendix also appears in Altieri (2016). The model is specifically tailored to include three auditory and visual inputs along with the internal representations of these phonemic categories. As an example, the vector x(t) represents the level of accumulation or activation for a given phonemic category (x b (t) = Auditory /b/, for example):x(t)= \( \left[\begin{array}{c}\hfill {x}_b(t)\hfill \\ {}\hfill {x}_g(t)\hfill \\ {}\hfill \vdots \hfill \end{array}\right] \). Rows 3–6 of the vector represent the level of evidence accumulated for auditory /d/, and visual /b/, /g/, and /d/ respectively. Next, the following vector u denotes the level of input to each phonemic category in the auditory (first 3 rows) and visual (last 6 rows) modalities:u= \( \left[\begin{array}{c}\hfill {u}_{/b/}\hfill \\ {}\hfill {u}_{/g/}\hfill \\ {}\hfill \vdots \hfill \end{array}\right] \) (One simplifying assumption here is that the phonemic inputs are specified by constants rather than functions (e.g., t > 0:u 1(t) = u 1 ; u 2 (t) = u 2)).

A complete model with three auditory and three visual inputs with visual to auditory (V➔A) and auditory to visual (A➔V) connection will require a 6 × 6 activation matrix; this can be represented by A = \( \left[\begin{array}{ccc}\hfill {a}_{A/b/}\hfill & \hfill \cdots \hfill & \hfill {a}_{A/b/;V/d/}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {a}_{A/d/;V/b/}\hfill & \hfill \cdots \hfill & \hfill {a}_{V/d/}\hfill \end{array}\right] \). (For simplicity, within modality interactions will be left out of this version of the model). In this model, the diagonal elements represent the stabilizing parameter for each auditory and visual accumulator. For example, a A/b/ represents the stabilizing parameter for auditory /b/ (i.e., A/b/) and a V/d/ represents the stabilizing parameter for visual /d/, which can be found in row 6 and column 6. These cross-modal inhibition parameter, particularly V/g/ ➔ A/b/, will be important in the following simulations because the lip-motion used to make a visually articulated /g/ is incompatible with the lip-motion used to make a /b/ sound; as a result, we can use a negative number for this parameter to include visual inhibition in the model.

Assuming separate sources of noise, setting the cross-modal interaction parameters (e.g., a 12 and a 21) to “0” renders the model mathematically equivalent to an independent parallel model with unlimited capacity (see Miller 1982; Townsend and Nozawa 1995). Eidels et al.(2011) utilized parameter values that enforced stability in the system; stability was maintained in the system by setting the stabilizing parameters such as aA/b/ and the interaction parameters |aA/b/;V/d/| < |aA/b/|. (The model can be further simplified by assuming equivalent accumulation rates and cross-channel activation for all of the parameters of interest.). Refer to Townsend and Wenger (2004), for discussion on an early interaction matrix denoted by B (Integration, or cross-modal interaction occurring at these early perceptual stages, will not be discussed).

The differential equation (Eq. (1)) constitutes a deterministic (i.e., we do not yet include Gaussian noise) model with three auditory and three visual phonemic inputs:

$$ \frac{d}{dt}\mathbf{x}(t)=\mathrm{A}\mathbf{x}(t)+\mathbf{u}(t)=\left[\begin{array}{ccc}\hfill {a}_{A/b/}\hfill & \hfill \cdots \hfill & \hfill {a}_{A/b/;V/d/}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {a}_{A/d/;V/b/}\hfill & \hfill \cdots \hfill & \hfill {a}_{V/d/}\hfill \end{array}\right]\cdot \left[\begin{array}{c}\hfill {x}_b(t)\hfill \\ {}\hfill {x}_g(t)\hfill \\ {}\hfill \vdots \hfill \end{array}\right]+\left[\begin{array}{c}\hfill {u}_{/b/}\hfill \\ {}\hfill {u}_{/g/}\hfill \\ {}\hfill \vdots \hfill \end{array}\right] $$
(1)

Importantly, for the deterministic system, there exists a closed form solution to the differential equation describing the level of activation in each channel at time t. Eq. (2) shows an example interaction in the accumulation rates between auditory /b/ and visual /d/ within a hypothetical listener. The full model showing a graphical view of the interactions is specified in Fig. 1c (A version of this equation was implemented in simulations carried out by Eidels et al. (2011).1).

$$ \begin{array}{l}{x}_{A/b/}(t)=\frac{u_{A/b/}+{u}_{V/g/}}{2\left({a}_{A/b/}+{a}_{A/b/;V/g/}\right)}\;\left[ \exp \left[\left({a}_{A/b/}+{a}_{A/b/;V/g/}\right)t\right]-1\right]+\frac{u_{A/b/}-{u}_{V/g/}}{2\left({a}_{V/g/}-{a}_{A/b/;V/g/}\right)}\left[ \exp \left[\left({a}_{A/b/}-{a}_{A/b/;V/g/}\right)t\right]-1\right]\\ {}{x}_{V/g/}(t)=\frac{u_{V/g/}+{u}_{A/b/}}{2\left({a}_{V/g/}+{a}_{A/b/;V/g/}\right)}\;\left[ \exp \left[\left({a}_{V/g/}+{a}_{A/b/V/g/}\right)t\right]-1\right]+\frac{u_{V/g/}-{u}_{A/b/}}{2\left({a}_{V/g/}-{a}_{A/b/;V/g/}\right)}\;\left[ \exp \left[\left({a}_{V/g/}-{a}_{A/b/;V/g/}\right)t\right]-1\right]\end{array} $$
(2)

The model was made stochastic by adding independent and identically distributed Gaussian white noise Ƞ(t) to each of the inputs. The stochastic version of the differential equation describing a two-channel model with cross-talk between channels is provided in Eq. (3). The interested reader is referred to Higham (2001), Øksenda (1985), and Smith (2000) for tutorials that include methods for working with and approximating stochastic differential equations.

$$ dx(t)= Ax(t)dt+u+\sigma I+ dB(t) $$
(3)

In Eq. (3) , A is a 6 × 6 feedback matrix, u represents the 6 × 1 input vector, σI represents the 6 × 6 matrix denoting the error, and B ( t ) represents the six-dimensional Brownian motion process.

1.2 Decision process

In the interactive model, perceptual evidence for different categories (words, syllables, or phonemes or visual visemes) accrues in separate channels. Activation (i.e., x ( t )) is accumulated over a finite time period although the model, in the present form, does not yet account for the decision process required for phoneme or “speech recognition”. To accomplish this, the model must be augmented in a couple of key ways: First, a decision threshold is necessary in order to determine when enough phonemic evidence has accumulated in a specific channel (e.g., auditory /b/, auditory /d/, etc.) for a response. This can be accomplished by adding accumulation thresholds to the auditory and visual channels, which can be represented by the parameter “γ”.

How is a decision made? A logical gate, in this case an OR gate, is imposed on the system. It stipulates that recognition occurs at the time point in which the first auditory phoneme, represented by x i ( t ), reaches the threshold. While visual (viseme) activation is included in the model, the decision only involves auditory phonemes; hence, the race is really only among auditory phonemes. Logical gates thus represent the decisional aspects of the system in the sense that they decide whether processing finishes when one auditory phoneme reaches thresholds (OR), or instead, whether it needs to wait for multiple channels such as each of the three auditory phonemes and/or visual phonemes in this model to finish processing (AND). The termination rule determines the form of the cumulative distribution function (CDF) of RTs. The CDF for the OR rule is given by P(OR RT ≤ t) = P(TA/b/ ≤ t OR TA/g/ OR TA/d/ ≤ t) = P(x/b/(t) > γ OR x/g/(t) > γ OR x/d/(t) > γ). T/b/, T/g/, and T/d/ denote random variables for RTs on the auditory channel. It is possible to slightly amend the model to allow for decisions once an auditory or visual phoneme reaches threshold. In this latter case, if a viseme reached threshold first, then the viewer would recognize the input by sight rather than audition.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Altieri, N., Yang, CT. Parallel linear dynamic models can mimic the McGurk effect in clinical populations. J Comput Neurosci 41, 143–155 (2016). https://doi.org/10.1007/s10827-016-0610-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10827-016-0610-z

Keywords

Navigation