Elsevier

Speech Communication

Volume 108, April 2019, Pages 53-64
Speech Communication

Why listening in background noise is harder in a non-native language than in a native language: A review

https://doi.org/10.1016/j.specom.2019.03.001Get rights and content

Highlights

  • Non-native spoken-word recognition is harder than native spoken-word recognition.

  • The effect of background noise is similar in native and non-native listening.

  • Native and non-native listening have the same underlying cognitive architecture.

  • Background noise has an effect on all cognitive speech recognition processes.

  • Language exposure determines for a large part spoken-word recognition performance.

Abstract

There is ample evidence that recognising words in a non-native language is more difficult than in a native language, even for those with a high proficiency in the non-native language involved, and particularly in the presence of background noise. Why is this the case? To answer this question, this paper provides a systematic review of the literature on non-native spoken-word recognition in the presence of background noise, and posits an updated theory on the effect of background noise on native and non-native spoken-word recognition. The picture that arises is that although spoken-word recognition in the presence of background noise is harder in a non-native language than in one's native language, this difference is not due to a differential effect of background noise on native and non-native listening. Rather, it can be explained by differences in language exposure, which influences the uptake and use of phonetic and contextual information in the speech signal for spoken-word recognition.

Introduction

Successful speech recognition is a key factor for social integration and communication. At the same time, it is thought that people who speak more than one language outnumber monolingual speakers. Most people who learned another language some years after having first acquired their first language (referred to as non-native listeners), even those with a high proficiency in the non-native language involved, will have noticed that communication, and especially understanding the specific words that have been spoken, is more difficult in a non-native than in the native language, particularly in the presence of background noise (see for experimental evidence, e.g., Borghini and Hazan, 2018, Bradlow and Alexander, 2007, Mayo et al., 1997, Meador et al., 2000, Scharenborg et al., 2018a; note, also early or simultaneous bilinguals, i.e., people who learned two or more languages (nearly) simultaneously from an early age, have been found to suffer more from the presence of background noise than monolingual listeners; e.g., Mayo et al., 1997). The main reason for this problem seems obvious: Imperfect knowledge of the language and the presence of background noise (together referred to as adverse listening conditions) interact strongly to our disadvantage (e.g., Bradlow and Pisoni, 1999, García Lecumberri et al., 2010, Mayo et al., 1997).

Research on non-native listening in noise has, so far, mostly focussed on phoneme perception, showing that phoneme perception in noise is worse for non-native listeners than for native listeners (e.g., Broersma and Scharenborg, 2010, Cooke et al., 2010; see for a review, Garcia Lecumberri et al., 2006), an effect that is now fairly well understood (see Section 1.2). In recent years, an increasing number of studies have been published focussing on the effect of the presence of background noise on word recognition. The results of these studies again show a native advantage. The obvious question is, why is word recognition in the presence of background noise harder in a non-native language than in one's native language? Since different studies used different research methodologies, with different stimuli, tasks, noise levels, and types of noise, and tested different groups of participants with various language backgrounds and proficiency levels, and a systematic comparison of these studies is lacking, this question is not easily answered.

The aim of this paper is to answer the above question. In order to do so, this paper first provides a review of the literature on non-native spoken-word recognition in the presence of background noise, in which the studies (see Appendix A for an overview of the papers on non-native spoken-word recognition in background noise discussed in this review; we will refer to these papers as the ‘sample’1) on non-native spoken-word recognition in background noise are for the first time systematically compared in order to understand the size of the native advantage (Section 2), the effect of different types of noise on spoken-word recognition (Section 3), the role of semantic and prosodic context (Section 4), and the role of individual differences in proficiency in the non-native language and cognitive abilities (Section 5). As such, this paper provides the first review of the literature on non-native spoken-word recognition in background noise. In the final section of this paper, the reviewed research is synthesised into an updated theoretical account of the effect of background noise on native and non-native listening (which thus abstracts away from the methodological differences between studies). In the remainder of this section we briefly summarise the processes underlying native and non-native spoken-word recognition and the effect of noise on these processes before turning to the question why word recognition in the presence of background noise is more difficult in a non-native language compared to a native language in Sections 2-5.

The spoken-word recognition process can be viewed as the search for the optimal mapping of the acoustic speech signal onto a word. Several (computational) models of (native) word recognition have been proposed, such as TRACE (McClelland and Elman, 1986), Shortlist (Norris, 1994), and PARSYN (Luce et al., 2000; see for reviews: McQueen, 2005, Weber and Scharenborg, 2012). Most influential models agree on the following.

As the auditory information unfolds over time, it is mapped onto stored representations of the words in the mental lexicon. This process is generally viewed as consisting of three underlying cognitive processes. First, all words that partly overlap with the input, irrespective of their onsets, are activated simultaneously (e.g., Allopenna et al., 1998, Gow and Gordon, 1995, Luce and Pisoni, 1998, Slowiaczek et al., 1987, Zwitserlood, 1989). This is referred to as the multiple activation process. As each language only has a limited set of phonemes from which all the words in that language are built (Maddieson, 1984), words are often highly similar (e.g., tall, ball, mall, call only differ in their first consonant), and shorter words are often embedded in longer words (e.g., sun, I, rye, rise, rises in sunrises). So upon hearing the word tall, all other words which resemble it, e.g., ball, mall, call, tell, toll, etc., will also be activated and compete for recognition (for a review, see McQueen, 2005). The number and nature of the words (the ‘neighbourhood’) that are activated have been shown to affect the speed and accuracy of word recognition: Words that have a dense and/or high-frequency neighbourhood tend to be processed slower and less accurately, thus requiring more cognitive effort (Luce and Pisoni, 1998; but see Vitevitch and Rodríguez, 2005, for results that challenge this canonical view). During the competition process, active candidates that fail to match the acoustic input and/or the semantic context are inhibited, leaving the optimal word candidate given the acoustic input and semantic context to be recognised (Marslen-Wilson, 1993). In the final step the semantic information related to the selected words is integrated into the ongoing sentence, which is known as the integration process (Marslen-Wilson and Tyler, 1980), and the word is recognised.

Non-native spoken-word recognition happens in much the same way as native spoken-word recognition. However, languages differ in their phoneme inventories (e.g., Dutch does not have the /æ/ as in English bad, while English does not have the /y/ as in Dutch vuur, English translation: fire). Non-native listeners will have to learn the non-native sound categories which might consequently be less well specified or even absent. This leads to a decrease in the phonological match between the speech signal and the non-native listener's sound categories (compared to that during native listening), which has been shown to lead to a decrease in phoneme perception accuracy (see for an overview, Bohn and Munro, 2007). There is ample evidence showing that the misperception of speech sounds leads to an increase of activated words due to an increase of words that partially match the (mis)recognised speech sounds (Broersma, 2012, Cutler et al., 2006, Pallier et al., 2001), not only from the non-native language but also the native language (Spivey and Marian, 1999, Weber and Cutler, 2004). Using an eye-tracking study, Weber and Cutler, for instance, showed that Dutch non-native listeners of English upon hearing the word English word panda would not only look at a picture of a panda but also at a picture of a pen, while English listeners would only look at the picture of the panda. The Dutch listeners confused English /æ/ with Dutch /ɛ/ which led to the spurious activation of pen. These spurious competitors are difficult to suppress, resulting in more competition for non-native than native listening (Broersma and Cutler, 2008, Broersma and Cutler, 2011), decreasing word recognition accuracy (Scharenborg et al., 2018a).

By far most research on the effect of background noise during non-native spoken-word recognition uses additive noise, thus mostly leaving aside distortions of the acoustic signal due to reverberation or transmission channels (but see Rogers et al., 2006; and, e.g., Nabelek, 1988, Nabelek and Donahue, 1984 for the effect of reverberation on phoneme perception). An often used distinction to describe the type of masking by additive noise is energetic masking and informational masking (Shinn-Cunningham, 2008). In the case of energetic masking, both the target speech and competing noise contain energy in the same critical frequency bands at the same time (Brungart, 2001). Because of this, listeners cannot effectively identify and use the acoustic cues needed to identify sounds. Put differently, energetic masking occurs due to the direct interaction of the background noise with the speech signal outside the listener (Pollack, 1975). Informational masking is ‘noise’ that interferes with speech perception inside the listener (Lidestam et al., 2014, Pollack, 1975). Informational masking is a container concept for all types of interferences after the effect of energetic masking has been taken into account (e.g., Cooke et al., 2008, García Lecumberri et al., 2010, Mattys et al., 2009). For example, imagine another person speaking in the background. The speech of that talker will mask the speech of the target talker you are attending to. This is energetic masking. At the same time, if the background talker speaks in a language you understand, the linguistic message in the background talker's speech will also interfere with recognition of the target talker's speech. Note, however, that an informational masker does not necessarily also provide energetic masking, for instance, carrying out a second task takes away cognitive resources from the speech recognition task and interferes with intelligibility of the speech signal (Mattys et al., 2009). This interference of the second task is also considered to be an informational masker. Since this review is concerned with speech processing in background noise, we will not focus on this type of informational masking.

Reverberation is not a background noise in the same vein as an energetic or informational masker, rather in the case of reverberation, the masking energy comes from the target speech itself. The sounds are reflected from surfaces and this results in a smeared signal (Garcia Lecumberri et al., 2010). Specifically, offsets of sounds are obscured, phoneme durations are prolonged, and bursts are smoothed. In the context of this review, reverberation is considered a background noise and consequently studies on reverberation are included.

In addition to the type of background noise, another important factor to consider is the level of the background noise. The level of the noise is measured in terms of Signal to Noise Ratio (SNR), which is a measure of the relative amplitude of the speech signal compared to the background noise, where a positive number means that the speech signal is stronger than the background noise, and a negative number the reverse. At SNR is 0 dB, both sound sources are equally loud. The severity of the masking effect, and thus the reduction in intelligibility of the speech signal, is dependent on the number and size of “glimpses” still available to the listener (Cooke, 2006). “Glimpses” are defined as those time-frequency regions where the energy of the speech exceeds the energy of the background noise by at least 3 dB.

The presence of background noise can obscure acoustic cues of the (target) speech, or acoustic cues from the background speech might ‘attach’ themselves to the target speech (Cooke, 2009). Both situations will lead to the listener perceiving incorrect acoustic cues and subsequently the listener is likely to hear a different sound than was intended by the talker (Cooke, 2009, García Lecumberri et al., 2010). The presence of background noise thus decreases the phonological match between the speech signal and the listener's sound categories, which results in a decrease in phoneme identification accuracy in noise for both native and non-native listeners, but more so for the latter group (see for an overview, Garcia Lecumberri et al., 2010). The effect of this deteriorated phonemic perception on the spoken-word recognition process is however less clear. This is the question we aim to answer and the topic of this review.

Section snippets

The native advantage

Studies generally report little or no differences between word recognition scores for native listeners and high-proficiency non-native listeners in quiet (see the papers in Appendix A; non-native listeners with a lower proficiency do perform worse than native listeners; e.g., Cooke et al., 2008). The difference in word recognition performance between native and (high-proficiency) non-native listeners occurs primarily in the presence of background noise (see the papers in Appendix A). A majority

The effect of different types of noise on spoken-word recognition

The effect of background noise on spoken-word recognition is dependent on the type of noise that is present. Section 3.1 discusses research focussing on the effect of energetic and informational maskers on spoken-word recognition in background noise, while Section 3.2 discusses research focussing on the effect of reverberation.

The role of context in spoken-word recognition in background noise

Two types of contextual information have been found to play an important role in explaining the native advantage in spoken-word recognition in background noise. Section 4.1 discusses the role of semantic context, while Section 4.2 discusses the role of prosodic information.

The role of proficiency and cognitive abilities on spoken-word recognition in noise

A common observation is that some people have more difficulty in listening in the presence of background noise than others. However, despite these individual differences, psycholinguistic theories of spoken-word recognition are often based on behavioural results that are averages over multiple subjects, thus removing between subject variation. There are however a couple of studies which investigate the role of individual differences on non-native speech recognition in background noise.

The effect of background noise on the cognitive processes underlying native and non-native listening

Most papers reviewed above aim to determine the size of the performance gap between native and non-native spoken-word recognition in background noise, and thus show that word-recognition in background noise is harder in a non-native language compared to one's native language. Here, we summarise and synthesise the reviewed research and bring together those and new findings from our lab into an updated theoretical account of the effect of background noise on native and non-native spoken-word

Concluding remarks

The picture that seems to arise from the current literature on non-native spoken-word recognition in background noise is that 1) indeed, spoken-word recognition in the presence of background noise is harder in a non-native language than in one's native language; 2) that, as suggested by both experimental and modelling studies, the difference is not due to a differential effect of background noise on native and non-native listening, but rather that the difference between native and non-native

Acknowledgements

This research was supported by a Vidi-grant from the Netherlands Organization for Scientific Research (NWO; grant number 276-89-003) awarded to Odette Scharenborg. Part of this work was carried out by the second author under the supervision of the first author while both authors were at the Centre for Language Studies at Radboud University.

References (94)

  • J.L. McClelland et al.

    The TRACE model of speech perception

    Cogn. Psychol.

    (1986)
  • D. Norris

    Shortlist: a connectionist model of continuous speech recognition

    Cognition

    (1994)
  • T. Shimizu et al.

    Effect of background noise on perception of English speech for Japanese listeners

    Auris Nasus Larynx

    (2002)
  • B.G. Shinn-Cunningham

    Object-based auditory and visual attention

    Trends Cogn. Sci.

    (2008)
  • K.J. Van Engen

    Similarity and familiarity: second language sentence recognition in first-and second-language multi-talker babble

    Speech Commun.

    (2010)
  • A. Weber et al.

    Lexical competition in non-native spoken-word recognition

    J. Memory Lang.

    (2004)
  • Zwitserlood

    The locus of the effects of sentential-semantic context in spoken-word processing

    Cognition

    (1989)
  • E. Akker et al.

    Prosodic cues to semantic structure in native and non-native listening

    Bilingualism

    (2003)
  • J. Aydelott et al.

    Effects of acoustic distortion and semantic context on lexical access

    Lang. Cogn Processes

    (2004)
  • J. Aydelott et al.

    Sentence comprehension in competing speech: Dichotic sentence word priming reveals hemispheric differences in auditory semantic processing

    Lang. Cogn Process.

    (2012)
  • B. Banks et al.

    Cognitive predictors of perceptual adaptation to accented speech

    J Acoust Soc Am

    (2015)
  • B.M. Ben-David et al.

    Effects of aging and noise on real-time spoken word recognition: evidence from eye movements

    J Speech Lang Hear Res

    (2011)
  • E. Bialystok

    Bilingualism in Development: Language, Literacy and Cognition

    (2001)
  • E. Bialystok et al.

    Bilingualism, aging, and cognitive control: evidence from the Simon task

    Psychol. Aging

    (2004)
  • G. Borghini et al.

    Listening effort during sentence processing is increased for non-native listeners: a pupillometry study

    Front. Neurosci.

    (2018)
  • A.R. Bradlow et al.

    Semantic and phonetic enhancements for speech-in-noise recognition by native and non-native listeners

    J. Acoust. Soc. Am.

    (2007)
  • A.R. Bradlow et al.

    The clear speech effect for non-native listeners

    J. Acoust. Soc. Am.

    (2002)
  • A.R. Bradlow et al.

    Recognition of spoken words by native and non-native listeners: talker-, listener- and item-related factors

    J. Acoust. Soc. Am.

    (1999)
  • M. Broersma

    Increased lexical activation and reduced competition in second-language listening

    Lang. Cogn. Process.

    (2012)
  • M. Broersma et al.

    Competition dynamics of second-language listening

    Q. J. Exp. Psychol.

    (2011)
  • S. Brouwer et al.

    The temporal dynamics of spoken word recognition in adverse listening conditions

    J. Psycholinguist. Res

    (2016)
  • S. Brouwer et al.

    Linguistic contributions to speech-on-speech masking for native and non-native listeners: language familiarity and semantic content

    J. Acoust. Soc. Am.

    (2012)
  • S. Brouwer et al.

    Speech reductions change the dynamics of competition during spoken word recognition

    Lang. Cognitive Proc.

    (2012)
  • D.S. Brungart

    Informational and energetic masking effects in the perception of two simultaneous talkers

    J. Acoust. Soc. Am.

    (2001)
  • M. Cooke

    A glimpsing model of speech perception in noise

    J. Acoust. Soc. Am.

    (2006)
  • M. Cooke

    Discovering consistent word confusions in noise

  • M. Cooke et al.

    An audio-visual corpus for speech perception and automatic speech recognition

    J. Acoust. Soc. Am.

    (2006)
  • M. Cooke et al.

    The foreign language cocktail party problem: energetic and informational masking effects in non-native speech perception

    J. Acoust. Soc. Am.

    (2008)
  • J. Coumans et al.

    Non-native word recognition in noise: the role of word-initial and word-final information

  • A. Cutler et al.

    Prosody in the comprehension of spoken language: a literature review

    Lang. Speech

    (1997)
  • A. Farris-Trimble et al.

    The process of spoken word recognition in the face of signal degradation

    J. Exp. Psychol.

    (2014)
  • I. FitzPatrick et al.

    Lexical competition in nonnative speech comprehension

    J. Cogn. Neurosci.

    (2010)
  • M.G. García Lecumberri et al.

    Effect of masker type on native and non-native consonant perception in noise

    J. Acoust. Soc. Am.

    (2006)
  • M.L.G. García Lecumberri et al.

    Non-native speech perception in adverse conditions: a review

    Speech Commun.

    (2010)
  • N. Golestani et al.

    Native-language benefit for understanding speech-in-noise: the contribution of semantics

    Bilingualism

    (2009)
  • D.W. Gow et al.

    Lexical and prelexical influences on word segmentation: evidence from priming

    J. Exp. Psychol.

    (1995)
  • Cited by (44)

    • Learning and bilingualism in challenging listening conditions: How challenging can it be?

      2022, Cognition
      Citation Excerpt :

      Challenging (specifically noisy) listening environments are known to hamper speech perception and cognitive abilities, such as memory and learning (e.g., Klatte, Lachmann, & Meis, 2010; Szalma & Hancock, 2011). Studies have shown that bilingual and multilingual listeners have greater perceptual difficulties in challenging listening environments (Garcia Lecumberri, Cooke, & Cutler, 2010; Scharenborg & van Os, 2019; Tabri, Chacra, & Pring, 2015). Bilinguals and multilinguals are often broadly defined as individuals who have a continuum of ability in more than one language (Grosjean, 2010; Lim, Liow, Lincoln, Chan, & Onslow, 2008; Nicol, 2001).

    View all citing articles on Scopus
    View full text