Keywords

1 Introduction

Users of modern day electronics are witnessing an increasing ubiquity of connected devices with wide variety of interface designs. While some of these devices require voice-based interfaces due to a lack of mechanical input controls or any visual display, others might offer these as add-on features for specific operational requirements, or simple convenience. A decreasing cost of both electronic components and processing power paired with increased usability has invited auditory interfaces into a wide variety of commercially available products, in turn leading more users to expect these interfaces. Voice-based interfaces allow users to interact in a hands-free manner, allowing and perhaps encouraging the user to engage in multiple activities. Presentation of auxiliary information to the user along an auditory channel may let the user continue to perform additional tasks which require visual and motor attention without a conscious awareness of the additional cognitive demands. Although in non-critical situations this compartmentalization of tasks can be easily made, in situations of distraction, overload, or fatigue the operator or other’s safety may be jeopardized [1, 2]. Therefore, the burdens placed on cognition as a result of interacting with synthetic speech-based interfaces must be assessed in order to minimize these costs.

A vital characteristic of human speech perception is a tolerance to natural variability of voices as generated by different speakers [3]. This tolerance allows individuals to flexibly engage in speech perception under varying external conditions which may degrade either the auditory quality or introduce potential distractions, negatively impacting speech perception. In spite of the relatively resilient nature of speech, the presence of synthetic speech requires operators to adapt to an introduction of unnatural acoustic, phonetic, and prosodic properties [4, 5]. Behavioral and neuroimaging studies have suggested that despite maintaining successful comprehension of sentence content, this adaptation can introduce a measurable performance penalty and a corresponding increase in cognitive load [6]. One explanation of this phenomena suggests that the mechanisms underlying the perception of synthetic speech may be different from naturally produced speech [7, 8].

Understanding the neural mechanisms that contribute to the acquisition, development, and use of cognitive skills is an important goal for cognitive neuroscience research and for applications of neuroscience to work and everyday activities. Neuroergonomics, defined by late Prof. Raja Parasuraman as studying “the brain at work” [9,10,11,12] is an emerging interdisciplinary research field at the intersection of cognitive neuroscience, systems engineering, human factors, and psychology. By utilizing portable and wearable brain imaging sensors, unique information such as mental workload and state can be captured. This mental state is independent of performance measures, and can, in turn, be used to guide product or complex machine interface design or adaptation during field use.

Functional Near-Infrared Spectroscopy (fNIRS [13]) is a non-invasive, safe, silent, and portable neuroimaging technology well suited for the study of speech perception and language. fNIRS measures cortical correlates of neural activity via relative changes in cortical oxygenated (HbO) and deoxygenated hemoglobin (HbR), taking advantage of transmissive and diffusive properties of tissue when using near-infrared light [14]. The technique allows research to be conducted practically under real-world settings and has become increasingly popular in auditory research [15, 16] and applied research [17], linking operational characteristics with the underlying cognitive functions.

In this neuroergonomics study, the influence of synthetic speech quality during a sentence comprehension and quality assessment task was assessed using self-reported, behavioral, and fNIRS measures. Participants listened to topical sentences from real-world audio interfaces employed in car driving scenarios, then answered questions regarding the content of the messages and rated the quality of the audio to assess perceived Intelligibility and Naturalness. Three levels of synthetic speech quality (low, medium, and high) were assessed in addition to naturally recorded speech to identify cognitive considerations in synthetic speech systems.

2 Methods

2.1 Participants

Eight right-handed participants (7 Male, 1 Female) between the ages 18 to 35 volunteered for this study. Participants reported no hearing impairment, neurological or psychiatric history. All participants were medication-free, with normal or corrected-to-normal vision. Participants gave written informed consent for the study, which was approved by the Institutional Review Board at Drexel University, and were paid for their participation.

2.2 Experiment Protocol

Synthetic speech recordings used in the study were originally developed at Intel Labs, while the experimental protocol itself was implemented in a custom protocol presentation package. After a 5 s baseline period, subjects were asked to listen to short 5–10 s sentences with topics adapted from real-world audio interfaces employed in car driving scenarios. Following audio presentation, subjects were asked a question regarding the content of the message as a measure of comprehension and then asked to evaluate the quality of the audio.

Participants listened to 5 different sentence categories under 4 levels of audio quality (natural + 3 levels of synthetic voice). Synthetic speech synthesizer quality reflected the different system requirements required for operation. Synthesizer S1 required 250 MB system memory while S2 required 50 MB and S3 only required 1 MB. Comprehension questions were varied such that no two questions were repeated. Voice, Category, and Comprehension were pseudo randomized to account for order effects. Quality was evaluated on a 1 to 5 scale for the metrics of Intelligibility (1: Understood none – 5: Understood all) and Naturalness (1: Nothing like a human – 5: Exactly like a Human). Each evaluation period lasted approximately 30 s and as is presented in Fig. 1. The entire Listening task lasted about 20 min. Audio was presented to the subject utilizing a professional headphone amplifier (Head Acoustics HPS IV), transducer, and high-fidelity headphones (Sennheiser HD 600).

Fig. 1.
figure 1

Trial timeline block diagram

2.3 fNIRS Acquisition

Prior to the starting the task, subjects were fitted with a continuous wave fNIRS system (fNIR1100; fNIR Devices LLC; www.fnirdevices.com). The fNIRS system includes a flexible sensor pad with 4 dual-wavelength (730 nm, 850 nm) LED light sources and 10 light detectors arranged spatially with a 2.5 cm separation and time-multiplexed to allow for 16 measurement locations [14]. The sensor was placed directly above the eyebrows over the forehead to allow measurement of the cortical areas directly underlying as seen in Fig. 2. Due to experimental setup conditions, the sensor was placed such that the sensor was centered over the midline and then offset by 1.58 cm corresponding with an offset of one optode in the horizontal direction.

Fig. 2.
figure 2

Functional Near Infrared Spectroscopy sensor (headband) and optode locations visualized on anterior view brain surface image [18].

COBI Studio [19] software was used for data acquisition and visualization. fNIRS data was collected continuously at a 2 Hz sampling rate. A serial cable between the presentation computer and the fNIRS acquisition computer was used to relay event markers used to synchronize neuroimaging data with audio stimulation onsets and other events.

2.4 Data Analysis

fNIRS data was processed using Matlab (R2016b) while R (3.3.2) was used for statistical testing. Channels were assessed for basic data quality and channels contaminated by excessive light saturation, insufficient signal, or subject motion were rejected prior to fNIRS evaluation. Each participant’s raw fNIRS data was low-pass filtered with a finite impulse response, linear phase filter with order of 20, and cut-off frequency of 0.25 Hz to attenuate noise from high frequency sources [18]. Motion artifacts were rejected automatically using a sliding motion artifact rejection technique [20]. Relative changes in blood oxygenation were calculated using the modified Beer Lambert Law (mBLL) [21] from changes in optical density measured during the pre-trial baseline period. Average change in oxygenation [Oxy], calculated as the difference in [HbO] and [HbR], from 4 to 7 s from the initiation of the audio presentation period were used as the dependent measure according to the estimated delay in the hemodynamic response function [22].

Main effects for dependent measures, including both self-reported, behavioral, and biomarker measurements were analyzed using repeated measures ANOVA. Subject and Category were used as fixed effects to control for the topic, audio length, and Subject variability. Tests of linear-hypotheses were corrected for multiplicity using the False Discovery Rate. A criterion of \( \alpha = 0.0 5 \) was designated as the threshold of statistical significance.

3 Results

3.1 Self-reported Measures

The subjective measures of Intelligibility and Naturalness were assessed using a repeated measures one-way ANOVA that adjusted for the content Category (Calendar, Email, Navigation, SMS, Weather) of the audio presented. There was a significant within subject main effect for Voice on Naturalness (F3,117 = 156.5, p < 0.001). Subjects appeared to easily order Voices in order of quality with Natural speech determining the highest level and Synthetic levels 1 to 3 (S1, S2, S3) showing decreasing levels of naturalness. Post-hoc tests determined that Natural speech was significantly different from all levels of Synthetic speech and that all levels of Synthetic speech showed significant differences from each other (q(117,0.05/3) = −2.84 to −12.84). Average levels of Naturalness for each Voice are presented in Fig. 3a.

Fig. 3.
figure 3

Average self-reported ratings of Naturalness (left) and Intelligibility (right) for each Voice. Error bars are standard error of the mean (SEM).

Despite most audio presentations in this study being highly Intelligible (> 3), Voice showed a significant main effect on Intelligibility (F3,117 = 11.34, p < 0.001). Post-hoc tests showed that Natural speech was the most highly-intelligible Voice with Synthetic speech decreasing in Intelligibility according to level. Natural Speech was significantly different from both S2 and S3 (q(117,0.05/3) = −4.69..−5.33), and approached trend level differences with S1 (q(117,0.05/3) = −1.71). S1 was also significantly different from both S2 and S3 levels (q(117,0.05/3) = −2.99..−3.63). Levels S2 and S3 appeared to continue a decreasing trend in Intelligibility, but they were not statistically different after correction for multiple-hypotheses. Average rated Intelligibility for each Voice is presented in Fig. 3b.

3.2 Behavioral Results

Workload during sentence comprehension can often be assessed behaviorally by measuring the length of time required to process and respond to information of increased difficulty, typically associated with increased processing time. The response time (RT) during the Comprehension Question phase was used as a metric of cognitive demand required for each Category and Voice combination. A significant within-subjects effect was observed for Voice on Question RT (F3,102 = 3.504, p = 0.018) when adjusted for different categories. Post-hoc differences showed that Natural and S1 Voices were significantly different from S3 (q(102,0.05/3) = 3.10..3.16) and approached trend differences with S2 (q(102,0.05/3) = 1.84) after adjusting for Subject differences and Category. Average RT during the sentence comprehension period for each Voice is presented in Fig. 4.

Fig. 4.
figure 4

Average response time for each speech quality level. Error bars are \( { + \mathord{\left/ {\vphantom { + - }} \right. \kern-0pt} - } \) standard error of the mean (SEM)

3.3 fNIRS Measures

One-way repeated measures ANOVAs were performed separately for each Optode to assess the impact of Voice, Intelligibility, and Naturalness on fNIRS biomarkers. Oxygenation changes as measured by [Oxy] were analyzed using one-way repeated measures ANOVA. A main effect for Voice was observed in Optodes 3 [F(3,61) = 2.983, p = 0.038] and 14 [F(3,40) = 2.841, p = 0.049]. Post-hoc Tukey tests showed that in Optode 14, the response to Natural Voice was significantly different from S1 (q(61,0.05/3) = 2.72) and S3 (q(61,0.05/3) = 2.73), with a trend difference for S2 (q(61,0.05/3) = 2.235). However, responses between Synthetic voices were undifferentiated.

When examining Naturalness, Optode 14 also showed a significant main effect [F(4,39) = 4.03, p = 0.008]. Post-hoc tests showed that highly Natural ratings (5) were associated with decreased activity relative to low Natural ratings (2,3) (q(39,0.05/4) = (−4.11 to −2.64) and a trend difference for the lowest rating (1) (q(39,0.05/4) = −2.325). Average changes in [Oxy] for Optode 14 for different Voice groups and across varying levels of self-reported Naturalness are presented in Fig. 5.

Fig. 5.
figure 5

Average oxygenation changes in Optode 14 across different Voices (left) and different rated levels of Naturalness (right). Error bars are standard error of the mean (SEM).

Optode 11 revealed a significant main effect for Intelligibility [F(3,74) = 3.85, p = 0.013]. Tukey post-hoc tests indicated that highly Intelligible (4–5) presentations were significantly different than ratings of lower Intelligibility (2–3) (q(76,0.05) = −3.68). Average changes in [Oxy] measured in Optode 11 across varied values of self-reported Intelligibility are presented in Fig. 6.

Fig. 6.
figure 6

Average oxygenation changes in Optode 11 across different rated levels of Intelligibility. Error bars are standard error of the mean (SEM).

4 Discussion

Extensive behavioral research has been conducted to assess the cognitive factors associated with the differential perception of synthetic and natural speech [4]. Researchers have largely demonstrated that natural and synthetic speech feature a variety of differences not only in the somewhat subjective quality of the voice itself, but in its accurate perception and comprehension. It is unsurprising that natural speech is typically viewed as superior to synthetic speech, as natural speech features a number of characteristics which even high quality synthetic speech cannot replicate. In human speech prosody is frequently varied, adapting the pace, emphasis, and even emotion in a manner that is listener, content, and context dependent, whereas synthetic speech features a more “mechanical” sounding quality, typical of rule-based generation, which is typically inflexible to changing listening conditions. The unnatural characteristics of synthetic speech systems may be further exacerbated by low-quality synthetic speech.

This study sought to identify the behavioral and cortical response to varying levels of synthetic audio quality during the comprehension of contextual audio frequently found in hands-free systems. Behavioral responses in terms of increased Response Time suggested that comprehension of lower-quality synthetic speech was more difficult than highly-natural and natural speech. Our fNIRS results appear to indicate that these trends can be anticipated by cortical evoked activity during active listening. Optodes 11 and 14, both located near the middle frontal gyrus, present significantly different evoked responses depending on presented audio quality. Increased cortical activity at Optode 14 associated with comprehension of synthetic speech appeared to suggest that natural speech was less cognitively demanding than its synthetic counterparts. A second comparison with the rated Naturalness showed a clearer trend towards reduced demand with increased Naturalness. Audio with lower ratings of Intelligibility appeared to recruit additional cognitive resources as evidenced by increased activity in Optode 11. These results describe a relationship between both the qualitative rating of Naturalness, the different quality levels of the Voices, and their associated cognitive demand. Specifically, the use of natural, and highly-natural synthetic speech appeared to reduce the cognitive workload required to correctly comprehend the sentence as measured both in behavioral and biomarker representations.

The task used in this study was an active listening task which required listeners to both comprehend the audio content and verify comprehension by answering simple questions about the content. While cortical roles associated with language perception are heavily left-lateralized, results from PET studies have suggested that reduced synthetic speech quality recruits additional domains in the prefrontal cortex including the right-MFG [23] as reported in this study. Previous behavioral studies have suggested that comprehension tests required more cognitive load during synthetic speech than required by natural speech [4, 24, 25]. Working memory (WM), which is necessarily recruited during comprehension tasks, is increasingly tasked under synthetic speech where misleading acoustic-phonetic structures and absent natural cues increase phrase ambiguity [26]. As WM is a shared and limited resource, increasing WM demands has also been reported to further compound penalties associated with low-quality synthetic speech [24]. These findings describe a situation where the presence of low-quality synthetic speech imposes a baseline cognitive burden and necessitates the minimization of cognitive workload in the design of Auditory interfaces.

Since the original development of synthetic Text-to-speech (TTS) systems, progress on speech synthesis has evolved substantially, as has the contexts in which they are frequently used. In the past, TTS systems were often used primarily for accessibility devices, but the recent introduction of conversational assistants such as Apple’s Siri, Microsoft’s Cortana, Google Assistant, and Amazon’s Echo, have again changed the nature of speech systems. Although auditory speech systems try to emulate natural language, at present these interfaces limit the ways in which the user can interact, meaning that users are already constrained to use these systems in ways completely unlike natural conversation. These new roles for synthetic speech systems require new cognitive design considerations as proper assessment of Human Computer Interaction requires an understanding of both the functional purpose and context in which the device is being used. However, as every scenario cannot be anticipated, design choices must be made regarding synthetic speech systems which optimize not just intelligibility [27] but overall user experience in the face of limited adaptability.

The results presented in this study suggest that design of synthetic speech affect not only the behavioral performance, but can also impact the listener’s cognitive load directly during message comprehension. This study expands the research on the comprehension of synthetic speech and templates a role in the use of neuroimaging techniques to assess baseline cognitive requirements of such systems. The flexibility of fNIRS as an emerging portable and wearable neuroimaging technique enables comprehensive exploration of the cognitive demands of modern computer interfaces. Consistent with the Neuroergonomic approach, such fNIRS sensors could be used to guide design of speech synthesizers and audio interfaces not only in artificial laboratory contexts, but also monitor the way in which users actually interact with these systems in real world settings.