Modeling drivers’ speech under stress
Introduction
Much of the current effort on studying speech under stress has been aimed at detecting stress conditions for improving the robustness of speech recognizers; typical research of speech under stress has targeted perceptual (e.g. Lombard effect), psychological (e.g. timed tasks), as well as physical stressors (e.g. roller-coaster rides, high G forces) (Steeneken and Hansen, 1999). In this work we are interested in modeling speech in the context of driving under varying conditions of cognitive load hypothesized to induce a level of stress on the driver. The results of this research may be relevant not only to building recognition systems that are more robust in the context described, but also to applications that attempt to infer the underlying affective state of an utterance. We have chosen the scenario of driving while talking on the phone as an application in which knowledge of the driver’s state may provide benefits ranging from a more fluid interaction with a speech interface to improvement of safety in the response of the vehicle.
The recent literature discussing the effects of stress on speech applies the label of stress to different phenomena. Some work views stress as any broad deviations in the production of speech from its normal production (Hansen and Womack, 1996; Sarikaya and Gowdy, 1998). In discussing the SUSAS database for the study of speech under stress, Hansen et al. (1998) go on to describe various types of stress in speech. These include the effects that speaking styles, noise, and G forces have on the speaker’s output, as well as the effect of states that are often described under the label of emotions elsewhere in the literature (e.g., anxiety, fear, or anger).
In an attempt to unify the often diverging views of stress that are being invoked by the research in this field, Murray et al. (1996) have reviewed various definitions of stress, and proposed a description of this phenomenon based on the character of the stressors. They have hypothesized four levels of stressors that can affect the speech production process. At the lowest level, these include direct changes on the vocal apparatus (zero-order stressors), unconscious physiological changes (first-order stressors), and conscious physiological changes (second-order stressors). At the highest level, changes can also be brought about by stimuli that are external to the speech production process, by the speaker’s cognitive reinterpretation of the context in which the speech is being produced, as well as by the speaker’s underlying affective conditions (third-order stressors). In this paper, we follow this taxonomy and investigate whether it is possible to discriminate between spoken utterances that have been produced under the influence of third-order stressors with a varying degree of stress.
Section snippets
Speech corpus and data annotation
The speech data used in the research presented in this paper was collected from subjects driving in a simulator at the Nissan’s Cambridge Research Lab. Subjects were asked to drive through a course while engaged in a simulated phone task: while the subject drove, a speech synthesizer prompted the driver with a math question consisting of adding up two numbers whose sum was less than 100. We controlled for the number of additions with and without carry-ons in order to maintain an approximately
Feature extraction
Non-linear features of the speech waveform have received much attention in studies of speech under stress; in particular, the Teager energy operator (TEO) has been proposed to be robust to noisy environments and useful in stress classification (Zhou et al., 1998, Zhou et al., 1999; Jabloun and Cetin, 1999). Another useful approach for analysis of speech and stress has been subband decomposition or multiresolution analysis via wavelet transforms (Sarikaya and Gowdy, 1997, Sarikaya and Gowdy, 1998
Graphical models
In this section we treat the dynamic evolution of the utterance features to discriminate between the different categories of driver stress and consider a family of graphical models for time series classification. One of the most extensively studied models in the literature of time series classification is that of a hidden Markov model (HMM). An HMM is often represented as a state transition diagram. Such a representation is suitable for expressing first-order transition probabilities; it does
Modeling features at the utterance-level
Modeling of linguistic phenomena requires that we choose an adequate time scale to capture relevant details. For speech recognition, a suitable time scale might be one that allows representing phonemes. For the supralinguistic phenomena we are interested in modeling, however, we wish to investigate whether a coarser time scale suffices. The database used in this study consists of short and simple utterances (with presumably simpler structures than those found in unconstrained speech), and
Results and discussion
The speech data of four subjects was first divided into a training and testing set comprising approximately 80% and 20% of the data set, respectively. The following labels will be used to denote the four categories of data: FF, SF, FS, SS. The first letter denotes whether the data came from a fast (F) or slow (S) driving speed condition; the second indicates the frequency with which the driver was presented with an arithmetic task: every 4 s (fast) (F) or every 9 s (slow) (S). We applied the
Conclusions
In this paper we have investigated the use of features based on subband decompositions and the TEO for classification of stress categories in speech produced in the context of driving at variable speeds while engaged on mental tasks of variable cognitive load for a set of four subjects. We investigated the performance of several classifiers on two representations of the speech waveforms: using a feature set representing intra-utterance dynamics and a sparser set consisting of more global
Acknowledgements
The authors would like to thank Nissan’s CBR Lab and Elias Vyzas for their help with data collection and Thomas Minka for valuable technical discussions and for suggesting the significance test.
References (24)
Neural Networks for Pattern Recognition
(1995)- Cowie, R., 2000. Describing the emotional states expressed in speech. In: Proceedings of the ISCA ITRW on Speech and...
- Daubechies, I., 1992. Ten Lectures on Wavelets. Regional Conference Series in Applied Mathematics, SIAM, Philadelphia,...
- et al.
Factorial hidden Markov models
Machine Learning
(1997) - Gunn, S., 1998. Support vector machines for classification and regression. Technical Report, Image, Speech and...
- et al.
Feature analysis and neural network-based classification of speech under stress
IEEE Transactions on Speech and Audio Processing
(1996) - Hansen, J., Bou-Ghazale, S.E., Sarikaya, R., Pellom, B., 1998. Getting started with the SUSAS: Speech Under Simulated...
- et al.
The Teager energy based feature parameters for robust speech recognition in car noise
An Introduction to Bayesian Networks
(1996)- et al.
Hidden Markov decision trees
Automatic recognition of emotion from voice: a rough benchmark
Cited by (139)
The linguistic structure of an emotional text influences the sympathetic activity and the speech prosody of the reader
2024, Biomedical Signal Processing and ControlIntroducing ISAP and MATSS: Mental stress induced speech utterance procedure and obtained dataset
2022, Speech CommunicationCitation Excerpt :ISAP also attempts to solve the ambiguous strategy of labeling the obtained speech under stress. Though the simulated mental tasks have been successfully carried out by various studies (Scherer et al., 2008; Lu et al., 2012; Fernandez and Picard, 2003), the task of labeling the obtained speech is not reliable up to large extent. As the labeling is done by different participants and not the actual speakers who have uttered the speech samples.
A survey of emotion recognition methods with emphasis on E-Learning environments
2019, Journal of Network and Computer ApplicationsAnalysis of speech features and personality traits
2019, Biomedical Signal Processing and Control