Modeling drivers’ speech under stress

doi:10.1016/S0167-6393(02)00080-8

Speech Communication

Volume 40, Issues 1–2, April 2003, Pages 145-159

https://doi.org/10.1016/S0167-6393(02)00080-8 Get rights and content

Abstract

We explore the use of features derived from multiresolution analysis of speech and the Teager energy operator for classification of drivers’ speech under stressed conditions. We apply this set of features to a database of short speech utterances to create user-dependent discriminants of four stress categories. In addition we address the problem of choosing a suitable temporal scale for representing categorical differences in the data. This leads to two modeling approaches. In the first approach, the dynamics of the feature set within the utterance are assumed to be important for the classification task. These features are then classified using dynamic Bayesian network models as well as a model consisting of a mixture of hidden Markov models (M-HMM). In the second approach, we define an utterance-level feature set by taking the mean value of the features across the utterance. This feature set is then modeled with a support vector machine and a multilayer perceptron classifier. We compare the performance on the sparser and full dynamic representations against a chance-level performance of 25% and obtain the best performance with the speaker-dependent mixture model (96.4% on the training set, and 61.2% on a separate testing set). We also investigate how these models perform on the speaker-independent task. Although the performance of the speaker-independent models degrades with respect to the models trained on individual speakers, the mixture model still outperforms the competing models and achieves significantly better than random recognition (80.4% on the training set, and 51.2% on a separate testing set).

Zusammenfassung

In diesem Bericht untersuchen wir die Verwendung von Merkmalen der Sprachanalyse aufgrund mehrerer Zeitskalen zur Klassifikation der Sprache eines Fahrers unter Stress. Diese Merkmale wenden wir auf eine Datenbank kurzer Sprachsequenzen an, um vier sprecherabhängige Stresskategorien zu erstellen. Zusätzlich beschäftigen wir uns mit der Auswahl der passenden Zeitskala für die Repräsentation klassenspezifischer Unterschiede in der Datenmenge. Dies führt zu zwei unterschiedlichen Modellierungsansätzen. Im ersten Ansatz wird vorausgesetzt, dass die dynamische Entwicklung der Merkmale, die innerhalb der Sprachsequenzen vorhanden ist, wichtig für die Klassifizierung ist. Diese Merkmale werden klassifiziert mit Hilfe von dynamischen Bayes Netzen bzw. mit einer Mischung von Hidden Markov Modellen. Im zweiten Ansatz definieren wir Merkmale auf Artikulationsebene, indem wir die Mittelwerte der Merkmale über die Sprachsequenzen berechnen. Diese Merkmalsmenge wird dann mit einer Support Vector Maschine und einem Multilayer-Perzeptron modelliert. Die Performanz der spärlichen und der voll dynamischen Darstellung wird verglichen mit dem zufälligen Klassifikationsniveau von 25%. Das beste Ergebnis erhalten wir hierbei mit einem sprecherabhängigen Mischmodell (96,4% auf den Trainingsdaten und 61,2% auf unabhängigen Testdaten). Weiterhin untersuchen wir, wie diese Modelle bei sprecherunabhängigen Aufgaben abschneiden. Obwohl die sprecherunabhängigen Modelle schlechter abschneiden als die auf einzelne Sprecher justierten Modellen, übertrifft das Mischmodell die konkurrierenden Modelle immer noch und erzielt signifikant bessere Sprechererkennung als Zufallserkennung (80.4% auf den Trainingsdaten und 51.2% auf unabhängigen Testdaten).

Résumé

Nous explorons l’utilisation de représentations dérivées de l’analyse multirésolution de la parole et de l’opérateur d’énergie de Teager pour la classification de la parole de conducteurs en condition de stress. Nous appliquons cette analyse à corpus d’énoncés courts pour créer des fonctions discriminantes dépendantes du locuteur pour quatre catégories de stress. En outre nous adressons le problème du choix d’une échelle temporelle appropriée pour catégoriser les données. Ceci mène à deux approches pour la modélisation. Dans la première approche, la dynamique des variables issues de l’analyse d’un énoncé donné est supposée pertinente pour la classification. Ces variables sont alors modélisées au moyen de réseaux bayésiens dynamiques (DBN) ou par un mélange des modèles de Markov cachés (M-HMM). Pour la seconde approche, nous ne gardons que les valeurs moyennes de ces variables pour chaque énoncé. Le vecteur résultant est alors modélisé au moyen d’une machine à support de vecteur et d’un perceptron multicouches. Nous comparons les performances de ces deux approches à un tirage aléatoire (25%), les meilleurs résultats étant obtenus avec le mélange de modèles dépendant du locuteur (96,44% sur les données d’apprentissage, et 61,20% sur un jeu de test distinct). Nous étudions également les performances de modèles indépendants du locuteur. Bien que les performances se dégradent par rapport des modèles spécifiques aux locuteurs, le mélange de modèles surpasse encore les autres modèles et obtient un taux de reconnaissance sensiblement meilleur qu’un tirage aléatoire (80,42 sur les données d’apprentissage, et 51,22% sur le jeu de test).

Introduction

Much of the current effort on studying speech under stress has been aimed at detecting stress conditions for improving the robustness of speech recognizers; typical research of speech under stress has targeted perceptual (e.g. Lombard effect), psychological (e.g. timed tasks), as well as physical stressors (e.g. roller-coaster rides, high G forces) (Steeneken and Hansen, 1999). In this work we are interested in modeling speech in the context of driving under varying conditions of cognitive load hypothesized to induce a level of stress on the driver. The results of this research may be relevant not only to building recognition systems that are more robust in the context described, but also to applications that attempt to infer the underlying affective state of an utterance. We have chosen the scenario of driving while talking on the phone as an application in which knowledge of the driver’s state may provide benefits ranging from a more fluid interaction with a speech interface to improvement of safety in the response of the vehicle.

The recent literature discussing the effects of stress on speech applies the label of stress to different phenomena. Some work views stress as any broad deviations in the production of speech from its normal production (Hansen and Womack, 1996; Sarikaya and Gowdy, 1998). In discussing the SUSAS database for the study of speech under stress, Hansen et al. (1998) go on to describe various types of stress in speech. These include the effects that speaking styles, noise, and G forces have on the speaker’s output, as well as the effect of states that are often described under the label of emotions elsewhere in the literature (e.g., anxiety, fear, or anger).

In an attempt to unify the often diverging views of stress that are being invoked by the research in this field, Murray et al. (1996) have reviewed various definitions of stress, and proposed a description of this phenomenon based on the character of the stressors. They have hypothesized four levels of stressors that can affect the speech production process. At the lowest level, these include direct changes on the vocal apparatus (zero-order stressors), unconscious physiological changes (first-order stressors), and conscious physiological changes (second-order stressors). At the highest level, changes can also be brought about by stimuli that are external to the speech production process, by the speaker’s cognitive reinterpretation of the context in which the speech is being produced, as well as by the speaker’s underlying affective conditions (third-order stressors). In this paper, we follow this taxonomy and investigate whether it is possible to discriminate between spoken utterances that have been produced under the influence of third-order stressors with a varying degree of stress.

Section snippets

Speech corpus and data annotation

The speech data used in the research presented in this paper was collected from subjects driving in a simulator at the Nissan’s Cambridge Research Lab. Subjects were asked to drive through a course while engaged in a simulated phone task: while the subject drove, a speech synthesizer prompted the driver with a math question consisting of adding up two numbers whose sum was less than 100. We controlled for the number of additions with and without carry-ons in order to maintain an approximately

Feature extraction

Non-linear features of the speech waveform have received much attention in studies of speech under stress; in particular, the Teager energy operator (TEO) has been proposed to be robust to noisy environments and useful in stress classification (Zhou et al., 1998, Zhou et al., 1999; Jabloun and Cetin, 1999). Another useful approach for analysis of speech and stress has been subband decomposition or multiresolution analysis via wavelet transforms (Sarikaya and Gowdy, 1997, Sarikaya and Gowdy, 1998

Graphical models

In this section we treat the dynamic evolution of the utterance features to discriminate between the different categories of driver stress and consider a family of graphical models for time series classification. One of the most extensively studied models in the literature of time series classification is that of a hidden Markov model (HMM). An HMM is often represented as a state transition diagram. Such a representation is suitable for expressing first-order transition probabilities; it does

Modeling features at the utterance-level

Modeling of linguistic phenomena requires that we choose an adequate time scale to capture relevant details. For speech recognition, a suitable time scale might be one that allows representing phonemes. For the supralinguistic phenomena we are interested in modeling, however, we wish to investigate whether a coarser time scale suffices. The database used in this study consists of short and simple utterances (with presumably simpler structures than those found in unconstrained speech), and

Results and discussion

The speech data of four subjects was first divided into a training and testing set comprising approximately 80% and 20% of the data set, respectively. The following labels will be used to denote the four categories of data: FF, SF, FS, SS. The first letter denotes whether the data came from a fast (F) or slow (S) driving speed condition; the second indicates the frequency with which the driver was presented with an arithmetic task: every 4 s (fast) (F) or every 9 s (slow) (S). We applied the

Conclusions

In this paper we have investigated the use of features based on subband decompositions and the TEO for classification of stress categories in speech produced in the context of driving at variable speeds while engaged on mental tasks of variable cognitive load for a set of four subjects. We investigated the performance of several classifiers on two representations of the speech waveforms: using a feature set representing intra-utterance dynamics and a sparser set consisting of more global

Acknowledgements

The authors would like to thank Nissan’s CBR Lab and Elias Vyzas for their help with data collection and Thomas Minka for valuable technical discussions and for suggesting the significance test.

References (24)

C.M. Bishop
Neural Networks for Pattern Recognition
(1995)
Cowie, R., 2000. Describing the emotional states expressed in speech. In: Proceedings of the ISCA ITRW on Speech and...
Daubechies, I., 1992. Ten Lectures on Wavelets. Regional Conference Series in Applied Mathematics, SIAM, Philadelphia,...
Z. Ghahramani et al.
Factorial hidden Markov models
Machine Learning
(1997)
Gunn, S., 1998. Support vector machines for classification and regression. Technical Report, Image, Speech and...
J.H.L. Hansen et al.
Feature analysis and neural network-based classification of speech under stress
IEEE Transactions on Speech and Audio Processing
(1996)
Hansen, J., Bou-Ghazale, S.E., Sarikaya, R., Pellom, B., 1998. Getting started with the SUSAS: Speech Under Simulated...
F. Jabloun et al.
The Teager energy based feature parameters for robust speech recognition in car noise
F.V. Jensen
An Introduction to Bayesian Networks
(1996)
M.I. Jordan et al.
Hidden Markov decision trees

S. McGilloway et al.

Automatic recognition of emotion from voice: a rough benchmark

Minka, T.P., 1999. Bayesian inference of a multinominal distribution. Available from...

Cited by (139)

The linguistic structure of an emotional text influences the sympathetic activity and the speech prosody of the reader
2024, Biomedical Signal Processing and Control
In this study, we present an analysis of the relationship between the linguistic profile of a text and the physiological and acoustic characteristics of the reader to improve the emotion recognition systems. To this aim, we recorded the speech and electrodermal activity (EDA) signals from 33 healthy volunteers reading neutral and affective texts aloud. We used the BioVoice toolbox and cvxEDA algorithm to estimate some of the main speech and EDA features, respectively. The selected texts were analyzed to quantify their lexical, morpho-syntactic, and syntactic properties. Correlation and Support Vector Regression analyses between linguistic and speech and EDA features have shown a significant bidirectional association between the morpho-syntactic structure of the text and both sympathetic markers and voice acoustic properties. Specifically, significant relationships were observed between linguistic properties and certain EDA and speech features commonly used to evaluate human emotional state (e.g., edaSymp, mean tonic, F0). These findings suggest that lexical, morpho-syntactic, and syntactic properties may have a significant impact on an individual’s emotional dynamics.
Introducing ISAP and MATSS: Mental stress induced speech utterance procedure and obtained dataset
2022, Speech Communication
Citation Excerpt :
ISAP also attempts to solve the ambiguous strategy of labeling the obtained speech under stress. Though the simulated mental tasks have been successfully carried out by various studies (Scherer et al., 2008; Lu et al., 2012; Fernandez and Picard, 2003), the task of labeling the obtained speech is not reliable up to large extent. As the labeling is done by different participants and not the actual speakers who have uttered the speech samples.
Mental stress persisting for long can cause severe health issues. There are various approaches available in the literature for investigating stress through speech utterances. The available procedure to obtain speech under stress dataset requires the speakers to undergo the actual stress situations in a real environment with limited control or inducing stress with a mental task in a lab environment. These approaches either suffer from ethical issues or unreliable labeling of the obtained speech samples. In this paper, we attempt to overcome these limitations with Induced mental Stress based speech production And labeling Procedure (ISAP), for obtaining speech utterances under mental stress along with labeling the samples simultaneously. The proposed ISAP can be incorporated by future studies as per their need to create a speech under stress dataset. We also present the obtained dataset, the baseline experiments, and classification results with various machine learning models. A total of 1260 speech utterances are obtained, with ISAP able to induce stress in 54.4% of the cases. The accuracy of the SVM classifier in recognizing three stress classes, namely, No Stress, Low Stress, and High Stress is found to be 57.1%.
A survey of emotion recognition methods with emphasis on E-Learning environments
2019, Journal of Network and Computer Applications
Emotions play an important role in the learning process. Considering the learner's emotions is essential for electronic learning (e-learning) systems. Some researchers have proposed that system should induce and conduct the learner's emotions to the suitable state. But, at first, the learner's emotions have to be recognized by the system. There are different methods in the context of human emotions recognition. The emotions can be recognized by asking from the user, tracking implicit parameters, voice recognition, facial expression recognition, vital signals and gesture recognition. Moreover, hybrid methods have been also proposed which use two or more of these methods through fusing multi-modal emotional cues. In the e-learning systems, the system's user is the learner. For some reasons, which have been discussed in this study, some of the user emotions recognition methods are more suitable in the e-learning systems and some of them are inappropriate. In this work, different emotion theories are reviewed. Then, various emotions recognition methods have been represented and their advantages and disadvantages of them have been discussed for utilizing in the e-learning systems. According to the findings of this research, the multi-modal emotion recognition systems through information fusion as facial expressions, body gestures and user's messages provide better efficiency than the single-modal ones.
Analysis of speech features and personality traits
2019, Biomedical Signal Processing and Control
Voice signal has been widely investigated to characterize mood and emotional states. A further interesting dimension could regard the personality traits. The relationship between personality traits and specific speech features is known, however this topic requires further investigation. Specifically, most studies are focused on perceived personality traits, without adopting dedicated personality tests. Moreover, the relationship among speaker personality traits and specific speech features have still to be clarified. In this study, a correlational analysis between some speech-related features and the personality traits, as described by the Zuckerman-Kuhlman model and the Toronto Alexithymia Scale, is performed. An experimental protocol, consisting of two structured speech tasks, was administered to eighteen healthy subjects. Speech features were estimated to describe fundamental frequency (F₀) and voice quality related features from whole speech recording and tilt-related features, describing F₀ dynamics at voiced segment level. Significant correlations among personality traits and speech features were observed using both feature sets. Interestingly, the adopted speech task was found to influence the obtained results. Specifically, no feature reports the same significant correlation in both adopted tasks. The impact of personality traits and speech production studies on the characterization of mental disorders and the estimation of emotional/mood state of the speaker are discussed.
An Improved Chaotic Gwo-Lgbm Hybrid Algorithm for Emotion Recognition
2024, SSRN
The Dysarthric Expressed Emotional Database (DEED): An audio-visual database in British English
2023, PLoS ONE

View all citing articles on Scopus

View full text

Modeling drivers’ speech under stress

Abstract

Zusammenfassung

Résumé

Introduction

Section snippets

Speech corpus and data annotation

Feature extraction

Graphical models

Modeling features at the utterance-level

Results and discussion

Conclusions

Acknowledgements

Neural Networks for Pattern Recognition

Factorial hidden Markov models

Machine Learning

Feature analysis and neural network-based classification of speech under stress

IEEE Transactions on Speech and Audio Processing

The Teager energy based feature parameters for robust speech recognition in car noise

An Introduction to Bayesian Networks

Hidden Markov decision trees

Automatic recognition of emotion from voice: a rough benchmark