Modeling drivers’ speech under stress

https://doi.org/10.1016/S0167-6393(02)00080-8Get rights and content

Abstract

We explore the use of features derived from multiresolution analysis of speech and the Teager energy operator for classification of drivers’ speech under stressed conditions. We apply this set of features to a database of short speech utterances to create user-dependent discriminants of four stress categories. In addition we address the problem of choosing a suitable temporal scale for representing categorical differences in the data. This leads to two modeling approaches. In the first approach, the dynamics of the feature set within the utterance are assumed to be important for the classification task. These features are then classified using dynamic Bayesian network models as well as a model consisting of a mixture of hidden Markov models (M-HMM). In the second approach, we define an utterance-level feature set by taking the mean value of the features across the utterance. This feature set is then modeled with a support vector machine and a multilayer perceptron classifier. We compare the performance on the sparser and full dynamic representations against a chance-level performance of 25% and obtain the best performance with the speaker-dependent mixture model (96.4% on the training set, and 61.2% on a separate testing set). We also investigate how these models perform on the speaker-independent task. Although the performance of the speaker-independent models degrades with respect to the models trained on individual speakers, the mixture model still outperforms the competing models and achieves significantly better than random recognition (80.4% on the training set, and 51.2% on a separate testing set).

Zusammenfassung

In diesem Bericht untersuchen wir die Verwendung von Merkmalen der Sprachanalyse aufgrund mehrerer Zeitskalen zur Klassifikation der Sprache eines Fahrers unter Stress. Diese Merkmale wenden wir auf eine Datenbank kurzer Sprachsequenzen an, um vier sprecherabhängige Stresskategorien zu erstellen. Zusätzlich beschäftigen wir uns mit der Auswahl der passenden Zeitskala für die Repräsentation klassenspezifischer Unterschiede in der Datenmenge. Dies führt zu zwei unterschiedlichen Modellierungsansätzen. Im ersten Ansatz wird vorausgesetzt, dass die dynamische Entwicklung der Merkmale, die innerhalb der Sprachsequenzen vorhanden ist, wichtig für die Klassifizierung ist. Diese Merkmale werden klassifiziert mit Hilfe von dynamischen Bayes Netzen bzw. mit einer Mischung von Hidden Markov Modellen. Im zweiten Ansatz definieren wir Merkmale auf Artikulationsebene, indem wir die Mittelwerte der Merkmale über die Sprachsequenzen berechnen. Diese Merkmalsmenge wird dann mit einer Support Vector Maschine und einem Multilayer-Perzeptron modelliert. Die Performanz der spärlichen und der voll dynamischen Darstellung wird verglichen mit dem zufälligen Klassifikationsniveau von 25%. Das beste Ergebnis erhalten wir hierbei mit einem sprecherabhängigen Mischmodell (96,4% auf den Trainingsdaten und 61,2% auf unabhängigen Testdaten). Weiterhin untersuchen wir, wie diese Modelle bei sprecherunabhängigen Aufgaben abschneiden. Obwohl die sprecherunabhängigen Modelle schlechter abschneiden als die auf einzelne Sprecher justierten Modellen, übertrifft das Mischmodell die konkurrierenden Modelle immer noch und erzielt signifikant bessere Sprechererkennung als Zufallserkennung (80.4% auf den Trainingsdaten und 51.2% auf unabhängigen Testdaten).

Résumé

Nous explorons l’utilisation de représentations dérivées de l’analyse multirésolution de la parole et de l’opérateur d’énergie de Teager pour la classification de la parole de conducteurs en condition de stress. Nous appliquons cette analyse à corpus d’énoncés courts pour créer des fonctions discriminantes dépendantes du locuteur pour quatre catégories de stress. En outre nous adressons le problème du choix d’une échelle temporelle appropriée pour catégoriser les données. Ceci mène à deux approches pour la modélisation. Dans la première approche, la dynamique des variables issues de l’analyse d’un énoncé donné est supposée pertinente pour la classification. Ces variables sont alors modélisées au moyen de réseaux bayésiens dynamiques (DBN) ou par un mélange des modèles de Markov cachés (M-HMM). Pour la seconde approche, nous ne gardons que les valeurs moyennes de ces variables pour chaque énoncé. Le vecteur résultant est alors modélisé au moyen d’une machine à support de vecteur et d’un perceptron multicouches. Nous comparons les performances de ces deux approches à un tirage aléatoire (25%), les meilleurs résultats étant obtenus avec le mélange de modèles dépendant du locuteur (96,44% sur les données d’apprentissage, et 61,20% sur un jeu de test distinct). Nous étudions également les performances de modèles indépendants du locuteur. Bien que les performances se dégradent par rapport des modèles spécifiques aux locuteurs, le mélange de modèles surpasse encore les autres modèles et obtient un taux de reconnaissance sensiblement meilleur qu’un tirage aléatoire (80,42 sur les données d’apprentissage, et 51,22% sur le jeu de test).

Introduction

Much of the current effort on studying speech under stress has been aimed at detecting stress conditions for improving the robustness of speech recognizers; typical research of speech under stress has targeted perceptual (e.g. Lombard effect), psychological (e.g. timed tasks), as well as physical stressors (e.g. roller-coaster rides, high G forces) (Steeneken and Hansen, 1999). In this work we are interested in modeling speech in the context of driving under varying conditions of cognitive load hypothesized to induce a level of stress on the driver. The results of this research may be relevant not only to building recognition systems that are more robust in the context described, but also to applications that attempt to infer the underlying affective state of an utterance. We have chosen the scenario of driving while talking on the phone as an application in which knowledge of the driver’s state may provide benefits ranging from a more fluid interaction with a speech interface to improvement of safety in the response of the vehicle.

The recent literature discussing the effects of stress on speech applies the label of stress to different phenomena. Some work views stress as any broad deviations in the production of speech from its normal production (Hansen and Womack, 1996; Sarikaya and Gowdy, 1998). In discussing the SUSAS database for the study of speech under stress, Hansen et al. (1998) go on to describe various types of stress in speech. These include the effects that speaking styles, noise, and G forces have on the speaker’s output, as well as the effect of states that are often described under the label of emotions elsewhere in the literature (e.g., anxiety, fear, or anger).

In an attempt to unify the often diverging views of stress that are being invoked by the research in this field, Murray et al. (1996) have reviewed various definitions of stress, and proposed a description of this phenomenon based on the character of the stressors. They have hypothesized four levels of stressors that can affect the speech production process. At the lowest level, these include direct changes on the vocal apparatus (zero-order stressors), unconscious physiological changes (first-order stressors), and conscious physiological changes (second-order stressors). At the highest level, changes can also be brought about by stimuli that are external to the speech production process, by the speaker’s cognitive reinterpretation of the context in which the speech is being produced, as well as by the speaker’s underlying affective conditions (third-order stressors). In this paper, we follow this taxonomy and investigate whether it is possible to discriminate between spoken utterances that have been produced under the influence of third-order stressors with a varying degree of stress.

Section snippets

Speech corpus and data annotation

The speech data used in the research presented in this paper was collected from subjects driving in a simulator at the Nissan’s Cambridge Research Lab. Subjects were asked to drive through a course while engaged in a simulated phone task: while the subject drove, a speech synthesizer prompted the driver with a math question consisting of adding up two numbers whose sum was less than 100. We controlled for the number of additions with and without carry-ons in order to maintain an approximately

Feature extraction

Non-linear features of the speech waveform have received much attention in studies of speech under stress; in particular, the Teager energy operator (TEO) has been proposed to be robust to noisy environments and useful in stress classification (Zhou et al., 1998, Zhou et al., 1999; Jabloun and Cetin, 1999). Another useful approach for analysis of speech and stress has been subband decomposition or multiresolution analysis via wavelet transforms (Sarikaya and Gowdy, 1997, Sarikaya and Gowdy, 1998

Graphical models

In this section we treat the dynamic evolution of the utterance features to discriminate between the different categories of driver stress and consider a family of graphical models for time series classification. One of the most extensively studied models in the literature of time series classification is that of a hidden Markov model (HMM). An HMM is often represented as a state transition diagram. Such a representation is suitable for expressing first-order transition probabilities; it does

Modeling features at the utterance-level

Modeling of linguistic phenomena requires that we choose an adequate time scale to capture relevant details. For speech recognition, a suitable time scale might be one that allows representing phonemes. For the supralinguistic phenomena we are interested in modeling, however, we wish to investigate whether a coarser time scale suffices. The database used in this study consists of short and simple utterances (with presumably simpler structures than those found in unconstrained speech), and

Results and discussion

The speech data of four subjects was first divided into a training and testing set comprising approximately 80% and 20% of the data set, respectively. The following labels will be used to denote the four categories of data: FF, SF, FS, SS. The first letter denotes whether the data came from a fast (F) or slow (S) driving speed condition; the second indicates the frequency with which the driver was presented with an arithmetic task: every 4 s (fast) (F) or every 9 s (slow) (S). We applied the

Conclusions

In this paper we have investigated the use of features based on subband decompositions and the TEO for classification of stress categories in speech produced in the context of driving at variable speeds while engaged on mental tasks of variable cognitive load for a set of four subjects. We investigated the performance of several classifiers on two representations of the speech waveforms: using a feature set representing intra-utterance dynamics and a sparser set consisting of more global

Acknowledgements

The authors would like to thank Nissan’s CBR Lab and Elias Vyzas for their help with data collection and Thomas Minka for valuable technical discussions and for suggesting the significance test.

References (24)

  • C.M. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • Cowie, R., 2000. Describing the emotional states expressed in speech. In: Proceedings of the ISCA ITRW on Speech and...
  • Daubechies, I., 1992. Ten Lectures on Wavelets. Regional Conference Series in Applied Mathematics, SIAM, Philadelphia,...
  • Z. Ghahramani et al.

    Factorial hidden Markov models

    Machine Learning

    (1997)
  • Gunn, S., 1998. Support vector machines for classification and regression. Technical Report, Image, Speech and...
  • J.H.L. Hansen et al.

    Feature analysis and neural network-based classification of speech under stress

    IEEE Transactions on Speech and Audio Processing

    (1996)
  • Hansen, J., Bou-Ghazale, S.E., Sarikaya, R., Pellom, B., 1998. Getting started with the SUSAS: Speech Under Simulated...
  • F. Jabloun et al.

    The Teager energy based feature parameters for robust speech recognition in car noise

  • F.V. Jensen

    An Introduction to Bayesian Networks

    (1996)
  • M.I. Jordan et al.

    Hidden Markov decision trees

  • S. McGilloway et al.

    Automatic recognition of emotion from voice: a rough benchmark

  • Minka, T.P., 1999. Bayesian inference of a multinominal distribution. Available from...
  • Cited by (139)

    • Introducing ISAP and MATSS: Mental stress induced speech utterance procedure and obtained dataset

      2022, Speech Communication
      Citation Excerpt :

      ISAP also attempts to solve the ambiguous strategy of labeling the obtained speech under stress. Though the simulated mental tasks have been successfully carried out by various studies (Scherer et al., 2008; Lu et al., 2012; Fernandez and Picard, 2003), the task of labeling the obtained speech is not reliable up to large extent. As the labeling is done by different participants and not the actual speakers who have uttered the speech samples.

    • Analysis of speech features and personality traits

      2019, Biomedical Signal Processing and Control
    View all citing articles on Scopus
    View full text