Elsevier

Pattern Recognition Letters

Volume 84, 1 December 2016, Pages 1-7
Pattern Recognition Letters

Feature optimisation for stress recognition in speech

https://doi.org/10.1016/j.patrec.2016.07.017Get rights and content

Highlights

  • An evolutionary algorithm for the optimisation of filter banks.

  • Filter banks more appropriate to stress and emotion classification were obtained.

  • New speech features were obtained through optimised filter banks.

  • The optimised features improved the results in stressed speech classification.

Abstract

Mel-frequency cepstral coefficients introduced biologically-inspired features into speech technology, becoming the most commonly used representation for speech, speaker and emotion recognition, and even for applications in music. While this representation is quite popular, it is ambitious to assume that it would provide the best results for every application, as it is not designed for each specific objective. This work proposes a methodology to learn a speech representation from data by optimising a filter bank, in order to improve results in the classification of stressed speech. Since population-based metaheuristics have proved successful in related applications, an evolutionary algorithm is designed to search for a filter bank that maximises the classification accuracy. For the codification, spline functions are used to shape the filter banks, which allows reducing the number of parameters to optimise. The filter banks obtained with the proposed methodology improve the results in stressed and emotional speech classification.

Introduction

The most widely used speech representation consists of the mel-frequency cepstral coefficients (MFCCs) [2], [19], based on the linear voice production model, and uses a psycho-acoustic scale to mimic the frequency response in the human ear [9]. The MFCC features have been extensively used for speech [24], [44], speaker [21], emotion [16], [35], [45] and language recognition [12], and even also for other applications not related to speech, such as music information retrieval [18]. However, the entire auditory system is not yet fully understood and the shape of the truly optimal filter bank is unknown. Moreover, the relevant part of the information contained in the signal depends on the application. Thus, it is unlikely that the same filter bank would provide the best performance for any kind of task. In fact, many alternative representations have been developed and some of them consist of modifications to the mel-scaled filter bank [44]. For example, a scheme for determining filter bandwidth was presented in [32], showing speech recognition improvements compared to traditional features. Also, auditory features based on Gammatone filters were developed for robust speech recognition [30]. Moreover, different approaches considering the noise energy on each mel band have been proposed in order to define MFCC weighting parameters [41], [46]. The compression of filter bank energies according to the signal-to-noise ratio in each band was proposed in [15]. Similarly, other adjustments to the classical representation have been introduced [40]. Particularly for stressed speech classification, new time-frequency features have been presented [42]. Although these alternative features improve recognition results in particular tasks, to our knowledge, a methodology to automatically obtain an optimised filter bank for speech emotion classification has not been proposed.

Another common strategy that has been exploited for speech recognition is based on the optimisation of the feature extraction process in order to maximise the discrimination capability for a given corpus [7]. In this sense, the use of deep neural networks for learning filter banks was presented in [22], while other works introduced the use of linear discriminant analysis [6], [43]. Genetic algorithms have also been applied for the design of wavelet-based representations [36]. Similarly, evolutionary strategies have been proposed for feature selection in other tasks [37]. Moreover, different approaches for the optimisation of speech features were based on evolutionary algorithms [38], [39]. Also, an evolutionary approach for the generation of novel features has been proposed [25]. For stressed speech classification, genetic algorithms are also among the most successful feature selection techniques [8]. Nevertheless, there have not been attempts to optimise filter banks for the specific tasks of emotion or stress classification.

Evolutionary algorithms have proved to be effective in many complex optimisation problems [14]. Then, in order to tackle this challenging optimisation problem, we propose the use of an evolutionary algorithm for learning a filter bank from speech data. This work, based on the approach for the optimisation of filter banks, addresses the classification of different emotions and stress types in speech. The approach makes use of an evolutionary algorithm in order to optimise the filter bank involved in the extraction of cepstral features, with spline interpolation for parameter encoding. Our method attempts to provide an alternative speech representation to improve the classical MFCC on stress and emotion classification. A classifier is used to evaluate the evolved individuals, so that the accuracy is assigned as fitness. In contrast to previous work [39], in which the temporal dynamics of each class was modelled, for this task we introduced a static classification approach based on a single feature vector per utterance.

The remainder of this paper is organised as follows. In Section 2, a short overview of evolutionary algorithms is given, and also the feature extraction process for the MFCC is explained. Then, the proposal of this work is presented in Section 3 and the results obtained are discussed in Section 4. Finally, conclusions and proposals for future work are given in Section 5.

Section snippets

Evolutionary algorithms

Evolutionary algorithms (EAs) are heuristic methods inspired by the process of biological evolution, which are useful for a wide range of optimisation problems [3], [17], [23]. The evolution is typically performed by means of natural operations like selection, mutation, crossover and replacement [4]. The selection operator assigns a reproduction probability to each individual in the population, favouring those with high fitness, in order to simulate natural selection. Mutation introduces random

Evolutionary filter bank optimisation

Several parameters could be taken into account in the search for an optimal filter bank, such as the number of filters, filter shape and filter gain. However, as the number of parameters is increased, the problem becomes extremely complex, so there is a tradeoff between optimisation complexity and flexibility. In previous works, three parameters were considered for each triangular filter in the filter bank; these correspond to the frequency values where: the triangle begins, reaches its maximum

Materials

In the experiments, the FAU Aibo Emotion Corpus [5], [33] and a simulated stressed speech corpus in Hindi language [31] were used. The Hindi language database consists of stressed speech signals recorded from fifteen speakers, ten male and five female. The speech utterances were sampled at 8 kHz and include neutral speech and four acted stress conditions: anger, happiness, lombard and sadness. Each recorded signal consisted of a keyword, which was uttered within a sentence and then isolated.

Conclusion and future work

In this work, an evolutionary optimisation method has been proposed in order to improve stressed and emotional speech classification results. The chromosome codification based on splines allowed reducing the number of optimisation parameters, while maintaining the quality and diversity of possible solutions. This encoding also helped to simplify the filter bank optimisation problem, making possible to speed up the convergence of the EA to decent solutions. Also, we proposed a static

Acknowledgements

This work was supported by the Argentinian Ministerio de Ciencia, Tecnología e Innovación Productiva and by the Indian Department of Science and Technology, under project IN1103. Also, the authors wish to acknowledge the support provided by Agencia Nacional de Promoción Científica y Tecnológica (with projects PICT 2011-2440, PICT 2014-1442), Universidad Nacional de Litoral (with projects CAID 2011-519, -525 and PACT 2011-058) and Consejo Nacional de Investigaciones Científicas y Técnicas from

References (46)

  • F. Zheng et al.

    Comparison of different implementations of MFCC

    J. Comput. Sci. Technol.

    (2001)
  • E. Albornoz et al.

    Emotion recognition in never-seen languages using a novel ensemble method with emotion profiles

    IEEE Trans. Affective Comput.

    (2015)
  • E.M. Albornoz et al.

    Spoken emotion recognition using hierarchical classifiers

    Comput. Speech Language

    (2011)
  • A. Batliner et al.

    Private emotions versus social interaction: a data-driven approach towards analysing emotion in speech

    User Model. User-Adapt. Interact.

    (2008)
  • L. Burget et al.

    Data driven design of filter bank for speech recognition

    Text, Speech and Dialogue. Lecture Notes in Computer Science

    (2001)
  • H. Bǒril et al.

    Data-driven design of front-end filter bank for lombard speech recognition

    Proc. of INTERSPEECH 2006 – ICSLP, Pittsburgh, Pennsylvania

    (2006)
  • S. Casale et al.

    Multistyle classification of speech under stress using feature subset selection based on genetic algorithms

    Speech Commun.

    (2007)
  • A. Engelbrecht

    Computational Intelligence: An Introduction

    (2007)
  • F. Eyben et al.

    Opensmile: the Munich versatile and fast open-source audio feature extractor

    Proc. of the Int. Conf. on Multimedia, ACM, New York, NY, USA

    (2010)
  • C.L. Huang et al.

    Feature normalization using MVAW processing for spoken language recognition

    Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific

    (2013)
  • T. Kohonen

    Self-Organizing Maps, Springer Series in Information Sciences, vol. 30

    (2000)
  • C.D. Lin et al.

    Using genetic algorithms to design experiments: a review

    Qual. Reliab. Eng. Int.

    (2015)
  • B. Nasersharif et al.

    SNR-dependent compression of enhanced mel sub-band energies for compensation of noise effects on MFCC features

    Pattern Recognit. Lett.

    (2007)
  • Cited by (12)

    • A closed-form solution to the graph total variation problem for continuous emotion profiling in noisy environment

      2018, Speech Communication
      Citation Excerpt :

      With the development of artificial intelligence, humans have demanded more and more from affective computing, which facilitates the development of an increasing number of automatic speech emotion recognition (SER) applications and more relevantly dimensional emotion prediction from time-continuous labels Gunes and Schuller (2013); Mencattini et al. (2017); Martinelli et al. (2016); Mariooryad and Busso (2015). Automatic emotion recognition (AER) technology from speech has matured well enough to be applied in some real-life scenarios Vignolo et al. (2016), such as call centers Chen et al. (2012), disease auxiliary diagnosis Schuller et al. (2015), remote education and safe driving. However, these scenarios not only require an almost silent environment to maximize the performance of the system but also need the system to provide the emotional states as accurately as possible.

    • Empirical Mode Decomposition for adaptive AM-FM analysis of Speech: A Review

      2017, Speech Communication
      Citation Excerpt :

      In this direction, different methodologies based on non-stationary data analysis techniques like Evolutionary Algorithms (EAs) (Bäck, 1996; Vignolo et al., 2011b; 2011a; 2009; Charbuillet et al., 2009) and Wavelet Transform (WT) (Cohen, 1995; Boashash, 2003; Polikar, 1996; Torres and Rufiner, 1998; 2000; Dabin et al., 2005; Vignolo et al., 2013; 2016a) have been utilized in different speech processing applications. An example of filterbank optimization using EAs for the extraction of cepstral features may be found in Vignolo et al. (2016b). Fig. 9 shows the filterbanks optimized for Hindi stressed speech corpus (Shukla et al., 2011) and the FAU Aibo Emotion Corpus (Batliner et al., 2008; Steidl, 2009), for stressed and emotional speech classification respectively.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by J. Yang.

    View full text