Elsevier

Speech Communication

Volume 119, May 2020, Pages 36-45
Speech Communication

Automatic classification of infant vocalization sequences with convolutional neural networks

https://doi.org/10.1016/j.specom.2020.03.003Get rights and content

Highlights

  • Small bottlenecks after the convolutional stage primarily raise CNN performance.

  • Global average pooling layers are most efficient for creating small bottlenecks.

  • Tuning the convolutional receptive field size is the second most important factor.

  • Vocalization classes are primarily confused according to their affective similarity.

Abstract

In this study we investigated Convolutional Neural Networks (CNNs) for the classification of infant vocalization sequences. The target classes were ‘crying’, ‘fussing’, ‘babbling’, ‘laughing’ and ‘vegetative vocalizations’. The general case of this classification task is of importance for applications which require a qualitative evaluation of general infant vocalizations, such as pain assessment or assessment of language acquisition. The classification procedure was based on representing audio segments as spectrograms which are input to an conventional CNN architecture scheme. We systematically analyzed the influence of network features on the classification performance to derive guidelines for designing effective CNN architectures for the task. We show that CNNs should be modeled to have a small bottleneck between the convolutional stage and the fully connected stage, achieved through broad aggregation of convolutional feature maps across the time and frequency axis. The best performing CNN configuration yielded a balanced accuracy of 72%. We conclude that conventional CNN architectures can reach satisfactory performance for this task even with small amounts of training data as long as certain network features are ensured.

Introduction

Automatic classification of infant vocalizations is a promising field of research to support areas which require a qualitative assessment of infant vocal expressions. Examples for such areas are pain assessment in paediatric wards or assessment of language acquisition. Automatic systems can aid in increasing coverage when human surveillance is infeasible.

Most publications in automatic infant vocalization recognition have been based on conventional audio recognition approaches. In these, audio signals are represented through hand-crafted and often task-specific feature sets such as mel frequency cepstral coefficients, fundamental frequency etc. They are used to train conventional classifiers such as support vector machines, hidden Markov models or multilayer perceptrons (Wagner et al., 2018) (Virtanen et al., 2018, chapter 4). These methods have been widely used for infant vocalization recognition, e.g. in the studies (Abdulaziz, Ahmad, 2010, Xie, Ward, Laszlo, 1996, Zhang, Cristia, Warlaumont, Schuller, 2018, Naithani, Kivinummi, Virtanen, Tammela, Peltola, Leppänen, 2018, Rodriguez, Caluya, 2017, Ntalampiras, 2015).

More recently, ‘deep learning’ or ‘end-to-end’ approaches emerged as an alternative to conventional approaches. They are based on replacing hand-crafted feature sets with more basic, task-unspecific audio representations such as spectrograms or raw waveforms. Those representations are fed into neural networks, particularly Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) Wagner et al. (2018) (Virtanen et al., 2018, chapter 4). Recently end-to-end systems significantly outperformed conventional systems for general audio recognition tasks such as audio scene classification or audio event detection, as the DCASE competitions demonstrated Mesaros, Heittola, Virtanen, 2018, Mesaros, Heittola, Diment, Elizalde, Shah, Vincent, Raj, Virtanen, 2017.

Consequently, researchers began to apply end-to-end systems to infant vocalization recognition tasks as well: Chang and Li (2016) employed CNNs with spectrogram inputs for cry reason classification and reached a validation accuracy of 78.5%. Lavner et al. (2016) also employed CNNs with spectrogram inputs for cry sound detection and compared it to a conventional approach based on logistic regression. The CNN outperformed the conventional approach for lower false-positive rates. The Interspeech 2018 computational paralinguistics challenge (Schuller et al., 2018) proposed a competition for classifying infant vocalizations into ‘crying’, ‘fussing’ and ‘neutral’. The organization team entered a baseline system employing a CNN-RNN with raw waveform inputs which reached a test balanced accuracy of 63 %. Two of seven participating teams entered end-to-end system as well: Turan and Erzin (2018) proposed a CNN variant called “capsule net” with spectrogram inputs and reached a balanced accuracy of 71.6%. Wagner et al. (2018) reached 67% through an RNN with spectrogram as well as raw waveform inputs.

These studies established and emphasized the general feasibility of infant vocalization recognition through end-to-end systems. While they provide performance indications for specific system configurations, they did not yet provide in-depth analysis of which system choices and parameterizations have the greatest impact on the performance. Such analysis is necessary to explore the true performance potential of end-to-end systems. For example, analysis and optimization of architectural CNN hyperparameters lead to significant performance improvements in the area of image classification with CNNs (Krizhevsky, Sutskever, Hinton, 2012, Zeiler, Fergus, 2014, Simonyan, Zisserman, 2014). To our knowledge, there is only one study in the area of infant vocalization recognition which provided an in-depth analysis of an end-to-end system component: Wagner et al. (2018) systematically compared various feature set types for a fixed RNN architecture and found that hand-crafted features outperformed more basic audio representations. They concluded that RNNs are more suited for hand-crafted features for infant vocalization classification.

In this study we investigated the classification of infant vocalizations through CNNs with mel-spectrogram inputs. We adapted this approach as it is currently the prevailing end-to-end paradigm for general audio classification tasks (Mesaros, Heittola, Virtanen, 2018, Mesaros, Heittola, Diment, Elizalde, Shah, Vincent, Raj, Virtanen, 2017; Hershey et al., 2017). Our investigation focused on optimizing the CNN architecture to increase the classification performance. Our goal was to not only identify the most performant CNN configurations, but to determine which CNN features have the greatest impact on the classification performance in general. Consequently, our key contribution is the identification of the most relevant CNN traits for infant vocalization classification. While there are additional system components which typically also contribute to raising performance, such as data augmentation techniques (Virtanen et al., 2018, chapter 5), our investigation focuses exclusively on the influence of the CNN architecture as it forms the essential core part of any end-to-end system.

The methodology of the investigation was as follows: We specified a vocalization classification task by defining a target class set and constructing a corresponding acoustic database. We then defined a CNN architecture scheme representative of conventional VGG-like CNNs (Simonyan and Zisserman, 2014) and produced numerous CNN configurations of this scheme drawn from a parameter space. Each configuration was trained and evaluated on the database. We finally analyzed the relation between architectural CNN features and the classification performance through statistical methods to identify the most influential ones.

The remainder of this paper is structured as follows: Section 2.1 describes the target classes and Section 2.2 the acoustic database. Section 2.3 summarizes the deep learning approach. Section 2.4 describes the CNN architecture scheme as well as the configurations. Section 2.5 describes the evaluation procedure for measuring the performance of the CNN configurations. The results Section 3 summarizes the overall system performance and shows the analysis of the relation between CNN features and the performance.

Section snippets

Target classes

Table 1 summarizes the target classes and their definitions. The class set was selected to represent the middle ground between various infant monitoring scenarios: The classes ‘fussing’ and ‘crying’ are frequently required for tracking an infants distress state in medical and domestic monitoring (McGrath et al., 2013, chapter 37)James-Roberts et al. (1996). The classes ‘babbling’, ‘laughing’ and ‘vegetative vocalizations’ are typically of interest in language development monitoring which

Results

First, the overall system performance and the analysis the class-wise performance are presented in Section 3.1. A detailed analysis on the relation between the CNN features and the classification performance is presented in Section 3.2.

Discussion

We summarize that the CNN features with the strongest overall impact on the performance were the choice of separator layer as well as the input size to the fully connected layers. The most efficient CNN configurations employed a 2D global average pooling separator layer and had receptive fields of  ≈ 0.65 s and  ≈ 37% of the provided frequency range. CNNs with flatten separator layers additionally benefited from large pooling kernel sizes/strides in conjunction with large receptive fields.

From

Conclusion

In this study we investigated the influence of architectural CNN features on the classification performance for infant vocalization sequences when using spectrogram inputs. We discovered that the primary factor for raising the classification performance is designing CNNs to have a small bottleneck between the convolutional stage and the fully connected stage, ideally achieved through global pooling between those stages. The secondary factor is management of the size of the receptive field. The

CRediT authorship contribution statement

Franz Anders: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing - original draft. Mario Hlawitschka: Conceptualization, Methodology, Funding acquisition, Project administration, Resources, Supervision, Writing - review & editing. Mirco Fuchs: Conceptualization, Methodology, Funding acquisition, Project administration, Resources, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the European Union as part of the ESF-Program, grant number K-7531.20/434-11; SAB-Nr. 100316843.

References (56)

  • A. Mesaros et al.

    Dcase 2017 challenge setup: Tasks, datasets and baseline system

    DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events

    (2017)
  • Y. Abdulaziz et al.

    Infant cry recognition system: a comparison of system performance based on mel frequency and linear prediction cepstral coefficients

    Information Retrieval & Knowledge Management,(CAMP), 2010 International Conference on

    (2010)
  • K.H. Brodersen et al.

    The balanced accuracy and its posterior distribution

    2010 20th International Conference on Pattern Recognition

    (2010)
  • E.H. Buder et al.

    An acoustic phonetic catalog of prespeech vocalizations from a developmental perspective

    Comprehensive perspectives on child speech development and disorders: pathways from linguistic theory to clinical practice. Hauppauge, NY: NOVA

    (2013)
  • E. Cakır et al.

    Convolutional recurrent neural networks for polyphonic sound event detection

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2017)
  • C.-Y. Chang et al.

    Application of deep learning for recognizing infant cries

    Consumer Electronics-Taiwan (ICCE-TW), 2016 IEEE International Conference on

    (2016)
  • F. Chollet

    Xception: deep learning with depthwise separable convolutions

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2017)
  • G. Esposito et al.

    Judgment of infant cry: the roles of acoustic characteristics and sociodemographic characteristics

    Japanese Psychol. Res.

    (2015)
  • T. Fuhr et al.

    Comparison of supervised-learning models for infant cry classification/vergleich von klassifikationsmodellen zur säuglingsschreianalyse

    Int. J. Health Professions

    (2015)
  • S.A. Fulop

    Speech Spectrum Analysis

    (2011)
  • I. Goodfellow et al.

    Deep learning

    (2016)
  • G. Gosztolya et al.

    General utterance-level feature extraction for classifying crying sounds, atypical & self-assessed affect and heart beats

    Proc. Interspeech 2018

    (2018)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2016)
  • G. Heinzel et al.

    Spectrum and spectral density estimation by the discrete fourier transform (dft), including a comprehensive list of window functions and some new at-top windows

    (2002)
  • S. Hershey et al.

    Cnn architectures for large-scale audio classification

    Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on

    (2017)
  • G. Huang et al.

    Densely connected convolutional networks

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2017)
  • S. Ioffe et al.

    Batch normalization: accelerating deep network training by reducing internal covariate shift

    Batch normalization: accelerating deep network training by reducing internal covariate shift

    (2015)
  • G. James et al.

    An Introduction to Statistical Learning

    (2013)
  • I.S. James-Roberts et al.

    Bases for maternal perceptions of infant crying and colic behaviour.

    Arch. Dis. Child.

    (1996)
  • A. Kershenbaum et al.

    Acoustic sequences in non-human animals: a tutorial review and prospectus

    Biol. Rev.

    (2016)
  • D.P. Kingma et al.

    Adam: a method for stochastic optimization

    (2014)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Advances in neural information processing systems

    (2012)
  • Y. Lavner et al.

    Baby cry detection in domestic environment using deep learning

    (2016)
  • H. Le et al.

    What are the receptive, effective receptive, and projective fields of neurons in convolutional neural networks?

    (2017)
  • Y. LeCun et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)
  • H.-C. Lin et al.

    Infantsâ expressive behaviors to mothers and unfamiliar partners during face-to-face interactions from 4 to 10 months

    Infant Behav. Dev.

    (2009)
  • M. Lin et al.

    Network in network

    (2013)
  • B. McFee et al.

    librosa: audio and music signal analysis in python

    Proceedings of the 14th python in science conference

    (2015)
  • Cited by (15)

    • A Comparative Study of Machine Learning Methods for Baby Cry Detection Using MFCC Features

      2024, Journal of Electronics, Electromedical Engineering, and Medical Informatics
    View all citing articles on Scopus
    View full text