Automatic classification of infant vocalization sequences with convolutional neural networks
Introduction
Automatic classification of infant vocalizations is a promising field of research to support areas which require a qualitative assessment of infant vocal expressions. Examples for such areas are pain assessment in paediatric wards or assessment of language acquisition. Automatic systems can aid in increasing coverage when human surveillance is infeasible.
Most publications in automatic infant vocalization recognition have been based on conventional audio recognition approaches. In these, audio signals are represented through hand-crafted and often task-specific feature sets such as mel frequency cepstral coefficients, fundamental frequency etc. They are used to train conventional classifiers such as support vector machines, hidden Markov models or multilayer perceptrons (Wagner et al., 2018) (Virtanen et al., 2018, chapter 4). These methods have been widely used for infant vocalization recognition, e.g. in the studies (Abdulaziz, Ahmad, 2010, Xie, Ward, Laszlo, 1996, Zhang, Cristia, Warlaumont, Schuller, 2018, Naithani, Kivinummi, Virtanen, Tammela, Peltola, Leppänen, 2018, Rodriguez, Caluya, 2017, Ntalampiras, 2015).
More recently, ‘deep learning’ or ‘end-to-end’ approaches emerged as an alternative to conventional approaches. They are based on replacing hand-crafted feature sets with more basic, task-unspecific audio representations such as spectrograms or raw waveforms. Those representations are fed into neural networks, particularly Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) Wagner et al. (2018) (Virtanen et al., 2018, chapter 4). Recently end-to-end systems significantly outperformed conventional systems for general audio recognition tasks such as audio scene classification or audio event detection, as the DCASE competitions demonstrated Mesaros, Heittola, Virtanen, 2018, Mesaros, Heittola, Diment, Elizalde, Shah, Vincent, Raj, Virtanen, 2017.
Consequently, researchers began to apply end-to-end systems to infant vocalization recognition tasks as well: Chang and Li (2016) employed CNNs with spectrogram inputs for cry reason classification and reached a validation accuracy of 78.5%. Lavner et al. (2016) also employed CNNs with spectrogram inputs for cry sound detection and compared it to a conventional approach based on logistic regression. The CNN outperformed the conventional approach for lower false-positive rates. The Interspeech 2018 computational paralinguistics challenge (Schuller et al., 2018) proposed a competition for classifying infant vocalizations into ‘crying’, ‘fussing’ and ‘neutral’. The organization team entered a baseline system employing a CNN-RNN with raw waveform inputs which reached a test balanced accuracy of 63 %. Two of seven participating teams entered end-to-end system as well: Turan and Erzin (2018) proposed a CNN variant called “capsule net” with spectrogram inputs and reached a balanced accuracy of 71.6%. Wagner et al. (2018) reached 67% through an RNN with spectrogram as well as raw waveform inputs.
These studies established and emphasized the general feasibility of infant vocalization recognition through end-to-end systems. While they provide performance indications for specific system configurations, they did not yet provide in-depth analysis of which system choices and parameterizations have the greatest impact on the performance. Such analysis is necessary to explore the true performance potential of end-to-end systems. For example, analysis and optimization of architectural CNN hyperparameters lead to significant performance improvements in the area of image classification with CNNs (Krizhevsky, Sutskever, Hinton, 2012, Zeiler, Fergus, 2014, Simonyan, Zisserman, 2014). To our knowledge, there is only one study in the area of infant vocalization recognition which provided an in-depth analysis of an end-to-end system component: Wagner et al. (2018) systematically compared various feature set types for a fixed RNN architecture and found that hand-crafted features outperformed more basic audio representations. They concluded that RNNs are more suited for hand-crafted features for infant vocalization classification.
In this study we investigated the classification of infant vocalizations through CNNs with mel-spectrogram inputs. We adapted this approach as it is currently the prevailing end-to-end paradigm for general audio classification tasks (Mesaros, Heittola, Virtanen, 2018, Mesaros, Heittola, Diment, Elizalde, Shah, Vincent, Raj, Virtanen, 2017; Hershey et al., 2017). Our investigation focused on optimizing the CNN architecture to increase the classification performance. Our goal was to not only identify the most performant CNN configurations, but to determine which CNN features have the greatest impact on the classification performance in general. Consequently, our key contribution is the identification of the most relevant CNN traits for infant vocalization classification. While there are additional system components which typically also contribute to raising performance, such as data augmentation techniques (Virtanen et al., 2018, chapter 5), our investigation focuses exclusively on the influence of the CNN architecture as it forms the essential core part of any end-to-end system.
The methodology of the investigation was as follows: We specified a vocalization classification task by defining a target class set and constructing a corresponding acoustic database. We then defined a CNN architecture scheme representative of conventional VGG-like CNNs (Simonyan and Zisserman, 2014) and produced numerous CNN configurations of this scheme drawn from a parameter space. Each configuration was trained and evaluated on the database. We finally analyzed the relation between architectural CNN features and the classification performance through statistical methods to identify the most influential ones.
The remainder of this paper is structured as follows: Section 2.1 describes the target classes and Section 2.2 the acoustic database. Section 2.3 summarizes the deep learning approach. Section 2.4 describes the CNN architecture scheme as well as the configurations. Section 2.5 describes the evaluation procedure for measuring the performance of the CNN configurations. The results Section 3 summarizes the overall system performance and shows the analysis of the relation between CNN features and the performance.
Section snippets
Target classes
Table 1 summarizes the target classes and their definitions. The class set was selected to represent the middle ground between various infant monitoring scenarios: The classes ‘fussing’ and ‘crying’ are frequently required for tracking an infants distress state in medical and domestic monitoring (McGrath et al., 2013, chapter 37)James-Roberts et al. (1996). The classes ‘babbling’, ‘laughing’ and ‘vegetative vocalizations’ are typically of interest in language development monitoring which
Results
First, the overall system performance and the analysis the class-wise performance are presented in Section 3.1. A detailed analysis on the relation between the CNN features and the classification performance is presented in Section 3.2.
Discussion
We summarize that the CNN features with the strongest overall impact on the performance were the choice of separator layer as well as the input size to the fully connected layers. The most efficient CNN configurations employed a 2D global average pooling separator layer and had receptive fields of ≈ 0.65 s and ≈ 37% of the provided frequency range. CNNs with flatten separator layers additionally benefited from large pooling kernel sizes/strides in conjunction with large receptive fields.
From
Conclusion
In this study we investigated the influence of architectural CNN features on the classification performance for infant vocalization sequences when using spectrogram inputs. We discovered that the primary factor for raising the classification performance is designing CNNs to have a small bottleneck between the convolutional stage and the fully connected stage, ideally achieved through global pooling between those stages. The secondary factor is management of the size of the receptive field. The
CRediT authorship contribution statement
Franz Anders: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing - original draft. Mario Hlawitschka: Conceptualization, Methodology, Funding acquisition, Project administration, Resources, Supervision, Writing - review & editing. Mirco Fuchs: Conceptualization, Methodology, Funding acquisition, Project administration, Resources, Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the European Union as part of the ESF-Program, grant number K-7531.20/434-11; SAB-Nr. 100316843.
References (56)
- et al.
Dcase 2017 challenge setup: Tasks, datasets and baseline system
DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events
(2017) - et al.
Infant cry recognition system: a comparison of system performance based on mel frequency and linear prediction cepstral coefficients
Information Retrieval & Knowledge Management,(CAMP), 2010 International Conference on
(2010) - et al.
The balanced accuracy and its posterior distribution
2010 20th International Conference on Pattern Recognition
(2010) - et al.
An acoustic phonetic catalog of prespeech vocalizations from a developmental perspective
Comprehensive perspectives on child speech development and disorders: pathways from linguistic theory to clinical practice. Hauppauge, NY: NOVA
(2013) - et al.
Convolutional recurrent neural networks for polyphonic sound event detection
IEEE/ACM Trans. Audio Speech Lang. Process.
(2017) - et al.
Application of deep learning for recognizing infant cries
Consumer Electronics-Taiwan (ICCE-TW), 2016 IEEE International Conference on
(2016) Xception: deep learning with depthwise separable convolutions
Proceedings of the IEEE conference on computer vision and pattern recognition
(2017)- et al.
Judgment of infant cry: the roles of acoustic characteristics and sociodemographic characteristics
Japanese Psychol. Res.
(2015) - et al.
Comparison of supervised-learning models for infant cry classification/vergleich von klassifikationsmodellen zur säuglingsschreianalyse
Int. J. Health Professions
(2015) Speech Spectrum Analysis
(2011)
Deep learning
General utterance-level feature extraction for classifying crying sounds, atypical & self-assessed affect and heart beats
Proc. Interspeech 2018
Deep residual learning for image recognition
Proceedings of the IEEE conference on computer vision and pattern recognition
Spectrum and spectral density estimation by the discrete fourier transform (dft), including a comprehensive list of window functions and some new at-top windows
Cnn architectures for large-scale audio classification
Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on
Densely connected convolutional networks
Proceedings of the IEEE conference on computer vision and pattern recognition
Batch normalization: accelerating deep network training by reducing internal covariate shift
Batch normalization: accelerating deep network training by reducing internal covariate shift
An Introduction to Statistical Learning
Bases for maternal perceptions of infant crying and colic behaviour.
Arch. Dis. Child.
Acoustic sequences in non-human animals: a tutorial review and prospectus
Biol. Rev.
Adam: a method for stochastic optimization
Imagenet classification with deep convolutional neural networks
Advances in neural information processing systems
Baby cry detection in domestic environment using deep learning
What are the receptive, effective receptive, and projective fields of neurons in convolutional neural networks?
Gradient-based learning applied to document recognition
Proc. IEEE
Infantsâ expressive behaviors to mothers and unfamiliar partners during face-to-face interactions from 4 to 10 months
Infant Behav. Dev.
Network in network
librosa: audio and music signal analysis in python
Proceedings of the 14th python in science conference
Cited by (15)
Infant cry classification by MFCC feature extraction with MLP and CNN structures
2023, Biomedical Signal Processing and ControlInfant cry classification by using different deep neural network models and hand-crafted features
2023, Biomedical Signal Processing and ControlCan You Understand Why I Am Crying? A Decision-making System for Classifying Infants' Cry Languages Based on DeepSVM Model
2024, ACM Transactions on Asian and Low-Resource Language Information ProcessingA Comparative Study of Machine Learning Methods for Baby Cry Detection Using MFCC Features
2024, Journal of Electronics, Electromedical Engineering, and Medical Informatics