Automatic classification of infant vocalization sequences with convolutional neural networks

doi:10.1016/j.specom.2020.03.003

Speech Communication

Volume 119, May 2020, Pages 36-45

https://doi.org/10.1016/j.specom.2020.03.003 Get rights and content

Highlights

•
Small bottlenecks after the convolutional stage primarily raise CNN performance.
•
Global average pooling layers are most efficient for creating small bottlenecks.
•
Tuning the convolutional receptive field size is the second most important factor.
•
Vocalization classes are primarily confused according to their affective similarity.

Abstract

In this study we investigated Convolutional Neural Networks (CNNs) for the classification of infant vocalization sequences. The target classes were ‘crying’, ‘fussing’, ‘babbling’, ‘laughing’ and ‘vegetative vocalizations’. The general case of this classification task is of importance for applications which require a qualitative evaluation of general infant vocalizations, such as pain assessment or assessment of language acquisition. The classification procedure was based on representing audio segments as spectrograms which are input to an conventional CNN architecture scheme. We systematically analyzed the influence of network features on the classification performance to derive guidelines for designing effective CNN architectures for the task. We show that CNNs should be modeled to have a small bottleneck between the convolutional stage and the fully connected stage, achieved through broad aggregation of convolutional feature maps across the time and frequency axis. The best performing CNN configuration yielded a balanced accuracy of 72%. We conclude that conventional CNN architectures can reach satisfactory performance for this task even with small amounts of training data as long as certain network features are ensured.

Introduction

Automatic classification of infant vocalizations is a promising field of research to support areas which require a qualitative assessment of infant vocal expressions. Examples for such areas are pain assessment in paediatric wards or assessment of language acquisition. Automatic systems can aid in increasing coverage when human surveillance is infeasible.

Most publications in automatic infant vocalization recognition have been based on conventional audio recognition approaches. In these, audio signals are represented through hand-crafted and often task-specific feature sets such as mel frequency cepstral coefficients, fundamental frequency etc. They are used to train conventional classifiers such as support vector machines, hidden Markov models or multilayer perceptrons (Wagner et al., 2018) (Virtanen et al., 2018, chapter 4). These methods have been widely used for infant vocalization recognition, e.g. in the studies (Abdulaziz, Ahmad, 2010, Xie, Ward, Laszlo, 1996, Zhang, Cristia, Warlaumont, Schuller, 2018, Naithani, Kivinummi, Virtanen, Tammela, Peltola, Leppänen, 2018, Rodriguez, Caluya, 2017, Ntalampiras, 2015).

More recently, ‘deep learning’ or ‘end-to-end’ approaches emerged as an alternative to conventional approaches. They are based on replacing hand-crafted feature sets with more basic, task-unspecific audio representations such as spectrograms or raw waveforms. Those representations are fed into neural networks, particularly Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) Wagner et al. (2018) (Virtanen et al., 2018, chapter 4). Recently end-to-end systems significantly outperformed conventional systems for general audio recognition tasks such as audio scene classification or audio event detection, as the DCASE competitions demonstrated Mesaros, Heittola, Virtanen, 2018, Mesaros, Heittola, Diment, Elizalde, Shah, Vincent, Raj, Virtanen, 2017.

Consequently, researchers began to apply end-to-end systems to infant vocalization recognition tasks as well: Chang and Li (2016) employed CNNs with spectrogram inputs for cry reason classification and reached a validation accuracy of 78.5%. Lavner et al. (2016) also employed CNNs with spectrogram inputs for cry sound detection and compared it to a conventional approach based on logistic regression. The CNN outperformed the conventional approach for lower false-positive rates. The Interspeech 2018 computational paralinguistics challenge (Schuller et al., 2018) proposed a competition for classifying infant vocalizations into ‘crying’, ‘fussing’ and ‘neutral’. The organization team entered a baseline system employing a CNN-RNN with raw waveform inputs which reached a test balanced accuracy of 63 %. Two of seven participating teams entered end-to-end system as well: Turan and Erzin (2018) proposed a CNN variant called “capsule net” with spectrogram inputs and reached a balanced accuracy of 71.6%. Wagner et al. (2018) reached 67% through an RNN with spectrogram as well as raw waveform inputs.

These studies established and emphasized the general feasibility of infant vocalization recognition through end-to-end systems. While they provide performance indications for specific system configurations, they did not yet provide in-depth analysis of which system choices and parameterizations have the greatest impact on the performance. Such analysis is necessary to explore the true performance potential of end-to-end systems. For example, analysis and optimization of architectural CNN hyperparameters lead to significant performance improvements in the area of image classification with CNNs (Krizhevsky, Sutskever, Hinton, 2012, Zeiler, Fergus, 2014, Simonyan, Zisserman, 2014). To our knowledge, there is only one study in the area of infant vocalization recognition which provided an in-depth analysis of an end-to-end system component: Wagner et al. (2018) systematically compared various feature set types for a fixed RNN architecture and found that hand-crafted features outperformed more basic audio representations. They concluded that RNNs are more suited for hand-crafted features for infant vocalization classification.

In this study we investigated the classification of infant vocalizations through CNNs with mel-spectrogram inputs. We adapted this approach as it is currently the prevailing end-to-end paradigm for general audio classification tasks (Mesaros, Heittola, Virtanen, 2018, Mesaros, Heittola, Diment, Elizalde, Shah, Vincent, Raj, Virtanen, 2017; Hershey et al., 2017). Our investigation focused on optimizing the CNN architecture to increase the classification performance. Our goal was to not only identify the most performant CNN configurations, but to determine which CNN features have the greatest impact on the classification performance in general. Consequently, our key contribution is the identification of the most relevant CNN traits for infant vocalization classification. While there are additional system components which typically also contribute to raising performance, such as data augmentation techniques (Virtanen et al., 2018, chapter 5), our investigation focuses exclusively on the influence of the CNN architecture as it forms the essential core part of any end-to-end system.

The methodology of the investigation was as follows: We specified a vocalization classification task by defining a target class set and constructing a corresponding acoustic database. We then defined a CNN architecture scheme representative of conventional VGG-like CNNs (Simonyan and Zisserman, 2014) and produced numerous CNN configurations of this scheme drawn from a parameter space. Each configuration was trained and evaluated on the database. We finally analyzed the relation between architectural CNN features and the classification performance through statistical methods to identify the most influential ones.

The remainder of this paper is structured as follows: Section 2.1 describes the target classes and Section 2.2 the acoustic database. Section 2.3 summarizes the deep learning approach. Section 2.4 describes the CNN architecture scheme as well as the configurations. Section 2.5 describes the evaluation procedure for measuring the performance of the CNN configurations. The results Section 3 summarizes the overall system performance and shows the analysis of the relation between CNN features and the performance.

Section snippets

Target classes

Table 1 summarizes the target classes and their definitions. The class set was selected to represent the middle ground between various infant monitoring scenarios: The classes ‘fussing’ and ‘crying’ are frequently required for tracking an infants distress state in medical and domestic monitoring (McGrath et al., 2013, chapter 37)James-Roberts et al. (1996). The classes ‘babbling’, ‘laughing’ and ‘vegetative vocalizations’ are typically of interest in language development monitoring which

Results

First, the overall system performance and the analysis the class-wise performance are presented in Section 3.1. A detailed analysis on the relation between the CNN features and the classification performance is presented in Section 3.2.

Discussion

We summarize that the CNN features with the strongest overall impact on the performance were the choice of separator layer as well as the input size to the fully connected layers. The most efficient CNN configurations employed a 2D global average pooling separator layer and had receptive fields of  ≈ 0.65 s and  ≈ 37% of the provided frequency range. CNNs with flatten separator layers additionally benefited from large pooling kernel sizes/strides in conjunction with large receptive fields.

From

Conclusion

In this study we investigated the influence of architectural CNN features on the classification performance for infant vocalization sequences when using spectrogram inputs. We discovered that the primary factor for raising the classification performance is designing CNNs to have a small bottleneck between the convolutional stage and the fully connected stage, ideally achieved through global pooling between those stages. The secondary factor is management of the size of the receptive field. The

CRediT authorship contribution statement

Franz Anders: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing - original draft. Mario Hlawitschka: Conceptualization, Methodology, Funding acquisition, Project administration, Resources, Supervision, Writing - review & editing. Mirco Fuchs: Conceptualization, Methodology, Funding acquisition, Project administration, Resources, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the European Union as part of the ESF-Program, grant number K-7531.20/434-11; SAB-Nr. 100316843.

References (56)

A. Mesaros et al.
Dcase 2017 challenge setup: Tasks, datasets and baseline system
DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events
(2017)
Y. Abdulaziz et al.
Infant cry recognition system: a comparison of system performance based on mel frequency and linear prediction cepstral coefficients
Information Retrieval & Knowledge Management,(CAMP), 2010 International Conference on
(2010)
K.H. Brodersen et al.
The balanced accuracy and its posterior distribution
2010 20th International Conference on Pattern Recognition
(2010)
E.H. Buder et al.
An acoustic phonetic catalog of prespeech vocalizations from a developmental perspective
Comprehensive perspectives on child speech development and disorders: pathways from linguistic theory to clinical practice. Hauppauge, NY: NOVA
(2013)
E. Cakır et al.
Convolutional recurrent neural networks for polyphonic sound event detection
IEEE/ACM Trans. Audio Speech Lang. Process.
(2017)
C.-Y. Chang et al.
Application of deep learning for recognizing infant cries
Consumer Electronics-Taiwan (ICCE-TW), 2016 IEEE International Conference on
(2016)
F. Chollet
Xception: deep learning with depthwise separable convolutions
Proceedings of the IEEE conference on computer vision and pattern recognition
(2017)
G. Esposito et al.
Judgment of infant cry: the roles of acoustic characteristics and sociodemographic characteristics
Japanese Psychol. Res.
(2015)
T. Fuhr et al.
Comparison of supervised-learning models for infant cry classification/vergleich von klassifikationsmodellen zur säuglingsschreianalyse
Int. J. Health Professions
(2015)
S.A. Fulop
Speech Spectrum Analysis
(2011)

I. Goodfellow et al.

Deep learning

(2016)

G. Gosztolya et al.

General utterance-level feature extraction for classifying crying sounds, atypical & self-assessed affect and heart beats

Proc. Interspeech 2018

(2018)

K. He et al.

Deep residual learning for image recognition

Proceedings of the IEEE conference on computer vision and pattern recognition

(2016)

G. Heinzel et al.

Spectrum and spectral density estimation by the discrete fourier transform (dft), including a comprehensive list of window functions and some new at-top windows

(2002)

S. Hershey et al.

Cnn architectures for large-scale audio classification

Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on

(2017)

G. Huang et al.

Densely connected convolutional networks

Proceedings of the IEEE conference on computer vision and pattern recognition

(2017)

S. Ioffe et al.

Batch normalization: accelerating deep network training by reducing internal covariate shift

(2015)

G. James et al.

An Introduction to Statistical Learning

(2013)

I.S. James-Roberts et al.

Bases for maternal perceptions of infant crying and colic behaviour.

Arch. Dis. Child.

(1996)

A. Kershenbaum et al.

Acoustic sequences in non-human animals: a tutorial review and prospectus

Biol. Rev.

(2016)

D.P. Kingma et al.

Adam: a method for stochastic optimization

(2014)

A. Krizhevsky et al.

Imagenet classification with deep convolutional neural networks

Advances in neural information processing systems

(2012)

Y. Lavner et al.

Baby cry detection in domestic environment using deep learning

(2016)

H. Le et al.

What are the receptive, effective receptive, and projective fields of neurons in convolutional neural networks?

(2017)

Y. LeCun et al.

Gradient-based learning applied to document recognition

Proc. IEEE

(1998)

H.-C. Lin et al.

Infantsâ expressive behaviors to mothers and unfamiliar partners during face-to-face interactions from 4 to 10 months

Infant Behav. Dev.

(2009)

M. Lin et al.

Network in network

(2013)

B. McFee et al.

librosa: audio and music signal analysis in python

Proceedings of the 14th python in science conference

(2015)

Cited by (15)

Infant cry classification by MFCC feature extraction with MLP and CNN structures
2023, Biomedical Signal Processing and Control
In this study, Dunstan’s infant cry data set is pre-processed with the feature vector approach, including MFCC (19 features) and energy (one feature). By using extracted features and Support Vector Machine (SVM), Multilayer Perceptron (MLP), and Convolutional Neural Network (CNN) classifiers, five classes of infant cry (“Neh” = hungry; “Eh” = need to burp; “Owh” = tired; “Eairh” = stomach cramp; “Heh” = physical discomfort) are distinguished. The proposed MLP and CNN structures are analyzed according to the loss and the accuracy based on the epoch; moreover, to evaluate the performance of classifiers AUC-ROC, Confusion matrix, accuracy, f1_score, recall, and precision have been used. All three classifiers are analyzed, and their results show that the CNN-designed model has the best performance. Results show that the performance will improve by increasing the complexity of the model. With this approach, classifiers are run 10 times, and the average accuracy for SVM for SMOTE and non-SMOTE data are obtained with tolerance 0.823 $\pm$ 0.02, 0.861 $\pm$ 0.02, respectively. These accuracies for MLP are 0.876 $\pm$ 0.01, 0.892 $\pm$ 0.01, and finally, for CNN, are 0.921 $\pm$ 0.005, 0.911 $\pm$ 0.005. At the best condition, an accuracy of 92.1 % is obtained for five classes of infant cries by the proposed CNN structure.
Infant cry classification by using different deep neural network models and hand-crafted features
2023, Biomedical Signal Processing and Control
Crying is the way babies communicate with the outside world. These cries may be related to the needs of the baby or maybe an expression of a medical disorder. For this reason, infant cries are examined to support inexperienced parents and to make an early diagnosis if there is a medical disorder. Infant cry signals are classified using signal processing methods such as hand-crafted features or image processing methods based on the spectral image of the cry. In this study, we investigate the effect of using hand-crafted features and spectral images individually and hybrid in the classification of infant cries. In this context, experiments were conducted with the 1D CNN model, transfer learning, texture analysis methods, hand-crafted features, and their combination. In addition, the number of classes used in most of the studies in the literature is two or three, whereas in this study 5-classes in the dataset are used. Classification with hand-crafted and hybrid features was performed with SVM, RNN, and PNN. In addition, hand-crafted features were also classified with 1D CNN. GoogLeNet, ShuffleNet, ResNet-18 were used for transfer learning in image-based classification. The results show that texture analysis methods are insufficient and that hand-crafted feature sets and spectrogram and scalogram images provide high success. In addition, the 1D CNN model showed lower success than traditional classifiers and transfer learning models. Especially on DB2, the lowest result was obtained with 1D CNN in all experiments. The results were compared on two data commonly used in the literature, and a success rate of 97.6% and 95.2% was achieved. These results show that both signal processing methods and spectrogram and scalogram images can be used successfully in infant cry classification studies.
Preliminary Technical Validation of LittleBeats™: A Multimodal Sensing Platform to Capture Cardiac Physiology, Motion, and Vocalizations
2024, Sensors
Can You Understand Why I Am Crying? A Decision-making System for Classifying Infants' Cry Languages Based on DeepSVM Model
2024, ACM Transactions on Asian and Low-Resource Language Information Processing
A Comparative Study of Machine Learning Methods for Baby Cry Detection Using MFCC Features
2024, Journal of Electronics, Electromedical Engineering, and Medical Informatics
Classification of pig calls produced from birth to slaughter according to their emotional valence and context of production
2022, Scientific Reports

View all citing articles on Scopus

View full text

Automatic classification of infant vocalization sequences with convolutional neural networks

Highlights

Abstract

Introduction

Section snippets

Target classes

Results

Discussion

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Infant cry recognition system: a comparison of system performance based on mel frequency and linear prediction cepstral coefficients

Information Retrieval & Knowledge Management,(CAMP), 2010 International Conference on

The balanced accuracy and its posterior distribution

2010 20th International Conference on Pattern Recognition

An acoustic phonetic catalog of prespeech vocalizations from a developmental perspective

Comprehensive perspectives on child speech development and disorders: pathways from linguistic theory to clinical practice. Hauppauge, NY: NOVA

Convolutional recurrent neural networks for polyphonic sound event detection

IEEE/ACM Trans. Audio Speech Lang. Process.

Application of deep learning for recognizing infant cries

Consumer Electronics-Taiwan (ICCE-TW), 2016 IEEE International Conference on

Xception: deep learning with depthwise separable convolutions

Proceedings of the IEEE conference on computer vision and pattern recognition

Judgment of infant cry: the roles of acoustic characteristics and sociodemographic characteristics

Japanese Psychol. Res.

Comparison of supervised-learning models for infant cry classification/vergleich von klassifikationsmodellen zur säuglingsschreianalyse

Int. J. Health Professions

Speech Spectrum Analysis

Deep learning

General utterance-level feature extraction for classifying crying sounds, atypical & self-assessed affect and heart beats

Proc. Interspeech 2018

Deep residual learning for image recognition

Proceedings of the IEEE conference on computer vision and pattern recognition

Spectrum and spectral density estimation by the discrete fourier transform (dft), including a comprehensive list of window functions and some new at-top windows

Cnn architectures for large-scale audio classification

Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on

Densely connected convolutional networks

Proceedings of the IEEE conference on computer vision and pattern recognition

Batch normalization: accelerating deep network training by reducing internal covariate shift

Batch normalization: accelerating deep network training by reducing internal covariate shift

An Introduction to Statistical Learning

Bases for maternal perceptions of infant crying and colic behaviour.

Arch. Dis. Child.

Acoustic sequences in non-human animals: a tutorial review and prospectus

Biol. Rev.

Adam: a method for stochastic optimization

Imagenet classification with deep convolutional neural networks

Advances in neural information processing systems

Baby cry detection in domestic environment using deep learning

What are the receptive, effective receptive, and projective fields of neurons in convolutional neural networks?

Gradient-based learning applied to document recognition

Proc. IEEE

Infantsâ expressive behaviors to mothers and unfamiliar partners during face-to-face interactions from 4 to 10 months

Infant Behav. Dev.

Network in network

librosa: audio and music signal analysis in python

Proceedings of the 14th python in science conference