Detecting pertussis in the pediatric population using respiratory sound events and CNN

https://doi.org/10.1016/j.bspc.2021.102722Get rights and content

Highlights

  • Classification of pertussis and non-pertussis subjects based on respiratory sound events (cough and whooping) and deep learning.

  • Convolutional neural network models trained on three time-frequency image-like representations: mel-spectrogram, wavelet scalogram, and cochleagram.

  • Time-frequency image augmentation during training using mixup and late fusion to combine learning from different time-frequency representations.

  • Achieved an overall accuracy of 90.48% (AUC = 0.9501), outperforming various baseline methods.

  • Promising results demonstrates that automated respiratory sound analysis may be useful in non-invasive detection of pertussis.

Abstract

Background and objective

Pertussis (whooping cough), a respiratory tract infection, is a significant cause of morbidity and mortality in children. The classic presentation of pertussis includes paroxysmal coughs followed by a high-pitched intake of air that sounds like a whoop. Although these respiratory sounds can be useful in making a diagnosis in clinical practice, the distinction of these sounds by humans can be subjective. This work aims to objectively analyze these respiratory sounds using signal processing and deep learning techniques to detect pertussis in the pediatric population.

Methods

Various time-frequency representations of the respiratory sound signals are formed and used as a direct input to convolutional neural networks, without the need for feature engineering. In particular, we consider the mel-spectrogram, wavelet scalogram, and cochleagram representations which reveal spectral characteristics at different frequencies. The method is evaluated on a dataset of 42 recordings, containing 542 respiratory sound events, from children with pertussis and non-pertussis. We use data augmentation to prevent model overfitting on the relatively small dataset and late fusion to combine the learning from the different time-frequency representations for more informed predictions.

Results

The proposed method achieves an accuracy of 90.48% (AUC = 0.9501) in distinguishing pertussis subjects from non-pertussis subjects, outperforming several baseline techniques.

Conclusion

Our results suggest that detecting pertussis using automated respiratory sound analysis is feasible. It could potentially be implemented as a non-invasive screening tool, for example, in smartphones, to increase the diagnostic utility for this disease which may be used by parents/carers in the community.

Introduction

Pertussis, commonly known as whooping cough, is a respiratory tract infection caused by Bordetella pertussis coccobacillus. It spreads by air droplets and is highly contagious [18]. The number of pertussis cases has decreased since the development of a vaccine. However, neither immunization nor previous infection provide lifelong immunity to the disease [2]. There is a resurgence of pertussis infections which is attributed to waning immunity and bacteria mutation [23,34]. While pertussis affects all age groups, it is a significant cause of morbidity and mortality in young children [35], especially in developing countries, where access to timely diagnoses may not be available.

Following an incubation period, the typical progression of pertussis is in three distinct stages: catarrhal phase, paroxysmal phase, and convalescent phase [18]. The catarrhal phase characteristics are similar to other upper respiratory tract infections. This is followed by the paroxysmal phase. Cough is one of the symptoms of pertussis and it increases in severity at this stage, developing into a paroxysmal or hacking cough followed by a high-pitched intake of air that sounds like a whoop, hence the name whooping cough [35]. The residual cough can persist for weeks to months in the convalescent phase. In severe cases in infants it can lead to respiratory failure and death [20].

People with pertussis are infectious for weeks but, if given appropriate antibiotic treatment, the infectious period and spread is reduced and may also prevent complications [4]. Early treatment of pertussis is, therefore, crucial for managing this disease. We posit that the paroxysmal coughing and whooping sounds can be useful for screening pertussis, especially in the pediatric population which remains the most vulnerable age group. However, recognizing these respiratory sounds by parents/carers of the child can be unfeasible. In clinical practice, this is dependent on the skills and training of the clinicians.

In this work, we aim to develop an objective computational method for detecting respiratory sound events associated with pertussis, that is, the hacking cough and whooping, for the pediatric population. If disseminated widely, for example, as a smartphone application, such an objective assessment tool could prove useful as a screening tool for parents/carers. It could also be useful in developing countries and remote communities which lack access to health facilities and clinicians.

Detecting respiratory diseases using digital respiratory sounds, cough sounds, in particular, has generated interest recently such as in detecting childhood pneumonia [16], monitoring chronic obstructive pulmonary disease [9], and in detecting croup, which is common in children between the age of 6 months to 6 years and produces a distinctive barking cough [30]. Various signal processing and machine learning techniques have been proposed for the analysis and detection of cough sounds. Being a relatively new area of research, a number of techniques are inspired by other audio classification tasks such as speech recognition. One such measure is mel-frequency cepstral coefficients (MFCCs) [8]. MFCCs utilize mel-filters which are effective in revealing the perceptually significant characteristics of the speech spectrum in small time windows. Speech and cough share some similarities in the generation process and the physiology which could explain the widespread use and effectiveness of MFCCs in cough sound analysis tasks [10,16,27,29,30,37].

It is a common practice to complement MFCCs with other techniques. In [10,16,29], various temporal and spectral analysis techniques are employed for this purpose. In addition, wavelet transformation is applied in [16] in analysis of cough sounds for detecting pneumonia. Wavelets are effective at the decomposition of non-stationary signals in both the time and frequency domains and, in [16], the focus is particularly on picking the crackle sounds in pneumonia coughs.

Furthermore, the spectral information contained in cough sounds is more dominant in low frequencies than in high frequencies. The human auditory model offers a higher resolution for low frequencies than for high frequencies. In [30], this frequency selectivity property of the human cochlea is modeled using a gammatone filter to differentiate the barking cough sound of croup subjects from the cough sound of other respiratory diseases. A similar approach is also taken in [37].

Audio sound analysis, including cough sound analysis, is typically carried out in small time windows at different frequency localizations. These result in a high dimensional data which conventional classification methods may be unable to handle. A common approach is to reduce this data size into a smaller feature set using statistical methods. With MFCCs, for example, the mean and standard deviation of the coefficients have been used [30]. Similarly, the slope of the wavelet coefficients is used as wavelet feature (WF) in [16]. In [30], the time-frequency representation is formed using gammatone filters, referred to as gammatone-spectrogram or cochleagram, is divided into blocks and the second and third central moments are used as the cochleagram image features (CIF). In [16,29,30], feature extraction follows feature selection to further reduce the feature dimension and select the most dominant features for classification.

The use of conventional feature engineering techniques inevitably leads to loss of some information which causes poor classification performance and misdetection of respiratory diseases. More recently, these methods have been superseded by deep learning techniques due to their superior classification results. One such deep learning technique is convolutional neural network (CNN) [17]. CNN is originally an image classification technique which has the ability to learn distinguishing image characteristics directly from the raw image through various mathematical operations. In audio signal classification tasks, this arrangement is typically realized by transforming the signal into an image-like representation [21,32]. Time-frequency representation of audio signals is the most common approach for this purpose, such as the conventional spectrogram representation formed using short-time Fourier transform (STFT).

An overview of the proposed approach is given in Fig. 1. We take inspiration from conventional feature extraction techniques and the state of the art CNN for detecting pertussis using respiratory sounds. In particular, we represent the one-dimensional respiratory sound signals as two-dimensional time-frequency representations for classification using CNN. Our approach in forming the time-frequency representations is based on the feature extraction techniques from [16,29,30]. In particular, we use mel-filters, as used in computing MFCCs, to form mel-spectrogram; wavelet transform, as used in computing WF, to form wavelet scalogram, and gammatone filters, as used in computing CIF, to form cochleagram.

Furthermore, different time-frequency representations reveal spectral characteristics at different frequencies. In conventional machine learning, this information is combined, for example, using feature vector concatenation, to improve the classification performance. With CNN this can be achieved using late fusion whereby the outputs of CNN models trained on different representations are combined. This can be realized either by averaging the output scores [39] or using the output scores to train a secondary classifier [41]. In this work, we use late fusion to combine the CNN learning from different time-frequency representations, aiming to make more accurate predictions.

The proposed approach is evaluated on a dataset of respiratory sounds from children with suspected or confirmed pertussis and other respiratory diseases. Collecting physiological data is time consuming, expensive, and may require patient cooperation, which can be difficult with children. However, a rapid rise in the use of digital technology has prompted researchers to collect self-reported data from the public. In a similar study [29], researchers composed a dataset of respiratory diseases using online sources while researchers at Microsoft used web search queries of users with self-identified conditions [36]. More recently, researchers at the University of Cambridge collected COVID-19 related sounds of users with self-reported disease status through a website and a smartphone application. In this work, we use a dataset of respiratory sounds collated from the YouTube online video sharing platform and reviewed by a clinician.

In total, the dataset contains 42 recordings, each with multiple respiratory sounds. This makes it a relatively small dataset and CNN models trained on small datasets can be prone to overfitting. One method to reduce overfitting is mixup [40] which augments the dataset, mixing the features of different classes. It is a simple yet effective method with very low computational costs. In this work, we extend the mixup data augmentation technique to time-frequency representations of respiratory sounds.

The rest of the paper is organized as follows. An overview of the dataset and the proposed method is given in Section 2. The experimental setup and results are provided in Section 3 and discussion of the results and conclusions are in Section 4.

Section snippets

Dataset

The dataset used in this work was collated from YouTube. Various search terms were used to identify respiratory sound recordings from children with the following respiratory conditions: pertussis, asthma, bronchiolitis, croup, and pneumonia. The diagnosis of pertussis and other respiratory conditions in the videos was attributed by the information provided in the title and/or description of the videos and later checked by a clinician to assess the plausibility of the sounds and the reported

Experimental setup

In this work, we use a stratified 7-fold cross-validation which we found to give a good compromise between the number of training and validation samples in each fold. As such, 3 pertussis and 3 non-pertussis recordings are used for validating the model and the remaining recordings are used for training the model, in each fold. The respiratory sounds from a recording/subject are present either in the training or validation dataset, but not in both.

The number of respiratory sounds per recording

Discussion and conclusions

The dataset used in this work has been recorded in natural environments with SNR as low as 16 dB. The recordings are believed to be made using smartphones of different manufacturers and models and the training and validation procedure followed in this work is subject independent. All these increase the difficulty and complexity of the task. Despite these constraints, our method is empirically shown to achieve strong classification performance at the cough and particularly subject levels. In

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

CRediT authorship contribution statement

Roneel V. Sharan: Conceptualization, Data curation, Methodology, Software, Investigation, Visualization, Writing - original draft, Writing - review & editing. Shlomo Berkovsky: Supervision, Writing - review & editing. David Fraile Navarro: Data curation, Writing - review & editing. Hao Xiong: Data curation, Writing - review & editing. Adam Jaffe: Writing - review & editing.

Declaration of Competing Interest

The authors report no declarations of interest.

References (41)

  • C. Cortes et al.

    Support-vector networks

    Mach. Learn.

    (1995)
  • J.S. Cramer

    Discussion paper 2002-119/4

    The Origins of Logistic Regression

    (2002)
  • S. Davis et al.

    Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

    IEEE Trans. Acoustics Speech Signal Process.

    (1980)
  • D.D. Greenwood

    A cochlear frequency-position function for several species - 29 years later

    J. Acoust. Soc. Am.

    (1990)
  • S. Ioffe et al.

    Batch normalization: accelerating deep network training by reducing internal covariate shift

    arXiv preprint arXiv:1502.03167

    (2015)
  • A.K. Jain

    Fundamentals of Digital Image Processing

    (1989)
  • K. Jarrett et al.

    What is the best multi-stage architecture for object recognition?

  • D.P. Kingma et al.

    Adam: a method for stochastic optimization

    arXiv preprint arXiv:1412.6980

    (2014)
  • K. Kosasih et al.

    Wavelet augmented cough analysis for rapid childhood pneumonia diagnosis

    IEEE Trans. Biomed. Eng.

    (2015)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Advances in Neural Information Processing Systems (NIPS)

    (2012)
  • Cited by (13)

    • Automated Cough Sound Analysis for Detecting Childhood Pneumonia

      2024, IEEE Journal of Biomedical and Health Informatics
    • Detecting Childhood Pneumonia Using Handcrafted and Deep Learning Cough Sound Features and Multilayer Perceptron

      2023, Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS
    View all citing articles on Scopus
    View full text