Abstract
The boost in speech technologies that we have witnessed over the last decade has allowed us to go from a state of the art in which correctly recognizing strings of words was a major target, to a state in which we aim much beyond words. We aim at extracting meaning, but we also aim at extracting all possible cues that are conveyed by the speech signal. In fact, we can estimate bio-relevant traits such as height, weight, gender, age, physical and mental health. We can also estimate language, accent, emotional and personality traits, and even environmental cues. This wealth of information, that one can now extract with recent advances in machine learning, has motivated an exponentially growing number of speech-based applications that go much beyond the transcription of what a speaker says. In particular, it has motivated many health related applications, namely aiming at non-invasive diagnosis and monitorization of diseases that affect speech.
Most of the recent work on speech-based diagnosis tools addresses the extraction of features, and/or the development of sophisticated machine learning classifiers [5, 7, 12,13,14, 17]. The results have shown remarkable progress, boosted by several joint paralinguistic challenges, but most results are obtained from limited training data acquired in controlled conditions.
This talk covers two emerging concerns related to this growing trend. One is the collection of large in-the-wild datasets and the effects of this extended uncontrolled collection in the results [4]. Another concern is how the diagnosis may be done without compromising patient privacy [18].
As a proof-of-concept, we will discuss these two aspects and show our results for two target diseases, Depression and Cold, a selection motivated by the availability of corresponding lab datasets distributed in paralinguistic challenges. The availability of these lab datasets allowed us to build a baseline system for each disease, using a simple neural network trained with common features that have not been optimized for either disease. Given the modular architecture adopted, each component of the system can be individually improved at a later stage, although the limited amount of data does not motivate us to exploit deeper networks.
Our mining effort has been focused on video blogs (vlogs), that include a single speaker which, at some point, admits that he/she is currently affected by a given disease. Retrieving vlogs with the target disease involves not only a simple query (i.e. depression vlog), but also a post-filtering stage to exclude videos that do not correspond to our target of first person, present experiences (lectures, in particular, are relatively frequent). This filtering stage combines multimodal features automatically extracted from the video and its metadata, using mostly off-the-shelf tools.
We collected a large dataset for each target disease from YouTube, and manually labelled a small subset which we named the in-the-Wild Speech Medical (WSM) corpus. Although our mining efforts made use of relatively simple techniques using mostly existing toolkits, they proved effective. The best performing models achieved a precision of \(88\%\) and \(93\%\), and a recall of \(97\%\) and \(72\%\), for the datasets of Cold and Depression, respectively, in the task of filtering videos containing these speech affecting diseases.
We compared the performance of our baseline neural network classifiers trained with data collected in controlled conditions in tests with corresponding in-the-wild data. For the Cold datasets, the baseline neural network achieved an Unweighted Average Recall (UAR) of 66.9% for the controlled dataset, and 53.1% for the manually labelled subset of the WSM corpus. For the Depression datasets, the corresponding values were 60.6%, and 54.8%, respectively (at interview level, the UAR increased to 61.9% for the vlog corpus). The performance degradation that we had anticipated for using in-the-wild data may be due to a greater variability in recording conditions (p.e. microphone, noise) and in the effects of speech altering diseases in the subjects’ speech. Our current work with vlog datasets attempts to estimate the quality of the predicted labels of a very large set in an unsupervised way, using noisy models.
The second aspect we addressed was patient privacy. Privacy is an emerging concern among users of voice-activated digital assistants, sparkled by the awareness of devices that must be always in the listening mode. Despite this growing concern, the potential misuse of health related speech based cues has not yet been fully realized. This is the motivation for adopting secure computation frameworks, in which cryptographic techniques are combined with state-of-the-art machine learning algorithms. Privacy in speech processing is an interdisciplinary topic, which was first applied to speaker verification, using Secure Multi-Party Computation, and Secure Modular Hashing techniques [1, 15], and later to speech emotion recognition, also using hashing techniques [6]. The most recent efforts on privacy preserving speech processing have followed the progress in secure machine learning, combining neural networks and Full Homomorphic Encryption (FHE) [3, 8, 9].
In this work, we applied an encrypted neural network, following the FHE paradigm, to the problem of secure detection of pathological speech. This was done by developing an encrypted version of a neural network, trained with unencrypted data, in order to produce encrypted predictions of health-related labels. As proof-of-concept, we used the same two above mentioned target diseases, and compared the performance of the simple neural network classifiers with their encrypted counterparts on datasets collected in controlled conditions. For the Cold dataset, the baseline neural network achieved a UAR of 66.9%, whereas the encrypted network achieved 66.7%. For the Depression dataset, the baseline value was 60.6%, whereas the encrypted network achieved 60.2% (67.9% at interview level). The slight difference in results showed the validity of our secure approach.
This approach relies on the computation of features on the client side before encryption, with only the inference stage being computed in an encrypted setting. Ideally, an end-to-end approach would overcome this limitation, but combining convolutional neural networks with FHE imposes severe limitations to their size. Likewise, the use of recurrent layers such as LSTMs (Long Short Term Memory) also requires a number of operations too large for current FHE frameworks, making them computationally unfeasible as well.
FHE schemes, by construction, only work with integers, whilst neural networks work with real numbers. By using encoding methods to convert real weights to integers we are throwing away the capability of using an FHE batching technique that would allow us to compute several predictions, at the same time, using the same encrypted value. Recent advances in machine learning have pushed towards the “quantization” and“discretization” of neural networks, so that models occupy less space and operations consume less power. Some works have already implemented these techniques using homomorphic encryption, such as Binarized Neural Networks [10, 11, 16] and Discretized Neural Networks [2]. The talk will also cover our recent efforts in applying this type of approach to the detection of health related cues in speech signals, while discretizing the network and maximizing the throughput of its encrypted counterpart.
More than presenting our recent work in these two aspects of speech analysis for medical applications, this talk intends to point to different directions for future work in these two relatively unexplored topics that were by no means exhausted in this summary.
This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references UID/CEC/50021/2013, and SFRH/BD/103402/2014.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Boufounos, P., Rane, S.: Secure binary embeddings for privacy preserving nearest neighbors. In: International Workshop on Information Forensics and Security (WIFS) (2011)
Bourse, F., Minelli, M., Minihold, M., Paillier, P.: Fast homomorphic evaluation of deep discretized neural networks. IACR Cryptology ePrint Archive 2017, 1114 (2017)
Chabanne, H., de Wargny, A., Milgram, J., Morel, C., et al.: Privacy-preserving classification on deep neural network. IACR Cryptology ePrint Archive 2017, 35 (2017)
Correia, J., Raj, B., Trancoso, I., Teixeira, F.: Mining multimodal repositories for speech affecting diseases. In: INTERSPEECH (2018)
Cummins, N., Scherer, S., Krajewski, J., Schnieder, S., Epps, J., Quatieri, T.F.: A review of depression and suicide risk assessment using speech analysis. Speech Commun. 71, 10–49 (2015)
Dias, M., Abad, A., Trancoso, I.: Exploring hashing and cryptonet based approaches for privacy-preserving speech emotion recognition. In: ICASSP. IEEE (2018)
Dibazar, A.A., Narayanan, S., Berger, T.W.: Feature analysis for automatic detection of pathological speech. In: 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society EMBS/BMES Conference, vol. 1, pp. 182–183. IEEE (2002)
Gilad-Bachrach, R., Dowlin, N., Laine, K., et al.: CryptoNets: applying neural networks to encrypted data with high throughput and accuracy. In: ICML, JMLR Workshop and Conference Proceedings, vol. 48, pp. 201–210 (2016)
Hesamifard, E., Takabi, H., Ghasemi, M.: CryptoDL: deep neural networks over encrypted data. CoRR abs/1711.05189 (2017)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 4107–4115. Curran Associates, Inc., New York (2016)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 187:1–187:30 (2017)
Lopez-de Ipiña, K., et al.: On automatic diagnosis of Alzheimers disease based on spontaneous speech analysis and emotional temperature. Cogn. Comput. 7(1), 44–55 (2015)
López-de Ipiña, K., et al.: On the selection of non-invasive methods based on speech analysis oriented to automatic Alzheimer disease diagnosis. Sensors 13(5), 6730–6745 (2013)
Orozco-Arroyave, J.R., et al.: Characterization methods for the detection of multiple voice disorders: neurological, functional, and laryngeal diseases. IEEE J. Biomed. Health Inform. 19(6), 1820–1828 (2015)
Pathak, M.A., Raj, B.: Privacy-preserving speaker verification and identification using gaussian mixture models. IEEE Trans. Audio Speech Lang. Process. 21(2), 397–406 (2013)
Sanyal, A., Kusner, M.J., Gascón, A., Kanade, V.: TAPAS: tricks to accelerate (encrypted) prediction as a service. CoRR abs/1806.03461 (2018)
Schuller, B., et al.: The INTERSPEECH 2017 computational paralinguistics challenge: addressee, cold & snoring. In: INTERSPEECH (2017)
Teixeira, F., Abad, A., Trancoso, I.: Patient privacy in paralinguistic tasks. In: INTERSPEECH (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Trancoso, I., Correia, J., Teixeira, F., Raj, B., Abad, A. (2018). Analysing Speech for Clinical Applications. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds) Statistical Language and Speech Processing. SLSP 2018. Lecture Notes in Computer Science(), vol 11171. Springer, Cham. https://doi.org/10.1007/978-3-030-00810-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-00810-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00809-3
Online ISBN: 978-3-030-00810-9
eBook Packages: Computer ScienceComputer Science (R0)