ABSTRACT
There are a growing number of installations of network speakers in public space like train stations, schools, and hospitals. These speakers are used for announcements and playing background music. Network performance can affect the quality of announcement speech heard from the network speaker. In this study, a deep neural network method is proposed for live monitoring of the quality of speaker output as perceived by public space occupants. Single end method for speech quality assessment was proposed because of the nature of the application, there is no reference speech to use for assessment. The network node (end point) of the network speaker usually has low memory and computing resource. Therefore, compact deep neural network architecture and post-training quantization method were examined as deep neural network compression techniques for memory saving and compute acceleration. Using PESQ which is an end-to-end assessment method as the baseline for comparing the proposed method and ITU-T P.563 which are single-end methods, the estimated mean opinion score Pearson correlation coefficient was 0.710 and 0.40 for proposed method and P.563 respectively. The mean squared error for proposed method and P.563 was 0.154 and 0.319, respectively. The proposed method performed better than P.563 ITU-T recommended method.
- Andrew A. Catellier and Stephen D. Voran. 2020. Wawenets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 331–335. DOI:https://doi.org/10.1109/ICASSP40776.2020.9054204Google Scholar
- Benjamin Cauchi, Kai Siedenburg, Joao F. Santos, Tiago H. Falk, Simon Doclo, and Stefan Goetze. 2019. Non-Intrusive Speech Quality Prediction Using Modulation Energies and LSTM-Network. IEEE/ACM Trans. Audio Speech Lang. Process.27, 7 (July 2019), 1151–1163. DOI:https://doi.org/10.1109/TASLP.2019.2912123Google ScholarDigital Library
- Hannes Gamper, Chandan K A Reddy, Ross Cutler, Ivan J Tashev, and Johannes Gehrke. 2019. Intrusive and Non-Intrusive Perceptual Speech Quality Assessment Using a Convolutional Neural Network.2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE, New Paltz, NY. DOI:https://doi.org/10.1109/WASPAA.2019.8937202Google Scholar
- Philipp Gysel, Jon Pimentel, Mohammad Motamedi, and Soheil Ghiasi. 2018. Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. IEEE Trans. Neural Networks Learn. Syst.29, 11 (November 2018), 5784–5789. DOI:https://doi.org/10.1109/TNNLS.2018.2808319Google ScholarCross Ref
- Rainer Huber and Birger Kollmeier. 2006. PEMO-Q-A new method for objective audio quality assessment using a model of auditory perception. IEEE Trans. Audio, Speech Lang. Process.14, 6 (November 2006), 1902–1911. DOI:https://doi.org/10.1109/TASL.2006.883259Google ScholarDigital Library
- INTERNATIONAL TELECOMMUNICATION UNION. 1996. Methods for subjective determination of transmission quality. ITU-T Recomm. P.800 (1996).Google Scholar
- INTERNATIONAL TELECOMMUNICATION UNION. 2001. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T Recomm. P.862 (2001).Google Scholar
- INTERNATIONAL TELECOMMUNICATION UNION. 2003. Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm. ITU-T Recomm. P.835 (2003).Google Scholar
- INTERNATIONAL TELECOMMUNICATION UNION. 2004. Single-ended method for objective speech quality assessment in narrow-band telephony applications. ITU-T Recomm. P.563 (2004).Google Scholar
- INTERNATIONAL TELECOMMUNICATION UNION. 2011. Perceptual Objective Listening Quality Assessment: An advanced objective perceptual method for end-to-end listening speech quality evaluation of fixed, mobile, and IP-based networks and speech codecs covering narrowband, wideband, and super-wideband. ITU-T Recomm. P.863 (2011).Google Scholar
- Rafidul Islam, Ashequr Rahman, Numan Hasan, A. N.M.Shahriyar Hossain, Ahmed Nazim Uddin, and Mohammad Ariful Haque. 2017. Non-intrusive objective evaluation of speech quality in noisy condition. In Proceedings of 9th International Conference on Electrical and Computer Engineering, ICECE 2016, IEEE, 586–589. DOI:https://doi.org/10.1109/ICECE.2016.7853988Google Scholar
- Catherine Colomes. Thiede Thilo. William C. Treurniet Roland Bitto Christian Schmidmer Thomas Sporer John G. Beerends. 2000. PEAQ-The ITU standard for objective measurement of perceived audio quality. J. Audio Eng. Soc.48, 1/2 (2000), 3–29.Google Scholar
- Brian McFee, Colin Raffel, Dawen Liang, Daniel Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and Music Signal Analysis in Python. In Proceedings of the 14th Python in Science Conference. DOI:https://doi.org/10.25080/majora-7b98e3ed-003Google ScholarCross Ref
- Dushyant Sharma, Yu Wang, Patrick A. Naylor, and Mike Brookes. 2016. A data-driven non-intrusive measure of speech quality and intelligibility. Speech Commun.80, (June 2016), 84–94. DOI:https://doi.org/10.1016/j.specom.2016.03.005Google ScholarDigital Library
- Ana Paula Couto da Silva, Martín Varela, Edmundo de Souza e Silva, Rosa M.M. Leão, and Gerardo Rubino. 2008. Quality assessment of interactive voice applications. Comput. Networks 52, 6 (April 2008), 1179–1192. DOI:https://doi.org/10.1016/j.comnet.2008.01.002Google ScholarDigital Library
- Cassia Valentini-Botinhao. 2016. Reverberant speech database for training speech dereverberation algorithms and TTS models, 2016 [dataset]. DOI:https://doi.org/https://doi.org/10.7488/ds/1425Google Scholar
- Cassia Valentini-Botinhao. 2017. Noisy speech database for training speech enhancement algorithms and TTS models. (2017). DOI:https://doi.org/https://doi.org/10.7488/ds/2117Google Scholar
- Raspberry Pi 4 Model B specifications – Raspberry Pi. Retrieved November 10, 2020 from https://www.raspberrypi.org/products/raspberry-pi-4-model-b/specifications/?resellerType=homeGoogle Scholar
- tensorflow/tensorflow/lite at master · tensorflow/tensorflow · GitHub. Retrieved October 1, 2020 from https://github.com/tensorflow/tensorflow/tree/master/tensorflow/liteGoogle Scholar
Recommendations
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System
Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
Speaker independent speech recognition method using training speech from a small number of speakers
ICASSP'92: Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1This paper presents a new speaker independent speech recognition method. which registers speech uttered by a small number of speakers into a dictionary as a "model" speech. It is based on the hypothesis that movement of the vocal tract differs little ...
Accent neutralization for speech recognition of non-native speakers
iiWAS2019: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & ServicesThese days, automatic speech recognition (ASR) systems achieve higher and higher accuracy rates. The score drops significantly, in case when the ASR system is being used with a non-native speaker of the language to be recognized. The main reason is ...
Comments