On artificial bandwidth extension of telephone speech
Introduction
The limited acoustic bandwidth of today's public telephone networks originates from the former analogue transmission techniques. The limitation to a frequency range of about 0.3– causes the typical sound of the narrowband telephone speech. In the transition to digital transmission, the upper frequency limit of has been retained (passband up to , sampling frequency , whereas the lower frequency limit may be somewhat below [14].
Listening experiments have shown that the acoustic bandwidth of speech signals contributes significantly to the perceived speech quality [21], [39], which is measured in terms of the mean opinion score (MOS). In comparison to telephone speech, typical wideband speech with a frequency range of – yields a considerable gain of up to about 1.3 MOS points.
Although the sentence intelligibility of clean telephone speech is about 99%, the intelligibility of meaningless syllables is roughly 90%, only. As a result, we sometimes need a spelling alphabet to communicate words that cannot be understood from the context, such as unknown names. Improving the intelligibility of syllables makes the communication more comfortable and less strenuous in many cases, i.e., the listening effort can be reduced.
True digital wideband speech communication can be achieved by redesigning the transmission link, i.e., by introducing new speech codecs on both sides of the link. Actually, several wideband speech coding schemes have been developed for the increased acoustic bandwidth –. Already in the 1980s the G.722 codec was standardized for teleconferencing and ISDN telephony [15]. As yet this codec has not found widespread introduction into ISDN. Recently, the so-called adaptive multi-rate wideband (AMR-WB) speech codec was developed and standardized for mobile radio systems such as GSM and UMTS [11]. For the future the gradual introduction of wideband terminals can be expected. However, for a long transitional period mixed telephone networks with both narrowband and wideband terminals will exist due to economical reasons.
An approach to enhance the perceived acoustic bandwidth based on the information from the available narrowband speech is artificial bandwidth extension (BWE) [4], [5], [6], [7], [29], [36] at the receiving end. The problem of BWE is illustrated in Fig. 1: the original wideband (wb) signal swb is band-pass filtered prior to analogue-to-digital conversion and transmission over the telephone network. At the receiving terminal only the narrowband (nb) signal snb is available. By artificial bandwidth extension an estimate of the wideband speech is produced by adding some artificial low- and/or high-frequency signal components. Although true wideband speech quality cannot be obtained by artificial bandwidth extension, BWE represents a very attractive enhancement of any receiving wideband terminal as long as there are sending narrowband terminals in the network. In this paper the bandwidth extension of speech signals towards higher frequencies is addressed. The high-frequency band will be called the extension band (eb) in the following.
The following conventions are used to denote quantities: capital bold letters refer to matrices, e.g., A, bold letters refer to vectors, e.g., a, and scalars are not bold, e.g., a. Estimated quantities are labeled with a tilde, e.g., , quantized variables are marked by a hat, e.g., , and mean values are labeled by a bar, e.g., .
Section snippets
Bandwidth extension algorithm
The key point of the bandwidth extension algorithm is to exploit implicit redundancy of the speech production process as proposed in the pioneering approaches [4], [5], [16]. The linear source-filter model of speech, widely used in speech coding and recognition, consists of an auto-regressive (AR) filter (corresponding to the vocal tract) and a source producing a spectrally flat excitation (cf. Fig. 2). According to this model the algorithm for bandwidth extension is divided into two tasks,
Extension of the excitation signal
According to the simplifying linear model of speech production the excitation signal u(k) is spectrally flat: for voiced sounds it contains sinusoids at multiples of the fundamental (pitch) frequency of the speech segment where all harmonics have almost the same amplitude; during unvoiced sounds the excitation is more or less white noise.
Due to these properties the missing high-frequency components of the excitation signal can be produced by modulation, i.e., by a frequency shift by [4], [10]
Extension of the spectral envelope
The procedure of estimating the wideband spectral envelope, i.e. the AR coefficient set , is related to pattern recognition techniques. We use true wideband speech signals in a training phase and narrowband signals during the application phase.
In our algorithm the estimated wideband spectral envelope is utilized both in the analysis filter to estimate the narrowband excitation signal and in the synthesis filter for spectral shaping of the extended excitation signal. Hence, the
Performance evaluation
Different modeling and estimation methods have been evaluated both by instrumental performance measures and by informal listening tests. Starting from typical “telephone speech” with frequency components between and , the extension of high-frequency components above was investigated.
The statistical model was trained with diverse parameterizations and a 15-dimensional composite feature vector x (see Section 4.3). The complexity of the HMM was varied between NS=2,…,64 states,
Discussion
In this paper an algorithm for artificial bandwidth extension has been proposed that is based on a linear source-filter model of the speech signal. According to the two-stage structure of the source-filter model, the bandwidth extension algorithm is divided into two sub-systems that are mutually independent to a large extent [4]. The BWE algorithm proposed in the paper inherently guarantees transparency of the system with respect to the narrowband input signal.
The principal part of the
Acknowledgements
The authors would like to thank the Siemens AG, Mobile Phones for supporting this project and for providing access to the BAS SI100 speech corpus.
References (39)
- et al.
Acoustic–phonetic features for the automatic classification of fricatives
J. Acoust. Soc. Amer.
(May 2001) - C. Avendano, H. Hermansky, E.A. Wan, Beyond Nyquist: towards the recovery of broad-bandwidth speech from...
- et al.
Optimal decoding of linear codes for minimizing symbol error rate
IEEE Trans. Inform. Theory
(March 1974) - H. Carl, Untersuchung verschiedener Methoden der Sprachkodierung und eine Anwendung zur Bandbreitenvergrößerung von...
- H. Carl, U. Heute, Bandwidth enhancement of narrow-band speech signals, in: Proceedings of the EUSIPCO, Vol. 2,...
- et al.
Statistical recovery of wideband speech from narrowband speech
IEEE Trans. Speech Audio Process.
(October 1994) - M.G. Croll, Sound-quality improvement of broadcast telephone calls, Technical Report 1972/26, The British Broadcasting...
- J. Epps, W.H. Holmes, A new technique for wideband enhancement of coded narrowband speech, in: Proceedings of the IEEE...
- T. Fingscheidt, Softbit-Sprachdecodierung in digitalen Mobilfunksystemen, Ph.D. Thesis, Aachen University (RWTH); P....
- et al.
Techniques for the regeneration of wideband speech from narrowband speech
EURASIP J. Appl. Signal Process.
(December 2001)
Cited by (168)
Artificial bandwidth extension using H<sup>∞</sup> sampled-data control theory
2021, Speech CommunicationDeep neural network ensemble for reducing artificial noise in bandwidth extension
2020, Digital Signal Processing: A Review JournalCitation Excerpt :BWE without side information is called blind BWE or stand-alone BWE and can be further divided into two categories. One is based on a speech production model using linear predictive coding (LPC) coefficients [4], and the other directly estimates the HB spectra in the frequency domain using extrapolation [5] or machine learning [6]. Especially, deep neural network (DNN) regression model-based BWE algorithms that can estimate the log power magnitudes of the high-frequency band were first proposed in [7].
VoiceListener: A Training-free and Universal Eavesdropping Attack on Built-in Speakers of Mobile Devices
2023, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous TechnologiesSpeech Bandwidth Enhancement Based on Spectral-Domain Approach
2023, 2023 International Conference on Computational Intelligence, Networks and Security, ICCINS 2023Digital Speech Transmission and Enhancement, Second edition
2023, Digital Speech Transmission and Enhancement, Second edition