Non-intrusive single-ended speech quality assessment in VoIP
Introduction
Voice over Internet protocol (VoIP) is a very promising technology, and it is expected to replace the traditional public switched telephone network (PSTN) in the next few years (Beritelli et al., 2002, Ilk and Güler, 2006). Although VoIP is efficient, its speech quality is still less than what telephone users are accustomed to, due to various new impairments introduced by Internet and IP terminals. Many techniques have been implemented to enhance the speech quality in VoIP, leading to a large number of services offered with different levels of price and quality in the market. Assessing the VoIP speech quality is an area of intense research interest (Gierlich and Kettler, 2006); it is also an imperative task for network designs and optimizations, as network operators need to maintain certain levels of service quality by monitoring live calls and taking corrective measures whenever necessary.
Speech quality is inherently subjective, as it is determined by the listener’s perception. Therefore, the most reliable approach for assessing speech quality is through subjective tests. Defined in (ITU-T Rec. P.800, 1996), the absolute category rating (ACR) test is one of the widely accepted norms for subjective speech quality rating. In the test, listeners express their opinions on the quality of the speech materials in terms of five categories: excellent, good, fair, poor and bad, with a corresponding integer score: 5, 4, 3, 2, and 1, respectively. The ratings are averaged and the result is usually known as mean opinion score (MOS). Subjective testing is time-consuming and expensive. It is quite complicated; the design of the test is strongly influenced by both human subjects and elaborate testing settings. Moreover, it provides little information on the causes of speech quality degradation from technical aspects. In general, subjective testing is impractical for automated, frequent testing purposes, such as routine network monitoring.
Objective methods have been developed to fill a need to produce a good estimate of subjective MOS in recent years. They are machine-executable and require little human involvement. Objective methods can be classified into two categories (Möller and Raake, 2002), intrusive or non-intrusive, based on whether a reference speech is needed or not.
In intrusive methods, MOS is measured by comparing a reference signal with the degraded one, which is the output of the system under test. The most widely used algorithms include perceptual analysis/measurement system (PAMS, Rix and Hollier, 2000) and perceptual evaluation of speech quality (PESQ, ITU-T Rec. P.862, 2001). Intrusive methods can achieve relatively accurate MOS estimates. However, they are not suitable for on-line, live call quality monitoring purposes, as in this case the reference speech is unavailable.
On the other hand, non-intrusive methods (sometimes called single-ended or output-based methods) utilize the degraded speech signal only or rely on some statistics collected from the network, without having to remove the test channel from service and eject test calls. Although promising, this technique is quite challenging in that the original signal is unknown. Speech quality can be estimated through voice payload analysis (El-Hennawey and Lee, 2004, Falk and Chan, 2006, Gray et al., 2000, Kim, 2005, ITU-T, 2004), or Internet protocol analysis – like PsyVoIP (Broom, 2003), or from a transmission rating model – like the E-model (ITU-T Rec. G.107, 2005). The embedded voice quality estimation module (EVQEM, El-Hennawey and Lee, 2004) utilizes a small portion of the bandwidth available during silences to transmit a reference signal and then uses an intrusive algorithm to measure the speech quality. In (Falk and Chan, 2006), the difference between feature vectors of degraded speech and those of artificial references, which are constructed from a high quality, clean speech database, is measured and mapped to MOS. In (Gray et al., 2000), auditory non-intrusive quality estimation Kim, 2005, ITU-T, 2004, vocal tract models or auditory models are developed. For the Internet protocol analysis approach, PsyVoIP extracts statistical descriptors of a call and maps them to MOS. In comparison, the E-model, is not a measurement tool, but rather a computational model that covers a wide range of parameters affecting the conversation quality in narrow band telephone networks. The E-model is primarily used for transmission planning purposes.
In short, the voice payload analysis models mentioned above are listening-only models; they do not consider factors affecting conversational quality, such as echo and delay. On the other hand, the protocol analysis models hardly cover impairments directly linked to the speech signal itself, such as temporal clipping and noise. These two approaches complement each other; it is advantageous to combine their merits in a new non-intrusive model, which is further substantiated by the E-model for listening-only or conversational quality.
In this paper, a parametric, non-intrusive VoIP speech quality assessment algorithm is proposed, based on the structures developed by El-Hennawey et al. (2006). It adopts a three-step strategy. First, a particular impairment is detected; then, its effects on speech quality are quantified; finally, an overall assessment model is developed. The algorithm covers several major speech quality impairments in VoIP. It relies on voice payload to analyze the occurrences of temporal clipping, echo and noise. A packet loss profile is mainly derived by exploiting Internet protocols; voice payload is also processed for helping silence/unvoiced/voiced (S/U/V) classification of the lost packets.
This paper mainly focuses on modeling the effects of packet loss and combined listening-only quality resulted from packet loss, temporal clipping and noise. Algorithms for derivation of packet loss profile, classification of lost packet type, and noise detection are also presented. The proposed listening-only algorithm can be further extended into a conversational model by considering the effect of echo by using the E-model, where echo detection and its parameter measurement are achieved by using our developed methods (Ding et al., 2006). The proposed non-intrusive algorithm can be implemented at the receive-end media gateway or IP terminal, at low cost. It is suitable as a tool for network design, identification of root causes of speech quality degradation, and speech quality assessment purposes in VoIP. The three-step structure allows efficient implementation of the algorithm and dynamic allocation of processing resources.
The rest of the paper is organized as follows: Section 2 reviews background information on several impairments we examined in VoIP, as well as their detection and modeling challenges. A brief introduction of the E-model is also given. Then, the proposed non-intrusive algorithm is presented in Section 3. In Section 4, the experimental setup and simulation design are described, followed by evaluation results and discussions in Section 5. And finally, Section 6 concludes the paper and suggests future research directions.
Section snippets
Speech quality impairments in VoIP
The IP network, which was originally designed for non-real time data communications, only offers best-effort service with no quality of service (QoS) guarantee. Speech quality impairments, including packet loss, temporal clipping and noise, are analyzed in this paper. Their impacts, detection and effect modeling challenges are discussed.
The proposed algorithm
The structure of the proposed non-intrusive algorithm is presented in this section, followed by detailed detection and modeling algorithms for each impairment. Finally, the overall assessment model is given.
Experimental setup and simulation design
In this section, the experimental setup, simulation and measurement designs are described, including the speech library, impairment introductions, MOS measurement and performance analysis methods.
Results
The results for the speech quality prediction algorithms are first presented, followed by the performance evaluation.
Conclusion
Assessing VoIP speech quality in a non-intrusive manner is a challenging task. This paper develops a parametric, non-intrusive speech quality assessment algorithm suitable for VoIP environments. It adopts a three-step strategy, impairment detection, individual effect modeling and an overall model. The algorithm combines the merits of the voice payload analysis and Internet protocol analysis approaches, and further incorporates the noise perception model in the ITU-T E-model to build an overall
Acknowledgements
This research was supported by Nortel. The authors would like to thank Dr. Leigh Thorpe of Nortel for her invaluable discussions.
References (41)
- et al.
Hybrid multimode/multirate CS-ACELP speech coding for adaptive voice over IP
Speech Commun.
(2002) - et al.
Advanced speech quality testing of modern telecommunication equipment: an overview
Signal Process.
(2006) - et al.
Adaptive time scale modification of speech for graceful degrading voice quality in congested networks for VoIP applications
Signal Process.
(2006) - et al.
Telephone speech quality prediction: Towards network planning and monitoring models for modern network scenarios
Speech Commun.
(2002) - et al.
Speech enhancement using fourth-order cumulants and optimum filters in the subband domain
Speech Commun.
(2002) - et al.
Speech enhancement for personal communication using an adaptive gain equalizer
Signal Process.
(2005) Statistical Inference for Markov Processes
(1961)End-to-end packet delay and loss behavior in the Internet
ACM SIGCOMM Comput. Commun. Rev.
(1993)- Borella, M.S., Swider, D., Uludag, S., Brewster, G.B., 1998. Internet packet loss: measurement and implications for...
- Broom, S., 2003. High level description of Psytechnics ITU-T P.VTQ candidate, ITU-T Study Group 12, Delayed...