Elsevier

Speech Communication

Volume 49, Issue 6, June 2007, Pages 477-489
Speech Communication

Non-intrusive single-ended speech quality assessment in VoIP

https://doi.org/10.1016/j.specom.2007.04.003Get rights and content

Abstract

Evaluating speech quality in voice over Internet protocol (VoIP) in a non-intrusive manner is challenging, because it relies on a degraded speech signal only. In this paper, a parametric, non-intrusive VoIP speech quality assessment algorithm is proposed, which adopts a three-step strategy, impairment detection, individual effect modeling and an overall model. Mainly based on voice payload analysis, the algorithm also combines Internet protocol analysis approach and the ITU-T E-model. It quantifies the individual contributions to speech quality from several major VoIP impairments, including packet loss, temporal clipping and noise. Also, an overall assessment model is developed. The performance is evaluated through intensive simulations, and the results show that the algorithm is effective and accurate. For the overall model, the correlation between prediction and measurement is 0.90; the root mean square error (RMSE) is 0.27 mean opinion score (MOS). The algorithm aims to be implemented at the receive-end media gateway or IP terminal, for identifying the root causes of speech quality degradation as well as quality assessment in VoIP.

Introduction

Voice over Internet protocol (VoIP) is a very promising technology, and it is expected to replace the traditional public switched telephone network (PSTN) in the next few years (Beritelli et al., 2002, Ilk and Güler, 2006). Although VoIP is efficient, its speech quality is still less than what telephone users are accustomed to, due to various new impairments introduced by Internet and IP terminals. Many techniques have been implemented to enhance the speech quality in VoIP, leading to a large number of services offered with different levels of price and quality in the market. Assessing the VoIP speech quality is an area of intense research interest (Gierlich and Kettler, 2006); it is also an imperative task for network designs and optimizations, as network operators need to maintain certain levels of service quality by monitoring live calls and taking corrective measures whenever necessary.

Speech quality is inherently subjective, as it is determined by the listener’s perception. Therefore, the most reliable approach for assessing speech quality is through subjective tests. Defined in (ITU-T Rec. P.800, 1996), the absolute category rating (ACR) test is one of the widely accepted norms for subjective speech quality rating. In the test, listeners express their opinions on the quality of the speech materials in terms of five categories: excellent, good, fair, poor and bad, with a corresponding integer score: 5, 4, 3, 2, and 1, respectively. The ratings are averaged and the result is usually known as mean opinion score (MOS). Subjective testing is time-consuming and expensive. It is quite complicated; the design of the test is strongly influenced by both human subjects and elaborate testing settings. Moreover, it provides little information on the causes of speech quality degradation from technical aspects. In general, subjective testing is impractical for automated, frequent testing purposes, such as routine network monitoring.

Objective methods have been developed to fill a need to produce a good estimate of subjective MOS in recent years. They are machine-executable and require little human involvement. Objective methods can be classified into two categories (Möller and Raake, 2002), intrusive or non-intrusive, based on whether a reference speech is needed or not.

In intrusive methods, MOS is measured by comparing a reference signal with the degraded one, which is the output of the system under test. The most widely used algorithms include perceptual analysis/measurement system (PAMS, Rix and Hollier, 2000) and perceptual evaluation of speech quality (PESQ, ITU-T Rec. P.862, 2001). Intrusive methods can achieve relatively accurate MOS estimates. However, they are not suitable for on-line, live call quality monitoring purposes, as in this case the reference speech is unavailable.

On the other hand, non-intrusive methods (sometimes called single-ended or output-based methods) utilize the degraded speech signal only or rely on some statistics collected from the network, without having to remove the test channel from service and eject test calls. Although promising, this technique is quite challenging in that the original signal is unknown. Speech quality can be estimated through voice payload analysis (El-Hennawey and Lee, 2004, Falk and Chan, 2006, Gray et al., 2000, Kim, 2005, ITU-T, 2004), or Internet protocol analysis – like PsyVoIP (Broom, 2003), or from a transmission rating model – like the E-model (ITU-T Rec. G.107, 2005). The embedded voice quality estimation module (EVQEM, El-Hennawey and Lee, 2004) utilizes a small portion of the bandwidth available during silences to transmit a reference signal and then uses an intrusive algorithm to measure the speech quality. In (Falk and Chan, 2006), the difference between feature vectors of degraded speech and those of artificial references, which are constructed from a high quality, clean speech database, is measured and mapped to MOS. In (Gray et al., 2000), auditory non-intrusive quality estimation Kim, 2005, ITU-T, 2004, vocal tract models or auditory models are developed. For the Internet protocol analysis approach, PsyVoIP extracts statistical descriptors of a call and maps them to MOS. In comparison, the E-model, is not a measurement tool, but rather a computational model that covers a wide range of parameters affecting the conversation quality in narrow band telephone networks. The E-model is primarily used for transmission planning purposes.

In short, the voice payload analysis models mentioned above are listening-only models; they do not consider factors affecting conversational quality, such as echo and delay. On the other hand, the protocol analysis models hardly cover impairments directly linked to the speech signal itself, such as temporal clipping and noise. These two approaches complement each other; it is advantageous to combine their merits in a new non-intrusive model, which is further substantiated by the E-model for listening-only or conversational quality.

In this paper, a parametric, non-intrusive VoIP speech quality assessment algorithm is proposed, based on the structures developed by El-Hennawey et al. (2006). It adopts a three-step strategy. First, a particular impairment is detected; then, its effects on speech quality are quantified; finally, an overall assessment model is developed. The algorithm covers several major speech quality impairments in VoIP. It relies on voice payload to analyze the occurrences of temporal clipping, echo and noise. A packet loss profile is mainly derived by exploiting Internet protocols; voice payload is also processed for helping silence/unvoiced/voiced (S/U/V) classification of the lost packets.

This paper mainly focuses on modeling the effects of packet loss and combined listening-only quality resulted from packet loss, temporal clipping and noise. Algorithms for derivation of packet loss profile, classification of lost packet type, and noise detection are also presented. The proposed listening-only algorithm can be further extended into a conversational model by considering the effect of echo by using the E-model, where echo detection and its parameter measurement are achieved by using our developed methods (Ding et al., 2006). The proposed non-intrusive algorithm can be implemented at the receive-end media gateway or IP terminal, at low cost. It is suitable as a tool for network design, identification of root causes of speech quality degradation, and speech quality assessment purposes in VoIP. The three-step structure allows efficient implementation of the algorithm and dynamic allocation of processing resources.

The rest of the paper is organized as follows: Section 2 reviews background information on several impairments we examined in VoIP, as well as their detection and modeling challenges. A brief introduction of the E-model is also given. Then, the proposed non-intrusive algorithm is presented in Section 3. In Section 4, the experimental setup and simulation design are described, followed by evaluation results and discussions in Section 5. And finally, Section 6 concludes the paper and suggests future research directions.

Section snippets

Speech quality impairments in VoIP

The IP network, which was originally designed for non-real time data communications, only offers best-effort service with no quality of service (QoS) guarantee. Speech quality impairments, including packet loss, temporal clipping and noise, are analyzed in this paper. Their impacts, detection and effect modeling challenges are discussed.

The proposed algorithm

The structure of the proposed non-intrusive algorithm is presented in this section, followed by detailed detection and modeling algorithms for each impairment. Finally, the overall assessment model is given.

Experimental setup and simulation design

In this section, the experimental setup, simulation and measurement designs are described, including the speech library, impairment introductions, MOS measurement and performance analysis methods.

Results

The results for the speech quality prediction algorithms are first presented, followed by the performance evaluation.

Conclusion

Assessing VoIP speech quality in a non-intrusive manner is a challenging task. This paper develops a parametric, non-intrusive speech quality assessment algorithm suitable for VoIP environments. It adopts a three-step strategy, impairment detection, individual effect modeling and an overall model. The algorithm combines the merits of the voice payload analysis and Internet protocol analysis approaches, and further incorporates the noise perception model in the ITU-T E-model to build an overall

Acknowledgements

This research was supported by Nortel. The authors would like to thank Dr. Leigh Thorpe of Nortel for her invaluable discussions.

References (41)

  • I. Cohen et al.

    Noise estimation by minima controlled recursive averaging for robust speech enhancement

    IEEE Signal Process. Lett.

    (2002)
  • De Martin, J.C., 2001. Source-driven packet marking for speech transmission over differentiated-services networks. In:...
  • Ding, L., Goubran, R.A., 2003. Speech quality prediction in VoIP using the extended E-Model. In: Proc. IEEE GLOBECOM,...
  • Ding, L., El-Hennawey, M.S., Goubran, R.A., 2005. Measurement of the effects of temporal clipping on speech quality....
  • L. Ding et al.

    Nonintrusive measurement of echo-path parameters in VoIP environments

    IEEE Trans. Instrum. Meas.

    (2006)
  • Duysburgh, B., Vanhastel, S., De Vreese, B., Petrisor, C., Demeester, P., 2001. On the influence of best-effort network...
  • El-Hennawey, M.S., Lee, D., 2004. Embedded real-time voice quality analysis system. In: GSPx: The International...
  • El-Hennawey, M.S., Goubran, R.A., Radwan, A., Ding, L., 2006. Method and apparatus for non-intrusive single-ended voice...
  • ETSI ETR 250, 1996. Transmission and Multiplexing (TM); Speech communication quality from mouth to ear for 3.1kHz...
  • T. Falk et al.

    Nonintrusive speech quality estimation using Gaussian mixture models

    IEEE Signal Process. Lett.

    (2006)
  • Cited by (0)

    View full text