Elsevier

Speech Communication

Volume 38, Issues 1–2, September 2002, Pages 47-75
Speech Communication

Telephone speech quality prediction: Towards network planning and monitoring models for modern network scenarios

https://doi.org/10.1016/S0167-6393(01)00043-7Get rights and content

Abstract

This paper addresses the problem of predicting the quality of telephone speech. Starting from a definition of quality, which takes communicative as well as service-related factors into account, a new classification scheme for prediction models is proposed. It considers input and output parameters, the network components and application area the model is used for, as well as the psychoacoustic and judgment-related bases. According to this scheme, quality prediction models can be classified into signal-based comparative measures, network planning models and monitoring models. Whereas signal-based approaches have been described extensively in literature, this paper discusses the latter two approaches in detail. The underlying psychoacoustic properties of two network planning models, the E-model and the SUBMOD model, are analyzed, and combined approaches for monitoring models are developed. Quality predictions obtained from the models are compared to the results of auditory test data, and weaknesses as well as network elements that remain uncovered are identified. Possible future extensions to the models are pointed out, including wide-band scenarios and speech sound quality, non-stationary impairments as well as speech technology devices.

Zusammenfassung

Dieser Beitrag befasst sich mit der Vorhersage der Qualität von Telefonsprache. Basierend auf einer Qualitätsdefinition, welche kommunikative und dienstbezogene Faktoren berücksichtigt, wird ein neuartiges Klassifikations-Schema für Qualitätsmodelle vorgestellt. Es berücksichtigt die Eingangs- und Ausgangsparameter der Modelle, die von ihnen betrachteten Netzwerkkomponenten und Anwendungsgebiete, sowie ihre psychoakustischen und beurteilungsbezogenen Grundlagen. Gemäß diesem Schema können drei Arten von Modellen unterschieden werden: signalbasierte Vergleichsmaße, Netzwerkplanungsmodelle sowie Monitoring-Modelle. Während über signalbasierte Vergleichsmaße umfangreiche Literatur vorliegt, legt dieser Beitrag den Schwerpunkt auf die letzten beiden genannten Ansätze. Die psychoakustischen Grundlagen zweier bekannter Netzwerkplanungsmodelle (das E-Modell und das SUBMOD-Modell) werden analysiert und auf ihnen aufbauende Monitoring-Ansätze entwickelt. Zur Verifikation der genannten Modelle werden ihre Qualitätsvorhersagen mit den Ergebnissen auditiver Tests verglichen. Hierdurch lassen sich Schwächen und noch nicht berücksichtigte Netzwerkkomponenten identifizieren. Erweiterungsansätze werden diskutiert, insbesondere für Breitband-Übertragung, instationäre Störungen, die Modellierung von Sprach-Klangqualität sowie die Anwendung von Qualitätsmodellen im Bereich der Sprachtechnologie.

Résumé

Cet article traite de la prédiction de la qualité vocale transmise par une ligne téléphonique. On commence par présenter une définition du terme `qualité' qui prend en compte aussi bien des facteurs de communication que des facteurs relatifs au service téléphonique, et on propose un nouveau schéma de classification. Ce schéma considère les paramètres d'entrée et de sortie du modèle, les composants du réseau considérés et les domaines d'applications pour lesquels un modèle est utilisé, ainsi que des données psychoacoustiques et de jugement. Selon ce schéma, les modèles de prédiction de la qualité peuvent être classifiés sous trois classes: les mesures comparatives à base de signaux, les modèles de planification de réseau, et les modèles de surveillance de réseau. Les mesures comparatives étant discutées amplement dans la littérature, on se limite aux deux dernières types de modèle. Les bases psychoacoustiques des deux modèles de planification les plus connus (le modèle E et le modèle SUBMOD) sont discutées en détail, et – en combinaison avec d'autres modèles – utilisées pour développer des approches de surveillance. Pour tester les modèles, leurs prédictions sont comparées aux résultats des tests auditifs. Cela permet d'identifier les points faibles des modèles, et les composants du réseau qui ne sont pas encore pris en considération. On discute des extensions potentielles de ces modèles, qui incluent la transmission à large bande, la modélisation de la qualité de la voix transmise, les prédictions pour les perturbations non-stationnaires, ainsi que l'application des modèles à la prédiction des effets de la transmission téléphonique sur la reconnaissance et la synthèse vocale.

Introduction

Telephone Speech Quality, in the times of telephone networks administered and operated at the national level, was closely linked to a standard analogue or digital transmission channel of 300–3400 Hz bandwidth, terminated at both sides by conventionally shaped wirebound handsets. Most national and international connections featured these characteristics until the 1980s. Common impairments were transmission loss, continuous circuit noise as well as signal-correlated quantizing noise associated with waveform PCM coding processes, and these features were usually described in terms of a signal-to-noise ratio, SNR. Due to the low variability of the physical channel characteristics, users' expectations largely reflected their experiences with such connections over the years.

This situation completely changed with the advent of new coding and transmission technology, new terminal equipment, and with the establishment of mobile and IP-based networks on a larger scale. Telephone speech quality is no longer necessarily linked to a specific transmission channel nor to a specific piece of terminal equipment. Rather, a specific transmission channel may be accessed through different types of terminal equipment (e.g. handset phones, hands-free terminals, headset-operated computer terminals), or one specific terminal (e.g. a standard wirebound handset phone) serves as a gate to different transmission channels (wirebound or mobile telephony, IP-based telephony). For the user, it is often not obvious what kind of service he/she is using, nor what terminal or background noise conditions the communication partner is encountering. As a consequence, a specific telecommunication service does not often live up to user expectation. This fact leads to inadequate quality, as it is perceived by the user, although the planner of the network can assure a relatively high quality level (compared to similar configurations).

These examples show that there may be a gap between the planning quality – which has to be provided by the planner of telecommunication networks – and the quality perceived by the service user. Using the terminology of Jekosch (2000), the planner has quality elements at hand, which allow him to set up a network (and finally a telephone service) with desirable physical transmission characteristics. Such quality elements may include a particular type of terminal equipment to be used, a specific coder–decoder pair (codec), the introduction of an echo canceller at a specific point in the network, etc. Unfortunately, the quality elements are not directly reflected in the quality features, which are the dimensions perceived by the user that contribute to the overall quality of a service. The quality which is experienced by the human service user can be seen as the result of a perception and assessment process, during which the user implicitly establishes a relationship between what he/she perceives, and what he/she expects or desires (Jekosch, 2000). Thus, the quality of a telephone service does not exist in an absolute sense – rather, quality is attributed to the service by the user in a specific situation, and also reflects the user's expectations, motivation and attitude.

So far no distinction was made between speech transmission or speech communication quality on the one hand, and the quality to be associated with a speech communication service on the other. In a more analytical picture, speech communication quality is just one aspect contributing to the usability, utility and finally to the acceptability of a service. From the planner's point of view, other important components are the service performance, i.e., service support, service operability, service security and serveability (ITU-T Rec. E.800, 1994), as well as the performance of specific terminal equipment for operating the service. Studies report on a psychological effect with respect to terminals, namely that a user attributes different quality levels to physically identical connections, depending on the type of network the terminal equipment is connected to. For example, a higher rating was found for mobile services compared to wirebound ones (Hollier and Cosier, 1996), or lower ratings for handset-based computer terminals (see Section 5.4). On the other hand, the terminal equipment obviously has a direct impact on the physical characteristics of the transmission path from mouth to ear.

All the quality elements that are directly related to the transmission channel (mouth-to-ear, including terminals) and are to be used in a bi-directional conversation can be subsumed under the term `speech communication quality'. Voice transmission quality is just one aspect of speech communication quality; it refers to the ability of the channel to transmit voice-coded information in a one-way sense, i.e. it is directly linked to the auditory percept. Additionally, and particularly from a user's point of view, conversation effectiveness (the ability of the channel to enable a bi-directional exchange of information to take place) as well as the ease of communication (aspects related to the communication partners) contribute to the user's percept of speech communication quality, which is called communication efficiency here. Apart from the speech transmission related quality features, aspects like comfort and costs have to be taken into account when concepts for usability, utility or acceptability are formulated. These aspects will be largely neglected in this paper. A more detailed discussion of the relevant features can be found e.g. in (Gleiss, 1992).

Although the quality of the transmission channel alone does not ensure an acceptable service, speech communication quality is a necessary prerequisite for it. It is important that networks are planned in order to achieve optimum quality for the user. This does not necessarily mean that maximum transmission performance is required for all the transmission elements. In fact, if all the transmission components performed perfectly, considerable costs for the network set-up and operation would be incurred, which would, in turn, lead to an unacceptable (because too expensive) service from the user's point of view. Thus, a compromise between optimum user quality and network over-engineering has to be found.

Unfortunately, no simple relationship exists between the quality elements (which are in the hands of the network planner) and the quality features perceived by the human user. Nevertheless, since the 1960s efforts have been made to establish such relationships so that networks can be planned to meet the users' requirements. The basic idea is to define transformation laws from instrumentally measurable characteristics of the transmission channel into estimations of user quality percepts. Transformation laws are derived by matching quality prediction indices with auditory test data. Once the transformation laws have been defined, these so-called quality prediction models can then be used to estimate quality for future network scenarios.

In order to make quality predictions for new, unknown networks, a reference telephone connection has to be agreed upon which describes all the relevant elements of the transmission channel, from a network planning point of view. This reference connection should ideally cover all the present and future elements between the mouth of a speaker and the ear of a listener which may have an impact on the speech communication quality in a bi-directional conversation situation (i.e., with the talker and listener changing roles). The entities of the reference connection should be the quality elements that can be defined by the network planner.

Such a reference connection has been agreed upon on a worldwide level by the International Telecommunication Union, ITU-T (ITU-T Rec. G.107, 2000), and is depicted in Fig. 1. It shows the relevant network planning parameters for a 2-/4-wire analogue or digital handset-terminated connection. Besides the main speech transmission path through the network, both the electrical sidetone path (the coupling of the talker's own voice) as well as the talker and listener echo paths are taken into account. All the paths are described in terms of their contribution to speech or noise loudness (the so-called loudness ratings, see Section 3), as well as in terms of their contribution to signal delay. Alternatively, if linearity is assumed, transfer functions of the different network paths can be measured. Circuit as well as ambient room noise are modeled by ideal noise sources of equivalent (A- or psophometrically weighted) noise power. Non-linear speech codecs are not modeled in detail, but are described either by the amount of signal-correlated quantizing noise they introduce (for PCM waveform coders), or as a black box in terms of an equipment impairment factor (for non-waveform low-bitrate codecs). The concept of impairment factors as well as the single parameters of the reference connection will be discussed in more detail in Section 3.

The given reference connection currently refers to synchronously operating networks which are terminated by handset telephones that have a traditional shape. However, it can in principle be extended to include other network scenarios. Headsets and hands-free terminals can be described by their transfer characteristics between a human or an artificial head and the electric interface of the terminal element. In mobile and packet-based transmission, time-variant impairments may occur which have to be taken into account in quality prediction. A first simplified approach is to treat the complete coding–channel–decoding chain as a black box, and then to derive a quality or degradation prediction for mean values of the time-variant impairment, integrating codec distortions and eventually any packet or frame loss that may occur. The quality estimate can then be combined with other impairments to form a prediction for the whole transmission chain, including the terminals.

By means of the reference connection in Fig. 1, the application areas of different quality prediction models can be distinguished. If all the planning values are known beforehand, or if appropriate estimations can be obtained, quality prediction models will give an impression of the overall quality a user of an actual connection might encounter. Comparative calculations for different connections facilitate decisions on whether to insert a specific network element, e.g. an echo canceller, or whether to use a specific codec. On the other hand, some of the planning parameters may be measured for an existing connection, and then quality estimations can be based on measured physical transmission characteristics. If measurements can be obtained on-line, then it becomes possible to monitor the quality of networks in operation. In the case of network problems, such quality monitoring models help network operators to decide which part of the network has caused the problems they have encountered, and whether action has to be taken. During the development phase of network elements (e.g. codec or terminal equipment development), it is not necessary to have quality predictions for the whole transmission path mouth-to-ear. But it is desirable to obtain relative quality estimations for the network element under test, e.g. in comparison to similar elements. Quality prediction models built for this purpose may use actual speech signals as an input, because the quality expert has full access to the network element under consideration.

The aim of the present paper is to discuss the different modeling approaches in more detail. It will focus on models that consider the whole transmission chain, from the mouth of the talker to the ear of a listener, in a conversational situation. In Section 2, a new classification scheme is proposed which describes different types of models in terms of their input and output parameters, the considered network elements and application domains, as well as the amount of psychoacoustic knowledge they incorporate. Two specific types of model are discussed in more detail in the following. The first type involves the network planning models, which allow quality predictions for the whole transmission chain to be obtained in the planning phase, even before a telephone network has been set up (Section 3). Starting from the basic idea of network planning models, a second, new class of quality monitoring models performs quality predictions, based on characteristics of existing networks that have actually been measured. These models are analyzed in Section 4, which also points out how combinations of different modeling approaches can be used to enhance their quality estimations and to extend their application scope. Section 5 presents the results of auditory evaluation experiments on selected quality aspects, which are important for modern network configurations. The final outlook identifies configurations where quality predictions are still not available, and shows the limitations of the current approaches.

Section snippets

Classification of prediction models

It has been pointed out that quality prediction models may serve different purposes, depending on the phase (network planning, set-up or operation) when quality becomes the center of attention. In each phase different input parameters are made available to the quality expert. During the planning stage, no actual signals can be measured, because it is, in general, too expensive to install a network for test purposes. Thus, quality predictions have to be based on planning values, or on averaged

Network planning models

According to a classification made by British Telecom (see ITU-T Suppl. 3 to P-Series Rec., 1993), network planning models can be distinguished by the degree to which they explicitly model the human perception process. Models which describe the perception process as a cause-and-effect relationship between input parameters (listener's hearing characteristics, emitted speech spectrum, sensitivities of the speech transmission path, noise spectra, etc.) and one or several output parameters (which

Monitoring models

In contrast to signal-based comparative measures, network planning models base their quality predictions on input parameters which are estimated, or on planning targets. This fact leads to a certain degree of impreciseness – because the future connection characteristics will differ from the estimated ones – but it cannot be avoided when models are to be used before the network has been set up. In existing networks, it is possible to directly measure input parameters and base quality predictions

Auditory evaluation of prediction models

When quality prediction models are used in communication network design, implementation or operation, it is important to know how well the predicted values correspond to quality impressions of actual users, in terms of accuracy (i.e., how well the model predicts what it predicts) and validity (i.e., how well the model predicts what it should predict). For that purpose, a comparison between model predictions and user quality judgments for a specific connection has to be made. In real-life

Conclusions and outlook

Due to the diversification of transmission networks and terminal equipment, quality planning and assessment has become an important element in the planning, implementation and operation phases of modern telecommunication networks. In this paper, we presented a systematic approach to classify quality prediction models for telephone speech.

Three basic classes of models have to be distinguished. The first class contains the well-known signal-based comparative measures, which aim at predicting

Acknowledgements

The auditory experiment regarding the effects of pure delay was performed at IKA (J. Blauert, U. Jekosch) within a collaboration with Deutsche Telekom Berkom GmbH, D-Berlin. Auditory experiments investigating time-variant impairments were funded by Tektronix Padova SpA, Italy. The authors would like to thank Jens Berger, Stefano Galetto, Pietro Paglierani and Edoardo Rizzi for their collaboration and for the kind publication permission, as well as their anonymous reviewers for helpful comments

References (51)

  • J.W Allnatt

    Subjective rating and apparent magnitude

    Int. J. Man Mach. Stud.

    (1975)
  • V Bappert et al.

    Auditory quality evaluation of speech-coding systems

    Acta Acustica

    (1994)
  • Beerends, J.G., 1995. Measuring the quality of speech and music codecs: an integrated psychoacoustic approach. In:...
  • Berger, J., 1998. Instrumentelle Verfahren zur Sprachqualitätsschätzung – Modelle auditiver Tests, Doctoral...
  • Bodden, M., Jekosch, U., 1996. Entwicklung und Durchführung von Tests mit Versuchspersonen zur Verifizierung von...
  • J Collard

    A theoretical study of the articulation and intelligibility of a telephone circuit

    Electr. Commun.

    (1929)
  • ETSI Technical Report ETR 250, 1996. Transmission and multiplexing (TM); Speech communication quality from mouth to ear...
  • Euler, S., Zinke, J., 1994. The influence of speech coding algorithms on automatic speech recognition. In: Proc....
  • H Fletcher et al.

    Relation between loudness and masking

    J. Acoust. Soc. Am.

    (1937)
  • Gleiss, N., 1992. Usability – Concepts and evaluation. TELE (English Edition) 2/92, Swedish Telecommunications...
  • Hansen, M., 1998. Assessment and prediction of speech transmission quality with an auditory processing model. Doctoral...
  • Hauenstein, M., 1997. Psychoakustisch motivierte Maße zur instrumentellen Sprachgütebeurteilung, Doctoral dissertation,...
  • M.P Hollier et al.

    Assessing human perception

    BT Technol. J.

    (1996)
  • ITU-R Rec. BS.1387, 1998. Method for objective measurements of perceived audio quality. International Telecommunication...
  • ITU-T Contribution COM 12-003, 2000. Future enhancement of the E-model – Introduction of a special impairment factor...
  • ITU-T Contribution COM 12-71, 1998. Estimation of impairment factors with TOSQA. International Telecommunication Union,...
  • ITU-T Delayed Contribution D.069, 1998. Accuracy analysis of mapping non-intrusive measurements to the E-model....
  • ITU-T Delayed Contribution D.071, 1995. Impairment factors for speech clipping. International Telecommunication Union,...
  • ITU-T Delayed Contribution D.110, 1999. Subjective results on impairment effects of IP packet loss. International...
  • ITU-T Draft Recommendation P.562, 2000. Analysis and interpretation of INMD voice-service measurements. Published as...
  • ITU-T Draft Recommendation P.833, 2000. Methodology for derivation of equipment impairment factors from subjective...
  • ITU-T Draft Recommendation P.862, 2000. Perceptual evaluation of speech quality (PESQ), an objective method for...
  • ITU-T Recommendation E.800, 1994. Terms and definitions related to quality of service and network performance including...
  • ITU-T Recommendation G.107, 2000. The E-model, a computational model for use in transmission planning. International...
  • ITU-T Recommendation G.109, 1999. Definition of categories of speech transmission quality. International...
  • Cited by (0)

    View full text