Telephone speech quality prediction: Towards network planning and monitoring models for modern network scenarios
Introduction
Telephone Speech Quality, in the times of telephone networks administered and operated at the national level, was closely linked to a standard analogue or digital transmission channel of 300–3400 Hz bandwidth, terminated at both sides by conventionally shaped wirebound handsets. Most national and international connections featured these characteristics until the 1980s. Common impairments were transmission loss, continuous circuit noise as well as signal-correlated quantizing noise associated with waveform PCM coding processes, and these features were usually described in terms of a signal-to-noise ratio, SNR. Due to the low variability of the physical channel characteristics, users' expectations largely reflected their experiences with such connections over the years.
This situation completely changed with the advent of new coding and transmission technology, new terminal equipment, and with the establishment of mobile and IP-based networks on a larger scale. Telephone speech quality is no longer necessarily linked to a specific transmission channel nor to a specific piece of terminal equipment. Rather, a specific transmission channel may be accessed through different types of terminal equipment (e.g. handset phones, hands-free terminals, headset-operated computer terminals), or one specific terminal (e.g. a standard wirebound handset phone) serves as a gate to different transmission channels (wirebound or mobile telephony, IP-based telephony). For the user, it is often not obvious what kind of service he/she is using, nor what terminal or background noise conditions the communication partner is encountering. As a consequence, a specific telecommunication service does not often live up to user expectation. This fact leads to inadequate quality, as it is perceived by the user, although the planner of the network can assure a relatively high quality level (compared to similar configurations).
These examples show that there may be a gap between the planning quality – which has to be provided by the planner of telecommunication networks – and the quality perceived by the service user. Using the terminology of Jekosch (2000), the planner has quality elements at hand, which allow him to set up a network (and finally a telephone service) with desirable physical transmission characteristics. Such quality elements may include a particular type of terminal equipment to be used, a specific coder–decoder pair (codec), the introduction of an echo canceller at a specific point in the network, etc. Unfortunately, the quality elements are not directly reflected in the quality features, which are the dimensions perceived by the user that contribute to the overall quality of a service. The quality which is experienced by the human service user can be seen as the result of a perception and assessment process, during which the user implicitly establishes a relationship between what he/she perceives, and what he/she expects or desires (Jekosch, 2000). Thus, the quality of a telephone service does not exist in an absolute sense – rather, quality is attributed to the service by the user in a specific situation, and also reflects the user's expectations, motivation and attitude.
So far no distinction was made between speech transmission or speech communication quality on the one hand, and the quality to be associated with a speech communication service on the other. In a more analytical picture, speech communication quality is just one aspect contributing to the usability, utility and finally to the acceptability of a service. From the planner's point of view, other important components are the service performance, i.e., service support, service operability, service security and serveability (ITU-T Rec. E.800, 1994), as well as the performance of specific terminal equipment for operating the service. Studies report on a psychological effect with respect to terminals, namely that a user attributes different quality levels to physically identical connections, depending on the type of network the terminal equipment is connected to. For example, a higher rating was found for mobile services compared to wirebound ones (Hollier and Cosier, 1996), or lower ratings for handset-based computer terminals (see Section 5.4). On the other hand, the terminal equipment obviously has a direct impact on the physical characteristics of the transmission path from mouth to ear.
All the quality elements that are directly related to the transmission channel (mouth-to-ear, including terminals) and are to be used in a bi-directional conversation can be subsumed under the term `speech communication quality'. Voice transmission quality is just one aspect of speech communication quality; it refers to the ability of the channel to transmit voice-coded information in a one-way sense, i.e. it is directly linked to the auditory percept. Additionally, and particularly from a user's point of view, conversation effectiveness (the ability of the channel to enable a bi-directional exchange of information to take place) as well as the ease of communication (aspects related to the communication partners) contribute to the user's percept of speech communication quality, which is called communication efficiency here. Apart from the speech transmission related quality features, aspects like comfort and costs have to be taken into account when concepts for usability, utility or acceptability are formulated. These aspects will be largely neglected in this paper. A more detailed discussion of the relevant features can be found e.g. in (Gleiss, 1992).
Although the quality of the transmission channel alone does not ensure an acceptable service, speech communication quality is a necessary prerequisite for it. It is important that networks are planned in order to achieve optimum quality for the user. This does not necessarily mean that maximum transmission performance is required for all the transmission elements. In fact, if all the transmission components performed perfectly, considerable costs for the network set-up and operation would be incurred, which would, in turn, lead to an unacceptable (because too expensive) service from the user's point of view. Thus, a compromise between optimum user quality and network over-engineering has to be found.
Unfortunately, no simple relationship exists between the quality elements (which are in the hands of the network planner) and the quality features perceived by the human user. Nevertheless, since the 1960s efforts have been made to establish such relationships so that networks can be planned to meet the users' requirements. The basic idea is to define transformation laws from instrumentally measurable characteristics of the transmission channel into estimations of user quality percepts. Transformation laws are derived by matching quality prediction indices with auditory test data. Once the transformation laws have been defined, these so-called quality prediction models can then be used to estimate quality for future network scenarios.
In order to make quality predictions for new, unknown networks, a reference telephone connection has to be agreed upon which describes all the relevant elements of the transmission channel, from a network planning point of view. This reference connection should ideally cover all the present and future elements between the mouth of a speaker and the ear of a listener which may have an impact on the speech communication quality in a bi-directional conversation situation (i.e., with the talker and listener changing roles). The entities of the reference connection should be the quality elements that can be defined by the network planner.
Such a reference connection has been agreed upon on a worldwide level by the International Telecommunication Union, ITU-T (ITU-T Rec. G.107, 2000), and is depicted in Fig. 1. It shows the relevant network planning parameters for a 2-/4-wire analogue or digital handset-terminated connection. Besides the main speech transmission path through the network, both the electrical sidetone path (the coupling of the talker's own voice) as well as the talker and listener echo paths are taken into account. All the paths are described in terms of their contribution to speech or noise loudness (the so-called loudness ratings, see Section 3), as well as in terms of their contribution to signal delay. Alternatively, if linearity is assumed, transfer functions of the different network paths can be measured. Circuit as well as ambient room noise are modeled by ideal noise sources of equivalent (A- or psophometrically weighted) noise power. Non-linear speech codecs are not modeled in detail, but are described either by the amount of signal-correlated quantizing noise they introduce (for PCM waveform coders), or as a black box in terms of an equipment impairment factor (for non-waveform low-bitrate codecs). The concept of impairment factors as well as the single parameters of the reference connection will be discussed in more detail in Section 3.
The given reference connection currently refers to synchronously operating networks which are terminated by handset telephones that have a traditional shape. However, it can in principle be extended to include other network scenarios. Headsets and hands-free terminals can be described by their transfer characteristics between a human or an artificial head and the electric interface of the terminal element. In mobile and packet-based transmission, time-variant impairments may occur which have to be taken into account in quality prediction. A first simplified approach is to treat the complete coding–channel–decoding chain as a black box, and then to derive a quality or degradation prediction for mean values of the time-variant impairment, integrating codec distortions and eventually any packet or frame loss that may occur. The quality estimate can then be combined with other impairments to form a prediction for the whole transmission chain, including the terminals.
By means of the reference connection in Fig. 1, the application areas of different quality prediction models can be distinguished. If all the planning values are known beforehand, or if appropriate estimations can be obtained, quality prediction models will give an impression of the overall quality a user of an actual connection might encounter. Comparative calculations for different connections facilitate decisions on whether to insert a specific network element, e.g. an echo canceller, or whether to use a specific codec. On the other hand, some of the planning parameters may be measured for an existing connection, and then quality estimations can be based on measured physical transmission characteristics. If measurements can be obtained on-line, then it becomes possible to monitor the quality of networks in operation. In the case of network problems, such quality monitoring models help network operators to decide which part of the network has caused the problems they have encountered, and whether action has to be taken. During the development phase of network elements (e.g. codec or terminal equipment development), it is not necessary to have quality predictions for the whole transmission path mouth-to-ear. But it is desirable to obtain relative quality estimations for the network element under test, e.g. in comparison to similar elements. Quality prediction models built for this purpose may use actual speech signals as an input, because the quality expert has full access to the network element under consideration.
The aim of the present paper is to discuss the different modeling approaches in more detail. It will focus on models that consider the whole transmission chain, from the mouth of the talker to the ear of a listener, in a conversational situation. In Section 2, a new classification scheme is proposed which describes different types of models in terms of their input and output parameters, the considered network elements and application domains, as well as the amount of psychoacoustic knowledge they incorporate. Two specific types of model are discussed in more detail in the following. The first type involves the network planning models, which allow quality predictions for the whole transmission chain to be obtained in the planning phase, even before a telephone network has been set up (Section 3). Starting from the basic idea of network planning models, a second, new class of quality monitoring models performs quality predictions, based on characteristics of existing networks that have actually been measured. These models are analyzed in Section 4, which also points out how combinations of different modeling approaches can be used to enhance their quality estimations and to extend their application scope. Section 5 presents the results of auditory evaluation experiments on selected quality aspects, which are important for modern network configurations. The final outlook identifies configurations where quality predictions are still not available, and shows the limitations of the current approaches.
Section snippets
Classification of prediction models
It has been pointed out that quality prediction models may serve different purposes, depending on the phase (network planning, set-up or operation) when quality becomes the center of attention. In each phase different input parameters are made available to the quality expert. During the planning stage, no actual signals can be measured, because it is, in general, too expensive to install a network for test purposes. Thus, quality predictions have to be based on planning values, or on averaged
Network planning models
According to a classification made by British Telecom (see ITU-T Suppl. 3 to P-Series Rec., 1993), network planning models can be distinguished by the degree to which they explicitly model the human perception process. Models which describe the perception process as a cause-and-effect relationship between input parameters (listener's hearing characteristics, emitted speech spectrum, sensitivities of the speech transmission path, noise spectra, etc.) and one or several output parameters (which
Monitoring models
In contrast to signal-based comparative measures, network planning models base their quality predictions on input parameters which are estimated, or on planning targets. This fact leads to a certain degree of impreciseness – because the future connection characteristics will differ from the estimated ones – but it cannot be avoided when models are to be used before the network has been set up. In existing networks, it is possible to directly measure input parameters and base quality predictions
Auditory evaluation of prediction models
When quality prediction models are used in communication network design, implementation or operation, it is important to know how well the predicted values correspond to quality impressions of actual users, in terms of accuracy (i.e., how well the model predicts what it predicts) and validity (i.e., how well the model predicts what it should predict). For that purpose, a comparison between model predictions and user quality judgments for a specific connection has to be made. In real-life
Conclusions and outlook
Due to the diversification of transmission networks and terminal equipment, quality planning and assessment has become an important element in the planning, implementation and operation phases of modern telecommunication networks. In this paper, we presented a systematic approach to classify quality prediction models for telephone speech.
Three basic classes of models have to be distinguished. The first class contains the well-known signal-based comparative measures, which aim at predicting
Acknowledgements
The auditory experiment regarding the effects of pure delay was performed at IKA (J. Blauert, U. Jekosch) within a collaboration with Deutsche Telekom Berkom GmbH, D-Berlin. Auditory experiments investigating time-variant impairments were funded by Tektronix Padova SpA, Italy. The authors would like to thank Jens Berger, Stefano Galetto, Pietro Paglierani and Edoardo Rizzi for their collaboration and for the kind publication permission, as well as their anonymous reviewers for helpful comments
References (51)
Subjective rating and apparent magnitude
Int. J. Man Mach. Stud.
(1975)- et al.
Auditory quality evaluation of speech-coding systems
Acta Acustica
(1994) - Beerends, J.G., 1995. Measuring the quality of speech and music codecs: an integrated psychoacoustic approach. In:...
- Berger, J., 1998. Instrumentelle Verfahren zur Sprachqualitätsschätzung – Modelle auditiver Tests, Doctoral...
- Bodden, M., Jekosch, U., 1996. Entwicklung und Durchführung von Tests mit Versuchspersonen zur Verifizierung von...
A theoretical study of the articulation and intelligibility of a telephone circuit
Electr. Commun.
(1929)- ETSI Technical Report ETR 250, 1996. Transmission and multiplexing (TM); Speech communication quality from mouth to ear...
- Euler, S., Zinke, J., 1994. The influence of speech coding algorithms on automatic speech recognition. In: Proc....
- et al.
Relation between loudness and masking
J. Acoust. Soc. Am.
(1937) - Gleiss, N., 1992. Usability – Concepts and evaluation. TELE (English Edition) 2/92, Swedish Telecommunications...