Perceptual-based quality assessment for audio–visual services: A survey

https://doi.org/10.1016/j.image.2010.02.002Get rights and content

Abstract

Accurate measurement of the perceived quality of audio–visual services at the end-user is becoming a crucial issue in digital applications due to the growing demand for compression and transmission of audio–visual services over communication networks. Content providers strive to offer the best quality of experience for customers linked to their different quality of service (QoS) solutions. Therefore, developing accurate, perceptual-based quality metrics is a key requirement in multimedia services. In this paper, we survey state-of-the-art signal-driven perceptual audio and video quality assessment methods independently, and investigate relevant issues in developing joint audio–visual quality metrics. Experiments with respect to subjective quality results have been conducted for analyzing and comparing the performance of the quality metrics. We consider emerging trends in audio–visual quality assessment, and propose feasible solutions for future work in perceptual-based audio–visual quality metrics.

Introduction

Multimedia services are experiencing a tremendous growth in popularity recently due to the evolution of digital communication systems. Two main media modalities, namely audio and video signals constitute the core content in most digital systems. Quality of audio–visual signals can be degraded during lossy compression and transmission through error-prone communication networks. Consequently, accurately measuring the quality of distorted audio–visual signals plays an important role in digital applications, for example, when evaluating the performance of codecs and networks, helping to improve the coding abilities or adjusting network settings based on a strategy of maximizing the perceived quality at the end-user.

Subjective assessment of audio–visual quality is considered to be the most accurate method reflecting the human perception [1]. It is, however, time-consuming and cannot be done in real time. Thus, the International Telecommunication Union (ITU) has released requirements for an objective perceptual multimedia quality model [2]. Currently, most studies regarding the understanding of human quality perception of multimedia systems have focused on individual modalities, i.e., audio and video separately. These investigations have led to a considerable progress in developing objective models based on the human perceptual system for both audio and video. A brief introduction to signal-driven audio and video quality metrics is given in the following paragraphs and the main focus of this paper is on full reference quality models for general audio and video signals.

Perceptual audio quality assessment has been investigated for several decades. Most audio quality models are designed for handling coding distortions only. This paper will also focus on audio quality metrics for coding distortions. Traditional objective measurement methods, such as signal-to-noise ratio (SNR) or total harmonic distortion (THD), have never really been shown to relate reliably to the perceived audio quality. A number of methods for making objective perceptual assessment of audio quality have been developed as the ITU identified an urgent need to establish a standard in this area. The level-difference between the masked threshold and the noise signal is evaluated in a noise-to-masked ratio (NMR) measurement method presented by Brandenburg [3]. In the method proposed by Beerends and Stemerdink. [4], the difference in intracranial representations of the reference and distorted audio signals was transformed with a cognitive mapping to the subjective perceptual audio quality. A perceptual evaluation developed by Paillard et al. [5] first modeled the transfer characteristics of the middle and inner ear to form an internal representation inside the head of the subject, which is an estimate of the information being available to the human brain for comparison of signals, and the difference between the representations of the reference and distorted signals was taken as a perceptual quality. By comparing internal basilar representation of the reference and distorted signals, a perceptual objective measurement (POM) proposed by Colomes and Rault. [6] quantified a certain amount of degradations including the probability of detecting a distortion and a so-called basilar distance. Sporer introduced a filter bank with 241 filters to analyze and compare the reference and distorted signals in [7]. A perceptual measurement method (DIX: disturbance index) proposed by Thiede and Kabit. [8] is based on an auditory filter bank that yields a high temporal resolution and thus enables a more precise modeling of temporal effects such as pre- and post-masking. These six perceptual models [3], [4], [5], [6], [7], [8] combined with some toolbox functions were integrated into the ITU recommendation BS.1387 [9]. In this recommendation, the method for objective measurement of perceived audio quality (PEAQ) is used to predict the perceived quality for wide-band audio signals with small impairment.

However, some limitations have been discovered in PEAQ. Most notably PEAQ is shown to be unreliable for signals with large impairment resulting from low bitrate coding [10]. Furthermore, PEAQ is limited to a maximum of two channels. Consequently, improvements in PEAQ have been developed. Barbedo and Lopes. [11] proposed a new cognitive model and new distortion parameters. The limitation of PEAQ up to a maximum of two channels was addressed by the development of an expert system to assist with an optimization of multichannel audio system [12]. Creusere et al. [10], [13] presented an energy equalization quality metric (EEQM), which can be used in predicting the audio quality for a wide range of impairments. Furthermore, a variable called the energy equalization threshold (EET), used in EEQM, can also be appended in PEAQ as a complementary model output variable (MOV) to give a more accurate quality prediction [14].

A widely used objective video quality metric, peak signal-to-noise ratio (PSNR), has been found to correlate inaccurately with the perceived quality, since it does not take the characteristics of the human visual system (HVS) into account [15]. A number of objective methods for measuring the perceived video quality have been proposed, and many of them have been studied by the video quality experts group (VQEG) [16]. Different validation phases conducted by VQEG between 1997 and 2008 have helped the ITU in producing two recommendations for objective video quality assessment using full-reference models. ITU-T J.144 [17] recommends models for digital television pictures (i.e. coding impairments) and ITU-T J.247 [18] is intended for multimedia-type video (QCIF, CIF and VGA) transmitted over error-prone networks (i.e. coding impairments and transmission errors). The conducted validations are also reported in VQEG reports [19], [20], [21].

Objective video quality metrics are generally classified into three categories based on the availability of reference information: full-reference (FR), reduced-reference (RR), and no-reference (NR) [22]. The FR metrics have access to the reference signal. They have been studied widely and usually have the best performance in predicting the perceived quality, but the drawback is that they cannot be used for all services, for example IPTV monitoring. The following aspects are typically considered in a typical HVS-based FR quality metric: color processing, multi-channel decomposition, perceived contrast and adaptation to a specific luminance or color, contrast sensitivity, spatial and temporal masking, and error pooling over various channels within the primary visual cortex [23]. The perceptual distortion metric (PDM) proposed by Winkler [24] exploited main elements of an HVS-based model and exhibited a promising performance in VQEG FR-TV Phase I evaluation. RR metrics analyze a number of quality features extracted from the reference and distorted videos and integrate them into a single predictive result. For example, Lee et al. [25] extracted a few edge pixels in each frame, and used them to compute PSNR around edge pixels. The task of NR metrics is very complex as no information about the reference medium is available. Therewith, an NR method is an absolute measurement of features and properties in the distorted video. Most NR metrics currently focused on certain distortion features, such as blockiness [26], blurring [27], and the analysis of coding parameter settings [28].

Compared to the extensive studies on quality assessment of individual modalities, relatively little work on joint audio–visual quality assessment has been performed. Fundamental research on multi-modal perception is required to study the mutual influence between auditory and visual stimuli as well as other influence factors in audio–visual quality assessment [29]. Some experiments, reviewed below, have demonstrated that there is a significant mutual influence between the auditory and the visual domain in the perceived overall audio–visual quality. To explain the relationship between audio quality (AQ), video quality (VQ), and the overall audio–visual quality (AVQ), five combinations of stimulus types and assessment tasks, presented in Table 1, have been suggested. In the Table, AQ_V denotes the audio quality in the presence of a visual stimulus, and VQ_A denotes the video quality in the presence of an auditory stimulus. Earlier studies have shown that when a combined auditory–visual stimulus was given, the judgment of the quality in one modality was influenced by the presence of the other modality [30], [31]. Some other experiments have been conducted to study how to derive AVQ based on single AQ and VQ. In these experiments, three subjective assessments corresponding to AQ, VQ, and AVQ in Table 1 have been conducted. Most studies have shown that VQ dominates AVQ in general [30], [32], [33], while Hands [34] suggested that AQ is more important than VQ in a teleconference setup, because the human attention is mainly focused on the auditory stimulus. Winkler’s experiments [35], [36] have shown that more bits should be allocated to audio to achieve a higher overall quality in very low total bitrate budgets. Moreover, the relationship between AQ, VQ, and AVQ is also influenced by some other factors, such as attention of subjects, the audio–visual content itself, usage context, and the experiment environment [37], [38]. It has been proposed that the overall audio–visual quality can be derived by a linear combination and a multiplication of AQ and VQ, where the multiplication of AQ and VQ has very high correlation with the overall quality [33].

When assessing the overall audio–visual quality, synchronization between audio and video (e.g. lip sync) may be an important issue. It is known that the perception of asynchrony depends on the type of content and on the task [39]. The overall quality is degraded when audio and video do not form a single spatio-temporal coherent stream. Dixon and Spitz [40] claimed that the perceived quality degrades rapidly when asynchrony is increased. Massaro et al. [41] also reported that intelligibility is decreased when audio and video are not in sync. When the audio is about 150 ms earlier than the video, it was found that subjects might find the asynchrony annoying [42], [43]. When the video is earlier than the audio, the same degradation is perceived for asynchronies that are about twice when the audio is earlier than the video. A large number of methods of audio and video synchronization have been proposed [44], [45]. In this survey paper, we will introduce related issues in audio–video synchronization briefly, while we will mainly focus on objective quality metrics for spatiotemporally coherent audio–visual systems.

Another issue for developing audio–visual quality metrics, which cannot be neglected is the relationship between the semantic importance of audio–visual contents and the perceived quality. Only little work has been done on this topic. As mentioned earlier, the mutual influence of AQ and VQ is related to audio–visual contents, for example, it can be assumed that AQ is more important than VQ for typical teleconference sequences. The visual content consists principally of head and shoulders of speakers, while the audio content is semantically more important as it may convey more information. Moreover, there is a significant relation between semantic audio–visual importance and the perceived quality. For instance, when same quality degradation occurs on two audio–visual segments with different importance levels, subjects might give different quality judgments for these two segments. Although most existing quality metrics take into account the audio and video contents latently, the relationship between semantic audio–visual importance and the perceived quality has not been studied adequately. There are two challenging problems for integrating semantic importance into quality assessment. Firstly, semantics is a very subjective concept; so it is a challenging task to construct a generic semantic importance model for audio–visual contents. Existing semantic analysis methods are mainly focused on certain types of multimedia contents, such as sports video. They typically exploit audio–visual features, for example, we have proposed a semantic analysis framework for video contents based on visual perception [46]. Secondly, because semantics is a strong subjective concept, it is difficult to define an order of semantic importance among pieces of audio–visual contents. For example, a sports sequence might be important for sports fans, whereas a child may think a cartoon sequence is more important. Thus, rather than comparing different content items, the semantic importance of different temporal segments in an audio–video sequence is usually compared. Taking an example of a football sequence, goal segments are usually important for most subjects. Consequently, the quality scoring on the goal segments is potentially different from other scenes in the same sequence.

As a basis of developing an audio–visual quality model, we first study the popular objective quality metrics for single auditory and visual modality, and analyze the characteristics and performance of these metrics. Some issues related to audio–visual quality assessment are then studied, including audio–video synchronization, the interactive process of multi-modalities (i.e. audio and video in this paper), semantic multimedia analysis and its relation with quality assessment, temporal averaging methods for long-term audio–visual sequences, etc. In addition, we briefly review subjective quality assessment methodologies for audio–visual signals. To make this survey paper more specific, we will focus on the general audio signals and FR models.

The rest of this paper is organized as follows: we introduce alignment methods for audio signals, video signals, and audio–video synchronization in Section 2. The audio quality metrics, especially PEAQ, are introduced in Section 3. Section 4 introduces some well-known FR and RR video quality metrics and presents the algorithm analysis and experimental results. Subjective quality assessment methodologies for audio–visual signals are reviewed in Section 5. The mutual influence of audio , video and audio–visual qualities, as well as some relevant issues and trends of audio–visual quality assessment, are investigated in Section 6. Finally, some conclusions are drawn in Section 7.

Section snippets

Alignment for audio–video signals

Alignment between distorted audio–visual signals and original signals has a significant influence on quality assessment. Slight misalignment may not affect the subjective quality evaluation by human, while it will greatly reduce the accuracy of an objective quality metric. In addition, the audio–video synchronization is another important issue for audio–visual quality assessment. This section will investigate briefly the related issues in the alignment of audio–video signals, and approaches for

Perceptual evaluation of audio quality

Objective audio quality models that incorporate properties of the human auditory system have existed since the 1970s and were mainly applied to speech codecs. As mentioned earlier, a number of psychoacoustic models have been proposed to measure the perceived audio quality, and six of them [3], [4], [5], [6], [7], [8] were extracted and integrated by the ITU into the standard method, PEAQ [9]. Although there are many other metrics that have good performance, we concentrate on PEAQ and the

Objective video quality metrics

The goal of objective video quality metrics is to give quality predictions, which are in accordance with subjective assessment results. Thus, a good objective metric should take the psychophysical process of the human vision and perception system into account. Main characteristics of the HVS include modeling of contrast and orientation sensitivity, frequency selection, spatial and temporal pattern masking and color perception [24]. Due to their generality, HVS-based metrics can in principle be

Subjective methodology for audio–visual quality assessment

As introduced in the preceding sections, ITU has come up with a number of normative recommendations of how to perform subjective assessments of perceived quality in general. These suggestions are internationally recognized and allow a comparison of results of assessments carried out in different laboratories. They define the test conditions as well as the form of presentation. Suggestions on rating scales to be used as well as on the classification of the test material are given. Several

Perceptual-based audio–visual quality metrics

In the preceding sections, we have investigated and analyzed a number of audio quality and video quality metrics. These methods assume only a single modality, either video or audio. Nevertheless, it has been shown by subjective tests (e.g. in [30]) that there is a strong mutual influence between audio and video on the experienced overall quality. At present, there is no reliable metric available for measuring the audio–visual quality automatically. In order to produce a reliable and accurate

Conclusions

We surveyed perceptual-based audio and video quality assessment methods in this paper. The main existing methods were introduced and analyzed, and the experimental results with respect to subjective assessment were presented. Alignment of audio and video signals and audio–video synchronization were reviewed. For audio quality metrics, we mainly concentrated on PEAQ and some improvement methods as well as two simple metrics. The basic version of PEAQ was implemented and validated by the

Acknowledgement

The authors would like to thank three anonymous reviewers and the editor for their valuable and constructive comments on this paper.

References (119)

  • X.K. Yang et al.

    Just noticeable distortion model and its applications in video coding

    Signal Process.: Image Commun.

    (2005)
  • Z. Wang et al.

    Video quality assessment based on structural distortion measurement

    Signal Process.: Image Commun.

    (2004)
  • Recommendation ITU P.911, Subjective audiovisual quality assessment methods for multimedia application, ITU...
  • ITU Recommendation J.148, Requirements for an objective perceptual multimedia quality model, ITU Telecommunication...
  • K. Brandenburg, Evaluation of quality for audio encoding at low bit rates, in: Proceedings of the Contribution to the...
  • J.G. Beerends et al.

    A perceptual audio quality measure based on a psychoacoustics sound representation

    J. Audio Eng. Soc.

    (1992)
  • B. Paillard et al.

    Perceval: perceptual evaluation of the quality of audio signals

    J. Audio Eng. Soc.

    (1992)
  • C. Colomes et al.

    A perceptual model applied to audio bit-rate reduction

    J. Audio Eng. Soc.

    (1995)
  • T. Sporer, Objective audio signal evaluation – applied psychoacoustics for modeling the perceived quality of digital...
  • T. Thiede, E. Kabit, A new perceptual quality measure for bit rate reduced audio, in: Proceedings of the Contribution...
  • ITU-R Recommendation 1387-1, Method for objective measurement of perceived audio quality, ITU Telecommunication...
  • C.D. Creusere et al.

    An objective metric for human subjective audio quality optimized for a wide range of audio fidelities

    IEEE Trans. Audio Speech Lang. Process.

    (2008)
  • J. Barbedo et al.

    A new cognitive model for objective assessment of audio quality

    J. Audio Eng. Soc.

    (2005)
  • S. Zielinski et al.

    Development and initial validation of a multichannel audio quality expert system

    J. Audio Eng. Soc.

    (2005)
  • R. Vanam, C.D. Creusere, Scalable perceptual metric for evaluating audio quality, in: Proceedings of the Conference...
  • C.D. Creuser,e et al.

    Understanding perceptual distortion in MPEG scalable audio coding

    IEEE Trans. Speech Audio Process.

    (2005)
  • B. Girod

    What’s wrong with mean-square error

  • Video Quality Experts Group,...
  • ITU-T Recommendation J.144, Objective perceptual video quality measurement techniques for digital cable television in...
  • ITU-T Recommendation J.247, Objective perceptual multimedia video quality measurement in the presence of a full...
  • VQEG, Final report from the Video Quality Experts Group on the validation of objective models of video quality...
  • VQEG, Final report from the Video Quality Experts Group on the validation of objective models of video quality...
  • VQEG, Final report from the Video Quality Experts Group on the validation of objective models of multimedia quality...
  • S. Winkler

    Video quality and beyond

    Proc. Eur. Signal Process. Conf.

    (September 2007)
  • S. Winkler

    Perceptual video quality metrics – A review

  • S. Winkler

    Digital Video Quality: Vision Models and Metrics

    (2005)
  • C. Lee et al.

    Objective video quality assessment

    Opt. Eng.

    (2006)
  • S. Liu et al.

    Efficient DCT-domain blind measurement and reduction of blocking artifacts

    IEEE Trans. Circuits Syst. Video Technol.

    (2002)
  • P. Marziliano et al.

    A no-reference perceptual blur metric

    Proc. IEEE Int. Conf. Image Process.

    (September 2002)
  • M. Ries, O. Nemethova, M. Rupp, Reference-free video quality metric for mobile streaming applications, in: Proceedings...
  • M.P. Hollier et al.

    Multi-modal perception

    J. BT Technol.

    (1999)
  • J.G. Beerends et al.

    The influence of video quality on perceived audio quality and vice versa

    J. Audio Eng. Soc.

    (1999)
  • N. Kitawaki, Y. Arayama, T. Yamada, Multimedia opinion model based on media interaction of audio–visual communications,...
  • C. Jones, and D.J. Atkinson, Development of opinion-based audiovisual quality models for desktop...
  • ANSI-Accredited Committee T1 Contribution, T1A1.5/94-124, Combined A/V model with multiple audio and video impairments,...
  • D.H. Hands

    A basic multimedia quality model

    IEEE Trans. Multimedia

    (2004)
  • S. Winkler, and C. Faller, Maximizing audiovisual quality at low bitrates, in: Proceedings of the Workshop on Video...
  • S. Winkler et al.

    Perceived audiovisual quality of low-bitrate multimedia content

    IEEE Trans. Multimedia

    (2006)
  • M.R. Frater et al.

    Impact of audio on subjective assessment of video quality in videoconference application

    IEEE Trans. Circuits Syst.Video Technol.

    (2001)
  • ITU-T Contribution COM 12-61-E, Study of the influence of experimental context on the relationship between audio, video...
  • R. van Eijk et al.

    Audiovisual synchrony and temporal order judgments: effects of experimental method and stimulus type

    Percept. Psychophys.

    (2008)
  • N.F. Dixon et al.

    The diction of auditory visual desynchrony

    Perception

    (1980)
  • D.W. Massaro et al.

    Perception of asynchronous and conflicting visual and auditory speech

    J. Acoust. Soc. Am.

    (1980)
  • ITU-R SG11 11A/55, Evaluation of the subjective effects of timing errors between sound and vision signals in...
  • S. Rihs, The influence of audio on perceived picture quality and subjective audio–video delay tolerance in RACE MOSAIC...
  • G. Blakowski et al.

    A media synchronization survey: Reference model, specification, and case studies

    IEEE J. Selected Areas Commun.

    (1996)
  • R. Steinmetz

    Human perception of jitter and media synchronization

    IEEE J. Selected Areas Commun.

    (1996)
  • J. You et al.

    A multiple visual models based perceptive analysis framework for multilevel video summarization

    IEEE Trans. Circuits Systems Video Technol.

    (2007)
  • Y. Gao

    Audio coding standard overview: MPEG4-AAC, HE-AAC, and HE-AAC v2

  • Sonic Visualiser Software, [online] available: 〈http://www.sonicvisualiser.org/〉, MATCH Vamp plugin, [online]...
  • Cited by (118)

    • Tactile Internet Security System Architecture for Mobile Networks

      2023, 2023 6th World Symposium on Communication Engineering, WSCE 2023
    View all citing articles on Scopus

    This work was supported in part by the Academy of Finland (application number 213462, Finnish Program for Centres of Excellence in Research 2006–2011) while Junyong You worked at Tampere University of Technology.

    1

    Centre for Quantifiable Quality of Service in Communication Systems, Centre of Excellence, appointed by The Research Council of Norway, funded by the Research Council, NTNU and UNINETT.

    EURASIP member.

    View full text