Elsevier

Displays

Volume 70, December 2021, 102096
Displays

Spatiotemporal just noticeable difference modeling with heterogeneous temporal visual features

https://doi.org/10.1016/j.displa.2021.102096Get rights and content

Highlights

  • This paper extracts content-aware temporal feature of video for JND modeling.

  • We propose statistical probability models to quantitatively depict feature parameters.

  • We quantify saliency and uncertainty to relieve the heterogeneous fusion difficulty.

Abstract

Developing accurate Just-noticeable difference (JND) models are challenged by complicated HVS characteristics and nonstationary features of video sequence. Great efforts have been devoted to JND modeling, and inspiring performance improvements are witnessed in the literature, especially spatial JND models. However, there are not only urgent requirement but also technical potentiality for improving temporal JND models fully accounting for the temporal perception characteristics. In terms of temporal JND modeling, there are two challenges, one is how to extract perceptual feature parameters of source video, and the other is how to quantitatively characterize the interaction relationship between feature parameters and HVS characteristics? Firstly, this work extracts content-aware temporal feature parameters having predominate impacts on vision perception, including motion (foreground/background), pixel-correspondence duration and inter-frame residue fluctuation intensity along temporal trajectory, and investigates the HVS responses to these four heterogeneous feature parameters. Secondly, this work proposes respective probability density functions (PDF) in the perception sense to quantitatively depict the attention and suppression perception responses of feature parameters, accounting for the temporal perception characteristics. Using these PDF models, we fuse the heterogeneous feature parameters from the viewpoint of uniform dimension,i.e. self-information measured visual attention and information entropy measured masking uncertainty, achieving heterogeneous parameter homogenization. Thirdly, with self-information and entropy results, this work then proposes a temporal weight model, by striking the balance between visual attention and masking suppression, to adjust the spatial JND threshold, and then develops the improved spatiotemporal JND model. Intensive simulation results verity the effectiveness of the proposed spatiotemporal JND profile, with competitive model accuracy compared with the-state-of-the-art candidate models.

Introduction

Video signals undergo a series of processing before being represented for human viewing, such as enhancement [1], compression, transmission, and storage, etc [2]. The ultimate judger of video is human eye system, and human visual system (HVS) characteristics is supposed to be taken into account in terms of distortion and video quality assessment [3], [4]. Just-Noticeable difference (JND) refers to the minimum visibility threshold of the HVS [2]. The computational models of JND have been widely applied for image or video compression [5], perceptual quality evaluation [6], digital watermarking [7], image enhancement [8], etc.

JND models can generally be divided into two categories according to the operating domains, transform domain and pixel-domain JND profiles [9]. The pixel-domain JND models were constructed by using the spatial context features of image within local support, with which HVS spatial characteristics are fully imitated and employed. Pixel-domain JND models were usually used for quality assessment [10], image processing and spatially adaptive image coding, such as quantization, bit allocation [11], etc. In transform domain JND models, frequency domain HVS characteristics, such as contrast sensitivity function (CSF), are depicted using frequency domain context features of image followed by transform, and utilized for JND modeling. The transform domain JND models were widely employed in hybrid image and video compression [3].

JND models are supposed to be developed according to the HVS characteristics, luminance adaptation (LA), contrast masking (CM), foveated masking (FM), spatial-temporal contrast sensitivity function (CSF), temporal masking (TM), visual attention, etc. We model the human eye as a signal processing system with perception response function, which explicitly and quantitatively describes the HVS characteristics. The input video sequences characterized with diversified signal features are perceived with a certain perceptual quality or distortion. This signal activation-HVS response process should be quantitatively depicted in the signal-system processing way. Thus, there are two considerations in terms of JND modeling. On one hand, we need to quantitatively measure the HVS perception characteristics. On the other hand, we need to extract the feature parameters characterizing the input video signal.

LA represents the sensitivity of the human eyes in the cases of different background luminance. The CM effect represents the HVS perception property of different complex texture backgrounds, which can be measured by features such as edges and textures. The CSF indicates a bandpass sensitivity property of the HVS regarding the spatial frequency distribution of the image. The FM effect depicts the sensitivity of HVS varying with the retina eccentricity from the attention point of the human eyes. This can be described by the position distribution of gaze point and the corresponding visual attention intensities. The TM effect is mainly due to the rapid temporal changes that lead to the decline of human eye’s detail resolution ability. It is usually modeled based on the temporal-domain frequency and retinal image velocity [12]. Compared with perception suppression in masking effect, visual attention is an important cognitive property of HVS [13], [14], which is widely used in image quality assessment [15], [16], JND modeling [17], etc. From the viewpoint of bottom-up perception mechanism, the attention is activated by hierarchical stimulus in signals, such as color, luminance, direction and other primary visual features of the image [18].

In the literature, there are several works proposed to develop the spatial JND profiles, in which image feature parameters such as background luminance, edge, direction, texture, color, etc are taken into account. However, temporal characteristics are not elaborately and fully considered in spatiotemporal JND models. This is mainly due to the complicated interaction between temporal content-aware features and their resulting perception response characteristics. The reported JND profiles either suffer from insufficient model accuracy due to incomplete consideration of HVS characteristics, or have computational complexity that makes it unfriendly to integrate. JND thresholds are usually overestimated or underestimated, and there is still long way for evolution of JND modeling, especially accounting for temporal HVS characteristics.

Therefore, in order to accurately estimate the JND threshold fully characterizing temporal perception characteristics, it is necessary to extract discriminative temporal feature parameters and quantify their resulting perception activation impacts. We investigate three types of motion in video, absolute motion, relative motion, and background motion, and analyzes the visual attention induced by relative motion and masking suppression effects incurred by background motion. In addition, we explore the temporal duration along the motion trajectory to measure the perception attention effect, accounting for the HVS temporal-memory effect and asymmetric perception. The intensities of inter-frame prediction residue fluctuate along motion trajectory, and we measure the fluctuation intensity to account for the uncertainty-incurred suppression effect associated with temporal masking effect.

Different feature parameters affect the perception distortions differently. Some are associated with visual attention while others are related with masking suppression. Visual attention and masking suppression are jointly caused by interaction and interference among diversified stimuli [19], i.e. feature parameters. Exploring the coupling effect of these stimuli has been one critical research issue in JND modeling [20]. For example, the NAMM model was commonly used in pixel domain JND modeling to eliminate the overlapping masking effect, and the DCT domain JND profile was generally modeled as the product of multiple masking factors. However, the interaction mechanism of visual attention and masking suppression is complicated, and there is still technical potentiality for further improving temporal JND.

It is both inevitable and meaningful to develop a unified framework to fuse diversified perceptual feature parameters for JND modeling. Generally speaking, visual perception can be modeled as an information communication framework, and HVS is formulated as efficient encoder or information extractor [21]. During information source (video signal) passes through an equivalent error-prone communication channel (HVS), eye-brain perception system automatically extracts hierarchical salient features activating the attention, and irregular features activating masking suppression. The image areas that contain more information are more likely to attract visual attention and fixations [20], [22]. HVS cannot perceive all image content with the same certainty degree, i.e. different feature signals correspond to different degree uncertainty, equivalently different channel distortion in the eye-brain information transmission.

Inspired by the framework in [20], [23], this work proposes corresponding statistical probability models for four perceptual feature parameters, and uses information theory to quantify the corresponding saliency (visual attention) or uncertainty (masking suppression) in the sense of visual perception. Finally, four perceptual feature parameters are mapped to the same dimension and fused to derive an adaptive temporal weight, which is incorporated to adjust spatial JND model to improve the spatio-temporal JND model.

The rest parts of the paper are organized as follows. Section 2 reviews the works of JND modeling and analyze the motivation. Section 3 proposes four feature parameters and describes homogenization method for these heterogeneus feature parameters. Section 4 details the proposed spatiotemporal JND modeling method. Section 5 gives the simulation results to verify the proposed JND profile. Section 6 summarizes the paper.

Section snippets

Review of JND modeling

In the past two decades, inspiring pixel-domain JND profiles have been proposed one after another in the literature. As early as 1995, Chou and Li [24] taken the LM effect and CM effect into account, and then proposed the pixel-domain JND model. Following the work of Chou, Yang [25] proposed a generalized JND model in which a nonlinear adaptive masking (NAMM) model was employed to eliminate the overlapping effect between LM and CM. Liu [26] decomposed the input image into texture region and

Extraction and quantification for relative motion

The pixel-wise motion information in a video sequence can be represented as a three-dimensional filed of motion vectors, denoted as v in which spatial and temporal indices are dropped. In this work, we consider three motion components, absolute motion, background motion and relative motion. The relative motion vector vr is equal to the difference between the absolute motion vector va and the background motion vector vg [55], as show in Fig. 1.vr=va-vg.

The absolute motion vector vr is

Proposed spatiotemporal JND model

As described in section III, this work measures the perceptual saliency I of feature parameters including relative motion and temporal duration, and measures the perceptual uncertainty U of feature parameters including background motion and inter-frame residue fluctuation intensity. Based on the information theory framework, resorting to self-information and information entropy, the proposed work unifies four feature parameters for homogeneous fusion, and then determines a weight factor

Simulation results

In order to verify the effectiveness of the proposed JND model, we conduct intensive experiments for objective and subjective video quality evaluation experiments. A total of nine test sequences were selected for testing, in which five test sequence are 1920x1080 full HD resolution (Ba, BT, CT, Ki, Pa), and other four test sequences (BD, BM, PS, RH) are QVGA resolution (832x480). For a testing video, the additive noise is injected with the guidance of the JND model, which is formulated similar

Conclusions

This paper proposes a new temporal JND weight model by fully exploring the temporal HVS characteristics and analyzing the temporal feature parameters along the motion trajectory in video. We measure the stimulus saliency induced by relative motion and temporal duration along motion trajectory, and the masking uncertainty induced by background motion and inter-frame residual fluctuation intensity. Self-information and information entropy are used to measure the degree of stimulus saliency and

CRediT authorship contribution statement

Yafen Xing: Conceptualization, Methodology, Software, Data curation, Writing – original draft. Haibing Yin: Conceptualization, Validation, Writing – review & editing, Project administration, Funding acquisition. Yang Zhou: Writing – review & editing, Methodology. Yong Chen: Writing – review & editing, Methodology. Chenggang Yan: Writing – review & editing, Methodology.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported in part by NSFC 61972123, 61931008, 62031009 and ZJNSF LY19F020043.

References (60)

  • M.J. Black et al.

    The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields

    Comput. Vis. Image Underst.

    (1996)
  • J. Wu et al.

    Non-local spatial redundancy reduction for bottom-up saliency estimation

    J. Vis. Commun. Image Represent.

    (2012)
  • K. Gu et al.

    Automatic contrast enhancement technology with saliency preservation

    IEEE Trans. Circuits Syst. Video Technol.

    (2014)
  • Z. Wei et al.

    Spatio-temporal just noticeable distortion profile for grey scale image/video in dct domain

    IEEE Trans. Circuits Syst. Video Technol.

    (2009)
  • M. Uzair et al.

    Estimating just-noticeable distortion for images/videos in pixel domain

    IET Image Process.

    (2017)
  • K. Gu et al.

    The analysis of image contrast: From quality assessment to automatic enhancement

    IEEE Trans. Cybern.

    (2015)
  • C.M. Mak, K.N. Ngan, Enhancing compression rate by just-noticeable distortion model for h. 264/avc, in: Proc. IEEE Int....
  • Z. Lu et al.

    Modeling visual attention’s modulatory aftereffects on visual sensitivity and quality evaluation

    IEEE Trans. Image Process.

    (2005)
  • W. Lin et al.

    Visual distortion gauge based on discrimination of noticeable contrast changes

    IEEE Trans. Circuits Syst. Video Technol.

    (2005)
  • Z. Chen, H. Liu, Jnd modeling: Approaches and applications, in: Proc. 2014 19th Int. Conf. Digit. Signal Process.,...
  • H. Chen, R. Hu, J. Hu, Z. Wang, Temporal color just noticeable distortion model and its application for video coding,...
  • S.J. Daly, Engineering observations from spatiovelocity and spatiotemporal visual models, in: Proc. SPIE Conf. on Human...
  • H. Hadizadeh et al.

    Saliency-aware video compression

    IEEE Trans. Image Process.

    (2013)
  • Z. Chen et al.

    Perceptually-friendly h. 264/avc video coding based on foveated just-noticeable-distortion model

    IEEE Trans. Circuits Syst. Video Technol.

    (2010)
  • K. Gu et al.

    Saliency-guided quality assessment of screen content images

    IEEE Trans. Multimedia

    (2016)
  • G. Zhai et al.

    Perceptual image quality assessment: a survey

    Sci. China Inf. Sci.

    (2020)
  • Z. Zeng et al.

    Visual attention guided pixel-wise just noticeable difference model

    IEEE Access

    (2019)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • S.L. Macknik et al.

    Neuronal correlates of visibility and invisibility in the primate visual system

    Nat. Neurosci.

    (1998)
  • Z. Wang et al.

    Video quality assessment using a statistical model of human visual speed perception

    JOSA A

    (2007)
  • This paper was recommended for publication by “G. Guangtao Zhai”.

    View full text