Spatiotemporal just noticeable difference modeling with heterogeneous temporal visual features☆
Introduction
Video signals undergo a series of processing before being represented for human viewing, such as enhancement [1], compression, transmission, and storage, etc [2]. The ultimate judger of video is human eye system, and human visual system (HVS) characteristics is supposed to be taken into account in terms of distortion and video quality assessment [3], [4]. Just-Noticeable difference (JND) refers to the minimum visibility threshold of the HVS [2]. The computational models of JND have been widely applied for image or video compression [5], perceptual quality evaluation [6], digital watermarking [7], image enhancement [8], etc.
JND models can generally be divided into two categories according to the operating domains, transform domain and pixel-domain JND profiles [9]. The pixel-domain JND models were constructed by using the spatial context features of image within local support, with which HVS spatial characteristics are fully imitated and employed. Pixel-domain JND models were usually used for quality assessment [10], image processing and spatially adaptive image coding, such as quantization, bit allocation [11], etc. In transform domain JND models, frequency domain HVS characteristics, such as contrast sensitivity function (CSF), are depicted using frequency domain context features of image followed by transform, and utilized for JND modeling. The transform domain JND models were widely employed in hybrid image and video compression [3].
JND models are supposed to be developed according to the HVS characteristics, luminance adaptation (LA), contrast masking (CM), foveated masking (FM), spatial-temporal contrast sensitivity function (CSF), temporal masking (TM), visual attention, etc. We model the human eye as a signal processing system with perception response function, which explicitly and quantitatively describes the HVS characteristics. The input video sequences characterized with diversified signal features are perceived with a certain perceptual quality or distortion. This signal activation-HVS response process should be quantitatively depicted in the signal-system processing way. Thus, there are two considerations in terms of JND modeling. On one hand, we need to quantitatively measure the HVS perception characteristics. On the other hand, we need to extract the feature parameters characterizing the input video signal.
LA represents the sensitivity of the human eyes in the cases of different background luminance. The CM effect represents the HVS perception property of different complex texture backgrounds, which can be measured by features such as edges and textures. The CSF indicates a bandpass sensitivity property of the HVS regarding the spatial frequency distribution of the image. The FM effect depicts the sensitivity of HVS varying with the retina eccentricity from the attention point of the human eyes. This can be described by the position distribution of gaze point and the corresponding visual attention intensities. The TM effect is mainly due to the rapid temporal changes that lead to the decline of human eye’s detail resolution ability. It is usually modeled based on the temporal-domain frequency and retinal image velocity [12]. Compared with perception suppression in masking effect, visual attention is an important cognitive property of HVS [13], [14], which is widely used in image quality assessment [15], [16], JND modeling [17], etc. From the viewpoint of bottom-up perception mechanism, the attention is activated by hierarchical stimulus in signals, such as color, luminance, direction and other primary visual features of the image [18].
In the literature, there are several works proposed to develop the spatial JND profiles, in which image feature parameters such as background luminance, edge, direction, texture, color, etc are taken into account. However, temporal characteristics are not elaborately and fully considered in spatiotemporal JND models. This is mainly due to the complicated interaction between temporal content-aware features and their resulting perception response characteristics. The reported JND profiles either suffer from insufficient model accuracy due to incomplete consideration of HVS characteristics, or have computational complexity that makes it unfriendly to integrate. JND thresholds are usually overestimated or underestimated, and there is still long way for evolution of JND modeling, especially accounting for temporal HVS characteristics.
Therefore, in order to accurately estimate the JND threshold fully characterizing temporal perception characteristics, it is necessary to extract discriminative temporal feature parameters and quantify their resulting perception activation impacts. We investigate three types of motion in video, absolute motion, relative motion, and background motion, and analyzes the visual attention induced by relative motion and masking suppression effects incurred by background motion. In addition, we explore the temporal duration along the motion trajectory to measure the perception attention effect, accounting for the HVS temporal-memory effect and asymmetric perception. The intensities of inter-frame prediction residue fluctuate along motion trajectory, and we measure the fluctuation intensity to account for the uncertainty-incurred suppression effect associated with temporal masking effect.
Different feature parameters affect the perception distortions differently. Some are associated with visual attention while others are related with masking suppression. Visual attention and masking suppression are jointly caused by interaction and interference among diversified stimuli [19], i.e. feature parameters. Exploring the coupling effect of these stimuli has been one critical research issue in JND modeling [20]. For example, the NAMM model was commonly used in pixel domain JND modeling to eliminate the overlapping masking effect, and the DCT domain JND profile was generally modeled as the product of multiple masking factors. However, the interaction mechanism of visual attention and masking suppression is complicated, and there is still technical potentiality for further improving temporal JND.
It is both inevitable and meaningful to develop a unified framework to fuse diversified perceptual feature parameters for JND modeling. Generally speaking, visual perception can be modeled as an information communication framework, and HVS is formulated as efficient encoder or information extractor [21]. During information source (video signal) passes through an equivalent error-prone communication channel (HVS), eye-brain perception system automatically extracts hierarchical salient features activating the attention, and irregular features activating masking suppression. The image areas that contain more information are more likely to attract visual attention and fixations [20], [22]. HVS cannot perceive all image content with the same certainty degree, i.e. different feature signals correspond to different degree uncertainty, equivalently different channel distortion in the eye-brain information transmission.
Inspired by the framework in [20], [23], this work proposes corresponding statistical probability models for four perceptual feature parameters, and uses information theory to quantify the corresponding saliency (visual attention) or uncertainty (masking suppression) in the sense of visual perception. Finally, four perceptual feature parameters are mapped to the same dimension and fused to derive an adaptive temporal weight, which is incorporated to adjust spatial JND model to improve the spatio-temporal JND model.
The rest parts of the paper are organized as follows. Section 2 reviews the works of JND modeling and analyze the motivation. Section 3 proposes four feature parameters and describes homogenization method for these heterogeneus feature parameters. Section 4 details the proposed spatiotemporal JND modeling method. Section 5 gives the simulation results to verify the proposed JND profile. Section 6 summarizes the paper.
Section snippets
Review of JND modeling
In the past two decades, inspiring pixel-domain JND profiles have been proposed one after another in the literature. As early as 1995, Chou and Li [24] taken the LM effect and CM effect into account, and then proposed the pixel-domain JND model. Following the work of Chou, Yang [25] proposed a generalized JND model in which a nonlinear adaptive masking (NAMM) model was employed to eliminate the overlapping effect between LM and CM. Liu [26] decomposed the input image into texture region and
Extraction and quantification for relative motion
The pixel-wise motion information in a video sequence can be represented as a three-dimensional filed of motion vectors, denoted as in which spatial and temporal indices are dropped. In this work, we consider three motion components, absolute motion, background motion and relative motion. The relative motion vector is equal to the difference between the absolute motion vector and the background motion vector [55], as show in Fig. 1.
The absolute motion vector is
Proposed spatiotemporal JND model
As described in section III, this work measures the perceptual saliency I of feature parameters including relative motion and temporal duration, and measures the perceptual uncertainty U of feature parameters including background motion and inter-frame residue fluctuation intensity. Based on the information theory framework, resorting to self-information and information entropy, the proposed work unifies four feature parameters for homogeneous fusion, and then determines a weight factor
Simulation results
In order to verify the effectiveness of the proposed JND model, we conduct intensive experiments for objective and subjective video quality evaluation experiments. A total of nine test sequences were selected for testing, in which five test sequence are 1920x1080 full HD resolution (Ba, BT, CT, Ki, Pa), and other four test sequences (BD, BM, PS, RH) are QVGA resolution (832x480). For a testing video, the additive noise is injected with the guidance of the JND model, which is formulated similar
Conclusions
This paper proposes a new temporal JND weight model by fully exploring the temporal HVS characteristics and analyzing the temporal feature parameters along the motion trajectory in video. We measure the stimulus saliency induced by relative motion and temporal duration along motion trajectory, and the masking uncertainty induced by background motion and inter-frame residual fluctuation intensity. Self-information and information entropy are used to measure the degree of stimulus saliency and
CRediT authorship contribution statement
Yafen Xing: Conceptualization, Methodology, Software, Data curation, Writing – original draft. Haibing Yin: Conceptualization, Validation, Writing – review & editing, Project administration, Funding acquisition. Yang Zhou: Writing – review & editing, Methodology. Yong Chen: Writing – review & editing, Methodology. Chenggang Yan: Writing – review & editing, Methodology.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported in part by NSFC 61972123, 61931008, 62031009 and ZJNSF LY19F020043.
References (60)
- et al.
A spatiotemporal saliency-modulated jnd profile applied to video watermarking
J. Vis. Commun. Image Represent.
(2018) - et al.
Disparity-based just-noticeable-difference model for perceptual stereoscopic video coding using depth of focus blur effect
Displays
(2016) - et al.
Just noticeable distortion model and its applications in video coding
Signal Process.: Image Commun.
(2005) - et al.
Videoset: A large-scale compressed video quality dataset based on jnd measurement
J. Vis. Commun. Image Represent.
(2017) - et al.
Improved estimation for just-noticeable visual distortion
Signal Process.
(2005) - et al.
Adaptive block-size transform based just-noticeable difference model for images/videos
Signal Process.: Image Commun.
(2011) - et al.
No-reference video quality assessment based on modeling temporal-memory effects
Displays
(2021) - et al.
Structure and function come unglued in the visual cortex
Neuron
(2008) - et al.
View from the top: Hierarchies and reverse hierarchies in the visual system
Neuron
(2002) - et al.
No-reference screen content video quality assessment
Displays
(2021)
The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields
Comput. Vis. Image Underst.
Non-local spatial redundancy reduction for bottom-up saliency estimation
J. Vis. Commun. Image Represent.
Automatic contrast enhancement technology with saliency preservation
IEEE Trans. Circuits Syst. Video Technol.
Spatio-temporal just noticeable distortion profile for grey scale image/video in dct domain
IEEE Trans. Circuits Syst. Video Technol.
Estimating just-noticeable distortion for images/videos in pixel domain
IET Image Process.
The analysis of image contrast: From quality assessment to automatic enhancement
IEEE Trans. Cybern.
Modeling visual attention’s modulatory aftereffects on visual sensitivity and quality evaluation
IEEE Trans. Image Process.
Visual distortion gauge based on discrimination of noticeable contrast changes
IEEE Trans. Circuits Syst. Video Technol.
Saliency-aware video compression
IEEE Trans. Image Process.
Perceptually-friendly h. 264/avc video coding based on foveated just-noticeable-distortion model
IEEE Trans. Circuits Syst. Video Technol.
Saliency-guided quality assessment of screen content images
IEEE Trans. Multimedia
Perceptual image quality assessment: a survey
Sci. China Inf. Sci.
Visual attention guided pixel-wise just noticeable difference model
IEEE Access
A model of saliency-based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Mach. Intell.
Neuronal correlates of visibility and invisibility in the primate visual system
Nat. Neurosci.
Video quality assessment using a statistical model of human visual speed perception
JOSA A
Cited by (2)
- ☆
This paper was recommended for publication by “G. Guangtao Zhai”.