Post-processing for intra coding through perceptual adversarial learning and progressive refinement
Introduction
Video compression is a vital process for local storage and transmission over the network due to the huge data volume of video. For example, Ultra-High Definition (UHD) video (e.g. 8K × 4K@120fps) has 11.5 gigabytes per second raw data, which require highly efficient compression. In view of its excellent compression capability, state-of-the-art video compression standards all employ the block-based hybrid motion compensation and transform coding frameworks, such as H.264/AVC [1], High Efficiency Video Coding (HEVC) [2], and the developing Joint Exploration Model (JEM) [3].
However, we pay for their high compression rate with complex compression artifacts, such as blocking artifacts, ringing artifacts, blurring artifacts, and color biases [4]. Block-based hybrid video coding will inevitably result in blocking artifacts, especially at the low bit rate or acute motions are contained in the input video. Ringing effects along the edges occur resulted from the coarse quantization of the high-frequency components. In addition, high compression rate will also lead to serious geometric distortion. All these artifacts not only cause severe degradation on Quality of Experience (QoE), but also adversely affect various low-level image processing routines that take compressed images as input, e.g. image segmentation [5], and super-resolution [6]. How to reduce compression artifacts has attracted more and more attention.
Existing post-processing methods for suppressing video compression artifacts can be categorized into two groups: the statistical model approach that has gained the most popularity [7], [8], [9], [10], and the deep learning model approach that emerged recently but has attracted much interest [11], [12], [13], [14], [15], [16].
The most famous statistical model is the in-loop filtering method employed by H.264/AVC, HEVC, and JEM. In-loop filtering techniques post-process the reconstructed images, where a de-blocking filter (DF) [7] is done followed by a sample adaptive offset (SAO) [8] filter. DF is specifically designed to reduce blocking artifacts, that non-linear filtering with predefined low-pass characteristics, without signaling any bit to decoder sides. Unlike DF, SAO is designed for ringing artifacts, which correct the quantization errors by sending offset values to decoders, and reconstructs the image by adding an offset. Zhang et al. [9], [10] incorporated the low-rank regularization into HEVC in-loop filtering algorithm, develop a non-local adaptive in-loop filter. These in-loop filtering techniques to restore both the subjective and objective quality of the reconstructed image. However, those hand-crafted methods rely on detailed statistical models of signal and artifacts, which are insufficient for modeling compression artifacts and exhibit limited quality enhancement performances.
Deep learning pipelines have become a widespread tool to address high-level computer vision tasks very successfully [17], [18]. Recently, they also achieved promising results on several low-level computer vision tasks, such as super-resolution [19], [20], [21], image denoising [22], [23], and image inpainting [24], [25]. Inspired by these successes, some convolutional neural networks (CNNs) based artifact reduction algorithms have emerged in the literature, and becomes a hot topic in recent years.
The primary success was reported by Dong et al. [11], they proposed an artifact reduction CNN (AR-CNN) for suppressing artifacts affect in JPEG. AR-CNN is consists of four convolutional layers, that feature extraction, feature enhancement, mapping and reconstruction layers, jointly in an end-to-end framework. Compared to the previous most successful de-blocking oriented methods, AR-CNN reported a remarkable jump in performance, since CNN-based deep network can learn features automatically via a data-driven approach, avoiding the limitation of the hand-crafted features.
Inspired by AR-CNN, Park and Kim [12] proposed a new in-loop filtering technique using CNN (IFCNN), to replace both DF and SAO in HEVC. IFCNN predicts the residues between the original image and the reconstructed image, and decoders only need to add IFCNN output to the reconstructed images. However, IFCNN may lack its generalization ability because the same sequences were used for training and testing. Furthermore, variable-filter-size residue-learning CNN (VRCNN) proposed by Wu et al. [13], and multi-modal multi-level convolutional neural network (MMS-net) proposed by Kang et al. [14] to replace the DF and SAO in HEVC. VRCNN emphasizes more on the efficiency and adopts a relatively shallow network, whereas MMS-net exploits a very deep network for more superior performance. More recently, Wang et al. [15] propose a decoder-side scalable CNN (DS-CNN) approach to achieve quality enhancement for HEVC intra coding, which does not require any modification of the encoder. In addition, Jia et al. [16] propose a spatial-temporal residue network (STResNet) based in-loop filter for HEVC inter coding by utilizing both spatial and temporal information.
All these works seem to open up a new direction that adopts CNN into video coding to further improve the coding efficiency. Despite the remarkable progress, existing state-of-the-art methods produce over-smooth results, and cannot generate multi-level refinement. As a result, one needs to train a large variety of models for various applications with different desired quality enhancement and computational loads. To address these drawbacks, we propose Multi-level Progressive Refinement Networks (MPRNet), as shown in Fig. 1. Here we extend and develop our preliminary ideas sketched in [26], through a perceptual adversarial training approach.
The main contributions of this paper can be summarized as the following:
- (1)
We propose a multi-level progressive refinement network (MPRNet) for solving the video compression artifacts task. To the best of our knowledge, the proposed framework is the first attempt to solve the video post-processing task through a perceptual adversarial training approach, which boosts the subjective and objective performance.
- (2)
We adding the smart multi-level mean squared error loss to the perceptual adversarial loss, which explicitly defines the specific function of each interlayer, eliminates interlayer coupling and improves the final performance.
- (3)
A scalable structure is defined in our MPRNet, refining the image in a coarse-to-fine fashion, which enables MPRNet to adjust processing efficiency adaptively, and accommodate to the computing resources of the hardware.
- (4)
Extensive experiments demonstrate the effectiveness of our MPRNet over state-of-the-art methods [10], [11], [12], [13], [14], [15], [27], [28] both subjective and objective visual quality, and reduces 8.2%, 8.3% and 9.5% BD-rate for the Y/Cb/Cr channel over the HEVC baseline respectively.
The rest of the paper is organized as follows: A brief review of related works is given in Section 2. Section 3 illustrates our proposed MPRNet architecture and loss function. In Section 4, extensive experiments are conducted to evaluate MPRNet. Finally, we conclude this work with some future directions in Section 5.
Section snippets
Related works
In this section, we briefly review the challenging of video compression artifacts reduction task, multi-scale inference models, and some related works that integrate generative adversarial networks (GANs) with a perceptual loss.
Framework of MPRNet
In this section, we illustrate MPRNet for video compression artifacts reduction task. The MPRNet does not require signaling bits by using the same trained model in both video encoder and decoder. Fig. 1 illustrates the overall framework of our proposed MPRNet, which consists of two parts, i.e., the progressive refinement network (E) and the perceptual discriminative network (D). In the following, we detail the network architecture and the loss functions devised to optimize such network.
Experimental results
In this section, we first introduce the implementation details. Then, the experimental results are presented to validate the effectiveness of our MPRNET approach for video compression artifacts reduction task, and compared with eight latest state-of-the-art approaches, e.g. NLSLF [10], AR-CNN [11], IFCNN [12], VRCNN [13], MMS-net [14], DS-CNN [15], VDSR [27] and CARGAN [28].
Conclusion
In this paper, we propose a novel multi-level progressive prediction network (MPRNet) for video coding post-processing. A progressive refinement strategy explicitly guides the interlayer of network to predict the sub-band residues at different levels, which can eliminate interlayer coupling, and improves the final performance. A scalable structure is implicitly included in our MPRNet, through refining the image in a coarse-to-fine fashion, which improves the flexibility of MPRNet deployment,
Declarations of interest
None.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61571285, and Shanghai Science and Technology Commission under Grants 17DZ2292400 and 18XD1423900.
Zhipeng Jin received the B.S. degree in electrical engineering from the University of Science and Technology of China (USTC), Hefei, China, in 2004, and the M.S. degrees in electrical engineering from Ningbo University, Ningbo, China, in 2007. He is with the Jiaxing Vocational and Technical College, and he is currently pursuing the Ph.D. degree from the Shanghai University. His research interests include image/video coding, video codec optimization and deep learning.
References (45)
- et al.
A survey of hybrid MC/DPCM/DCT video coding distortions
Signal Process.
(1998) - et al.
Video quality evaluation methodology and verification testing of HEVC compression performance
IEEE Trans. Circuits Syst. Video Technol.
(2016) - et al.
Overview of the high efficiency video coding (HEVC) standard
IEEE Trans. Circuits Syst. Video Technol.
(2013) - et al.
Algorithm description of joint exploration test model 5
JVET-E1001
(2017) - et al.
Fully convolutional networks for semantic segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
(2017) - et al.
Image super-resolution using deep convolutional networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2014) - et al.
HEVC deblocking filter
IEEE Trans. Circ. Syst. Video Technol.
(2013) - et al.
Sample adaptive offset in the HEVC standard
IEEE Trans. Circ. Syst. Video Technol.
(2012) - et al.
Low-rank based nonlocal adaptive loop filter for high efficiency video compression
IEEE Trans. Circuits Syst. Video Technol.
(2017) - et al.
Nonlocal in-loop filter: the way toward next-generation video coding?
IEEE Multimedia
(2016)
Compression artifacts reduction by a deep convolutional network
Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Cnn-based in-loop filtering for coding efficiency improvement
Proceedings of the Image, Video, and Multidimensional Signal Processing Workshop
A convolutional neural network approach for post-processing in HEVC intra coding
Proceedings of the International Conference on Multimedia Modeling
Multi-modal/multi-level convolutional neural network based in-loop filter design for next generation video codec
Proceedings of the IEEE International Conference on Image Processing (ICIP)
Decoder-side HEVC quality enhancement with scalable convolutional neural network
Proceedings of the IEEE International Conference on Multimedia and Expo (ICME)
Spatial-temporal residue network based in-loop filter for video coding
Proceedings of the IEEE Visual Communications and Image Processing (VCIP)
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Imagenet classification with deep convolutional neural networks
Proceedings of the International Conference on Neural Information Processing Systems
Training-free, single-image super-resolution using a dynamic convolutional network
IEEE Signal Proces. Lett.
Perceptual losses for real-time style transfer and super-resolution
Proceedings of the European Conference on Computer Vision (ECCV)
Photo-realistic single image super-resolution using a generative adversarial network
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Beyond a Gaussian denoiser: residual learning of deep CNN for image denoising
IEEE Trans. Image Process. Publ. IEEE Signal Process. Soc.
Cited by (11)
MSFNet: MultiStage Fusion Network for infrared and visible image fusion
2022, NeurocomputingCitation Excerpt :In addition, they employ hand-crafted fusion rules that are easily impacted by expert knowledge. Recently, deep learning (DL), especially convolutional neural networks [14,15], has shown great potential in a variety of fields, such as image processing [16–18], image captioning [19], and face recognition [20]. Needless to say, the DL-based methods [6–9,21–23] also represent the main emerging line in the field of IVIF because of the strong feature learning ability of DL models.
A Survey on Perceptually Optimized Video Coding
2023, ACM Computing SurveysHigh-frequency guided CNN for video compression artifacts reduction
2022, 2022 IEEE International Conference on Visual Communications and Image Processing, VCIP 2022
Zhipeng Jin received the B.S. degree in electrical engineering from the University of Science and Technology of China (USTC), Hefei, China, in 2004, and the M.S. degrees in electrical engineering from Ningbo University, Ningbo, China, in 2007. He is with the Jiaxing Vocational and Technical College, and he is currently pursuing the Ph.D. degree from the Shanghai University. His research interests include image/video coding, video codec optimization and deep learning.
Ping An is a professor of the video processing group at School of Communication and Information Engineering, Shanghai University, China. She received the B.S. and M.S. degrees from Hefei University of Technology in 1900, 1993, and Ph.D. from Shanghai University in 2002. In 1993, she joined Shanghai University. Between 2011 and 2012, she joined the Communication Systems Group at Technische University Berlin, Germany, as a visiting professor. Her research interest is image and video processing, especially focuses on 3D video processing recent years. She has finished more than 10 projects supported by the National Natural Science Foundation of China, National Science and Technology Ministry, and Science & Technology Commission of Shanghai Municipality, etc. She awarded the Second Prize of Shanghai Municipal Science & Technology Progress Award in 2011 and the Second Prize in Natural Sciences of the Ministry of Education in 2016.
Chao Yang received his B.S. and Ph.D. degrees in school of communication and information engineering from Shanghai University, Shanghai, China, in 2012 and 2017, respectively. He is now a Postdoc Fellow with the Department of Electrical Engineering, University of Southern California, Los Angeles, USA. His research interests include image and video processing.
Liquan Shen received the B.S. degree in automation control from Henan Polytechnic University, Jiaozuo, China, and the M.S. and Ph.D. degrees in communication and information systems from Shanghai University, Shanghai, China, in 2001, 2005, and 2008, respectively. He was with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA, as a visiting professor from 2013 to 2014. He has been with the Faculty of the School of Communication and Information Engineering, Shanghai University, since 2008, where he is currently a Professor. He has authored and co-authored more than 80 refereed technical papers in international journals and conferences in the field of video coding and image processing. He holds 10 patents in the areas of image/video coding and communications. His research interests include High Efficiency Video Coding, perceptual coding, video codec optimization, 3DTV, and 3D image/video quality assessment.