Machine learning based video coding optimizations: A survey
Introduction
With the development of multimedia computing, communication and display technologies, many video applications have emerged, such as TV broadcasting, movies, video-on-demand, video conference, mobile video, video surveillance, remote control, robotic, 3D videos and free viewpoint TV, Virtual Reality (VR), as shown in Fig. 1, which can provide immersive telepresence and realistic visual perception experience. These video applications have been widely employed for multiple roles in human daily life, such as manufacturing, communication, national security, military, education, medication, and entertainment. Nowadays, video data has been the majority of the data traffic over the internet and its volume grows explosively each year. In 2016, global IP video traffic was or 70 exabytes [EB] (one billion gigabytes [GB]) per month, which accounted for 73% of all consumer internet traffic [13]. Cisco Visual Networking Index (VNI) forecasts the video traffic will be increased to 82% of all consumer internet traffic by 2021 [13]. On that occasion, million minutes of video contents will be delivered through the network in every second. Regarding Internet video, for example, 400 h of videos were uploaded to YouTube every minute (i.e., 65 years video a day) and one billion hours of YouTube videos were watched every day at the end of 2017 [121]. Besides, mobile video was expected to account for a staggering 78% of total mobile data traffic by the end of 2021 [14]. IHS Markit reported in China that there were about 176 million surveillance cameras in 2017, which generated 104 EB every month. To further enhance the immersive and realistic visual experiences, more high-end video applications emerge, such as High Definition (HD)/Ultra HD (UHD), holograph 3D and VR, High Dynamic Range (HDR) and Wide Color Gamut (WCG) videos, which require larger data volume to represent higher fidelity and more details. Meanwhile, the number of video clients and cameras in use grows rapidly as the video demands keep boost in recent years, such as HDTV, surveillance camera, laptop and smart phones. The total amount of global video data doubles every two years, which has been for the bottleneck for data processing, storage and transmission.
Video coding is one of the core technologies in video applications that enables to structure and compress the video data in a more effective manner for computing, transmission and storage. It has been developed over three decades with four generations and the coding efficiency doubles every ten years. But there is a big gap as compared with the rapid growth of global video data doubling every two years. Achieving much higher compression efficiency and narrowing the gap in an effective way become urgent missions for video coding. Machine learning is a field of study that can learn from data, discover hidden patterns and make data-driven decisions. Due to its superior performance in learning from data, many emerging works have applied machine learning algorithms to video coding to further promote the coding performances, which becomes one of the most promising directions in both academic and industrial communities.
In this paper, we aim to provide a comprehensive overview on machine learning based video coding optimization. The main contributions of this work are: 1) We summarize the representations and redundancies of video and figure out three key challenging issues in video coding; 2) Subsequently, we overview the recent advances on learning based low complexity video coding optimization, which are categorized into statistical, machine learning based and end-to-end learning based schemes. Their decision problems, representative features, workflows, advantages and disadvantages are analyzed. 3) We review the learning based high efficiency video coding with four key problems, including predictive coding, transform coding, entropy coding and enhancement. Their problem formulation, representative schemes and coding performances are presented. 4) We conduct comprehensive survey on the subjective visual quality assessment and learning based visual quality prediction, which is the key to perceptual video coding. The quality prediction is summarized and reviewed in four categories based on the functionalities of learning models in feature extraction and fusion. 5) The challenging issues and potential research opportunities in learning based video coding optimizations are identified.
The paper is organized as follows. In Section 2, the representations and redundancies of videos are first analyzed. Subsequently, the milestones of video coding standards and challenging issues are presented in Section 3. In Sections 4–6, the recent technical advances on machine learning based coding optimization are further analyzed from three key aspects, including low complexity optimization, high efficiency coding tools design and perceptual encoding optimization. Meanwhile, their workflows, advantages and disadvantages are analyzed in detail. Finally, we draw the conclusions and identify future research opportunities in Section 7.
Section snippets
Representations of video data
The 3D world scene (P) can be modelled as a plenoptic function [5] with 7 parameters,where Vx,Vy,Vz indicate the horizontal, vertical and depth viewing position in the 3D world coordinates, φ and θ represent viewing directions, λ is the spectrum wave length and t is the time sampling for dynamic scene. It can also be presented in Cartesian coordinates as [5]where x and y are coordinates on an image plane. With the development of video technologies,
Milestones of video coding standards
Worldwide researchers and organizations, such as Motion Picture Expert Group (MPEG) from ISO/IEC and Video Coding Expert Group (VCEG) from ITU-T, make significant contributions to the video coding standardization and advances of coding technologies. Fig. 5 shows the evolution of the coding standards, in which five leading standards (H.261, MPEG-4, H.264/AVC [108], HEVC and VVC in red rectangles) in four generations have been issued in the last three decades. H.264/AVC standard is one of the
Mode decision in video coding
Refined variable block size partitioning is capable of improving the prediction accuracy, which consequently reduces the coding residue and improves the coding efficiency in predictive coding. Table 1 shows the evolution of block modes in standards from MPEG-1, 2 to H.264/AVC, H.265 and the on-going VVC. We can observe that the only one kind of block, i.e., 16 × 16 denoted as macroblock, is available for the H.261 and MPEG-4. In H.264/AVC, there are seven variable block-size partitioning
Learning based high efficiency coding optimization
The current video coding standards comply with a block-based hybrid framework, which includes three major components: the predictive coding, transform coding and entropy coding, as shown in Fig. 9. The predictive coding exploits the view-spatial-temporal correlations of video signal, which are the major part of video redundancies. The transform coding adopts the transform to compact the energy in frequency domain, then larger scale is used to quantize the insensitive frequencies of HVS, such as
Learning based visual quality assessment (VQA)
The objective of video coding is to minimize the distortion (D) or maximize the quality (Q) subject to bit rate (R) constraints, which can be presented aswhere RT is a target bit rate. Nowadays, the distortion D is still measured with MSE while the quality Q is measured with PSNR, which are based on the pixel-by-pixel difference between the original and reconstructed images. PSNR and MSE are simple but can hardly reflect the real perceived quality of HVS. To fully exploit the
Conclusion
In this article, we present a systemic survey on the recent advances and challenges associated with machine learning video coding optimization, which aims to provide researchers with a strong foundation and open the horizon for data-driven video signal processing. This survey is mainly presented from three key aspects, including learning based low complexity optimization, learning based high efficiency coding optimization and learning based visual quality assessment. In each part, the problem
Declaration of Competing Interest
I declared that I have no conflict of interest with this submission.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 61672443, 61772344 and 61871372, in part by Guangdong Natural Science Foundation for Distinguished Young Scholar under Grant 2016A030306022, in part by the Key Project for Guangdong Provincial Science and Technology Development under Grant 2017B010110014, in part by RGC General Research Fund (GRF) 9042322, 9042489 (CityU 11200116,11206317), Shenzhen International Collaborative ResearchProject under
References (140)
- et al.
Recent advances in omnidirectional video coding for virtual reality: projection and evaluation
Signal Process.
(2018) - et al.
Interpretable convolutional neural networks via feedforward design
J. Vis. Commun. Image R.
(2019) - et al.
Reinforcement learning based coding unit early termination algorithm for high efficiency video coding
J. Visual Commun. Image R.
(2019) - et al.
MCL-V: a streaming video quality assessment database
J. Visual Commun. Image R.
(2015) - et al.
Fast TU size decision algorithm for HEVC encoders using Bayesian theorem detection
Signal Process. Image Commun.
(2015) - et al.
Combining sparse coding with structured output regression machine for single image super-resolution
Inform. Sci.
(2018) - J. An, H. Huang, K. Zhang, Y.-.W. Huang, S. Lei, Quad-tree plus binary tree structure integration with JEM tools, JVET...
- et al.
Mode-dependent transform competition for HEVC
- et al.
Deep reinforcement learning: a brief survey
IEEE Signal Process. Mag.
(2017) - et al.
Recurrent and dynamic models for predicting streaming video quality of experience
IEEE Trans. Image Process.
(2018)