Elsevier

Information Sciences

Volume 506, January 2020, Pages 395-423
Information Sciences

Machine learning based video coding optimizations: A survey

https://doi.org/10.1016/j.ins.2019.07.096Get rights and content

Abstract

Video data has become the largest source of data consumed globally. Due to the rapid growth of video applications and boosting demands for higher quality video services, video data volume has been increasing explosively worldwide, which has been the most severe challenge for multimedia computing, transmission and storage. Video coding by compressing videos into a much smaller size is one of the key solutions; however, its development has become saturated to some extent while the compression ratio continuously grows in the last three decades. Machine leaning algorithms, especially those employing deep learning, which are capable of discovering knowledge from unstructured massive data and providing data-driven predictions, provide new opportunities for further upgrading video coding technologies. In this article, we present a review on machine learning based video encoding optimization, aiming to provide researchers with a strong foundation and inspire future developments for data-driven video coding. Firstly, we analyze the representations and redundancies of video data. Secondly, we review the development of video coding standards and key requirements. Subsequently, we present a systemic survey on the recent advances and challenges associated with the machine learning based video coding optimizations from three key aspects, including high efficiency, low complexity and high visual quality. Their workflows, representative schemes, performances, advantages and disadvantages are analyzed in detail. Finally, the challenges and opportunities are identified, which may provide the academic and industrial communities with groundwork and potential directions for future research.

Introduction

With the development of multimedia computing, communication and display technologies, many video applications have emerged, such as TV broadcasting, movies, video-on-demand, video conference, mobile video, video surveillance, remote control, robotic, 3D videos and free viewpoint TV, Virtual Reality (VR), as shown in Fig. 1, which can provide immersive telepresence and realistic visual perception experience. These video applications have been widely employed for multiple roles in human daily life, such as manufacturing, communication, national security, military, education, medication, and entertainment. Nowadays, video data has been the majority of the data traffic over the internet and its volume grows explosively each year. In 2016, global IP video traffic was or 70 exabytes [EB] (one billion gigabytes [GB]) per month, which accounted for 73% of all consumer internet traffic [13]. Cisco Visual Networking Index (VNI) forecasts the video traffic will be increased to 82% of all consumer internet traffic by 2021 [13]. On that occasion, million minutes of video contents will be delivered through the network in every second. Regarding Internet video, for example, 400 h of videos were uploaded to YouTube every minute (i.e., 65 years video a day) and one billion hours of YouTube videos were watched every day at the end of 2017 [121]. Besides, mobile video was expected to account for a staggering 78% of total mobile data traffic by the end of 2021 [14]. IHS Markit reported in China that there were about 176 million surveillance cameras in 2017, which generated 104 EB every month. To further enhance the immersive and realistic visual experiences, more high-end video applications emerge, such as High Definition (HD)/Ultra HD (UHD), holograph 3D and VR, High Dynamic Range (HDR) and Wide Color Gamut (WCG) videos, which require larger data volume to represent higher fidelity and more details. Meanwhile, the number of video clients and cameras in use grows rapidly as the video demands keep boost in recent years, such as HDTV, surveillance camera, laptop and smart phones. The total amount of global video data doubles every two years, which has been for the bottleneck for data processing, storage and transmission.

Video coding is one of the core technologies in video applications that enables to structure and compress the video data in a more effective manner for computing, transmission and storage. It has been developed over three decades with four generations and the coding efficiency doubles every ten years. But there is a big gap as compared with the rapid growth of global video data doubling every two years. Achieving much higher compression efficiency and narrowing the gap in an effective way become urgent missions for video coding. Machine learning is a field of study that can learn from data, discover hidden patterns and make data-driven decisions. Due to its superior performance in learning from data, many emerging works have applied machine learning algorithms to video coding to further promote the coding performances, which becomes one of the most promising directions in both academic and industrial communities.

In this paper, we aim to provide a comprehensive overview on machine learning based video coding optimization. The main contributions of this work are: 1) We summarize the representations and redundancies of video and figure out three key challenging issues in video coding; 2) Subsequently, we overview the recent advances on learning based low complexity video coding optimization, which are categorized into statistical, machine learning based and end-to-end learning based schemes. Their decision problems, representative features, workflows, advantages and disadvantages are analyzed. 3) We review the learning based high efficiency video coding with four key problems, including predictive coding, transform coding, entropy coding and enhancement. Their problem formulation, representative schemes and coding performances are presented. 4) We conduct comprehensive survey on the subjective visual quality assessment and learning based visual quality prediction, which is the key to perceptual video coding. The quality prediction is summarized and reviewed in four categories based on the functionalities of learning models in feature extraction and fusion. 5) The challenging issues and potential research opportunities in learning based video coding optimizations are identified.

The paper is organized as follows. In Section 2, the representations and redundancies of videos are first analyzed. Subsequently, the milestones of video coding standards and challenging issues are presented in Section 3. In Sections 4–6, the recent technical advances on machine learning based coding optimization are further analyzed from three key aspects, including low complexity optimization, high efficiency coding tools design and perceptual encoding optimization. Meanwhile, their workflows, advantages and disadvantages are analyzed in detail. Finally, we draw the conclusions and identify future research opportunities in Section 7.

Section snippets

Representations of video data

The 3D world scene (P) can be modelled as a plenoptic function [5] with 7 parameters,P=F7(φ,θ,λ,t,Vx,Vy,Vz),where Vx,Vy,Vz indicate the horizontal, vertical and depth viewing position in the 3D world coordinates, φ and θ represent viewing directions, λ is the spectrum wave length and t is the time sampling for dynamic scene. It can also be presented in Cartesian coordinates as [5]P=G7(x,y,λ,t,Vx,Vy,Vz),where x and y are coordinates on an image plane. With the development of video technologies,

Milestones of video coding standards

Worldwide researchers and organizations, such as Motion Picture Expert Group (MPEG) from ISO/IEC and Video Coding Expert Group (VCEG) from ITU-T, make significant contributions to the video coding standardization and advances of coding technologies. Fig. 5 shows the evolution of the coding standards, in which five leading standards (H.261, MPEG-4, H.264/AVC [108], HEVC and VVC in red rectangles) in four generations have been issued in the last three decades. H.264/AVC standard is one of the

Mode decision in video coding

Refined variable block size partitioning is capable of improving the prediction accuracy, which consequently reduces the coding residue and improves the coding efficiency in predictive coding. Table 1 shows the evolution of block modes in standards from MPEG-1, 2 to H.264/AVC, H.265 and the on-going VVC. We can observe that the only one kind of block, i.e., 16 × 16 denoted as macroblock, is available for the H.261 and MPEG-4. In H.264/AVC, there are seven variable block-size partitioning

Learning based high efficiency coding optimization

The current video coding standards comply with a block-based hybrid framework, which includes three major components: the predictive coding, transform coding and entropy coding, as shown in Fig. 9. The predictive coding exploits the view-spatial-temporal correlations of video signal, which are the major part of video redundancies. The transform coding adopts the transform to compact the energy in frequency domain, then larger scale is used to quantize the insensitive frequencies of HVS, such as

Learning based visual quality assessment (VQA)

The objective of video coding is to minimize the distortion (D) or maximize the quality (Q) subject to bit rate (R) constraints, which can be presented asminD,s.t.RRT,where RT is a target bit rate. Nowadays, the distortion D is still measured with MSE while the quality Q is measured with PSNR, which are based on the pixel-by-pixel difference between the original and reconstructed images. PSNR and MSE are simple but can hardly reflect the real perceived quality of HVS. To fully exploit the

Conclusion

In this article, we present a systemic survey on the recent advances and challenges associated with machine learning video coding optimization, which aims to provide researchers with a strong foundation and open the horizon for data-driven video signal processing. This survey is mainly presented from three key aspects, including learning based low complexity optimization, learning based high efficiency coding optimization and learning based visual quality assessment. In each part, the problem

Declaration of Competing Interest

I declared that I have no conflict of interest with this submission.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61672443, 61772344 and 61871372, in part by Guangdong Natural Science Foundation for Distinguished Young Scholar under Grant 2016A030306022, in part by the Key Project for Guangdong Provincial Science and Technology Development under Grant 2017B010110014, in part by RGC General Research Fund (GRF) 9042322, 9042489 (CityU 11200116,11206317), Shenzhen International Collaborative ResearchProject under

References (140)

  • J. Berent et al.

    Plenoptic manifolds

    IEEE Signal Process. Mag.

    (2007)
  • T. Biatek et al.

    Adaptive transforms for inter-predicted residuals in post-HEVC video coding

  • S. Bosse et al.

    Deep neural networks for no-reference and full-reference image quality assessment

    IEEE Trans. Image Process.

    (2018)
  • P. Carrillo et al.

    Low complexity H.264 video encoder design using machine learning techniques

    Proc. ICCE

    (2010)
  • Z. Chen, T. He, X. Jin, F. Wu, Learning for video compression, arXiv:1804.09869v1 [cs.MM],...
  • J.C. Chiang et al.

    A fast H.264/AVC-based stereo video encoding algorithm based on hierarchical two stage neural classification

    IEEE J. Select. Topics Signal Process.

    (2011)
  • Y.W. Chiou et al.

    Efficient image/video deblocking via sparse representation

  • Cisco Visual Networking Index: forecast and methodology 2016-2021 (2017)...
  • Cisco Visual Networking Index: global mobile data traffic forecast update, 2016-2021 White Paper, (2017)...
  • G. Correa et al.

    Fast HEVC encoding decisions using data mining

    IEEE Trans. Circuits Syst. Video Technol.

    (2015)
  • Q. Dai et al.

    Recent advances in computational photography

    Chin. J. Electron.

    (2019)
  • W. Dai et al.

    Sparse representation with spatio-temporal online dictionary learning for promising video coding

    IEEE Trans. Image Process.

    (2016)
  • W. Dai et al.

    Progressive dictionary learning with hierarchical predictive structure for low bit-rate scalable video coding

    IEEE Trans. Image Process.

    (2017)
  • Y. Dai et al.

    A convolutional neural network approach for post-processing in HEVC intra coding

  • L. Ding et al.

    Rate-performance-loss optimization for inter-frame deep feature coding from video

    IEEE Trans. Image Process.

    (2017)
  • C. Dong et al.

    Image super-resolution using deep convolutional networks

    IEEE Trans. Patt. Anal. Mach. Intell.

    (2016)
  • F. Duanmu et al.

    Fast mode and partition decision using machine learning for intra frame coding in HEVC screen content coding extension

    IEEE J. Emerg. Sel. Topics Circuits Syst.

    (2016)
  • E.M. Enriquez et al.

    A two level classification based approach to inter mode decision in H.264/AVC

    IEEE Trans. Circuits Syst. Video Technol.

    (2011)
  • C. Fan et al.

    No reference image quality assessment based on multi-expert convolutional neural networks

    IEEE Access

    (2018)
  • K. Goswami et al.

    A design of fast high efficiency video coding (HEVC) scheme based on Markov chain Monte Carlo model and Bayesian classifier

    IEEE Trans. Indust. Electron.

    (2018)
  • Q. Hu et al.

    Neyman-Pearson-based early mode decision for HEVC encoding

    IEEE Trans. Multimedia

    (2016)
  • S. Hu et al.

    Objective video quality assessment based on perceptually weighted mean squared error

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • ITU-R BT.1438-0. Subjective assessment of stereoscopic television pictures...
  • ITU-R BT.500-11. Methodology for the subjective assessment of the quality of television pictures...
  • ITU-R BT.710-4. Subjective assessment methods for image quality in high-definition television...
  • ITU-R Rec. BT.2020. Parameter values for ultra-high definition television systems for production and international...
  • C. Jia et al.

    Content-aware convolutional neural network for in-loop filtering in high efficiency video coding

    IEEE Trans. Image Process.

    (2019)
  • F. Jiang et al.

    An end-to-end compression framework based on convolutional neural networks

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • Z. Jin et al.

    CNN oriented fast QTBT partition algorithm for JVET intra coding

  • Z. Jin et al.

    Fast QTBT partition algorithm for JVET intra coding based on CNN

  • Joint call for proposals on video compression with capability beyond HEVC

  • J.W. Kang

    Structured sparse representation of residue in screen content video coding

    Electron. Lett.

    (2015)
  • J.W. Kang et al.

    Sparse/DCT (S/DCT) two layered representation of prediction residuals for video coding

    IEEE Trans. Image Process.

    (2013)
  • L. Kang et al.

    Convolutional neural networks for no-reference image quality assessment

  • A. Kapperler et al.

    Super-resolution of compressed videos using convolutional neural networks

  • H. Kim et al.

    Deep virtual reality image quality assessment with human perception guider for omnidirectional image

    IEEE Trans. Circuits Syst. Video Technol.

    (2019)
  • H.-.S. Kim et al.

    Fast CU partitioning algorithm for HEVC using an online-learning based Bayesian decision rule

    IEEE Trans. Circuits Syst. Video Technol.

    (2016)
  • J. Kim et al.

    Deep CNN-based blind image quality predictor

    IEEE Trans. Neur. Netw.

    (2019)
  • J. Kim et al.

    Accurate image super-resolution using very deep convolutional networks

  • J. Kim et al.

    Fully deep blind image quality predictor

    IEEE J. Sel. Top. Sign. Proces.

    (2017)
  • Cited by (0)

    View full text