Elsevier

Parallel Computing

Volume 38, Issues 4–5, April–May 2012, Pages 226-244
Parallel Computing

Cluster-based optimized parallel video transcoding

https://doi.org/10.1016/j.parco.2012.02.001Get rights and content

Abstract

Video transcoding is a popular technique for delivering video content of varying quality and size to diverse audiences.

In this paper an analytical approach to the optimization of a large collection of parallel transcoding techniques based on temporal partitioning, is pursued. The key elements in the design of such techniques are identified, allowing them to be enumerated and classified. Closed-form solutions to the partitioning/scheduling problem (and optimum operation sequencing where necessary) are derived for the most important of these methods, under CBR input media conditions. Subsequently, appropriate heuristics allow the solution of the partitioning problem under VBR input media conditions.

The paper is concluded by an extensive battery of tests for the most significant strategies, on several feature-length video streams. The tests reveal not only how one of the proposed strategies, namely NPWFVBR, strikes a nice balance between efficiency and distortion minimization on heterogeneous platforms, but also allow us to derive guidelines for transcoding solution deployment.

Highlights

► Optimized transcoding of both CBR and VBR media on heterogeneous platforms. ► Closed-form solutions for data partitioning problem. ► Theorem establishes the optimum operation sequence for single-port setups. ► Extensive experimental results are reported, based on feature-length movies. ► Proposed NPWFVBR method balances execution time and distortion.

Introduction

Video-on-Demand (VoD) is increasingly becoming a popular service as video compression and delivery technologies mature. However, a persisting problem is that the size of the associated media is typically extreme for the contemporary communication infrastructures. The shift to HD-quality media and the associated hike on media bitrates despite advanced codecs such as H.264/MPEG4-AVC has not helped the situation. Additionally, service providers need to cater for very diverse audiences, ranging from mobile devices to ADSL-connected desktops, requiring media quality/size adaptations.

A number of techniques have been suggested for offering video of varying quality and/or size [24]. Scalable Video Coding (SVC) provides this capability by encoding the video in two or more layers. The base layer provides the elementary video stream in its most basic quality. Each additional layer, can increase the resolution (spatial scalability), frame rate (temporal scalability) or level of detail (quality or SNR scalability) of the base layer. Spatial, temporal and SNR scalability can be also combined [31]. In [16] scalable video is discussed in association with multicasting by multiple replicated servers to multiple clients.

Recent advances in SVC coding, have resulted in small degradation (negligible for temporal and 10% for spatial and SNR scalability [31]) in bitrate over equivalent quality single layer media. However, given the current extremely low cost of secondary memory, an equally good choice to the complicated (client-and system-wise) solution of SVC, is to actually maintain multiple different-quality versions of the same content.

Because video transcoding is a typically compute intensive process that could take hours for feature-length movies even on the latest microprocessors, and since a movie stream can be easily partitioned into disjoint data sets, transcoding can be parallelized to reduce its total cost.

In this paper we analyze and explore the performance of a taxonomy of different methods that could be used to perform video transcoding on a Network-of-Workstations (NoW) using static (or semi-adaptive) scheduling, in the sense that a model is used to determine a priori the optimum or close-to-optimum load of each node. While dynamic partitioning can be an effective tool for load-balancing purposes, it poses substantial disadvantages for video transcoding: communication costs increase substantially as a single work request to a compute node has to be replaced by a large number of requests from the master/load originating node. For raw movie media, i.e. media that require sequential decoding, dynamic partitioning would require the -repeated- decoding of the input movie stream until the frames to be processed are reached.

A subset of the methods discussed here has been examined in [6] for Constant-Bit-Rate (CBR) media only. In this paper, the taxonomy and cost models of [6] are refined and extended to cover both CBR and Variable-Bit-Rate (VBR) content, while also covering a bigger set of methods.

In summary, the major contributions of this paper are:

  • Systematic analysis of a large collection of transcoding methods targeting heterogeneous platforms, for both CBR and VBR input media.

  • Closed-form solutions are provided for the partitioning problem in the case of CBR input media and these are accompanied by appropriate heuristics for VBR media. The optimal ordering of operations for single-port schemes is also rigorously proven.

  • Extensive experimental results are reported, focusing not only on the execution time but also on the characteristics of the produced output.

  • The reported performance metrics are based on actual transcoding of feature-length movies, totaling ∼3.7 h and ∼325,000 frames, and not mere simulations or runs on token media (100–200 frames or less) as is typically the case in the literature. This enhances the value of our results and brings out in far greater detail the advantages and drawbacks of temporal partitioning methodologies.

Our analysis is based on a systematic study of the costs incurred during transcoding. This study has shown that the decoding and subsequent encoding costs can be, either, predicted from the frame sizes of the original movie stream, or, considered constant for each type of output frame. These results and the nature of the problem as discussed above, allow the application of Divisible Load Theory (DLT) [36] for optimizing the data partitioning and associated computational schedule.

Additionally, although both spatial and temporal partitioning are possible, the former complicates the merging of the partial output streams and is more appropriate for shared-memory platforms such as multicore CPUs or GPUs. This is the reason why only temporal partitioning is targeted throughout this paper.

It should be stressed that in this paper we consider the case of one-pass encoding only. While a two-pass procedure can be accommodated with the proposed framework (possible requiring a different partitioning between passes), two-pass encoding is beyond the scope of this paper.

Our paper is organized as follows: in Section 2 we present related work, while in Section 3 we present the model formulation that we base our analysis on. In Section 4 we break down the cost of transcoding and establish models for the constituent costs. In Section 5 we present closed form solutions for CBR input streams and an algorithmic approach for solving the VBR input stream case. The paper is concluded by a rigorous experimental study in Section 6.

Section snippets

Related work

The problem of parallelizing video processing has been extensively studied in the past decade. In [32] the two principal strategies in parallel movie coding are identified: spatial and temporal parallelism. Spatial parallelism partitions a single movie frame in an attempt to speed-up the motion compensation step. The intra-frame partitioning can be performed at various levels [26]. Temporal parallelism partitions the movie stream in disjoint groups of frames that are subsequently processed by a

Method taxonomy

We base the taxonomy of the different parallelization approaches in this domain on the following attributes:

  • Data decomposition. Parallelism is achieved by data partitioning in the:

    • Temporal Domain: corresponds to inter-frame partitioning. This is typically unsuitable for live content, unless a delay is introduced.

    • Spatial Domain: corresponds to intra-frame partitioning. Because of the elevated communication requirement entailed, spatial decomposition lends itself primarily to shared-memory

Cost modeling

The transcoding process typically involves the repetition of the following sequence for each movie frame:

  • 1.

    Demultiplexing (reading from a stream).

  • 2.

    Decoding (decompression).

  • 3.

    Preprocessing (filtering, clipping, spatial and temporal resolution changes).

  • 4.

    Encoding (compression).

  • 5.

    Multiplexing (writing to a stream).

The most time-consuming of the above steps is the encoding part. Parallelizing the above sequence depends on deciding which steps will be done in parallel, especially in light of how much

Closed-form solution

In an effort to build realistic models for each of the examined strategies, we assume that computation can overlap communication, i.e. we have stream-type tasks that can process while receiving their input. When exactly the processing starts depends on the nature of the input.

In the solutions provided below, the time to deliver the encoded movie pieces back to the master is not explicitly considered in the NP schemes, as it is assumed that the output data collection is done concurrently with

Experimental study

Simulation studies have shown that the NPWFCBR approach is superior to the SPWFCBR one [6]. Based on these findings, we implemented and measured the performance of the NPWFVBR strategy. It should be noted that all the tests conducted were single-pass ones, i.e. the input movie was transcoded in a single pass.

The application of any model-based static data partitioning and scheduling approach usually stumbles on the accurate estimation of the model parameters. A model or simple hierarchy of

Conclusion

In this paper we present a novel approach at analyzing and optimizing parallel video transcoding for heterogeneous clusters of machines. Our contributions include the closed-form solution of the partitioning and scheduling problem for a big collection of parallel designs. The Optimum Sequence Theorem is proven that establishes the optimum configuration under an SP setup and CBR media. Additionally, NPWFVBR has been tested on feature-length VBR media under a variety of circumstances and against

References (42)

  • Gerassimos Barlas et al.

    Quantized load distribution for tree and bus connected processors

    Parallel Computing

    (2004)
  • J. Berlinska et al.

    Heuristics for multi-round divisible loads scheduling with limited memory

    Parallel Computing

    (2010)
  • J. Berlinska et al.

    Scheduling divisible mapreduce computations

    Journal of Parallel and Distributed Computing

    (2011)
  • FFMPEG Library Version 0.4.9-pre1. <http://ffmpeg.org/> (accessed...
  • Threaded I/O bench for Linux Version 0.3.3. <http://sourceforge.net/projects/tiobench/> (accessed...
  • Ismail Assayad, Philippe Gerner, Sergio Yovine, Valerie Bertin, Modelling, analysis and parallel implementation of an...
  • Cyril Banino et al.

    Scheduling strategies for master-slave tasking on heterogeneous processor platforms

    IEEE Transactions on Parallel & Distributed Systems

    (2004)
  • Denilson M. Barbosa, Joao Paulo Kitajima, Wagner Meira Jr, Parallelizing MPEG video encoding using multiprocessors, in:...
  • Gerassimos Barlas, Cluster-based optimized parallel video trans/en-coding: a taxonomy and a DLT-based solution, in:...
  • Gerassimos Barlas

    An analytical approach to optimizing parallel image registration/retrieval

    IEEE Transacctions on Parallel & Distributed Systems

    (2010)
  • Gerassimos D. Barlas

    Collection-aware optimum sequencing of operations and closed-form solutions for the distribution of a divisible load on arbitrary processor trees

    IEEE Transactions on Parallel & Distributed Systems

    (1998)
  • Olivier Beaumone et al.

    Scheduling divisible loads on star and tree networks: Results and open problems

    IEEE Transactions on Parallel & Distributed Systems

    (2005)
  • Ricardo Fernandez, Jose M. Garcia, Gregorio Bernabe, Manuel E. Acacio, Optimizing a 3D-FWT video encoder for SMPs and...
  • Ian Foster

    Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering

    (1995)
  • Horacio Gonzalez-Velez et al.

    Adaptive statistical scheduling of divisible workloads in heterogeneous systems

    Journal of Scheduling

    (2010)
  • A. Hamosfakidis, Y. Paker, J. Cosmas, A study of concurrency in MPEG-4 video encoder, in: IEEE International Conference...
  • Yong He, Ishfaq Ahmad, Ming L. Liou, MPEG-4 based interactive video using parallel processing, in: International...
  • Akihito Hiromori, Hirozumi Yamaguchi, Keiichi Yasumoto, Teruo Higashino, Kenichi Taniguchi, A selection technique for...
  • Chung-Ming Huang, Chung-Wei Lin, Chia-Ching Yang, Chung-Heng Chang, Hao-Hsiang Ku, An SVC-MDC video coding scheme using...
  • Jui Tsun Hung et al.

    Scheduling nonlinear computational loads

    IEEE Transactions on Aerospace and Electronic Systems

    (July 2008)
  • Jingxi Jia et al.

    Scheduling multi-source divisible loads on arbitrary networks

    IEEE Transactions on Parallel & Distributed Systems

    (2010)
  • Cited by (22)

    • Multi-modal Multimedia Big Data Analyzing Architecture and Resource Allocation on Cloud Platform

      2017, Neurocomputing
      Citation Excerpt :

      This research presents the applicability of emerging cloud computing technology for mobile multimedia services. Recently many researchers have focused on [22] distributed and cluster-based media transcoding methods. These approaches reduce processing time and maintenance costs for developing a computing resource infrastructure.

    • New model and genetic algorithm for divisible load scheduling in heterogeneous distributed systems

      2013, International Journal of Pattern Recognition and Artificial Intelligence
    • Cloud media video encoding: review and challenges

      2024, Multimedia Tools and Applications
    • Multicore and GPU Programming: An Integrated Approach

      2022, Multicore and GPU Programming: An Integrated Approach
    • CPU Microarchitectural Performance Characterization of Cloud Video Transcoding

      2020, Proceedings - 2020 IEEE International Symposium on Workload Characterization, IISWC 2020
    View all citing articles on Scopus
    View full text