An efficient VLSI processor chip for variable block size integer motion estimation in H.264/AVC

https://doi.org/10.1016/j.image.2011.04.006Get rights and content

Abstract

Motion estimation (ME) is the most critical component of a video coding standard. H.264/AVC adopts the variable block size motion estimation (VBSME) to obtain excellent coding efficiency, but the high computational complexity makes design difficult. This paper presents an effective processor chip for integer motion estimation (IME) in H264/AVC based on the full-search block-matching algorithm (FSBMA). It uses architecture with a configurable 2D systolic array to obtain a high data reuse of search area. This systolic array supports a three-direction scan format in which only one row of pixels is changed between the two adjacent subblocks, thus reducing the memory accesses and saving clock cycles. A computing array of 64 PEs calculates the SAD of basic 4×4 subblocks and a modified Lagrangian cost is used as matching criterion to find the best 41 variable-size blocks by means of a tree pipeline parallel architecture. Finally, a mode decision module uses serial data flow to find the best mode by comparing the total minimum Lagrangian costs. The IME processor chip was designed in UMC 0.18 μm technology resulting in a circuit with only 32.3 k gates and 6 RAMs (total 59kBits on-chip memory). In typical working conditions (25 °C, 1.8 V), a clock frequency of 300 MHz can be estimated with a processing capacity for HDTV (1920×1088 @ 30 fps) and a search range of 32×32.

Highlights

► Full search VLSI processor chip of Integer Motion Estimation in H.264/AVC for 1080HD real-time video. ► Configurable 2D systolic array with high data reuse for variable search area. ► Modified Lagragian cost for parallel architecture. ► Best mode and best motion vectors computed in a parallel pipeline architecture. ► Reduced area (32.3 k gates and 4.4 kBytes RAM) operating at 300 MHz

Introduction

H.264/AVC, developed by the Joint Video Team (JVT), is nowadays drawing considerable attention because of its network friendliness and the high video quality achieved with both low and high bit rates [1]. Compared with previous video standards, H.264/AVC can provide up to 50% coding efficiency for different bit rates and video definitions [2]. Fig. 1 presents a simplified block diagram of the H.264 encoder with the following main blocks: motion estimation (ME), motion compensation (MC), intra prediction, forward transform (FT), forward quantization (FQ), inverse quantization (IQ), inverse transform (IT), entropy coding and de-blocking filter. Initially, most of the work done on H.264 was oriented toward its software implementation [3]. However, in recent years the contributions to the VLSI hardware implementation of the H.264 have increased greatly in order to enable the implementation of fast architectures for real-time video applications [4], [5], [6].

ME based on a block-matching strategy is the most important part of H.264/AVC in exploiting the temporal redundancy between successive frames but it is also the most time consuming part in the coding framework. It requires large amounts of computation and accounts for 60–90% of encoding time. In H.264, a video frame is first split using macroblocks (MB) of size 16×16 in a VBSME approach [1]. This approach provides a better estimation of small and irregular motion fields and allows a better adaptation of the motion boundaries to object boundaries; all with the aim of reducing the number of bits required for coding prediction errors. In VBSME, each MB may be segmented into subblocks of different sizes, as illustrated in Fig. 2. ME is carried out in 7 different modes: one 16×16 MB (Mode 1), two 16×8 subblocks (Mode 2), two 8×16 subblocks (Mode 3) and four 8×8 subblocks (Mode 4). In turn, each 8×8 subblock is also split up into two 8×4 subblocks (Mode 5), two 4×8 subblocks (Mode 6) and four 4×4 subblocks (Mode 7). The total number of possible partitions is 41. ME refines the best candidate for each subblock's hierarchy in two phases: Integer Motion Estimation (IME) and Fractional Motion Estimation (FME). IME finds the best integer motion vector (MV) for all 41 variable-size blocks. FME refines those MVs in quarter-pixel precision using a 6-tap filter and a MV-bit-rate estimation. A pipeline architecture is the best solution to implement IME and FME [4], [6], [7].

There are many proposed IME algorithms and architectures based on the criteria of search, matching simplification, bit-width reduction, memory access or area/power, among others, where the system designer can choose the best trade-off for a specific application [8], [9]. FSBMA in the VBSME guarantees the best results by performing exhaustive matching of all candidate blocks. Most hardware video encoder designs adopt FSBMA in IME due to its excellent quality and data flow regularity, although it requires the maximum computation complexity and memory bandwidth. Systolic arrays [10], [11] composed of locally connected processing elements (PEs) has been considered to be the best option to implement FSBMAs for many reasons: pipelined data flow via the local connections does not require any control overhead; reuse of data for each PE by propagating through the array significantly reduces the memory bandwidth; and high clock frequencies and high processing speeds are achieved due to the small load capacities. 1D systolic arrays are attractive in low-end products because of their low hardware cost and desirable processing capability, but they are high in cost in terms of latency and efficiency [12], [13], [14]; Ref. [15] presents a 2D architecture based on 1D to attain low latency and high throughput. Indeed, many VLSI implementations propose 2D systolic arrays to be more suitable for high-end real-time usage. References which adopt the complete FSBMA include the 2D architecture with a simple regular control in [16], the novel memory-access with minimum off-chip memory bandwidth in [19], the high-performance reconfigurable architecture to support a scan format for a high data reuse within the search area in [20], the bit serial architecture in [22] and the high throughput design in [39]. Modifications of the FSBMA to reduce either hardware or computing time, at the cost of introducing some video quality loss, can be found in the soft algorithm to simplify the predicted MV and the early termination of motion search used in [21], the multi-resolution IME algorithm presented in [23], the adaptive size in the search area depending on the degree of motion activity in [24], [25], the modified algorithm to reduce hardware based on data dependency of motion vector prediction, pixel truncation and subsample proposed in [18], the IP with coarse and fine searches in [26] and the inter-candidate 4-parallel data reuse scheme with 16 2D PE-arrays in [27].

Finally, other fast ME algorithms, which are alternatives to FSBMA, have been developed with the aim of reducing the computational complexity and establishing a trade-off between efficiency and image quality [28], [29]. However, this reduction is often at the cost of losses in terms of visual quality and/or irregularities in data flow. This means that these algorithms lead to irregular memory access and difficulty in data reuse as well as introducing losses in terms of peak signal-to-noise ratio (PSNR). Often the irregular local search patterns can easily lead to a minimum local (suboptimal result) as opposed to a global minimum local by FSBMA. Examples of fast ME algorithms based on decimation of checking points are the three-step search [30], hexagon search [31] and diamond search [32]. FSBMA architectures are typically implemented with systolic mesh-connected arrays which are not suitable for the fast ME algorithms because of their unpredictable data flow and hardly parallelizable sequential control. Although the fast ME implementations achieve good efficiency over the FSBMA architectures, they are too rigid for a broad range of applications [33].

This paper presents a high-performance VLSI processor chip for IME based on the full-search VBSME algorithm for H.264/AVC video coding standard. The configurable 2D systolic array architecture supports a three-direction scan format for a high data reuse of the search area, an array of 16×4 PEs compute the SAD of basic 4×4 subblocks and a modified Lagragian cost is used as matching criterion to find the best 41 variable-size blocks by means of tree pipeline parallel architecture. The mode decision module selects the best mode and best MVs by comparing the total minimum Lagrangian costs. A prototype of the IME processor chip was designed in UMC 0.18 μm technology using a standard cell methodology. It achieves enough processing capacity for HDTV (1920×1088 @ 30 fps) with a search range of 32×32 in a high-performance circuit operating at 300 MHz and occupying reduced area. The remainder of this paper is organized as follows. Section 2 describes the proposed Lagrangian cost to be implemented in a parallel architecture. Section 3 presents a detailed description of the IME architecture including its main modules such as the processing unit, motion estimation, computation of the Lagrangian cost and motion decision unit. The results and comparisons of VLSI chip processor implementation are listed in Section 4. Finally, the conclusions are stated in Section 5.

Section snippets

Proposed Lagrangian cost in IME

The reference software JM of H.264, which is available on-line at [3], basically supports two methods to carry out a mode decision in terms of the cost calculation criteria: Motion Vector (MV) cost and Rate Distortion Optimization (RDO) cost. RDO cost is mainly used for selection prediction mode and MV cost in ME. RDO involves forward integer transform, quantization, dequantization or scaling, inverse integer transform and entropy coding. As a result, most real-time hardware encoders do not

Hardware architecture of the IME

Most FSBMA algorithms exploit the large amounts of overlapped data among the adjacent blocks. The proposed architecture for inter-prediction uses a 2-D systolic array to compute the SAD of 4×4 blocks. To optimize that overlapped data, the search area adopts a scan format scheme that supports three scan directions, unlike other traditional 2-D systolic array with one scan direction. This reused data saves roughly 30% of the memory access cycles achieving high throughput rate, low memory

ASIC implementation and comparisons

A prototype of the IME processor chip based on the architecture in Fig. 7 has been designed using standard cells in a semicustom methodology. Initially, the processor was described in Verilog. The test bench was made by simulating the design with NC-VHDL from Cadence® and comparing the results obtained with those provided by the rewritten JM reference software to simulate the proposed Lagragian cost for different input samples and values of QP. This processor was synthesized with Synopsys®

Conclusions

In this paper, we propose a high-performance VLSI processor chip for IME in H.264/AVC based on the full-search block matching algorithm (FSBMA) with enough processing capacity for 1080HD real-time video streaming with a search range of 32×32. The proposed design benefits greatly from a configurable 2D systolic array to obtain a high data reuse of the search area. It supports a three-direction scan format, a computing array of 64 PEs and a modified Lagrangian cost as matching criterion to find

Acknowledgment

We wish to acknowledge the financial help from the Spanish Ministry of Education and Science through TEC2006-12438/TCM.

References (40)

  • J.F. Lopez et al.

    Design of a 270 MHz/340 mW processing element for high performance motion estimation systems application

    Microelectronic Journal

    (2002)
  • Y.L. Xi et al.

    A fast block-matching algorithm based on adaptative search area and its VLSI architecture for H.264/AVC

    Signal Processing: Image Communication

    (2006)
  • ITU-T Rec., H.264/ISO/IEC 11496-10, “Advanced Video Coding,” Final Committee , Document JVTG050,...
  • T. Wiegand et al.

    Overview of H.264/AVC video coding standard

    IEEE Transactions on Circuits and Systems for Video Technology

    (2003)
  • K. Sühring, H.264/AVC Software Coordination, 〈http://iphome.hhi.de/suehring/tml/〉, Fraunhofer Institute for...
  • T.C. Chen et al.

    Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder

    IEEE Transactions on Circuits and Systems for Video Technology

    (2006)
  • T.C. Chen et al.

    Fast algorithm and architecture design of low-power integer motion estimation for H.264/AVC

    IEEE Transactions on Circuits and Systems for Video Technology

    (2007)
  • Z. Liu et al.

    HDTV1080p H.264/AVC encoder chip design and performance analysis

    IEEE Transactions of Solid-State Circuits

    (2009)
  • Y.K. Lin et al.

    A hardware-efficient H.264/AVC motion-estimation design for high-definition video

    IEEE Transactions on Circuits and Systems-I

    (2008)
  • Y.W. Huang et al.

    Survey on block matching motion estimation algorithms and architectures with new results

    Journal of VLSI Signal processing

    (2006)
  • J.C. Tuan et al.

    On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture

    IEEE Transactions on Circuits and Systems for Video Technology

    (2002)
  • T. Komarek et al.

    Array architectures for block matching algorithms

    IEEE Transactions on Circuits and Systems

    (1989)
  • L. de Vos et al.

    Parametrizable VLSI architectures for the full-search block-matching algorithm

    IEEE Transactions on Circuits and Systems

    (1989)
  • S.Y. Yap et al.

    A VLSI architecture for variable block size video motion estimation

    IEEE Transactions on Circuits and Systems

    (2004)
  • C.M. Ou et al.

    An efficient VLSI architecture for H.264 variable block size motion estimation

    IEEE Transactions on Consumer Electronics

    (2005)
  • Y. Song et al.

    A VLSI architecture for variable block size motion estimation in H.264/AVC with low cost memory organization

    IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences

    (2006 : [15].)
  • M. Kim, I. Hwang, S.I. Chae, A fast VLSI architecture for full-search variable block size motion estimation in MPEG-4...
  • L. Deng et al.

    An efficient hardware implementation for motion estimation of AVC standard,

    IEEE Transactions on Consumer Electronics

    (2005)
  • T.C. Chen et al.

    Analysis and architecture design of variable block-size motion estimation for H.264/AVC

    IEEE Transactions on Circuits and Systems-I

    (2006)
  • D.X. Li et al.

    Architecture design for H.264/AVC integer motion estimation with minimum memory bandwidth

    IEEE Transactions on Consumer Electronics

    (2007)
  • Cited by (12)

    • A survey on video compression fast block matching algorithms

      2019, Neurocomputing
      Citation Excerpt :

      However, smaller MB size leads to increased complexity and increase in the number of motion vectors that need to be transmitted, which may outweigh the benefit of reduced residual energy. An effective compromise is to adapt the macroblock size to the picture characteristics, for example choosing a large block size in the homogeneous and shade regions of a frame and choosing a small block size for areas of high details, edges, and complex motion, which is called Variable Block-Size Motion Estimation (VBSME) [90,13,107,106]. The default block size for motion compensation is 16 × 16 samples for the luminance component.

    • VLSI design of energy efficient computational centric smart objects for IoT

      2018, 2018 15th Learning and Technology Conference, L and T 2018
    • Hardware Efficient Architecture with Variable Block Size for Motion Estimation

      2016, Journal of Electrical and Computer Engineering
    View all citing articles on Scopus
    View full text