An efficient VLSI processor chip for variable block size integer motion estimation in H.264/AVC
Highlights
► Full search VLSI processor chip of Integer Motion Estimation in H.264/AVC for 1080HD real-time video. ► Configurable 2D systolic array with high data reuse for variable search area. ► Modified Lagragian cost for parallel architecture. ► Best mode and best motion vectors computed in a parallel pipeline architecture. ► Reduced area (32.3 k gates and 4.4 kBytes RAM) operating at 300 MHz
Introduction
H.264/AVC, developed by the Joint Video Team (JVT), is nowadays drawing considerable attention because of its network friendliness and the high video quality achieved with both low and high bit rates [1]. Compared with previous video standards, H.264/AVC can provide up to 50% coding efficiency for different bit rates and video definitions [2]. Fig. 1 presents a simplified block diagram of the H.264 encoder with the following main blocks: motion estimation (ME), motion compensation (MC), intra prediction, forward transform (FT), forward quantization (FQ), inverse quantization (IQ), inverse transform (IT), entropy coding and de-blocking filter. Initially, most of the work done on H.264 was oriented toward its software implementation [3]. However, in recent years the contributions to the VLSI hardware implementation of the H.264 have increased greatly in order to enable the implementation of fast architectures for real-time video applications [4], [5], [6].
ME based on a block-matching strategy is the most important part of H.264/AVC in exploiting the temporal redundancy between successive frames but it is also the most time consuming part in the coding framework. It requires large amounts of computation and accounts for 60–90% of encoding time. In H.264, a video frame is first split using macroblocks (MB) of size 16×16 in a VBSME approach [1]. This approach provides a better estimation of small and irregular motion fields and allows a better adaptation of the motion boundaries to object boundaries; all with the aim of reducing the number of bits required for coding prediction errors. In VBSME, each MB may be segmented into subblocks of different sizes, as illustrated in Fig. 2. ME is carried out in 7 different modes: one 16×16 MB (Mode 1), two 16×8 subblocks (Mode 2), two 8×16 subblocks (Mode 3) and four 8×8 subblocks (Mode 4). In turn, each 8×8 subblock is also split up into two 8×4 subblocks (Mode 5), two 4×8 subblocks (Mode 6) and four 4×4 subblocks (Mode 7). The total number of possible partitions is 41. ME refines the best candidate for each subblock's hierarchy in two phases: Integer Motion Estimation (IME) and Fractional Motion Estimation (FME). IME finds the best integer motion vector (MV) for all 41 variable-size blocks. FME refines those MVs in quarter-pixel precision using a 6-tap filter and a MV-bit-rate estimation. A pipeline architecture is the best solution to implement IME and FME [4], [6], [7].
There are many proposed IME algorithms and architectures based on the criteria of search, matching simplification, bit-width reduction, memory access or area/power, among others, where the system designer can choose the best trade-off for a specific application [8], [9]. FSBMA in the VBSME guarantees the best results by performing exhaustive matching of all candidate blocks. Most hardware video encoder designs adopt FSBMA in IME due to its excellent quality and data flow regularity, although it requires the maximum computation complexity and memory bandwidth. Systolic arrays [10], [11] composed of locally connected processing elements (PEs) has been considered to be the best option to implement FSBMAs for many reasons: pipelined data flow via the local connections does not require any control overhead; reuse of data for each PE by propagating through the array significantly reduces the memory bandwidth; and high clock frequencies and high processing speeds are achieved due to the small load capacities. 1D systolic arrays are attractive in low-end products because of their low hardware cost and desirable processing capability, but they are high in cost in terms of latency and efficiency [12], [13], [14]; Ref. [15] presents a 2D architecture based on 1D to attain low latency and high throughput. Indeed, many VLSI implementations propose 2D systolic arrays to be more suitable for high-end real-time usage. References which adopt the complete FSBMA include the 2D architecture with a simple regular control in [16], the novel memory-access with minimum off-chip memory bandwidth in [19], the high-performance reconfigurable architecture to support a scan format for a high data reuse within the search area in [20], the bit serial architecture in [22] and the high throughput design in [39]. Modifications of the FSBMA to reduce either hardware or computing time, at the cost of introducing some video quality loss, can be found in the soft algorithm to simplify the predicted MV and the early termination of motion search used in [21], the multi-resolution IME algorithm presented in [23], the adaptive size in the search area depending on the degree of motion activity in [24], [25], the modified algorithm to reduce hardware based on data dependency of motion vector prediction, pixel truncation and subsample proposed in [18], the IP with coarse and fine searches in [26] and the inter-candidate 4-parallel data reuse scheme with 16 2D PE-arrays in [27].
Finally, other fast ME algorithms, which are alternatives to FSBMA, have been developed with the aim of reducing the computational complexity and establishing a trade-off between efficiency and image quality [28], [29]. However, this reduction is often at the cost of losses in terms of visual quality and/or irregularities in data flow. This means that these algorithms lead to irregular memory access and difficulty in data reuse as well as introducing losses in terms of peak signal-to-noise ratio (PSNR). Often the irregular local search patterns can easily lead to a minimum local (suboptimal result) as opposed to a global minimum local by FSBMA. Examples of fast ME algorithms based on decimation of checking points are the three-step search [30], hexagon search [31] and diamond search [32]. FSBMA architectures are typically implemented with systolic mesh-connected arrays which are not suitable for the fast ME algorithms because of their unpredictable data flow and hardly parallelizable sequential control. Although the fast ME implementations achieve good efficiency over the FSBMA architectures, they are too rigid for a broad range of applications [33].
This paper presents a high-performance VLSI processor chip for IME based on the full-search VBSME algorithm for H.264/AVC video coding standard. The configurable 2D systolic array architecture supports a three-direction scan format for a high data reuse of the search area, an array of 16×4 PEs compute the SAD of basic 4×4 subblocks and a modified Lagragian cost is used as matching criterion to find the best 41 variable-size blocks by means of tree pipeline parallel architecture. The mode decision module selects the best mode and best MVs by comparing the total minimum Lagrangian costs. A prototype of the IME processor chip was designed in UMC 0.18 μm technology using a standard cell methodology. It achieves enough processing capacity for HDTV (1920×1088 @ 30 fps) with a search range of 32×32 in a high-performance circuit operating at 300 MHz and occupying reduced area. The remainder of this paper is organized as follows. Section 2 describes the proposed Lagrangian cost to be implemented in a parallel architecture. Section 3 presents a detailed description of the IME architecture including its main modules such as the processing unit, motion estimation, computation of the Lagrangian cost and motion decision unit. The results and comparisons of VLSI chip processor implementation are listed in Section 4. Finally, the conclusions are stated in Section 5.
Section snippets
Proposed Lagrangian cost in IME
The reference software JM of H.264, which is available on-line at [3], basically supports two methods to carry out a mode decision in terms of the cost calculation criteria: Motion Vector (MV) cost and Rate Distortion Optimization (RDO) cost. RDO cost is mainly used for selection prediction mode and MV cost in ME. RDO involves forward integer transform, quantization, dequantization or scaling, inverse integer transform and entropy coding. As a result, most real-time hardware encoders do not
Hardware architecture of the IME
Most FSBMA algorithms exploit the large amounts of overlapped data among the adjacent blocks. The proposed architecture for inter-prediction uses a 2-D systolic array to compute the SAD of 4×4 blocks. To optimize that overlapped data, the search area adopts a scan format scheme that supports three scan directions, unlike other traditional 2-D systolic array with one scan direction. This reused data saves roughly 30% of the memory access cycles achieving high throughput rate, low memory
ASIC implementation and comparisons
A prototype of the IME processor chip based on the architecture in Fig. 7 has been designed using standard cells in a semicustom methodology. Initially, the processor was described in Verilog. The test bench was made by simulating the design with NC-VHDL from Cadence® and comparing the results obtained with those provided by the rewritten JM reference software to simulate the proposed Lagragian cost for different input samples and values of QP. This processor was synthesized with Synopsys®
Conclusions
In this paper, we propose a high-performance VLSI processor chip for IME in H.264/AVC based on the full-search block matching algorithm (FSBMA) with enough processing capacity for 1080HD real-time video streaming with a search range of 32×32. The proposed design benefits greatly from a configurable 2D systolic array to obtain a high data reuse of the search area. It supports a three-direction scan format, a computing array of 64 PEs and a modified Lagrangian cost as matching criterion to find
Acknowledgment
We wish to acknowledge the financial help from the Spanish Ministry of Education and Science through TEC2006-12438/TCM.
References (40)
- et al.
Design of a 270 MHz/340 mW processing element for high performance motion estimation systems application
Microelectronic Journal
(2002) - et al.
A fast block-matching algorithm based on adaptative search area and its VLSI architecture for H.264/AVC
Signal Processing: Image Communication
(2006) - ITU-T Rec., H.264/ISO/IEC 11496-10, “Advanced Video Coding,” Final Committee , Document JVTG050,...
- et al.
Overview of H.264/AVC video coding standard
IEEE Transactions on Circuits and Systems for Video Technology
(2003) - K. Sühring, H.264/AVC Software Coordination, 〈http://iphome.hhi.de/suehring/tml/〉, Fraunhofer Institute for...
- et al.
Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder
IEEE Transactions on Circuits and Systems for Video Technology
(2006) - et al.
Fast algorithm and architecture design of low-power integer motion estimation for H.264/AVC
IEEE Transactions on Circuits and Systems for Video Technology
(2007) - et al.
HDTV1080p H.264/AVC encoder chip design and performance analysis
IEEE Transactions of Solid-State Circuits
(2009) - et al.
A hardware-efficient H.264/AVC motion-estimation design for high-definition video
IEEE Transactions on Circuits and Systems-I
(2008) - et al.
Survey on block matching motion estimation algorithms and architectures with new results
Journal of VLSI Signal processing
(2006)
On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture
IEEE Transactions on Circuits and Systems for Video Technology
Array architectures for block matching algorithms
IEEE Transactions on Circuits and Systems
Parametrizable VLSI architectures for the full-search block-matching algorithm
IEEE Transactions on Circuits and Systems
A VLSI architecture for variable block size video motion estimation
IEEE Transactions on Circuits and Systems
An efficient VLSI architecture for H.264 variable block size motion estimation
IEEE Transactions on Consumer Electronics
A VLSI architecture for variable block size motion estimation in H.264/AVC with low cost memory organization
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
An efficient hardware implementation for motion estimation of AVC standard,
IEEE Transactions on Consumer Electronics
Analysis and architecture design of variable block-size motion estimation for H.264/AVC
IEEE Transactions on Circuits and Systems-I
Architecture design for H.264/AVC integer motion estimation with minimum memory bandwidth
IEEE Transactions on Consumer Electronics
Cited by (12)
A survey on video compression fast block matching algorithms
2019, NeurocomputingCitation Excerpt :However, smaller MB size leads to increased complexity and increase in the number of motion vectors that need to be transmitted, which may outweigh the benefit of reduced residual energy. An effective compromise is to adapt the macroblock size to the picture characteristics, for example choosing a large block size in the homogeneous and shade regions of a frame and choosing a small block size for areas of high details, edges, and complex motion, which is called Variable Block-Size Motion Estimation (VBSME) [90,13,107,106]. The default block size for motion compensation is 16 × 16 samples for the luminance component.
The algorithm and VLSI architecture of a high efficient motion estimation with adaptive search range for HEVC systems
2019, Journal of Real-Time Image ProcessingVLSI design of energy efficient computational centric smart objects for IoT
2018, 2018 15th Learning and Technology Conference, L and T 2018A speed FPGA hardware accelerator based FSBMA-VBSME used in H.264/AVC
2016, Evolving SystemsOptimization of the Adaptive Computationally-Scalable Motion Estimation and Compensation for the Hardware H.264/AVC Encoder
2016, Journal of Signal Processing SystemsHardware Efficient Architecture with Variable Block Size for Motion Estimation
2016, Journal of Electrical and Computer Engineering