1 Introduction

The H.264/MPEG-4 AVC video-coding standard was developed by a Joint Video Team (JVT) consisting of experts from the Moving Picture Experts Group (MPEG) of the ISO/IEC and the Video Coding Experts Group (VCEG) of the ITU-T. The standard is used worldwide because of its superior performance compared to the previous standard. With PSNR 2 and 3 dB higher than MPEG-4 and H.263, respectively, it has become a favourite video-coding standard. Furthermore, with the same video quality, H.264 can save 50 % bit-rate compared to MPEG-4 [13].

However, these advantages come at the cost of an increase in computational complexity: the encoding and decoding times increase three and two times, respectively, compared to H.263. Energy consumption also increases. For H.264, motion estimation (ME) consumes 70 % (1 reference frame) to 90 % (5 reference frames) of the total encoding time of H.264 [46].

This paper discusses the hardware implementation of the improved motion estimation algorithm used in H.264 reference software, known as the unsymmetrical multi-hexagon search algorithm (UMHexagonS). In our proposed method, the UMHexagonS algorithm is modified by implementing a quadrant-based multi-octagon-grid search algorithm (QBMO) in the fourth stage of the UMHexagonS algorithm. This method is able to reduce the number of candidates, and hence also the energy consumption.

The rest of this paper is organized as follows. Section 2 discusses the existing motion estimation using the UMHexagonS algorithm. Section 3 outlines our proposed architecture. The result obtained from our design is discussed in Sect. 4. Section 5 concludes the paper.

2 Literature review

UMHexagonS is a fast-search motion-estimation algorithm that consists of several fast-search techniques. The steps are initial-search-point prediction, unsymmetrical cross search, small full search, uneven multi-hexagon-grid search, extended hexagon-based search, and small diamond search [79]. Several methods have been proposed for ME implementation in H.264, since this is crucial to determine the quality and speed of the encoder. One of the earliest UMHexagonS architectures was proposed in [10], as shown in Fig. 1. It makes use of the ability of H.264 to process five reference frames in real time, and consists of six processing units (PU), five buffers of 16 × 15 bytes each, one reference block memory of 16 × 16 bytes, one 4 × 16 reference block buffer and a comparison unit. The reference memory is made up of shift registers to allow data to be moved vertically or horizontally, depending on the required data direction. The PU architecture, as shown in Fig. 2, contains 16 processing elements, 12 delay registers and 37 adders. The 16 processing elements are divided into four groups. The values of sum of absolute differences (SAD) are calculated by the adder tree using the output of each PU. Then, the output is sent to the comparison unit for motion vector (MV) calculation.

Fig. 1
figure 1

Architecture proposed in [10]

Fig. 2
figure 2

Processing unit of architecture proposed in [10]

The method proposed in [11] focuses on the trade-off between small gate counts and high throughput. The block diagram of the architecture is shown in Fig. 3. The architecture contains dual-port SRAM modules (DPSRAM), 16 simple buffers, 16 basic search units (BSU) and a comparator. The DPSRAM module is used as a buffer to store the current and reference frame. Each BSU contains 4 × 4 processing elements (PE) to calculate the SAD of a single 4 × 4 block at a time.

Fig. 3
figure 3

Architecture proposed in [11]

The proposed architecture in [12] uses a modified UMHexagonS algorithm known as simplified unified multi-hexagon (SUMH) [13] to outperform the UMHexagonS. As illustrated in the top-level diagram shown in Fig. 4, the architecture consists of search window (SW) memory, current macroblock (CMB) memory, an address generation unit (AGU), a control unit, a block of processing units (PUs), a SAD combination tree, a comparison unit, and a register for storing the 41 minimum SADs and their associated MVs. Both internal memories in the architecture use dual-port block RAMS (BRAMS) to store the SW and CMB. The SW memory consists of N 16 × 16 BRAMS, where N represents the numbers of macroblock (MB) candidates. Figure 5 details the implementation of the processing unit. From Fig. 5, it can be seen that each PU contains 16 PE, and each PE calculates a single 4 × 4 block.

Fig. 4
figure 4

Top-level diagram proposed in [12]

Fig. 5
figure 5

Processing unit of the architecture in [12]

3 Proposed algorithm and architecture

This work aims to reduce the energy consumed by video compression. This is achieved by reducing the complexity of the ME by directly reducing the search point without sacrificing the video quality. As discussed in the previous section, the conventional UMHexagonS algorithm consists of six steps: (a) initial search point prediction, (b) unsymmetrical-cross search, (c) small full search, (d) uneven multi-hexagon-grid search, (e) extended hexagon-based search, and (f) small diamond search, with the number of search points for each step being 1, 24, 25, 64, 6 and 4, respectively [79].

In this work, the step with the highest search point will be focused upon since reducing these search locations will yield lower computational load and thus lowering energy consumption. An uneven multi-hexagon-grid search has a very high search point of 64, constituting half of the total search points for UMHexagonS algorithm. Based on this observation, the uneven multi-hexagon-grid search will be focused upon here to reduce the overall search points.

The conventional UMHexagonS algorithm emphasizes the left and right sides of the search range to deal with horizontal video motion, especially when performing an uneven multi-hexagon-grid search. The result of this is further refined by performing an extended hexagon search and a small diamond search. Thus, the emphasis for horizontal video motion can be removed from the uneven multi-hexagon-grid search, since this task can be performed by an extended hexagon and a small diamond search.

Based on the above observation, to reduce the computational load for the conventional uneven multi-hexagon-grid search, this paper proposes to replace this search with a method called quadrant-based multi-octagon-grid search (QBMO). In this proposed method, the uneven multi-hexagon-grid search is first replaced by a multi-octagon-grid search as shown in Fig. 6. Furthermore, instead of performing the search on all the multi-octagon candidates, this method performs the search on only one quarter of the multi-octagon-grid candidates. This is done by dividing the multi-octagon area into four quadrants, as shown in Fig. 7. Only one quadrant will be used to look for the motion vector MV. The quadrant is chosen based on the location of the MV of the same block in the previous frame. For example, if the MV of the previous frame is located in the top left of the quadrant shown in Fig. 8a, that particular quadrant will be used for the search, as shown in Fig. 8b. By implementing the multi-octagon-grid search, the total candidates for the search are reduced from 64 to only eight during this step. This is equivalent to search point reduction of more than 80 % compared to the original uneven multi-hexagon-grid algorithm.

Fig. 6
figure 6

Multi-octagon-grid search

Fig. 7
figure 7

Four quadrants of the multi-octagon-grid search

Fig. 8
figure 8

Determining the multi-octagonal search quadrant: a the triangle mark is the MV of the block located in the previous frame, b the chosen quadrant for the current block

To compare the complexity of the proposed algorithm, Table 1 summarizes the total number of search points for the conventional UMHexagonS (UMHS) and the modified UMHexagonS with QBMO replacing the multi-hexagon search (QBMO). From Table 1, it can be seen that the conventional UMHexagonS shows the highest number of search points at 123. QBMO shows the total number of search points of 67. This 45.53 % reduction of search points is caused by implementation of QBMO at the fourth step of UMHexagonS.

Table 1 Summary of search points for each ME algorithm

The proposed hardware implementation of the UMHexagonS algorithm with QBMO is illustrated in Fig. 9. It consists of six main parts: a control unit, a current block buffer (CBB), a reference block buffer (RBB), a processing unit (PU), an adder tree (AT), a comparison unit (CU) and a control unit. The control unit plays a vital role in the architecture since it ensures the correct operation of the whole module in performing the proposed algorithm. This unit consists of a state machine as shown in Fig. 10. In total, the state machine contains 12 states with 5 states controlling the top-level operation (s1, s2, s3, s4 and s5) and another five states controlling the PU sub-operation (st2, st3_1, st3_2, st4_1 and st4_2). The other two states are used for initialization and reset (srset and start).

Fig. 9
figure 9

Top-level architecture of the proposed algorithm

Fig. 10
figure 10

State machine flow for control unit

For each MB, the process starts with a global initialize and reset state. Then, the buffer units (CBB and RBB) are initialized during states s1 and s2. Stages st2 to st4_2 are then executed. During the execution of these states, the state that performs PE, AT and CU is called up repeatedly until the final candidate has been evaluated. Then, it returns to the start state to be ready to process the next MB. This process is repeated until the last MB of the frame.

The RBB consists of a 2,401-byte data buffer to store a 48 × 48 search window. The CBB consists of a 256-byte data buffer to store a 16 × 16 current block. The data are stored in the respective buffer before it is sent to the PU. Thus, the CBB and RBB require 256 and 2,401 clock cycles, respectively, to store the data into the buffer. Both modules require 256 clock cycles to read out the current MB and reference MB data into PU.

The PU consists of 256 process elements (PE), as shown in Fig. 11. Each PE calculates one absolute difference between the current and reference MB pixel. The PU will produce 256 absolute differences in one clock cycle. Once the absolute difference has been calculated, all 256 outputs of the PU will then be sent to the adder tree to calculate the SAD.

Fig. 11
figure 11

Processing unit (PU) diagram

The adder tree will first calculate the 256 SAD before recombining the results to generate 41 final SAD, which represent the SAD of seven different block sizes, as shown in Fig. 12. In the adder tree, the calculation is performed at five levels, where each level takes its input from the previous level. Each level generates different SAD for different block sizes. The first level generates SAD for the 4 × 4 block. The second level generates SAD for the 4 × 8 and 8 × 4 blocks. The third level generates the SAD for the 8 × 8 block. The fourth level generates SAD for the 8 × 16 and 16 × 8 blocks. Finally, the fifth level generates SAD for the 16 × 16 block. This adder tree requires four clock cycles to calculate all the 41 SAD.

Fig. 12
figure 12

Adder tree diagram

The diagram of the comparison unit is shown in Fig. 13. In this unit, each of the 41 SAD outputs will then be compared with the previously calculated SAD. The minimum SAD value will be kept temporarily for the next comparison. At the same time, the lowest SAD block location will be kept, since it is used during the motion vector (MV) calculation. The comparison unit requires one clock cycle to perform the operation.

Fig. 13
figure 13

Comparison unit diagram

4 Results and discussion

To implement the proposed QBMO algorithm, the algorithm is included in the reference software version JM17.2 by replacing the conventional uneven multi-hexagon-grid search. The simulation is performed using the standard encoder baseline profile, with the parameters as listed in Table 2. Six test video sequences (CIF format) categorized into three motion types are used. These are: aggressive motion (Bus and Football video), medium motion (Foreman and Silent video) and low motion (News and Hall video). Since there is no standard ways to measure the picture quality metrics, PSNR value is used as the measure throughout this work. This helps us in comparing our result with the existing works. Furthermore, in cases where accuracy is important, subjective quality will be used as the final comparison.

Table 2 H.264 parameters for the algorithms’ simulation

As discussed in the previous section, the computational load reduction for QBMO is achieved by reducing the number of search locations. However, this reduction could cause a drop in the prediction quality. To evaluate the prediction quality for the proposed QBMO algorithm, the reference software is simulated and the output is monitored in terms of PSNR, motion estimation simulation time and bit-rate. The result is tabulated as shown in Table 3.

Table 3 Motion estimation execution time result

From Table 3, the QBMO is able to achieve ME encoding time (MET) reduction without significantly sacrificing the quality of the video and the bit-rate. The average decrease in PSNR is less than only 0.01 dB with a maximum PSNR drop of 0.02 dB. These small changes in PSNR, however, will not be visible to the human eye [14].

Similarly, the reduction of the search candidate could affect the bit-error-rate optimization. From Table 3, on average of only a 0.5 % bit-rate increase can be observed for the QBMO, as compared to the conventional UMHexagonS. The worst bit-rate increment for the QBMO is 1.18 % or 18.72 kb/s.

To further evaluate the algorithm performance in real-world implementation, Fig. 14a–c show PSNR vs. bit-rate for three video sequences that represent each type of video motion. From Fig. 14, the QBMO can be seen to closely match the conventional UMHexagonS algorithm. This shows that the QBMO is effective in reducing the computational load of the UMHexagonS without sacrificing the overall H.264 compression efficiency.

Fig. 14
figure 14

PSNR vs. bit-rate for various frame sequence a News, b Foreman, and c Bus

Based on the results shown in Table 3 and Fig. 14, it is clear that the proposed QBMO is able to reduce the computational load without degrading the video output quality significantly. One of the reasons for this is the implementation of the multi-octagon search with the selective quadrant method. Since neighbouring videos have high similarity in terms of video motion, using the previous frame as the guide to determine the selected quadrant results in good prediction accuracy. This proposed method therefore does not degrade the picture quality and compression efficiency significantly.

In order to compare the algorithm performance with the standard method, the proposed QBMO algorithm is implemented in the hardware. The architectures are coded using Verilog hardware description language and simulated using VCS from Synopsys Inc. During the behavioural simulations, the architectures’ operation and the timing of the simulated architecture are verified. Once synthesized, the gate-level simulation is performed to verify the timing and energy consumption of the architectures.

The motion estimation behaviour is simulated using a test bench that contains a sample video with known MV. The test bench is fed into the architecture and the simulation output is monitored. The calculated MV must match the known MV to ensure the architecture functions as intended.

Table 4 shows the clock cycle requirement for both the conventional and the proposed algorithm. In total, the conventional UMHexagonS algorithm requires 123 search points per MB. From the simulation results, the algorithm requires 34,632 clock cycles to calculate a single MB, and consumes 13,678,852 clock cycles to process one video frame.

Table 4 Clock cycle to process one macroblock

The proposed QBMO algorithm has similar clock cycle requirements to the conventional architecture, except that it only evaluates 67 search locations per MB. Due to the lower number of candidates, as the simulation results show, the proposed algorithm requires 19,960 clock cycles to calculate a single MB and 7,883,412 clock cycles to process one frame sample video. Thus, the proposed algorithm results in 42 % fewer clock cycles compared to the conventional UMHexagonS architecture. This proves that the proposed QBMO algorithm greatly saves on clock cycle requirements.

In order to evaluate the energy consumption of the proposed algorithm, the architectures are compiled and simulated using Power Compiler from Synopsys Inc. From the simulation result, the total energy consumption can be deduced. Table 5 shows the power result for each architecture module. From Table 5, it can be seen that the power consumption for both the conventional and the improved UMHexagonS are similar, with PU consuming the largest amount of power (40 %), followed by control unit (23 %) and RBB (19 %).

Table 5 Breakdown of power analysis for each architecture module

In order to determine the total energy for the proposed algorithm, the results obtained from the simulated power are used to calculate the energy needed to process one frame. Table 6 shows the result of the energy consumption for the proposed algorithm. From Table 6, it is clear that the proposed architecture uses less energy (16.8 mJ) compared to the conventional UMHexagonS algorithm (29.7 mJ). The results show that the proposed architecture is able to save on energy by 43 % compared to the conventional method. The reduction in energy is due to the proposed QBMO algorithm requiring fewer clock cycles to complete the same task as the conventional UMHexagonS.

Table 6 Result for energy analysis

5 Conclusion

This paper has discussed the design of low-energy motion estimation for H.264 video coding. An improved UMHexagonS algorithm has been presented to reduce the computational load. The algorithm implements quadrant-based multi-octagon-grid search (QBMO) in the fourth step of the UMHexagonS algorithm. Based on the experimental results, the proposed architecture is able to reduce energy consumption by 43 % compared to the conventional algorithm, without affecting the prediction quality.