Keywords

1 Introduction

The resolution of today’s video is much higher than before, and it brings huge challenges to the limited network. What’s more, the attractive 3D videos have increasingly come into the public sight, which give people even better quality of experience. How to capture and compress the videos becomes a big problem. High Efficiency Video coding (HEVC) [1], as the latest video coding scheme, has a very high compression efficiency by exploiting the spatial and temporal structure for the video sequences, but it is not suited for the inexpensive video recording devices such as cellphones, wireless video cameras, which have limited computing capability and battery capacity.

Compressed sensing (CS) [2], as a novel signal processing theory, can acquire a signal at a sampling rate much lower than Nyquist rate via linear projection onto a random basis, and the original signal can be reconstructed through optimization method with high probability from some random measurements under certain conditions.

Given a signal \( {\mathbf{x}}\, \in \,R^{N} \) with length \( N \), it can be called \( K \) -sparse in a domain \( {\varvec{\Psi}} \) when \( K \) entries in its transform coefficients \( {\varvec{\uptheta}}\,{\mathbf{ = }}\,{\varvec{\Psi}}^{T} {\mathbf{x}} \) are nonzero, and \( {\varvec{\Psi}} \) is an orthonormal basis here. CS attempts to reconstruct the \( K \) -sparse signal vector from a relatively small number of samples with linear projection \( {\mathbf{y}}\,{\mathbf{ = }}\,{\varvec{\Phi}}{\mathbf{x}} \), and the size of \( {\varvec{\Phi}} \) is \( M \) by \( N \). The dimension of measurement vector \( {\mathbf{y}} \) is \( M \), and the values of the nonzero coefficients can be well recovered if \( M\, \ll \,N \).

In CS, instead of encoding all the coefficients of a signal, we only encode the \( M \) measurements and the reconstruction problem can be solved with the following \( l_{0} \) minimization method

$$ \hbox{min} \left\| {{\varvec{\Psi}}{\mathbf{x}}} \right\|_{0} ,s.t.\quad {\mathbf{y}}\,{\mathbf{ = }}\,{\varvec{\Phi}}{\mathbf{x}} $$
(1)

where \( \left\| {\mathbf{x}} \right\|_{0} \) is a pseudo-norm (\( l_{0} \)-norm), which equals the number of nonzero elements in vector \( {\mathbf{x}} \). Minimizing the number of nonzero entries is difficult. Instead, the optimization problem can be solved with \( l_{1} \) minimization method

$$ \hbox{min} \left\| {{\varvec{\Psi}}{\mathbf{x}}} \right\|_{1} ,s.t.\quad {\mathbf{y}}\,{\mathbf{ = }}\,{\varvec{\Phi}}{\mathbf{x}} $$
(2)

This is a convex optimization problem and can be solved easily via subspace pursuit [3] or CVX toolbox [4] and so on.

CS has a great potential in image and video applications for its low complexity and low power consumption. In CS based image compression techniques, an image is divided into \( n \) small blocks, then each one is rearranged into a vector \( {\mathbf{x}} \). Next, the vector is sampled with the measurement matrix \( {\varvec{\Phi}} \). Finally, the original image can be reconstructed by many state-of-the-art algorithms such as BCS-SPL [5], SGSR [6], GSR [7] and so on.

As for the application of CS to video compression [8,9,10,11,12], the sparsity between the successive frames with the classic transform domain (e.g. DWT, DCT) is exploited based on the adjacent frames’ correlation in video sequences. In [8], authors proposed a reconstruction model based on the idea that the total variation (TV) norm of the residual between the frame and its prediction, TV norm of the frame, \( l_{0} \)-norm of the frame in a certain transform domain are all very small. The similar idea appeared again in [9] which introduced the forward and the backward motion-compensated residuals. The support (location of large valued entries) is estimated based on the idea that the large valued entries belonging to the adjacent frames are located in almost the same place [10]. A hierarchical frame structure was proposed to exploit the correlation between the current frames and the reference frames in [11].

Unlike above methods, Distributed Compressed Video Sensing (DISCOS) framework is introduced in [13], which present the idea that the sparsest representation of a block is a linear combination of a few temporal neighboring blocks of previous reconstructed frames or nearby key frames. The same method also can be found in [14]. However, if the blocks are non-rigid objects (dancer in Fig. 1) whose shapes change a lot. And their sparsest representation is not a linear combination of a few temporal neighboring blocks, which means that it is unable to be well recovered even at a very high sampling rate by using DISCOS algorithm. As the result illustrated in Fig. 1, most of the regions in the frame are decoded perfectly, except for the dander in the background. Because the regions near the dancer change a lot and cannot be represented by the reference ones no matter how high the sampling rate is. So these blocks should be recovered by the image CS algorithms (INTRA). Moreover in [14, 15], a feedback channel was used to allocate the different measurement rates to each block at the encoder. Although it improved the image quality, it may lower the efficiency of the encoder.

Fig. 1.
figure 1

The performance of DISCOS (the 5th frame and the sampling rate is 0.08)

Traditional video coding techniques, as we all know, can achieve high compression ratio by making complicated mode decision in the encoder. And can we perform the mode decision both in the encoder and decoder in the framework of CS? Therefore, we propose a new CS based video coding framework based on mode decision in this paper. We employ different measurement rates for different block modes at the encoder, and perform mode decisions at the decoder side.

The paper is organized as follows. Section 2 proposes CS video scheme based on adaptive coding. Experimental results are given in Sect. 3, and we conclude this paper in Sect. 4.

2 Adaptive Coding Based on Video CS

Motivated by the above analysis, we propose the adaptive coding for compressed video sensing scheme (ACCS). The architecture of our proposed ACCS framework is depicted in Fig. 2, and we will show the details in the next subsections.

Fig. 2.
figure 2

Block diagram of the proposed CS scheme.

2.1 Adaptive CS Video Encoder

At the encoder, firstly, video sequences are divided into group of pictures (GOP), and each GOP contains two categories: key frames (K-frames) and non-key frames (CS frames), as shown in Fig. 3 (K stands for key frames, and CS stands for non-key frames). Be different from [14], the K-frames and CS-frames in the paper are both coded using CS principles. K-frames are sampled at a high sampling rate, while CS-frames are sampled at adaptive sampling rate.

Fig. 3.
figure 3

GOP structure. Every CS-frame is recovered with two K-frames.

In this paper, we propose an adaptive sampling method for CS-frames by considering different characteristics of blocks in a frame. Firstly, we classify the blocks into three kinds of mode: SKIP, COMBINED and INTER based on the \( l_{1} \) -norm of the residual between the current frame and the reference ones. If the residual is very small, which means the block changes very little with the co-located blocks in the reference frames, we only transmit a flag indicating the SKIP mode, which does not need any measurements so the sampling rate here is zero. If a block has very large change with respect to its co-located block, we will assign the COMBINED mode to it, which needs a high sampling rate and would be decoded both by INTRA and INTER. While the remainder blocks with minor change will take a small sampling rate, we refer to this coding mode as INTER mode, which decoded by INTER. The decoder schemes for three modes (SKIP, INTRA, INTER) will be explained in the next section. Figure 4 shows three kinds of mode of blocks in image, which is marked by different colors. The dark-red blocks show the COMBINED blocks, the light-red ones choose the INTER mode, and the rest blocks select the SKIP mode. The above description can be represented as follows

Fig. 4.
figure 4

The mode decision: (a) foreman (b) news.

$$ MODE\, = \,\left\{ {\begin{array}{*{20}c} {SKIP,} & {difference\, < \,threshold1} \\ {COMBINED,} & {threshold1\, < \,difference\, < \,threshold2} \\ {INTER,} & {difference\, > \,threshold1} \\ \end{array} } \right. $$
(3)

where \( difference \) means the \( l_{0} \) -norm of the difference between the blocks in current frame and the references. \( thresholds1 \) and \( threshold2 \) are thresholds, respectively, which are set by experimental results.

2.2 Adaptive CS Video Decoder

At decoder, we decode the K-frame by GSR, which can achieve the excellent performance in CS without using the information of the reference frames. For CS-frames, there are three kinds of blocks, and each of them will adopt different decoding schemes. The SKIP blocks can be decoded by copying the co-located block in the reference frames. The INTER blocks can be decoded by solving Eq. (2), whose redundant dictionary comes from the blocks near the co-located blocks in previously reconstructed frames. COMBINED blocks, which are the most complicated blocks, will perform the mode decision at the decoder. It recovered both by GSR (INTRA) and Eq. (2) (INTER). INTER blocks can be represented by the redundant dictionary sparsely, which is suitable for the blocks with complex local details but simple motion, while GSR cannot reconstruct the details finely at a low sampling rate, although it is a state-of-the-art scheme. As the blocks with complicated motion cannot be represented by the reference ones no matter how high the sampling rate is, they should be recovered by GSR. Then, we will decide which recovered block to be chosen by comparing the residual of the measurements based on the idea that the residual between the original (\( {\mathbf{x}} \)) and the prediction (\( {\hat{\mathbf{x}}} \)) is proportion to the residual of the measurements, as shown in (4). When the residual between the original measurements and the recovered measurements (by INTER) is smaller than the residual between the original and the recovered (by INTRA), we will choose block recovered by INTER and vice versa.

$$ {\mathbf{y}}\,{\mathbf{ - }}\,{\hat{\mathbf{y}}}\,{\mathbf{ = }}\,{\varvec{\Phi}}{\mathbf{(x}}\,{\mathbf{ - }}\,{\mathbf{\hat{x})}} $$
(4)

3 Experiment Results and Analysis

To evaluate the performance of our proposed scheme, several CIF video sequences are used: news, foreman, football, hall, coastguard, mobile, which represent for small, moderation and large movement videos. In our experiments, the CS measurements are obtained by using random matrix with Gaussian i.i.d entries at the block level, and the size of block (BS) is set to 32 * 32. \( threshold1 \) and \( threshold2 \) determine the coding modes for blocks. In order to determine the \( threshold1 \) and \( threshold2 \), we design a lot of experiments as follows. Firstly, we do not use the \( threshold2 \), and let the \( threshold1 \) vary from 0.05 * BS * BS to 5 * BS * BS. We can find from Fig. 5 that the PSNR is high when \( threshold1 < 3 \), and we choose \( threshold1\,{ = }\,2\,*\,\text{BS}\,*\,\text{BS} \) by considering all sequences. Secondly, in order to determine \( threshold2 \), we set the \( threshold1 \) to 2 * BS * BS and \( threshold2 \) varies from 1 * BS * BS to 10 * BS * BS. As shown in Fig. 6, we choose \( threshold2\,{ = }\,8\,*\,\text{BS}\,*\,\text{BS} \) by considering all sequences. In order to improve the compression ratio, the GOP size is set to 8, the first frame and the eighth frame are K-frames and others are CS-frames, as shown in Fig. 3. We assumed K-frames are losslessly available at the decoder in our experiments, as used in [14]. The high sampling rate (used in COMBINED mode) is five times the low one (used in INTER mode). We compare our proposed algorithm with two state-of-the-art image/video CS methods including GSR [7] and DISCOS [13]. GSR is an excellent still-image CS approach but smooths the details of the image. DISCOS makes full use of the inter-frame sparsity but it will introduce the blocking artifacts. Coding efficiency is measured using PSNR and sampling rate.

Fig. 5.
figure 5

Recovery performance changes with \( threshold1 \) in different sequences: (a) coastguard (b) football.

Fig. 6.
figure 6

Recovery performance changes with \( threshold2 \) in different sequences: (a) coastguard (b) football.

Fig. 7.
figure 7

Visual quality comparison with different algorithm for sequences (the 5th frame): (a) origin, (b) DISCOS [13], (c) GSR [7], (d) Proposed.

Figure 8 shows the rate-distortion curves of the proposed and the other approaches of all the test sequences. We can see that the rate-distortion performance of our approach is superior to GSR and DISCOS in most cases. Especially, for news and hall sequences with non-rigid objects, the proposed method achieves better coding performance compared to other two methods. For football sequence with large/complicated movement, our scheme will choose the INTRA mode in most cases, so the performance is similar to GSR. For Mobile sequence with large movement, different to football sequence, DISCOS achieves the best performance when the sampling rate is very low. This is due to the fact that mobile sequence has large motion but little shape-changing. Our scheme may choose the wrong encoding mode when the sampling rate is very low. However with the increase of sampling rate, our method achieves better performance than DISCOS. Besides, PSNR gain for DISCOS is less with the sampling rate increased [14]. Because the blocks have simple motion can be recovered perfectly even at a very low sampling rate, while the blocks have complicated motion which are not sparse in redundant dictionary cannot be reconstructed well.

Fig. 8.
figure 8

Recovery performance comparison with different algorithms: (a) news, (b) foreman, (c) football, (d) hall, (e) coastguard, (f) mobile.

Figure 7 shows the visual quality comparison for the foreman and news at the average rate about 0.04 (the actual sampling rates of our proposed scheme are 0.044 and 0.043 in foreman and news respectively) and for football at the sampling rate about 0.2. From Fig. 7, we can see that the proposed algorithm provides better visual quality than others. Table 1 provides the corresponding average execution time of various algorithms for reconstructing a frame in different sequences. These data are obtained using Matlab on a computer with Intel i5-3230, 2.6G CPU and 4 GB memory. From Table 1, we can see that the complexity of the proposed algorithm can be the medium one in most cases. Therefore, the proposed method provides a good tradeoff between visual quality and complexity.

Table 1. Reconstruction time in minutes.

4 Conclusion

In this paper, a new ACCS method has been proposed, which makes full use of the intra-frame and inter-frame sparsity. Our algorithm also exploits the fact that when the sequences have large or non-rigid motion, the sparsest representation of a block in video sequences is not a linear combination of a few temporal neighboring blocks that are in nearby key frames. Experimental results show that the proposed algorithm can achieve excellent performance even at a very low sampling rate, and outperforms existing state-of-the-art image/video CS approaches.