Keywords

1 Introduction

Now Internet is an inseparable part of our daily life. Almost every day, we get news from the news site, search information through the search engine and watch some pictures and videos on the Internet. In recent years, Internet, as a large scale information carrier, has a rapid-growing image database. Facial images become an important part of all kinds of images and are widely used in various applications. Therefore effective facial image coding is vital for storage and transmission of facial images.

From the earliest standard JPEG to recently proposed HEVC/H.265, the development of mainstream image coding method took more than two decades. Some classical coding algorithms emerged, including JPEG2000 [1], AVC/H.264 and HEVC (High Efficiency Video Coding) [2]. H.264 uses intra-frame prediction in the spatial domain to remove redundancy between adjacent image blocks. JPEG2000 is a representative transform domain oriented coding method, which transforms the image into wavelet domain and then compresses it by quantizing the wavelet coefficients. Since cloud database is huge, it is very likely to find some similar images from the cloud database. For facial image, even for different people, the face similarity can reach 70%. If we can use the database resource to help for image coding, we may get a higher compression ratio.

Currently, there are many visual feature extraction algorithms. For a facial image, we can extract some local features in the sense organs and hair areas. These local features can well reflect the portrait characteristics and can be invariant in some cases. Therefore we consider to encode facial images according to the local features, and try to keep the feature area and neglect the non-feature area appropriately.

In this paper, we propose a feature-based coding method for facial image in the transform domain. First of all, the input image is transformed by multi-layer wavelet transform. After transformation, we can get one low frequency sub band (LF sub band) and some high frequency sub bands (HF sub bands). We use local feature to describe the HF sub bands. The encoded bit stream is consists of encoded feature information and encoded LF sub band. On the decoded side, we use the decoded feature information to match features in a database and find similar patches to restore the HF sub bands, Final reconstructed image is obtained using wavelet inverse transform from decoded LF sub band and restored HF sub bands.

The rest of this paper is organized as follows: Sect. 2 introduces related works of proposed algorithm; Sect. 3 provides our coding method in detail; Sect. 4 shows our experimental results; Final conclusions are given in Sect. 5.

2 Related Works

In 2004, Lowe proposed SIFT algorithm [3]. SIFT features have the advantage of scale and rotation invariance. A scale space is established firstly, and extreme values are found in the scale space as feature key points in SIFT algorithm. There are three basic parameters for each key point: position, scale and orientation. Scale expresses the layer of the key points, and orientation gives the gradient orientation value of the neighborhood of the key point. The scale invariance of the SIFT feature can be used for matches between images at different scale, and rotation invariance can be used to adapt to the rotation transformation.

In this paper, we use SIFT algorithm to extract features of facial images. Each key point can be expressed by:

$$ K_{i} = \left( {Pt_{i} ,s_{i} ,o_{i} } \right) $$
(1)

where \( Pt_{i} \) is the coordinate of the key point, \( s_{i} \) and \( o_{i} \) are the scale and orientation of the key point respectively.

After the feature extraction, the feature key points need to be described. In [3], Lowe used a 128-dimension descriptor to described each key point. According to the location of the key point, a neighborhood image block is selected, whose size is related to the scale of the key point. To realize the rotation invariance, the image block is rotated to its main orientation \( o_{i} \) Then the block is divided into 4 × 4 subblocks, and a 8-dimension vector is calculated for each subblock. Finally, a vector with 4 × 4 × 8 = 128 dimensions is obtained. We can get the SIFT descriptor of one each point after normalization, which is expressed by \( D_{i} \).

SIFT algorithm had immediately attracted a wide spread attention when proposed. It was widely used in image reconstruction [4], image classification [5], image detection and recognition [6, 7] and image search [8, 9]. The method of image reconstruction based on feature was first proposed in [4]. Weinzaepfel et al. extracted image features by SIFT algorithm, then did the sift-match with a big image database. However, the reconstructed image had a serious distortion. Based on this idea, Yue et al. proposed a cloud-based image coding algorithm in [10]. A down sampled image was transferred to the decoder side as a guide of image reconstruction. At the decoder side, matching patches could be found from the cloud according to the feature match. Patches was pasted to the up sampled image after processing. This method can achieve high compression ratio, as well as good restoration quality. However, as it said in [10], when the high similarity image was removed from the database, the reconstructed image becomes blurred. In 2014, Song et al. proposed an image reconstruction method based on thumbnail [11], which got better performance in compression ratio and PSNR comparing with JPEG.

3 The Proposed Coding Method

3.1 General Analysis

On the encoder side, the image is multi-level wavelet transformed first. After wavelet transformation, LF sub band and HF sub bands are produced. LF sub band is transmitted to the decoder side after being encoded. For HF sub bands, we just remain some features of HF sub bands, and transmit the feature key point information to the decoder side. The key points of HF sub bands are got from the difference set between key points of input image and key points of HF-missed inverse wavelet transform image. The coded bit stream is consist of coded LF sub band \( L_{en} \) and coded key point information \( K_{en} \).

In the decoder, \( L_{en} \) and \( K_{en} \) are first decoded. Then HF-missed wavelet reverse transformation is done to get an inverse wavelet transform image which is as large as the input image. Then the decoded key point information is used to calculate the feature descriptor in the inverse wavelet transform image. After we get the descriptors, we can do SIFT match to a large image database. Each matching key points in the database corresponds a small image patch. These small patches are used to restore the HF sub bands. When all the HF sub bands are restored, we can do HF-restored inverse wavelet transform to get the reconstructed image (Fig. 1).

Fig. 1.
figure 1

System framework of proposed image coding scheme

3.2 Encoder

STEP 1 Wavelet transform

After the image wavelet transform, the input image can be decomposed into 4 sub band images: low frequency LL, and high frequency sub bands HL, LH, HH. The structure and color information are included in the LF sub band, while the HF sub bands include the texture and detail information. In order to obtain a higher compression ratio, low frequency sub band can continue to be decomposed. For instance, \( LL_{n - 1} \) can be divided into \( LL_{n} \), \( LH_{n} \), \( HL_{n} \) and \( HH_{n} \). In our image coding scheme, we use 2-level wavelet transform, and totally produce 1 low frequency sub band \( LL_{2} \), and 6 high frequency sub bands (\( HL_{2} \), \( HH_{2} \), \( HH_{2} \), \( LH_{1} \), \( HL_{1} \), \( HH_{1} \)). We use L and H to represent low and high frequency parts for short respectively.

STEP 2 HF key point extraction

The goal of this step is to find effective information in the HF sub bands. SIFT algorithm can well detect the edges, corners and some texture. So the SIFT key points can well describe the effective information in an image. Therefore, effective information in HF sub bands can be described by HF key points. To get all the HF key points information, we first extract SIFT key point from the input image, denoted by \( K_{I} \). Meanwhile, using \( LL_{2} \) to do a 2-level inverse wavelet transform with HF sub bands are all zero, Each inverse transform level can be expressed by:

$$ I_{iwt} = \frac{1}{2}\sum\nolimits_{j,k} {WT_{f} \left( {j,k} \right) \cdot \psi_{j,k} \left( t \right)} $$
(2)

where, \( \psi_{j,k} \left( t \right) \) is the wavelet basis. The inverse transform image is denoted by \( I_{L} \). Then key point from \( I_{L} \) is extracted and denoted by \( K_{L} \). It is obviously that \( K_{L} \) contains only low frequency information. The difference set of \( K_{I} \) and \( K_{L} \) is the HF key points. The information of HF feature key points can be described as:

$$ K_{H} = \left\{ {\left( {pt_{1} ,s_{1} ,o_{1} } \right),\left( {pt_{2} ,s_{2} ,o_{2} } \right), \ldots ,\left( {pt_{n} ,s_{n} ,o_{n} } \right)} \right\} $$
(3)

where n is the number of key points.

STEP 3 LF sub band encoding

Since low frequency image includes most of the color and intensity information of the input image, we can use coding algorithm with low distortion to encode the LF sub band such as JPEG2000 and HEVC. In our coding scheme, we use HEVC to encode the low frequency image \( LL_{2} \). The encoded LF sub band is expressed by \( LL_{en} \).

STEP 4 Feature encoding

The locations of the key points are represented by a binary matrix M, which has the same size as the input image. If k(x, y) is a feature key point, \( M_{xy} \) is set to 1, otherwise \( M_{xy} \) is set to 0. The binary matrix M is encoded using binary run-length coding. We use 14 bits to encode each sift key point, 7 bits for the scale and others for the orientation. \( K_{en} = \left\{ {Pt_{en} ,S_{en} ,O_{en} } \right\} \) is used to express the encoded feature points. The encoded bit stream is consist of \( LL_{en} \) and \( K_{en} \).

3.3 Decoder

STEP 1 Bit stream decoding

The coded bit stream is first split into two parts—\( LL_{en} \) and \( K_{en} \). HEVC decoder is used for low frequency decoding to get \( LL_{\text{de}} \). And \( Pt_{de} , S_{de} \) and \( O_{de} \) are got by the corresponding key point decoding.

STEP 2 Feature descriptor calculation

After we get the location, scale and orientation of all the key points, we need to use these data to calculate the SIFT descriptor of each key point to do the match with image database. We take 2-level HF-missed inverse wavelet transform with \( LL_{\text{de}} \) and use \( Pt_{de} ,S_{de} \) and \( O_{de} \) to calculate the SIFT descriptors in the IWT image \( I_{Lde} \). Then we can get the descriptor \( D_{de} \){\( d_{1} ,d_{2} , \ldots ,d_{n} \)}.

STEP 3 SIFT Key point matching

There is a large image database in the decoder. Images in the database are processed as the same way as for the input images. There are lot of features in the database images, making a key point set \( K^{D} \), whose corresponding descriptor set is \( D^{D} \left\{ {d_{1}^{D} ,\,d_{2}^{D} \text{,}\, \ldots \text{,}\,d_{m}^{D} } \right\} \). Assume that we have a certain HF key point \( k_{i} \), we use its descriptor \( d_{i} \) to do the matching. If the Euclidean distance between \( d_{i} \) and \( d_{j}^{D} \) is the minimum value of all the database key point descriptors, \( k_{i} \) and \( k_{j}^{D} \) are a pair of matching points, which is denoted by \( M_{i} \).

STEP 4 Optimizing and Clustering

In the encoder, the HF sub band key point \( K_{H} \) is the difference set of \( K_{I} \) and \( K_{L} \). In the decoder, the SIFT descriptors are calculated in the \( I_{Lde} \), which may have errors when calculating the descriptors, so as the matching process. Therefore we do not use all the matching points to restore the HF sub bands. A threshold value

$$ Rs = 0.3 \times max\left\{ {dist\left( {k_{i} ,M_{i} } \right)} \right\} $$
(4)

is used to reduce mismatching points.

Where \( \left\{ {dist\left( {k_{i} ,M_{i} } \right)} \right\} \) is the set of all the Euclidean distance between descriptors of \( k_{i} \) and \( M_{i} \). For a certain key point \( k_{i} \), if the Euclidean distance between \( k_{i} \) and its matching point \( M_{i} \) is smaller than \( Rs \), We regard \( M_{i} \) as a good matching, and this matching pair will be reserved. Otherwise, we regard \( M_{i} \) as a mismatching, the matching pair will be ignored.

After optimization, we need to take features into cluster. One reason is that clustering can effectively avoid image patches overlapped closely to each other when pasting to the HF sub bands. The other reason is to calculate the transformation matrix more precisely—we can calculate different transformation matrix for different cluster. We use K-means algorithm to do the clustering, which can be expressed by:

$$ Cluster_{mk} = \left\{ {d_{kj} = min\sum\nolimits_{j = 1}^{K} {\sum\nolimits_{i = 1}^{J} {\parallel pt_{i} - \mu_{j} \parallel } }^{2} } \right\} $$
(5)

where m is the index number of a database image, \( \upmu \) is the coordinate value of the cluster center.

STEP 5 HF sub bands restoration

When the cluster process is completed, for each cluster, we use RANSAC algorithm [12] to wipe off some mismatching points, and calculate the transformation matrix according to the rest of matching key points. Then we use the transformation matrix to all points in the cluster, make perspective transformation of the patches and paste them to the HF restoration sub bands to get the restored HF sub band images \( H_{r} \).

STEP 6 Wavelet inverse transform

2-level inverse wavelet transform is used at this step. In the first level, \( LL_{1de} \) is got from \( LL_{de} \) and \( H_{r2} \). Similarly, the final reconstructed image is produced by the second transform level.

4 Experimental Results

The facial database Utrecht ECVP [13] is used as our experimental image database. There are 131 images taken from 49 men and 20 women whose expression is usually natural or smile. The image resolutions is 900 \( \times \) 1200 for all images. Each person has 1–3 pictures. The sense organs and hair cuts are different between different persons. We pick 6 images as the input image and the input image need to be removed from the database. After we get the reconstructed image, we compared it with HEVC and JPEG2000. The input images are shown in Fig. 2.

Fig. 2.
figure 2

Input images. From left to right marked “a” to “f”.

Basic information of the input images are shown in Table 1.

Table 1. Basic information of input images

Take image b as an example, the reconstructed image is shown in Fig. 3. The comparison of compress ratio and image quality is shown in Tables 2 and 3.

Fig. 3.
figure 3

The reconstructed results of image b. In the row (a), images from left to right: input image, proposed algorithm, HEVC, JPEG2000. The second row is the corresponding details. In the row (b), details from left to right: proposed algorithm, HEVC, JPEG2000.

Table 2. Comparison of compression ratio
Table 3. Comparison of image quality

From Fig. 3 we can see our coding scheme has a satisfactory subjective visual effect. Under high compression ratio, details of HEVC-coded images will become blurred, and JPEG2000 would appear blocking artifact.

Table 2 shows the comparison of compression ratio between proposed algorithm, HEVC and JPEG2000. In our experiment, when doing the LF sub band encoding, we set the VQ to 31, the compression ratio of proposed algorithm is shown on the first line of compression ratio table. When compared to HEVC, making compression ratio of HEVC close to that of proposed algorithm, the corresponding PSNR and SSIM is shown in Table 3. We can see that compared with HEVC, when the compression ratio is close to or higher than HEVC, the proposed algorithm has a higher PSNR and SSIM. When compared to JPEG2000, we set the JPEG2000 to the highest compression ratio it can reach. We can see that when the compression ratio of proposed algorithm is much higher than that of JPEG2000, the PSNR and SSIM of proposed algorithm can be equal or better than JPEG2000.

From the experiment we can see that the proposed coding method has a good performance not only in compression ratio but also in the image quality of reconstructive image. A further optimization will be used to the feature detection step to improve the coding performance.

5 Conclusions

This paper proposed a novel facial image coding scheme in the transform domain base on SIFT. The input image is first decomposed into low frequency sub band image and high frequency sub band images by wavelet transform. While the high frequency part is encoded according to SIFT descriptors rather than pixel value at the encoder side. SIFT-match is used at the decoder side to restore the high frequency images. Final reconstructed image is obtained by inverse wavelet transform. Experimental results show high compression ratio of our coding scheme. However, it still needs to be improved for processing edge information.