Keywords

1 Introduction

Along with the widespread use of networks and mobile devices such as mobile phone, the number of images increases rapidly. A large number of images are loaded on various social networks every day. To help people exhibit higher aesthetic quality images and explore the aesthetic cognitive ability of computers, image aesthetic quality evaluation is becoming more and more important. Image aesthetic quality evaluation aims to classify the images to high or low aesthetic quality. As shown in Fig. 1, high aesthetic quality images bring more comfortable visual effect to people than the low aesthetic quality images.

Fig. 1.
figure 1

The first row is high aesthetic quality images, the second row is low aesthetic quality images

In the recent decade, how to make the computers distinguish the image aesthetic quality from the mass images by themselves becomes the main research direction. The approaches of image aesthetic quality evaluation can be divided into traditional hand-craft [3, 4, 15, 21, 22, 24, 26, 32, 33] and deep learning CNN [5, 8, 20, 23, 30, 31, 38].

Traditional hand-craft approaches have been aiming at some objective factors of affecting the image aesthetic quality. People have extracted many visual features, including low-level image statistics, such as edge distribution and color histograms, and high-level image graphic rules, such as the rule of thirds and golden ratio. In [3], Datta et al. proposed 56 global features referred to structure, color, light and so on. They used linear regression to quantify the image aesthetics. In [15], Ke et al. proposed a principled approach to design high level global features. They used the perceptual factors that distinguish between professional photos and snapshots to design high level global semantic features and measure the perceptual differences. In [22], Luo et al. began to use local features. They proposed the salient regions from a photo based on professional photography techniques to formulate a number of high-level semantic features based on the quotient of salient and background. The classification rate of their method was 93%. In [21], Luo et al. proposed three classes of local feature and two classes of global feature to automatically classify image aesthetic quality. They used receiver operating characteristic (ROC) curve to prove that these features are excellent for image aesthetic quality evaluation. Shao et al. [26] used Gabor wave transformation, class imbalance and total scene understanding to extract the main part and local features of the different category images. They used hue histogram and color pie to extract global color features at the same time. Dhar et al. [4] mainly analyzed the high-level features that can reflect the image aesthetic quality. They proposed three classes of feature: structure, content and light. They also used this approach to predict the interest feelings of images. In [24], Obrador et al. took focused on the features of structure. They proposed 55 structure features to evaluate the image aesthetics and the classification accuracy was close to the benchmark. In [32, 33], Wang et al. proposed 41 features, which referred to structure, color, light, global and local information to evaluate the aesthetic quality of image. Experiment results showed that these 41 features had high classification accuracy of the image aesthetic quality evaluation.

In 2006, Hinton et al. [10] has restarted the research direction of deep learning. In the next few years, CNN has achieved an excellent performance. Motive by the excellent performance in features extracting and autonomous Learning, CNN is wide employed in computer vision and image processing, such as handwritten numerals recognition [17], ImageNet Large Scale Visual Recognition Competition (ILSVRC) [9, 16, 27, 29], object detection and semantic segmentation [7, 19, 25], face recognition [28, 37], emotion recognition [2, 18, 35], image style recognition and transformation [6, 13], sentiment analysis [36]. Researchers are trying to use CNN for the aesthetic calculation and connotation exploration [5, 8, 20, 23, 30, 31, 38]. In [20], Lu et al. proposed a novel double-column CNN and used global view and fine-grained view to train. Their approach produced significantly 10% better results than the results achieved by hand-craft features on the AVA dataset [23]. Guo et al. [8] proposed PDCNN, a Paralleled CNN architecture. The results of their approach were better than the results of [21] on Photo Quality. The PDCNN can also overcome the problem of over-fitting and under-fitting. In [38], Zhou et al. used N-grams to describe the image text feature and used SVM to learn the weights of N-grams. Then they joined the text features extracted by N-grams into the visual features extracted by CNN, to get excellent classification accuracy on AVA dataset [23]. Besides, [5, 30, 31] use CNN to evaluate image aesthetic quality too.

At present, in many vision tasks, the main problem of using CNN is that the data scale is small, comparing with the large-scale dataset ImageNet which has million images. The problem also lingers in image aesthetic quality evaluation. The best ways of solving the problem are data-augmentation and fine-tune at present. Fine-tune from the pre-trained models on ImageNet is found to yield state-of-the-art performance for many vision tasks such as visual tracking [34], action recognition [14], object recognition [25], human pose estimation [19]. Therefore, this paper mainly use fine-tune to evaluate the images aesthetic quality. The main innovations and contributions in this paper can be summarized as follows:

  1. 1.

    Image contents are employed to further boost the classification performance. We use fine-tune based on the image content to train the aesthetic quality classification models and analyze the influence of the image contents to aesthetic quality classification.

  2. 2.

    Embedded fine-tune is proposed to solve the training data becoming smaller problem due to the limitation of image contents and further boost image classification performance. Embedded fine-tune has twice fine-tune and each fine-tune has different training samples. The results of the experiment show that embedded fine-tune improves the image aesthetic quality classification accuracy rates.

  3. 3.

    The classification accuracy probability of the trained CNN models is used to make specific evaluation to image aesthetic quality, break up the situations that images only have high or low aesthetic label.

2 Relate Work

2.1 Alexnet and VGG_S Architecture

Feature learning is unified with classifier training using RGB images in CNN. As shown in Fig. 2 and Table 1, Alexnet [16] is the champion in ILSVRC2012. It has five convolution layers, two pooling layers and two full connect layers. The input image is divided into R, G, B three channels and the size of each channel is initialized \(256\times 256\). Rectified Linear Unit (ReLU) as the active function is used in Alexnet. Local Response Normalization (LRN) following ReLU aids generalization.

Fig. 2.
figure 2

The parameters and architecture of Alexnet

Table 1. The comparison between Alexnet and VGG_S

Data augmentation and Dropout are used to reduce over-fitting in Alexnet. The first data-augmentation way is that \(256\times 256\) input images are cut to ten \(227\times 227\) patches from four corners and the center and then mirrored them horizontally. These ten patches are predicted by the network’s softmax layer averagely. The second data-augmentation way is that alter the intensities of the R, G, B channels in training images. Specifically, PCA is performed on the set of RGB pixel values. Multiples of the found principal components with a random variable drawn from a Gaussian with mean 0 and standard deviation 0.1 are added to each training image.

In [11], Hiton et al. proposed dropout. If it’s set 0, the output probability of each hidden layer is 0.5. Then the neurons are “dropped out” in this way and they don’t contribute to the forward propagation or the back-propagation.

In [1], Chatfield et al. proposed VGG_S which also has five convolution layers, three pooling layers and two full connect layers. The differences are that the size of the extracted patches is changed from \(227\times 227\) to \(224\times 224\) and the kernel size, stride of the first convolution layer are smaller. But the error rate of VGG_S on ImageNet is lower than that of Alexnet.

2.2 Fine-Tune from ImageNet

As Fig. 3 shows, fine-tune takes a well pre-trained model, adapts the same architecture and retrains on the pre-trained model weights. It reflects a kind of semantic transfer from general to specific. Fine-tuning the last layer of CNN, the parameter is set higher than other layers because this layer is starting from random while the others are already trained and the output number is set to satisfy the target task when we use fine-tune. With the help of the excellent existing models, fine-tune will save much resource in some new researches.

Fig. 3.
figure 3

Fine-tune from the large-scale dataset ImageNet

3 The Proposed Approaches

3.1 Comparing Alexnet with VGG_S

To compare Alexnet and VGG_S which is more suitable for image aesthetic quality evaluation task and boost the image aesthetic quality evaluation classification performance, first, we mix Animal, Architecture, Human, Landscape, Night, Plant and Static images of the dataset together as training samples which are divided into high and low aesthetic quality. Second, we train Alexnet_All and VGG_S_All based on all training samples. Third, to get Alexnet_FT_All and VGG_S_FT_All, we fine-tune from Alexnet_Model and VGG_S_Model which are trained on ImageNet. At last, we use the test samples of each kind of content to test these four models respectively and find out which architecture is more suitable for image aesthetic quality classification task. The way of comparing Alexnet and VGG_S is described in Algorithm 1.

figure a

3.2 Training Models Based on Image Contents

To further improve the image aesthetic quality evaluation classification performance, as Fig. 4 shows, we use fine-tune to train seven models with the training samples of seven kinds of content based on Alexnet_model or VGG_S_Model. The inside box is training Alexnet_model or VGG_S_Model and the outside box is training the models based on the image content. The CNN architecture used in training is the better one from Alexnet and VGG_S. We use CNN_i to represent the seven models we trained in all, where the i represents Animal, Architecture, Human, Landscape, Night, Plant or Static and CNN represents Alexnet or VGG_S. We also use testing samples of each kind of content to test the models of the same content.

Fig. 4.
figure 4

Train models based on the image content

Fig. 5.
figure 5

Use embedded fine-tune to train models

3.3 Using Embedded Fine-Tune to Train Models

When we train the aesthetic quality classification models based on image contents, the training samples become smaller, so we propose embedded fine-tune. As Fig. 5 shows, the inside dashed box shows Alexnet_FT_All or VGG_S_FT_All training process. The outside dashed box shows fine-tune from Alexnet_FT_All or VGG_S_FT_All with training samples of each kind of content. We use EFCNN_i to represent the seven models we trained, where the i also represents Animal, Architecture, Human, Landscape, Night, Plant or Static and EFCNN represents Embedded Fine-tune Convolution Neural Network. The difference between these two fine-tune is that the first fine-tune uses the aesthetic quality label and the second fine-tune uses both aesthetic quality label and content label. The algorithm is as follows:

figure b

3.4 Image Aesthetic Quality Evaluation

The traditional way of image aesthetic quality evaluation only divides the images into high or low aesthetic quality. We will score images to make the image aesthetic quality evaluation more specific. The score is calculated by the output probability of trained models. We use Softmax with loss to calculate the probability at last, and use the probability to score the image aesthetic quality.

4 Experiment Results and Analysis

4.1 Photo Quality Dataset and Experiment Platform

In this paper, the images used in experiment come from a image dataset named photo quality. Photo Quality built by [21] is a small-scale dataset for photo aesthetic quality evaluation. The images are from a website which photos are taken by professional and amateurs photographers. A total of 17,613 photos are divided into high or low aesthetic quality after eight of the ten observers make the same judgment. According to the photo contents, all photos are divided into 7 categorization, named Animal, Architecture, Human, Landscape, Night, Plant and Static. As Table 2 shows, we filtered the dataset again and chose 15,562 images at last.

Table 2. Number of high quality and low quality images of seven contents in Photo Quality

We use Caffe, a professional deep learning platform developed by Jia [12], as our experiment platform. Our GPU is GTX 1070 with 8G memory. In order to ensure the generality of the experiment results, we use three cross approach.

Table 3. The comparison results between Alexnet and VGG_S

4.2 The Comparison Results Between Alexnet and VGG_S

For each content, we randomly extract 300 low aesthetic quality images and 100 high quality aesthetic images from photo quality as our testing samples. The rest of images are training samples. And we use the data-augmentation way of Alexnet to expand the training samples. As Table 3 shows, the classification results of Alexnet_FT_All and VGG_S_FT_All are better than Alexnet_All and VGG_S_All, and the classification accuracy rate of VGG_S_FT_All is an average of 0.87% higher than Alexnet_FT_All. These comparison results show that we can use fine-tune to improve the classification accuracy rate and VGG_S is more suitable for image aesthetic quality evaluation no matter with or without fine-tune. So we will use VGG_S in next experiments.

Table 4. The results of image aesthetic quality based on image contents
Fig. 6.
figure 6

Confuse matrix of image aesthetic quality evaluation based on image content

4.3 The Results Based on the Image Contents

We use the same training, testing samples and data-augmentation as the experiment of comparing Alexnet with VGG_S. Table 4 shows the comparison results between using image contents and not using image contents. VGG_S_FT_i is trained by VGG_S architecture where the i represents Animal, Architecture, Human, Landscape, Night or Static. The results show that the classification accuracy rate of each image content is higher than VGG_S_FT_All. It proves that training models based on image content can boost the classification performance. Although the training sample size is reduced, the quality of training samples is higher, the categorization accuracy rate is improved.

We also build a confuse matrix to prove the image content is an important factor in image aesthetic quality evaluation. As Fig. 6 shows, the highest classification accuracy rates of the matrix are on its diagonal while the others are much lower. Training models based on image content and using the same content image to testing models can improve the classification accuracy rate. It proves not only that the importance of image content in training process but also the necessity of classification according to image content before we use the model to classify images.

4.4 The Results of Embedded Fine-Tune

Embedded fine-tune is proposed through summarizing conclusions and problems of the first two experiments and its training/testing samples and data-augmentation are same as the first two experiments. We compare the results of embedded fine-tune with the first two experiments and then compare with the existing traditional approaches and CNN approaches.

As Fig. 7 and Table 5 show, the classification accuracy of EFCNN_i is average higher 0.88% than VGG_S_FT_i. The EFCNN_Plant is the highest with the classification accuracy of 1.71% higher than VGG_S_FT_Plant. The classification accuracy of EFCNN_i is average higher 2.08% than VGG_S_FT_All. The EFCNN_Landscape is the highest with the classification accuracy of 4.2% higher than VGG_S_FT_All. Embedded fine-tune can improve the classification performance because the second fine-tune is based on a binary classification problem as the same as the image aesthetic quality classification task. Besides, image contents are jointed to embedded fine-tune and embedded fine-tune makes full use of the small-scale dataset. In Fig. 8, we visualize the first convolution layer to show that embedded fine-tune can improve the classification accuracy rate. The textures of feature extracted by embedded fine-tune are clearer than those textures extracted no using embedded fine-tune.

Fig. 7.
figure 7

The histogram of the comparison results

Table 5. The comparison results between embedded fine-tune and traditional and CNN approaches

Traditional approaches are more complex and they depend on the features which have been designed according to a particular dataset. The classification accuracies of each content of embedded fine-tune are much better than those results of [3, 15, 21, 22, 32, 33]. The gap between traditional approaches and the CNN approaches is growing ever wider.

Fig. 8.
figure 8

The visualization of the first convolution layer

Fig. 9.
figure 9

The scores of image aesthetic quality. The two columns on the left are high aesthetic quality images and the two columns on the right are low aesthetic quality images (Color figure online)

We also compare embedded fine-tune with the existing CNN approaches. DVGG_S_i (Double VGG_S Adaption) is that we double the VGG_S and use the parallel networks adaption method proposed by Lu [20] where the i represents Animal, Architecture, Human, Landscape, Night or Static and PDCNN (Parallel Deep Convolution Neutral Network) proposed by Guo [8].

The categorization accuracy of EFCNN_i is an average of 2.33% higher than that of PDCNN. The EFCNN_Static is the highest with the categorization accuracy of 3.62% higher than PDCNN. The classification accuracy of EFCNN_i is on average 3.35% higher than that of DVGG_S_AD_i. The EFCNN_Plant is the highest with the classification accuracy of 5.5% higher than DVGG_S_AD_Plant.

PDCNN and DVGGS_AD_i use parallel network architectures to train models. The features extracted by the paralleled CNN architecture from the small-scale dataset are limit too because the data scale is too small to offer more features unless expand the number of the data. Fine-tune from the pre-trained on the large-scale dataset can get more useful features and transfer these features to the target tasks. Embedded fine-tune can not only get the features from the large-scale dataset but also from the target task dataset. So embedded fine-tune can break up the limitation of small-scale data.

4.5 Image Aesthetic Quality Evaluation

We use the classification probability of the trained EFCNN_i to score the image aesthetic quality. The score represents the level of high or low image aesthetic quality. As Fig. 9 shows, blue represents the high aesthetic quality score probability and orange represents the low. If high aesthetic quality score is higher than the low, this image is a high aesthetic quality image. Otherwise it’s a low aesthetic quality image.

5 Conclusions

This paper analyzes the effect of image content to image aesthetic quality evaluation. We propose embedded fine-tune to solve the problems that data become smaller and using once fine-tune can not make full use of all data of small-scale dataset when we train the image aesthetic quality classification models. The experiment results show that embedded fine-tune can solve the problem of small-scale and boost the image aesthetic quality evaluation performance. At last, classification probability is used to evaluate the image aesthetic quality. The evaluation makes image aesthetic quality more specific.