1 Introduction

With the rapid development of social networks, people tend to express themselves in the form of texts with images or videos on the Internet. Therefore, the Internet has become an important source for opinion mining, affective computing, or emotion analysis. Influenced by development of social network, the text based emotion analysis has made great progress [5,6,7]. While the visual emotion analysis has lagged behind a lot. In recent years, visual contents become more and more popular on social network, human computer interaction and so on, visual emotion analysis is becoming one of hot research topics [8,9,10].

Human emotion is a kind of complex feelings. There are more than one kinds of emotion models. The emotion categories in these models are compared in Table 1. The most popular emotion model is Plutchik’s Wheel of Emotions [11] in which emotions are organized into eight basic categories: joy, trust, anticipation, anger, sadness, fear, disgust and surprise, each with three different emotional valences. Mikels et al. [12] also divide emotions into eights categories but replace joy, trust, anticipation, surprise in Plutchik’s model with amusement, contentment, excitement, awe. Ekman did a lot of cross-cultural comparative studies all over the world, and found that people with different cultural backgrounds are basically consistent with six emotions, including happiness, angry, sadness, fear, disgust and surprise. Based on the findings, he proposed Ekman’s facial expression system [13], which involves a more universal emotion model with the above six emotions. Ekman’s model is much unequal to positive and negative sentiments because only “happiness” is positive, while other five emotions are negative. To address this problem, Xu et al. [2] add “like” into Ekman’s model to express the positive emotion more exhaustively. These seven kinds of emotions are fundamentally consistent with the traditional Chinese presentation of “seven emotion”. Furthermore, Xu et al. construct an emotion ontology consisting of Chinese words corresponding to the seven emotions. Most existing researches on Chinese text emotion analysis [14,15,16] utilized this emotion model and ontology.

Table 1. The popular Emotion models and the emotion categories
Fig. 1.
figure 1

Example images of seven emotions in CH-EmoD.

Image datasets is of great importance for image emotion analysis and some datasets have been available for researches. Lang [17] built a dataset IAPS-Subset based on Mikels’s emotion model. And ArtPhoto is composed of photos by professional artists [18]. These two datasets include only hundreds of images, and they are a little small in the era of big data. Based on Plutchik’s Wheel of Emotions, Borth et al. [19] build a large scale dataset called SentiBank, in which more than 450,000 images are crawled from Flickr by the Adjective Noun Pairs (ANPs). And the emotion labels are assigned by the ANPs. With the same emotion model, Jou et al. [1] set up a large scale multilingual visual sentiment ontology and more than 7.36 million images and their metadata are also released. This large dataset is mainly created for different cultures, including 12 different languages. You et al. [20] query the image search engines (Flickr and Instagram) using Mikels’s eight emotion categories as keywords and established a dataset with 23,308 images. These datasets are frequently used in emotion/sentiment classification, and they are built based on Mikels’s models. According to Jou’s idea [20], emotion presentations of the Western and the Eastern are much different due to the culture difference. Xu’s emotion model [2] is popularly utilized in Chinese text emotion analysis. And they proposed a Chinese Emotion Ontology library for emotion analysis. If we could build an image dataset based on Xu’s model, it would be helpful for Chinese emotion analysis and emotion matching of Chinese-text and images.

In the era of big data, large scale image data is required. Where could we get a large number of images with emotion labels? There are a lot of images in social network that can be a resource for our dataset. But how could we get the emotion labels for these images? Inspired by SentiBank [19], we also collect the images and the labels from the visual social networks. Flickr is a popular visual social network, It has the characteristics as follows: (1) It is an image social network with lots of images free for public. (2) The images on Flickr have rich metadata such as tags, description texts which can help to attach the image to the corresponding emotional label. (3) A large number of images are shared with Chinese tags and text on Flickr every day.

Motivated by above, we establish an image dataset for emotion analysis by collecting images from Flickr using Chinese emotion ontology of Xu’s model. Xu’s model is usually used by Chinese, therefore, this dataset is named as CH-EmoD, some example images are shown in Fig. 1. Firstly, the emotion keywords of Chinese emotion ontology is used to crawl images as well as their tags and description text from Flickr. Secondly, with these metadata, a dataset refinement (de-noising) strategy is designed to remove the images with noise labels. Furthermore, we preserve a small set of images with multiple emotion labels, which is resulted from that an image may be connected to more than one emotional keywords. The sub dataset can be used for multi-label emotion classification. Therefore, we not only intend to address the emotion classification problem, but also to carry out multi-label emotion classification. Finally, we provide the baselines of emotion classification on this dataset by using the state-of-the-art sentiment/emotion classification frameworks Alexnet [3] and PCNN [4].

The contributions of this paper are as follows:

  • We build a large scale dataset for image emotion analysis by crawling images from Flickr using Chinese emotion ontology of Xu’s model [2]. Because Xu’s emotion model is usually utilized for Chinese text emotion analysis, the dataset is more suitable for analyzing Chinese emotion.

  • To address the problem of noise labels, we design a strategy to refine the original dataset automatically. And the final dataset CH-EmoD could be obtained. The Compared experimental results show that the dataset refinement strategy is effective.

  • We practice state-of-the-art sentiment/emotion classification algorithms on CH-EmoD and get the baselines of emotion classification as well as multi-label emotion classification.

2 Establishing the Image Emotion Dataset CH-EmoD

In this section, we build the dataset with Xu’s emotion model [2] which defines seven emotions: happiness, like, anger, sadness, fear, disgust, surprise. Meanwhile, a dataset refinement (de-noising) tactics is proposed to promote the confidence of emotion labels.

2.1 Crawling Images from Flickr by Emotion Keywords

In the Chinese emotional ontology library [2], there are totally 26,453 emotion keywords, as shown in Table 2. Each keyword is labeled with emotion category, emotion intensity and sentiment polarity. The emotion intensity is divided into five levels of 1, 3, 5, 7, and 9. The bigger the value, the stronger the emotion. From Table 2 we can see that distribution of the number of emotion keywords is imbalance. The number of emotion “surprise” is 228, it is the minimum. And the maximum number of keywords is 10,282 for “disgust”. On one hand the imbalance of number of keywords possibly results in the imbalance of images. On the other hand, it is a heavy work to crawl image using all of these keywords. Therefore, we select some of the keywords to represent the corresponding emotions. Based on the number 228, the number of keywords for each emotion category is not more than 300. We first delete the words of network words, idiom and prepositional phrase. Then based on the understanding that keywords with small emotion intensity possibly represent the emotion ineffectively, we select the keywords by the emotion intensity. Finally total 1935 keywords are selected as shown in Table 2. Using these keywords we obtain the raw dataset with 546,472 images whose labels are assigned by the corresponding emotion keywords.

Table 2. The number of emotion keywords for querying images of each categories
Fig. 2.
figure 2

Example images with problems in the raw dataset ((a)Example image with noise label; (b) Example image with keywords of different emotion categories).

2.2 Dataset Refinement

As is well known, the raw dataset crawled from social network is with noise. It is necessary to refine the raw dataset. By observing the raw dataset, there are some problems as follows:

Problem 1:

The emotion presented in some images is different from the labels assigned by the corresponding emotion keywords. In Fig. 2(a), the image is crawled by the keyword of “impatient” and it is assigned as emotion “disgust”. But it is clear that this image presents the emotion “happiness” which is much different from the assigned label.

Problem 2:

Some images are crawled by different keywords. In Fig. 2(b), the image is crawled by “affable”, “beautiful”, “courteous” and “desolate”. Therefore, it is assigned as the emotions of “happiness”, “like” and “sadness”.

For Problem 1, we should find the images with such problem and remove them from the raw dataset. Wu et al. [21] found such images by the sentiment polarity confliction of ANP and Tags in SentiBank [19]. Inspired by Wu’s idea, we try to refine the raw dataset by the sentiment polarity confliction of different text contents. In the raw dataset, most images involves emotion keywords, the tags and description texts. We look key words, the tags and the description text as three parties. If there is emotion confliction among these three parties for an image, we think the emotion label is confident, the image will be removed from the dataset. Otherwise, three parties give the sentiment polarity consistent or without confliction. And the label from the keywords is high confident. In Fig. 3, image in Fig. 2(a) is labeled as “disgust”, it presents a negative sentiment. However the sentiment polarities of tags and description text both are positive. There is sentiment polarity confliction between the emotion keywords, description text and tags for this image, and this image should be removed from the raw dataset.

Fig. 3.
figure 3

Examples of some contentious images. The red words are positive sentiment, and the green words represent negative sentiment. (Color figure online)

In the Chinese emotional ontology library, the sentiment polarity of emotion keywords is divided into neutral, positive, negative and both (positive and negative). For the convenience of statistics, we mark positive as 1, negative as −1, neutral and both as 0. We first use TextRank [22] algorithm to extract the keywords of description text. The sentiment of every text keyword is obtained from the Chinese emotional ontology library. If a keyword is not included in the library, its polarity is labeled as 0. Eventually, the sum of the polarity of the text keywords is the sentiment polarity of description texts. Secondly, the polarity of image tags is also determined by the same principle. Finally, an image preserved in or removed from the dataset is determined according to the sentiment polarity of emotion keywords, description text and tags. The detailed judgment rules are shown in Table 3. For example, if the sentiment polarity of the keyword is 1, and the sentiment polarity of description text and tags are 1 or 0, it means that there is not sentiment polarity contradiction, the image will be preserved. Otherwise, it will be removed.

Table 3. The rule of de-noising strategy.

After the above step, 120,429 images have been removed and the size of dataset become 426,043. To address Problem 2, we collect the images with multiple emotion labels and form a multiple label dataset. It includes total 29,022 images, each image includes almost 6 keywords on average. For an image, the number of keywords with the corresponding emotion is counted. And the number of keywords of each emotion category is divided by the total number of keywords so that we can obtain the probability of the corresponding emotion categories. As shown in Fig. 4.

Fig. 4.
figure 4

Example multi-label images. Each element of the seven dimensional vector represents happiness, like, anger, sadness, fear, disgust, surprise, sequentially.

Finally, the dataset CH-EmoD is composed of two parts: single label dataset and multi label dataset. In multi label dataset, the label of an image is presented as the probability distribution. Table 4 shows the number of images in each emotion category in dataset CH-EmoD. For multi-label dataset, we only give the total number of images.

Table 4. The distribution of every category

3 Image Emotion Analysis Using Convolutional Neural Network

In recent years, convolutional neural network has achieved great success in many image processing tasks, for example, Hand written numeral recognition, Image classification etc. At the same time, there also exist effective results by fine-tuning the AlexNet model pre-trained from ImageNet model [3]. In our work, the same tactics is conducted to fine-tune the Alexnet. We keep the same network structure as ImageNet reference network [23]. For the task of emotion, we only change the output of last fully connected layer from 1000 to 7. Additionally, we use the sigmoid cross entropy loss function instead of the softmax loss function for multi label classification. The other layers are exactly the same as ImageNet reference network which includes five convolutional layers and three fully connected layers.

Especially, because the label of multi-label data is probability distribution, we first processed it into binary labels based on formula as follows:

$$\begin{aligned} label_i=\left\{ \begin{array}{ll} 0 &{} prob_i < C_{th} \\ 1 &{} prob_i \ge C_{th} \end{array} \right. i = 1,2,...,7 \end{aligned}$$
(1)

where \(label_i\) represents each emotion category, \(prob_i\) is the probability of every emotion class in one image, and \(C_{th}\) is threshold ranging from 0 to 1. It is determined by experiments. At the end, the label of each image is a vector with seven binary values.

4 Experimental Results

Considering the diversity of data structures, we test our dataset from two different aspects: emotion classification, multi-label emotion classification.

Fig. 5.
figure 5

Confusion matrix for the three models on the testing data.

4.1 Emotion Classification

There are totally 275,687 images with single labels in our dataset CH-EmoD. And we randomly select 2,000 images as testing set, 1,000 images as validating set and the rest of them as training data. We fine-tune the pre-trained AlexNet [3] using the training data. Then the trained mode is tested using the testing data. In order to evaluate the effect of the dataset refinement, we fine-tune the same model using the raw dataset. We also practice the PCNN framework [4] on the dataset CH-EmoD. For these algorithms, the same testing set is utilized. The experimental results are shown in Table 5. The accuracy of the Alexnet on the refined dataset is 46.32%, which is higher by 14.37% than that of the same model on the raw dataset. The results show that the data refinement strategy is effective. The accuracy of PCNN is better than Alexnet on raw dataset while worse than Alexnet on CH-EmoD.

Table 5. Emotion multi-classification accuracy on different models.

We further compare the confusion matrix of these three models from their prediction results, as shown in Fig. 5. In general, the false positive rates of “happiness” and “like” is high on the three models, especially on the model of raw data. It is consistent with the fact that these two categories have the largest number of images in the dataset. Meanwhile, the true positive rates on model of refined data are the best in all emotion categories except “happiness” that is 0.75 on PCNN. These results demonstrate again that the proposed dataset refinement (de-noising) strategy works well.

4.2 Multi-label Classification of Image Emotion

There exist 29,022 images with multiple emotion labels. These images are randomly separated into training set (80%), validating set(5%) and testing set(15%). In order to obtain the best effect for multi-label classification, the experiments with different values of threshold \(C_{th}\) (in Sect. 3) are conducted. Meanwhile, we use Mean Average Precision (MAP) to evaluate the classification performance which is generally used in multi-label classification problems, as shown in Fig. 6. From Fig. 6 we can see that MAP reaches 36.63%, the best result when \(C_{th}=0.05\). As \(C_{th}\) increases from 0.05, MAP decreases more and more. It is reasonable because the labels of images become more and more sparse as \(C_{th}\) increases.

Fig. 6.
figure 6

The effect of multi-label emotion classification with different \(C_{th}\).

5 Conclusion

In this work, we address the challenging task of visual emotion classification since the sentiment analysis is difficult to present human emotions adequately. Due to the difference of Chinese and Western cultures, we use Chinese Emotion Ontology published by Dalian University of Technology to estabilsh an image dataset for emotion analysis. In addition, we design a refinement (de-nosing) strategy to promote the confidence of labels of each image. Furthermore, we obtain the dataset with multi emotion labels. Finally we provide the baselines of emotion classification and multi label emotion classification by using state-of-the-art emotion/sentiment classifications algorithms Alexnet and PCNN. The provided emotion dataset is the first emotion dataset involving seven emotion categories with Xu’s emotion model, which is popular in Chinese text emotion analysis. Therefore, the provided dataset is possibly useful for analyzing Chinese emotions from the images they uploaded or generated. The baselines could provide reference for the following researches. In future, we will continue to improve the credibility of the image labels and transform weakly labeled dataset into strongly labeled dataset. Furthermore, we will pay more attention on multi-label emotion classification.