Keywords

1 Introduction

Nowadays, more and more social media users tend to use images to express emotions or opinions. Compared with text, images are easier to express personal emotions intuitively. Therefore, the visual sentiment analysis of images has gained considerable attention and research [1, 2]. Visual sentiment analysis is a task to study the emotional response of humans to visual stimuli such as images and videos [3]. The key challenge of visual sentiment analysis is the giant gap between sentiment space and visual feature space.

Feature engineering methods have been studied extensively to construct image sentiment features in previous literatures, such as colors, textures and shapes [4,5,6]. Recently, deep neural network learning has achieved great success in computer vision because of its ability to learn abstract and robust features [7,8,9]. In particular, Convolution Neural Networks (CNN) can automatically learn robust features from large-scale image data and demonstrate excellent performance. CNN is widely used in image-related tasks such as image classification and object detection, therefore, methods based on CNN have also been proposed for predicting image sentiment [10].

However, visual sentiment analysis is more challenging than conventional recognition tasks due to a higher level of subjectivity in the emotion recognition process, and almost all these approaches have been trying to reveal sentiment from the global perspective of the whole image. Little attention has been paid to research the sentimental response from local objects’ regions. Therefore, it may lead to low accuracy performance on sentiment prediction.

In order to make prediction more accurate, we proposed a novel visual sentiment analysis method that integrating local objects’ region features with global features to enhance sentiment classification performance. The main contributions is that the proposed method can selectively focus on sentimentally important objects’ regions through object detection model and transfer learning strategy to learn more discriminative representation for visual sentiment classification.

2 Related Work

Feature engineering methods for visual sentiment analysis are categorized mainly into two types: feature selection and feature extraction. Lv et al. introduced color features for expressing emotion based on SIFT features of three RGB color channels and combined together to form a 384-dimensional C-SIFT features for predicting image sentiment [11]. Roth et al. analyzed image sentiment by extracting texture features of an image, using support vector machine (SVM) to classify sentiment polarity of an image [12]. Borth and Yuan et al. respectively, proposed using visual entities and attributes extraction to obtain semantic features to overcome sentiment gap between low-level visual features and high-level emotional semantics. Borth et al. constructed a large visual sentiment ontology library composed of 1,200 adjective noun pairs (ANP) [13]. Using this ontology library, the authors proposed SentiBank and MVSO emotion detectors to extract middle-level representation of the input images, which is treated as image features to train sentiment classifiers. Yuan et al. applied a similar strategy to Borth et al., but 102 pre-defined scene attributes are used instead of ANP as the middle-level representation [2].

More recently, researchers began to use deep models to automatically learn sentiment representations from large-scale image data and obtained better results. For example, Chen et al. studied the classification of visual sentiment concepts, trained models on the large dataset given in [13] to obtain an upgraded version of SentiBank, called DeepSentibank [14]. You et al. defined a CNN architecture for visual sentiment analysis to solve the problem of training on large-scale and noisy datasets, the authors used a progressive training strategy to fine-tune the network architecture, called PCNN [15]. Campos et al. used transfer learning strategy to fine-tune the image classification network with the Flickr dataset and applied it for image sentiment analysis [16, 17].

Although the above methods have achieved some positive performance, they basically considered to extract features only from whole images, and did not pay enough attention to local objects’ regions expressed prominent emotion. Sun et al. used deep model to automatically discover local regions that contain objects and use them for visual sentiment analysis [18]. Li et al. proposed a context-aware classification model that considered both the local context and local-global context [19]. Different from existing research, this paper mine sentiment information from whole images and the local objects’ regions simultaneously. The paper mainly focuses on the following points: First, we obtaining a localized region with accurate positioning and carrying emotional objects; Second, the feature fusion method is used to consider both the whole image and the local regions at the COIS architecture.

3 COIS Framework Description

The overall framework of the proposed COIS method is illustrated in Fig. 1. The goal is to learn discriminative sentiment representations from full images and local regions containing salient objects which may also express emotions. The framework consists of four parts: global feature extraction from whole images, local object regions detection, local features extraction from local objects’ region, global features and local features integrating for visual sentiment classification. The global feature representation from the whole image is extracted by VGGNet-16 [20], as shown in (a) of Fig. 1; Faster R-CNN [21] is a popular and excellent object detection model, in order to extract the local objects’ region features, COIS framework tunes a pre-trained Faster R-CNN model to detect objects’ regions, as shown in (c) of Fig. 1 by using an affective image dataset as shown in (b) of Fig. 1; The global features and the local objects’ region features are combined and used to train a sentiment classifier, which is shown in (d) of Fig. 1.

Fig. 1.
figure 1

Overview of the proposed COIS method.

3.1 Local Objects’ Regions Generation

Local objects’ region usually contain fine-grained information of objects in the image. We use Faster R-CNN to generate local regions for providing localized information. The input image is first fed into Faster R-CNN to generate a multi-channel feature map, and a series of candidate boxes is obtained. We get local objects’ regions by comparing the overlap ratio of each candidate box with the ground truth label of the object detection image. A transfer learning strategy is applied to overcome the difference between the object detection dataset and the affective image dataset. That is, Faster R-CNN model is pre-trained on the object detection dataset PASCAL VOC 2007, transfer the parameters of the learned object region detection model to be further tuned by the affective image dataset. The following is a detailed description of how to use the Faster R-CNN to generate objects’ region candidate boxes.

The input of the candidate box generation network is an image, and the output is a set of candidate boxes. Suppose the input image size is \( M \times N \). The image is fed into convolutional layer to obtain a response convolutional feature map \( F \in R^{w \times h \times n} \), where \( w \) and \( h \) are the width and height of \( F \), and \( n \) is the number of channels of the convolutional feature map. Let the size of the convolution feature map be \( \left( {{M \mathord{\left/ {\vphantom {M {16}}} \right. \kern-0pt} {16}}} \right) \times \left( {{N \mathord{\left/ {\vphantom {N {16}}} \right. \kern-0pt} {16}}} \right) \), which means the width, the height of the input image and the output feature map are both scaled to 1/16. To generate the candidate boxes, Faster R-CNN utilizes a two-layer CNN on the convolution feature map, the first layer contains \( c \) filters \( g \in R^{a \times c} \) of size \( a \times a \), the filter \( g \) slides on the input convolution feature map \( F \) to generate a lower-dimensional feature \( F^{\prime} \in R^{{w^{\prime} \times h^{\prime} \times l}} \), which is calculated as Eq. (1):

$$ F^{\prime} = \delta \left( {g * F + b} \right) $$
(1)

where \( * \) is a convolution operation, \( b \in R \) is the bias, and \( R \) is real number set, \( \delta ( \cdot ) \) is a nonlinear activation function.

For each position on \( F^{\prime} \), we considering \( k \) possible candidate box sizes to better detect objects of different sizes. Supposing \( W \) and \( H \) are the spatial size(width and height) of \( F^{\prime} \), respectively, then \( WHk \) candidate boxes can be obtained. The feature map \( F^{\prime} \) is delivered to two parallel fully connected layers. One is to decide whether there is an object in the candidate box; The other is to predict the coordinates of the center point of the candidate box and the size, as shown by the two rightmost branches in (c) of Fig. 1. Therefore, for \( k \) candidate boxes, the classification layer outputs \( 2k \) probability scores to evaluate whether there is an object in the candidate box, the regression layer outputs the candidate box center point coordinates and the width and height of the candidate box. The weighted expression of the classification layer and the regression layer loss function is given in Eq. (2):

$$ L\left( {\left\{ {p_{i} } \right\},\left\{ {t_{i} } \right\}} \right) = \frac{1}{{N_{cls} }}\sum\limits_{i}^{k} {L_{cls} \left( {p_{i} ,p_{i}^{*} } \right)} + \lambda \frac{1}{{N_{reg} }}\sum\limits_{i}^{k} {p_{i}^{*} L_{reg} \left( {t_{i} ,t_{i}^{*} } \right)} $$
(2)

where \( p_{i} \) denotes the prediction result of the candidate box \( i \), \( p_{i}^{*} = 1 \) if the candidate box \( i \) is a positive sample, that is, the object exists in the candidate box, and \( p_{i}^{*} = 0 \) otherwise, that is, the candidate box is the background. \( N_{cls} \) represents the number of all candidate boxes generated by a Minibatch. \( t_{i} \) represents the size of the candidate box, \( t_{i}^{*} \) is the ground truth label corresponding to \( t_{i} \), and \( \lambda \) is the hyper-parameter.

Since determining whether an object is in the candidate box belongs to the two-class classification problem, \( L_{cls} \) adopts \( Log \) Loss, which is commonly used for the two-class problem, and the calculation formula is shown in Eq. (3):

$$ \ell \left( \theta \right) = p_{i}^{*} \log p_{i} + \left( {1 - p_{i}^{*} } \right)\log \left( {1 - p_{i} } \right) $$
(3)

\( L_{reg} \) uses the common loss function Smooth \( L_{1} \) Loss, which measures the degree of deviation between the predicted value and the ground truth label, Smooth \( L_{1} \) Loss is given in Eq. (5):

$$ L_{reg} \left( {t_{i} ,t_{i}^{ * } } \right) = \sum {smooth_{L1} \left( {t_{i} - t_{i}^{ * } } \right)} $$
(4)
$$ smooth_{L1} \left( x \right) = \left\{ {_{{\begin{array}{*{20}c} {\left| x \right| - 0.5} & {otherwise} \\ \end{array} }}^{{\begin{array}{*{20}c} {0.5x^{2} } & {if\,\left| x \right|} \\ \end{array} < 1}} } \right. $$
(5)

3.2 Local Objects’ Region Features Extraction

Let \( L = \left\{ {r_{1} , \cdot \cdot \cdot ,r_{n} } \right\} \) be the generated candidate boxes set containing objects. The candidate boxes set \( L \) is projected onto the convolution feature map \( F \in R^{w \times h \times n} \), then we extract local objects’ region features, thereby avoiding the missing image information caused by cropping or scaling the candidate box, and reducing the time spent on a large number of convolution operations [22]. Any one of the candidate box sets \( r_{i} = \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}_{i = 1}^{n} \) is used as a sample generated in the affective image, as shown by the orange rectangular box in Fig. 2(a), where \( x_{i} \) is usually expressed as a four-dimensional vector, which represents the coordinates of the center point of the candidate box and the width and height, respectively, \( y_{i} \in \left\{ {0,1} \right\} \) represents the sentiment label corresponding to the object in the generated box. For each sample, we divided a generated box into \( m \) different granularities to obtain the semantic information of multiple levels, as shown in Fig. 2(b) where three sizes of granularities are applied. Then, the maximum pooling operation is performed on each of the divided sub-blocks to obtain a series of distinguishing feature maps \( \left\{ {f_{1} ,f_{2} , \cdot \cdot \cdot ,f_{d} } \right\} \), and \( d \) represents the number of divided sub-blocks, the calculation formula is shown in Eq. (6):

Fig. 2.
figure 2

Overview of local objects’ region features extraction.

$$ f_{i} = G_{max} \left( {b_{j} } \right) $$
(6)

where \( b_{j} \) denotes a sub-block after division, \( f_{i} \) denotes a feature map corresponding to the sub-block \( b_{j} \), and \( G_{max} \left( \cdot \right) \) denotes a maximum pooling operation. Finally, the feature maps of all the sub-blocks are added to obtain a local feature vector \( L_{{f_{i} }} \) of fixed dimension, which is specifically expressed as Eq. (7):

$$ L_{{f_{i} }} = \sum\limits_{i = 1}^{d} {f_{i} } $$
(7)

Let \( L_{{f_{i} }} \) be a feature vector of a local region detected in an image, thus all detected local regions can be represented as a feature vector set \( \left\{ {L_{{f_{1} }} , \cdots ,L_{{f_{n} }} } \right\} \), where \( n \) represents the number of detected local regions.

3.3 Global Feature Extraction

The global feature is an important factor related to the sentiment representation of an image, and typically includes the overall appearance information of the image and the contextual information surrounding the object in the image. We uses the VGGNet-16 shown in Fig. 3 to extract global features. VGGNet-16 consists of 5 convolutional blocks and 3 fully connected layers, this deep neural network jointly developed by Oxford University and DeepMind has a deeper network structure and unified network configuration than ordinary convolutional neural networks. It allows for more nonlinear transformations while reducing parameters, resulting in better feature extraction capabilities.

Fig. 3.
figure 3

Overview of VGGNet-16.

Specifically, we extract the global feature of an image from the last fully connected layer \( fc7 \) of VGGNet-16, and get a 4096-dimensional feature vector, denoted as \( G_{f} \), as shown in the rightmost side of Fig. 3.

3.4 Image Sentiment Classification

Global feature representations and local objects’ region features are \( G_{f} \) and \( \left\{ {L_{{f_{1} }} ,L_{{f_{2} }} , \cdot \cdot \cdot ,L_{{f_{n} }} } \right\} \), respectively. We select the top \( 5 \) objects detected to represent important local region information. Therefore each image can ultimately be represented as a set of feature vectors \( U = \left\{ {G_{f} ,L_{{f_{1} }} , \cdot \cdot \cdot ,L_{{f_{n} }} } \right\} \). We fuse these two type of features by the following Eq. (8):

$$ \varphi \left( U \right) = G_{f} \oplus L_{{f_{1} }} \oplus \cdots \oplus L_{{f_{n} }} $$
(8)

where \( \oplus \) represents concat two type of features, which are expressed by tensors.

For supervised image sentiment classification, the role of sentiment label in the training process cannot be ignored. Therefore, we assign a same sentiment polarity label to the local regions of the corresponding image. After the integrated joint feature vector \( \varphi \left( U \right) \) is obtained, it is delivered to the fully connected layer and classified into the output category by softmax. To measure the model loss, we uses cross entropy to define the loss function. The softmax layer maps the joint feature vector \( \varphi \left( U \right) \) to the output category by assigning a corresponding probability score \( q_{i} \). If the number of sentiment categories is \( s \), then \( q_{i} \) is given in (9):

$$ q_{i} { = }\frac{{\exp (\varphi \left( U \right)_{i} )}}{{\sum\nolimits_{i} {\exp (\varphi \left( U \right)_{i} )} }} , { }i\,{ = }\,1 ,\ldots ,s $$
(9)
$$ \ell { = - }\sum\limits_{i} {h_{i} \log (q_{i} )} $$
(10)

where \( \ell \) is the cross entropy loss of the network and \( h_{i} \) is the true emotional label of the image.

4 Experiments

In this section, we evaluate our method against other methods of sentiment classification by global feature of image to demonstrate the performance of the COIS for visual sentiment analysis.

4.1 Datasets

We evaluate our method on two public datasets including TwitterI [13], TwitterII [15]. TwitterI is assembled from social media Twitter in 881 images with two sentiment polarity categories (i.e. positive and negative), which are based on group intelligence strategy. Twitter II is provided by You et al. which contains 1269 images from social media Twitter and labeled with sentiment polarity categories by AMT participants. Both TwitterI and TwitterII datasets are split randomly into 80% training and 20% testing sets.

4.2 Experiment Settings

Our framework is implemented using Linux-Ubuntu14.04, Python 2.7, Tensorflow 1.3.0, and the development tool is PyCharm. All of our experiments are performed on an NVIDIA Tesla P100-PCIE GPU with 64 GB on-board memory. The framework used to extract feature of images is based on COIS. To conduct fair comparison, we keep most of the settings same as previous study [13, 14, 20, 23]. The image size is 224 × 224. The image is padded by 2 pixels on each side, filled with 0 value. In order to supervised training to obtain a local objects’ regions, we labeled affective images artificially with five types of objects such as people, vehicles, etc. The annotation tool is called ImageLab. At this point, the dataset contains both the sentiment label and the object detection label (including the center point coordinates and width and height of the object’s candidate box).

We train the model using MomentumOptimizer with a momentum of 0.9. The learning rate of the convolutional layers are initialized as 0.001. Our method uses the Dropout strategy [23] and the L2 paradigm to reduce over-fitting, the Dropout value is set to 0.5. Cross entropy is chosen as the loss function. The total number of iterations are 100 epochs. In addition, for local objects’ regions features, we adjusts the pooling layer of detection branch. The pool kernel adopts \( 3 \times 3 \), \( 2 \times 2 \), \( 1 \times 1 \) to adapt to the datasets of this paper.

4.3 Comparison Method

For traditional methods, we compares the SentiBank [13] model. For the basic CNN methods, we compared DeepSentiBank [14], and two classical deep learning methods pre-trained on ImageNet and fine-tuned on the affective datasets: AlexNet [23], VGGNet [20]. The specific instructions are as follows:

  • SentiBank: A visual concept detector library that can be used to detect the presence of 1,200 adjective noun phrase pair (ANPs) in an image, which is treated as an image middle-level representation for visual sentiment analysis [13].

  • DeepSentiBank: A visual sentiment concept classifier trained on large datasets using deep convolutional neural networks, an improved version of SentiBank [14].

  • ImageNet-AlexNet: A deep learning architecture AlexNet pre-trained on ImageNet and fine-tuned it on the affective image dataset for visual sentiment analysis [23].

  • ImageNet-VGGNet16: The same idea as the ImageNet-AlexNet, the difference is that the network is replaced by VGGNet-16 [20].

  • Local regions-Net: A method only considering local objects’ region feature with Faster R-CNN and fully connected layers to learn sentiment representation.

4.4 Experiment Result

Table 1 shows that the classification accuracy(%) of the proposed method and the comparison method on the TwitterI, TwitterII datasets. As shown in Table 1, our proposed method COIS achieves 75.81% accuracy on TwitterI and 78.90% accuracy on TwitterII compared with 66.63% and 65.93% accuracy on TwitterI and TwitterII for SentiBank method. The COIS method outperforms Deepsentibank by 5% and 8% on TwitterI and TwitterII datasets. It is suggests that our proposed COIS method can learn more discriminative representation for visual sentiment analysis.

Table 1. Classification accuracy of different methods on TwitterI and TwitterII.

In addition, compared with ImageNet-AlexNet and ImageNet-VGGNet16, our proposed method improved by about 10% accuracy on TwitterI and TwitterII under similar parameter size. The result suggests that the effectiveness of combining local objects’ regions representation.

5 Conclusions

Visual sentiment analysis is gaining more and more attention. Considering that the emotions generated by images are not only from the whole image, but also from the local objects’ regions in the images. This paper presents a novel COIS method for visual sentiment analysis. In the COIS method, the Faster R-CNN model is used to detect the objects’ regions in an image, and then the fully connected network is used to learn the sentiment representation of the local objects’ region of the image and integrate it with the global feature of the image to obtain a more discriminative sentiment representation. The performance of the proposed method is evaluated on two real datasets, and the experimental results show that the proposed COIS method is better than the method of learning emotional representation only from the whole image.

However, in this study, only the local region sentiment of the objects contained in the images are used to enhance the visual sentiment analysis, but other regions that do not contain objects are ignored. In the future work, we will consider the method of weakly supervised learning to discover more emotional regions in an image and consider better strategy of feature fusion to further improve the performance of COIS.