Visual Sentiment Analysis with Local Object Regions Attention

Cai, Guoyong; He, Xinhao; Pan, Jiao

doi:10.1007/978-981-15-0118-0_37

Guoyong Cai¹¹,
Xinhao He¹¹ &
Jiao Pan¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1058))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

1486 Accesses
1 Citations

Abstract

Human action recognition has gained popularity because of its worldwide applications such as video surveillance, video retrieval and human–computer interaction. This paper provides a comprehensive overview of notable advances made by deep neural networks in this field. Firstly, the basic conception of action recognition and its common applications were introduced. Secondly, action recognition was categorized as action classification and action detection according to its respective research goals. And various deep learning frameworks for recognition tasks were discussed in detail and the most challenging datasets and taxonomies were briefly reviewed. Finally, the limitations of the state-of-the-art and promising directions of the research were briefly outlined.

Download conference paper PDF

Survey on Deep Learning for Human Action Recognition

Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition

Human Action Recognition Based on Deep Learning

Keywords

1 Introduction

Nowadays, more and more social media users tend to use images to express emotions or opinions. Compared with text, images are easier to express personal emotions intuitively. Therefore, the visual sentiment analysis of images has gained considerable attention and research [1, 2]. Visual sentiment analysis is a task to study the emotional response of humans to visual stimuli such as images and videos [3]. The key challenge of visual sentiment analysis is the giant gap between sentiment space and visual feature space.

Feature engineering methods have been studied extensively to construct image sentiment features in previous literatures, such as colors, textures and shapes [4,5,6]. Recently, deep neural network learning has achieved great success in computer vision because of its ability to learn abstract and robust features [7,8,9]. In particular, Convolution Neural Networks (CNN) can automatically learn robust features from large-scale image data and demonstrate excellent performance. CNN is widely used in image-related tasks such as image classification and object detection, therefore, methods based on CNN have also been proposed for predicting image sentiment [10].

However, visual sentiment analysis is more challenging than conventional recognition tasks due to a higher level of subjectivity in the emotion recognition process, and almost all these approaches have been trying to reveal sentiment from the global perspective of the whole image. Little attention has been paid to research the sentimental response from local objects’ regions. Therefore, it may lead to low accuracy performance on sentiment prediction.

In order to make prediction more accurate, we proposed a novel visual sentiment analysis method that integrating local objects’ region features with global features to enhance sentiment classification performance. The main contributions is that the proposed method can selectively focus on sentimentally important objects’ regions through object detection model and transfer learning strategy to learn more discriminative representation for visual sentiment classification.

2 Related Work

Feature engineering methods for visual sentiment analysis are categorized mainly into two types: feature selection and feature extraction. Lv et al. introduced color features for expressing emotion based on SIFT features of three RGB color channels and combined together to form a 384-dimensional C-SIFT features for predicting image sentiment [11]. Roth et al. analyzed image sentiment by extracting texture features of an image, using support vector machine (SVM) to classify sentiment polarity of an image [12]. Borth and Yuan et al. respectively, proposed using visual entities and attributes extraction to obtain semantic features to overcome sentiment gap between low-level visual features and high-level emotional semantics. Borth et al. constructed a large visual sentiment ontology library composed of 1,200 adjective noun pairs (ANP) [13]. Using this ontology library, the authors proposed SentiBank and MVSO emotion detectors to extract middle-level representation of the input images, which is treated as image features to train sentiment classifiers. Yuan et al. applied a similar strategy to Borth et al., but 102 pre-defined scene attributes are used instead of ANP as the middle-level representation [2].

More recently, researchers began to use deep models to automatically learn sentiment representations from large-scale image data and obtained better results. For example, Chen et al. studied the classification of visual sentiment concepts, trained models on the large dataset given in [13] to obtain an upgraded version of SentiBank, called DeepSentibank [14]. You et al. defined a CNN architecture for visual sentiment analysis to solve the problem of training on large-scale and noisy datasets, the authors used a progressive training strategy to fine-tune the network architecture, called PCNN [15]. Campos et al. used transfer learning strategy to fine-tune the image classification network with the Flickr dataset and applied it for image sentiment analysis [16, 17].

Although the above methods have achieved some positive performance, they basically considered to extract features only from whole images, and did not pay enough attention to local objects’ regions expressed prominent emotion. Sun et al. used deep model to automatically discover local regions that contain objects and use them for visual sentiment analysis [18]. Li et al. proposed a context-aware classification model that considered both the local context and local-global context [19]. Different from existing research, this paper mine sentiment information from whole images and the local objects’ regions simultaneously. The paper mainly focuses on the following points: First, we obtaining a localized region with accurate positioning and carrying emotional objects; Second, the feature fusion method is used to consider both the whole image and the local regions at the COIS architecture.

3 COIS Framework Description

The overall framework of the proposed COIS method is illustrated in Fig. 1. The goal is to learn discriminative sentiment representations from full images and local regions containing salient objects which may also express emotions. The framework consists of four parts: global feature extraction from whole images, local object regions detection, local features extraction from local objects’ region, global features and local features integrating for visual sentiment classification. The global feature representation from the whole image is extracted by VGGNet-16 [20], as shown in (a) of Fig. 1; Faster R-CNN [21] is a popular and excellent object detection model, in order to extract the local objects’ region features, COIS framework tunes a pre-trained Faster R-CNN model to detect objects’ regions, as shown in (c) of Fig. 1 by using an affective image dataset as shown in (b) of Fig. 1; The global features and the local objects’ region features are combined and used to train a sentiment classifier, which is shown in (d) of Fig. 1.

3.1 Local Objects’ Regions Generation

Local objects’ region usually contain fine-grained information of objects in the image. We use Faster R-CNN to generate local regions for providing localized information. The input image is first fed into Faster R-CNN to generate a multi-channel feature map, and a series of candidate boxes is obtained. We get local objects’ regions by comparing the overlap ratio of each candidate box with the ground truth label of the object detection image. A transfer learning strategy is applied to overcome the difference between the object detection dataset and the affective image dataset. That is, Faster R-CNN model is pre-trained on the object detection dataset PASCAL VOC 2007, transfer the parameters of the learned object region detection model to be further tuned by the affective image dataset. The following is a detailed description of how to use the Faster R-CNN to generate objects’ region candidate boxes.

The input of the candidate box generation network is an image, and the output is a set of candidate boxes. Suppose the input image size is $ M \times N $. The image is fed into convolutional layer to obtain a response convolutional feature map $ F \in R^{w \times h \times n} $, where $ w $ and $ h $ are the width and height of $ F $, and $ n $ is the number of channels of the convolutional feature map. Let the size of the convolution feature map be $ \left( {{M \mathord{\left/ {\vphantom {M {16}}} \right. \kern-0pt} {16}}} \right) \times \left( {{N \mathord{\left/ {\vphantom {N {16}}} \right. \kern-0pt} {16}}} \right) $, which means the width, the height of the input image and the output feature map are both scaled to 1/16. To generate the candidate boxes, Faster R-CNN utilizes a two-layer CNN on the convolution feature map, the first layer contains $ c $ filters $ g \in R^{a \times c} $ of size $ a \times a $, the filter $ g $ slides on the input convolution feature map $ F $ to generate a lower-dimensional feature $ F^{\prime} \in R^{{w^{\prime} \times h^{\prime} \times l}} $, which is calculated as Eq. (1):

$$ F^{\prime} = \delta \left( {g * F + b} \right) $$

(1)

where $ * $ is a convolution operation, $ b \in R $ is the bias, and $ R $ is real number set, $ \delta ( \cdot ) $ is a nonlinear activation function.

For each position on $ F^{\prime} $, we considering $ k $ possible candidate box sizes to better detect objects of different sizes. Supposing $ W $ and $ H $ are the spatial size(width and height) of $ F^{\prime} $, respectively, then $ WHk $ candidate boxes can be obtained. The feature map $ F^{\prime} $ is delivered to two parallel fully connected layers. One is to decide whether there is an object in the candidate box; The other is to predict the coordinates of the center point of the candidate box and the size, as shown by the two rightmost branches in (c) of Fig. 1. Therefore, for $ k $ candidate boxes, the classification layer outputs $ 2k $ probability scores to evaluate whether there is an object in the candidate box, the regression layer outputs the candidate box center point coordinates and the width and height of the candidate box. The weighted expression of the classification layer and the regression layer loss function is given in Eq. (2):

$$ L\left( {\left\{ {p_{i} } \right\},\left\{ {t_{i} } \right\}} \right) = \frac{1}{{N_{cls} }}\sum\limits_{i}^{k} {L_{cls} \left( {p_{i} ,p_{i}^{*} } \right)} + \lambda \frac{1}{{N_{reg} }}\sum\limits_{i}^{k} {p_{i}^{*} L_{reg} \left( {t_{i} ,t_{i}^{*} } \right)} $$

(2)

where $ p_{i} $ denotes the prediction result of the candidate box $ i $, $ p_{i}^{*} = 1 $ if the candidate box $ i $ is a positive sample, that is, the object exists in the candidate box, and $ p_{i}^{*} = 0 $ otherwise, that is, the candidate box is the background. $ N_{cls} $ represents the number of all candidate boxes generated by a Minibatch. $ t_{i} $ represents the size of the candidate box, $ t_{i}^{*} $ is the ground truth label corresponding to $ t_{i} $, and $ \lambda $ is the hyper-parameter.

Since determining whether an object is in the candidate box belongs to the two-class classification problem, $ L_{cls} $ adopts $ Log $ Loss, which is commonly used for the two-class problem, and the calculation formula is shown in Eq. (3):

$$ \ell \left( \theta \right) = p_{i}^{*} \log p_{i} + \left( {1 - p_{i}^{*} } \right)\log \left( {1 - p_{i} } \right) $$

(3)

$ L_{reg} $ uses the common loss function Smooth $ L_{1} $ Loss, which measures the degree of deviation between the predicted value and the ground truth label, Smooth $ L_{1} $ Loss is given in Eq. (5):

$$ L_{reg} \left( {t_{i} ,t_{i}^{ * } } \right) = \sum {smooth_{L1} \left( {t_{i} - t_{i}^{ * } } \right)} $$

(4)

$$ smooth_{L1} \left( x \right) = \left\{ {_{{\begin{array}{*{20}c} {\left| x \right| - 0.5} & {otherwise} \\ \end{array} }}^{{\begin{array}{*{20}c} {0.5x^{2} } & {if\,\left| x \right|} \\ \end{array} < 1}} } \right. $$

(5)

3.2 Local Objects’ Region Features Extraction

Let $ L = \left\{ {r_{1} , \cdot \cdot \cdot ,r_{n} } \right\} $ be the generated candidate boxes set containing objects. The candidate boxes set $ L $ is projected onto the convolution feature map $ F \in R^{w \times h \times n} $, then we extract local objects’ region features, thereby avoiding the missing image information caused by cropping or scaling the candidate box, and reducing the time spent on a large number of convolution operations [22]. Any one of the candidate box sets $ r_{i} = \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}_{i = 1}^{n} $ is used as a sample generated in the affective image, as shown by the orange rectangular box in Fig. 2(a), where $ x_{i} $ is usually expressed as a four-dimensional vector, which represents the coordinates of the center point of the candidate box and the width and height, respectively, $ y_{i} \in \left\{ {0,1} \right\} $ represents the sentiment label corresponding to the object in the generated box. For each sample, we divided a generated box into $ m $ different granularities to obtain the semantic information of multiple levels, as shown in Fig. 2(b) where three sizes of granularities are applied. Then, the maximum pooling operation is performed on each of the divided sub-blocks to obtain a series of distinguishing feature maps $ \left\{ {f_{1} ,f_{2} , \cdot \cdot \cdot ,f_{d} } \right\} $, and $ d $ represents the number of divided sub-blocks, the calculation formula is shown in Eq. (6):

$$ f_{i} = G_{max} \left( {b_{j} } \right) $$

(6)

where $ b_{j} $ denotes a sub-block after division, $ f_{i} $ denotes a feature map corresponding to the sub-block $ b_{j} $, and $ G_{max} \left( \cdot \right) $ denotes a maximum pooling operation. Finally, the feature maps of all the sub-blocks are added to obtain a local feature vector $ L_{{f_{i} }} $ of fixed dimension, which is specifically expressed as Eq. (7):

$$ L_{{f_{i} }} = \sum\limits_{i = 1}^{d} {f_{i} } $$

(7)

Let $ L_{{f_{i} }} $ be a feature vector of a local region detected in an image, thus all detected local regions can be represented as a feature vector set $ \left\{ {L_{{f_{1} }} , \cdots ,L_{{f_{n} }} } \right\} $, where $ n $ represents the number of detected local regions.

3.3 Global Feature Extraction

The global feature is an important factor related to the sentiment representation of an image, and typically includes the overall appearance information of the image and the contextual information surrounding the object in the image. We uses the VGGNet-16 shown in Fig. 3 to extract global features. VGGNet-16 consists of 5 convolutional blocks and 3 fully connected layers, this deep neural network jointly developed by Oxford University and DeepMind has a deeper network structure and unified network configuration than ordinary convolutional neural networks. It allows for more nonlinear transformations while reducing parameters, resulting in better feature extraction capabilities.

Specifically, we extract the global feature of an image from the last fully connected layer $ fc7 $ of VGGNet-16, and get a 4096-dimensional feature vector, denoted as $ G_{f} $, as shown in the rightmost side of Fig. 3.

3.4 Image Sentiment Classification

Global feature representations and local objects’ region features are $ G_{f} $ and $ \left\{ {L_{{f_{1} }} ,L_{{f_{2} }} , \cdot \cdot \cdot ,L_{{f_{n} }} } \right\} $, respectively. We select the top $ 5 $ objects detected to represent important local region information. Therefore each image can ultimately be represented as a set of feature vectors $ U = \left\{ {G_{f} ,L_{{f_{1} }} , \cdot \cdot \cdot ,L_{{f_{n} }} } \right\} $. We fuse these two type of features by the following Eq. (8):

$$ \varphi \left( U \right) = G_{f} \oplus L_{{f_{1} }} \oplus \cdots \oplus L_{{f_{n} }} $$

(8)

where $ \oplus $ represents concat two type of features, which are expressed by tensors.

For supervised image sentiment classification, the role of sentiment label in the training process cannot be ignored. Therefore, we assign a same sentiment polarity label to the local regions of the corresponding image. After the integrated joint feature vector $ \varphi \left( U \right) $ is obtained, it is delivered to the fully connected layer and classified into the output category by softmax. To measure the model loss, we uses cross entropy to define the loss function. The softmax layer maps the joint feature vector $ \varphi \left( U \right) $ to the output category by assigning a corresponding probability score $ q_{i} $. If the number of sentiment categories is $ s $, then $ q_{i} $ is given in (9):

$$ q_{i} { = }\frac{{\exp (\varphi \left( U \right)_{i} )}}{{\sum\nolimits_{i} {\exp (\varphi \left( U \right)_{i} )} }} , { }i\,{ = }\,1 ,\ldots ,s $$

(9)

$$ \ell { = - }\sum\limits_{i} {h_{i} \log (q_{i} )} $$

(10)

where $ \ell $ is the cross entropy loss of the network and $ h_{i} $ is the true emotional label of the image.

4 Experiments

In this section, we evaluate our method against other methods of sentiment classification by global feature of image to demonstrate the performance of the COIS for visual sentiment analysis.

4.1 Datasets

We evaluate our method on two public datasets including TwitterI [13], TwitterII [15]. TwitterI is assembled from social media Twitter in 881 images with two sentiment polarity categories (i.e. positive and negative), which are based on group intelligence strategy. Twitter II is provided by You et al. which contains 1269 images from social media Twitter and labeled with sentiment polarity categories by AMT participants. Both TwitterI and TwitterII datasets are split randomly into 80% training and 20% testing sets.

4.2 Experiment Settings

Our framework is implemented using Linux-Ubuntu14.04, Python 2.7, Tensorflow 1.3.0, and the development tool is PyCharm. All of our experiments are performed on an NVIDIA Tesla P100-PCIE GPU with 64 GB on-board memory. The framework used to extract feature of images is based on COIS. To conduct fair comparison, we keep most of the settings same as previous study [13, 14, 20, 23]. The image size is 224 × 224. The image is padded by 2 pixels on each side, filled with 0 value. In order to supervised training to obtain a local objects’ regions, we labeled affective images artificially with five types of objects such as people, vehicles, etc. The annotation tool is called ImageLab. At this point, the dataset contains both the sentiment label and the object detection label (including the center point coordinates and width and height of the object’s candidate box).

We train the model using MomentumOptimizer with a momentum of 0.9. The learning rate of the convolutional layers are initialized as 0.001. Our method uses the Dropout strategy [23] and the L2 paradigm to reduce over-fitting, the Dropout value is set to 0.5. Cross entropy is chosen as the loss function. The total number of iterations are 100 epochs. In addition, for local objects’ regions features, we adjusts the pooling layer of detection branch. The pool kernel adopts $ 3 \times 3 $, $ 2 \times 2 $, $ 1 \times 1 $ to adapt to the datasets of this paper.

4.3 Comparison Method

For traditional methods, we compares the SentiBank [13] model. For the basic CNN methods, we compared DeepSentiBank [14], and two classical deep learning methods pre-trained on ImageNet and fine-tuned on the affective datasets: AlexNet [23], VGGNet [20]. The specific instructions are as follows:

SentiBank: A visual concept detector library that can be used to detect the presence of 1,200 adjective noun phrase pair (ANPs) in an image, which is treated as an image middle-level representation for visual sentiment analysis [13].
DeepSentiBank: A visual sentiment concept classifier trained on large datasets using deep convolutional neural networks, an improved version of SentiBank [14].
ImageNet-AlexNet: A deep learning architecture AlexNet pre-trained on ImageNet and fine-tuned it on the affective image dataset for visual sentiment analysis [23].
ImageNet-VGGNet16: The same idea as the ImageNet-AlexNet, the difference is that the network is replaced by VGGNet-16 [20].
Local regions-Net: A method only considering local objects’ region feature with Faster R-CNN and fully connected layers to learn sentiment representation.

4.4 Experiment Result

Table 1 shows that the classification accuracy(%) of the proposed method and the comparison method on the TwitterI, TwitterII datasets. As shown in Table 1, our proposed method COIS achieves 75.81% accuracy on TwitterI and 78.90% accuracy on TwitterII compared with 66.63% and 65.93% accuracy on TwitterI and TwitterII for SentiBank method. The COIS method outperforms Deepsentibank by 5% and 8% on TwitterI and TwitterII datasets. It is suggests that our proposed COIS method can learn more discriminative representation for visual sentiment analysis.

Table 1. Classification accuracy of different methods on TwitterI and TwitterII.

Full size table

In addition, compared with ImageNet-AlexNet and ImageNet-VGGNet16, our proposed method improved by about 10% accuracy on TwitterI and TwitterII under similar parameter size. The result suggests that the effectiveness of combining local objects’ regions representation.

5 Conclusions

Visual sentiment analysis is gaining more and more attention. Considering that the emotions generated by images are not only from the whole image, but also from the local objects’ regions in the images. This paper presents a novel COIS method for visual sentiment analysis. In the COIS method, the Faster R-CNN model is used to detect the objects’ regions in an image, and then the fully connected network is used to learn the sentiment representation of the local objects’ region of the image and integrate it with the global feature of the image to obtain a more discriminative sentiment representation. The performance of the proposed method is evaluated on two real datasets, and the experimental results show that the proposed COIS method is better than the method of learning emotional representation only from the whole image.

However, in this study, only the local region sentiment of the objects contained in the images are used to enhance the visual sentiment analysis, but other regions that do not contain objects are ignored. In the future work, we will consider the method of weakly supervised learning to discover more emotional regions in an image and consider better strategy of feature fusion to further improve the performance of COIS.

References

Jin, X., Gallagher, A., Cao, L., et al.: The wisdom of social multimedia: using flickr for prediction and forecast. In: International Conference on Multimedia, pp. 1235–1244, DBLP, Firenze (2010)
Google Scholar
Yuan, J., Mcdonough, S., You, Q., et al.: Sentribute: image sentiment analysis from a mid-level perspective. In: International Workshop on Issues of Sentiment Discovery and Opinion Mining, pp. 1–8 (2013)
Google Scholar
Yang, J., She, D., Lai, Y.K., et al.: Weakly supervised coupled networks for visual sentiment analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7584–7592 (2018)
Google Scholar
Wang, X., Jia, J., Hu, P., et al.: Understanding the emotional impact of images. In: ACM International Conference on Multimedia. ACM (2012)
Google Scholar
Cheng, Y.C., Chen, S.Y.: Image classification using color, texture and regions. Image Vis. Comput. 21, 759–776 (2003)
Article Google Scholar
Iqbal, Q., Aggarwal, J.K.: Retrieval by classification of images containing large manmade objects using perceptual grouping. Pattern Recogn. 35, 1463–1479 (2002)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., et al.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, pp. 1725–1732 (2014)
Google Scholar
Chen, M., Zhang, L., Allebach, J.P.: Learning deep features for image emotion classification. In: IEEE International Conference on Image Processing. IEEE, pp. 4491–4495 (2015)
Google Scholar
Szegedy, C., Liu, N.W., Jia, N.Y., et al.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society (2015)
Google Scholar
You, Q., Yang, J., Yang, J., et al.: Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In: Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press, pp. 308–314 (2016)
Google Scholar
Lv, P.: Research on image emotion categorization. Yanshan University (2014)
Google Scholar
Yanulevskaya, V., Gemert, J.C.V., Roth, K., et al.: Emotional valence categorization using holistic image features. In: IEEE International Conference on Image Processing. IEEE (2008)
Google Scholar
Borth, D., Ji, R., Chen, T., et al.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: ACM International Conference on Multimedia, pp. 223–232. ACM (2013)
Google Scholar
Chen, T., Borth, D., Darrell, T., et al.: DeepSentiBank: visual sentiment concept classification with deep convolutional neural networks. Comput. Sci. 675–678 (2014)
Google Scholar
You, Q., Luo, J., Jin, H., et al.: Robust image sentiment analysis using progressively trained and domain transferred deep networks, pp. 381–388 (2015)
Google Scholar
Campos, V., Salvador, A., Giro-I-Nieto, X., et al.: Diving deep into sentiment: understanding fine-tuned CNNs for visual sentiment prediction. In: International Workshop on Affect & Sentiment in Multimedia, pp. 57–62. ACM (2015)
Google Scholar
Campos, V., Jou, B., Giró-I-Nieto, X.: From pixels to sentiment: fine-tuning CNNs for visual sentiment prediction. Image Vis. Comput. 65, 15–22 (2017)
Article Google Scholar
Sun, M., Yang, J., Wang, K., et al.: Discovering affective regions in deep convolutional neural networks for visual sentiment prediction. In: IEEE International Conference on Multimedia and Expo, pp. 1–6. IEEE (2016)
Google Scholar
Li, B., Xiong, W., Hu, W., et al.: Context-aware affective images classification based on bilayer sparse representation. In: ACM International Conference on Multimedia, pp. 721–724. ACM (2012)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. 1409–1556 (2014)
Google Scholar
Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: International Conference on Neural Information Processing Systems, pp. 91–99. MIT Press (2015)
Google Scholar
He, K., Zhang, X., Ren, S., et al.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 1904–1916 (2014)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 4–12. Curran Associates Inc. (2012)
Google Scholar

Download references

Acknowledgment

This work is supported by The National Science Foundation of China (#61763007), The Nature Science Foundation of Guangxi in China (#2017JJD160017), The Science and Technology Project of Guilin City (20170113-6).

Author information

Authors and Affiliations

Guangxi Key Lab of Trusted Software, Guilin University of Electronic Technology, Guilin, China
Guoyong Cai & Xinhao He
Guilin Kaige Information Technology Co. Ltd., Guilin, China
Jiao Pan

Authors

Guoyong Cai
View author publications
You can also search for this author in PubMed Google Scholar
Xinhao He
View author publications
You can also search for this author in PubMed Google Scholar
Jiao Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoyong Cai .

Editor information

Editors and Affiliations

Guilin University of Technology, Guilin, China
Xiaohui Cheng
Northeast Forestry University, Harbin, China
Weipeng Jing
Harbin University of Science and Technology, Harbin, China
Xianhua Song
National Academy of Guo Ding Institute of Data Science, Harbin, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, G., He, X., Pan, J. (2019). Visual Sentiment Analysis with Local Object Regions Attention. In: Cheng, X., Jing, W., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2019. Communications in Computer and Information Science, vol 1058. Springer, Singapore. https://doi.org/10.1007/978-981-15-0118-0_37

Download citation

DOI: https://doi.org/10.1007/978-981-15-0118-0_37
Published: 13 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0117-3
Online ISBN: 978-981-15-0118-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics