Learning Deep Feature Fusion for Group Images Classification

Zhao, Wenting; Wang, Yunhong; Chen, Xunxun; Tang, Yuanyan; Liu, Qingjie

doi:10.1007/978-981-10-7302-1_47

Wenting Zhao¹⁶,
Yunhong Wang¹⁶,
Xunxun Chen¹⁷,
Yuanyan Tang¹⁸ &
…
Qingjie Liu¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 772))

Included in the following conference series:

CCF Chinese Conference on Computer Vision

2695 Accesses

Abstract

With the rapid development of social media, people tend to post multiple images under the same message. These images, we call it group images, may have very different contents, however are highly correlated in semantic space, which refers to the same theme that can be understood by a reader, easily. Understanding images present in one group has potential applications such as recommendation, user analysis, etc. In this paper, we propose a new research topic beyond the traditional image classification that aims at classifying a group of images in social media into corresponding classes. To this end, we design an end-to-end network which accepts variable number of images as input and fuses features extracted from them for classification. The method are tested on two newly collected datasets from Microblog and compared with a baseline method. The experiment demonstrates the effectiveness of our method.

Wenting Zhao is the first author of this paper and is a graduate student who is pursuing her master degree in School of Computer Science and Engineering, Beihang University.

You have full access to this open access chapter, Download conference paper PDF

Exploiting the Relationship Between Visual and Textual Features in Social Networks for Image Classification with Zero-Shot Deep Learning

Sentiment Analysis for Images on Microblogging by Integrating Textual Information with Multiple Kernel Learning

Text-Image Sentiment Analysis

Keywords

1 Introduction

Image classification is a fundamental task in computer vision which aims at classifying images into corresponding categories in term of its scene or objects. Recent researches on this topic mainly focus on classifying a single image into its scene or object categories, such as works in [2, 8, 19, 21]. Recently, with the rapid development of social media and mobile devices, more and more people are willing to share ideas and experiences on the social media platforms, e.g. Twitter, Flicker, Microblog. Most of these messages posted by users contain not only text information but also some images which express emotions, opinions and social activities, etc. Obtaining information from these data is quite important for a wide range of applications, including advertisements, recommendation, etc. Most researches about these social messages now focus on analyzing text or extracting information from a single image. However, the fact is that one post or message usually consists of multiple images and sometimes there is no text messages. Therefore, lots of important information will be lost if only text or single image is used. We believe that images within one message or post should be processed in their entirety as they show similar semantic information and most of these have the same theme. Therefore, analyzing images in a group manner may help to better understand the message. Thus, how to simply use the relationship between a group of images to analyze the user’s behavior state or the semantic theme of the group images is a very interesting and challenging subject. To solve this, we do a lot of researches on multiple images processing.

Aside from the widely studied image or scene classification, one similar topic to ours is action recognition from video sequences [5, 7, 16], in which adjacent frames referring to the same action in a video clip can be assigned with a label indicating a specific action category. Another similar research topics are image co-segmentation and co-saliency detection, which aim at detecting foreground objects or salient regions from multiple images containing similar of same objects over simple background [3, 6, 9, 12, 14, 15]. All these researches as well as our’s take multiple images as input. However, our research is fundamentally different from that of the aforementioned researches in that multiple images posted under the same message show very complex contents. For example, one user post a group of images when he/she is attending a soccer game. The images may comprise of playground, audiences and soccer players, etc. They hardly contain the same objects. An example is shown in Fig. 1, from which we can see the images in the soccer games including playground, scoreboard and players, etc. It is unlikely that the same object appears in different images. Even if the same object appears in more than one images, it is still hard to understand the relationship between these images due to scale variations caused by the different angle or the distance while taking photos. Furthermore, the shooting time span is extensive and discontinuous. From Fig. 1 we can see that both the space and times of photoing these images within the same group are discrete.

In this paper, we put forward a two-step framework to classify the group images in the social media platforms. Firstly, a VGG-16 model [17] pre-trained on ImageNet is employed to extract high level semantic features from images. Then, we propose a feature map fusion strategy to combine all features extracted from different images together to perform classification. We elevate our method on two newly collected datasets GUD-5 and GUD-12, and compare it with a baseline method. Experimental results demonstrate that a group of images under the same message show strong correlation in semantic space and it is feasible to use deep learning method to classify them.

2 Group Image Classification

One image can show very rich information. People may tell different stories from the same image. However, we believe that, although having very different content, images under the same post or message in social media platforms are highly relevant in semantic space. They narrow down the semantic space so that people can easily understand what is the meaning of this message or post. Here, we demonstrate that, it is possible to classify multiple images in the same message into different categories. In this section, we describe the details of our method to classify the group images. The brief architecture of our method is shown in Fig. 2.

2.1 Feature Extraction

We use a pre-trained CNN on ImageNet to extract features from each image. In this work, VGG-16 [17] is employed. Let $\mathcal {I} = \{I_i\}_{i=1}^{n}$ be a set of images contained in one group, n is the number of images ranging from 2 to 9. After feature extraction, we have a set of high level features to represent semantic information of each image $f_{i} = \mathrm {VGG}(I_{i})$. It should be noted the number of images are not fixed in each group, thus the size of features varies with the number of images in each group. Because we need fixed-sized features to be classified, these features for each group should be further processed to have the same size. One way is using bag of visual words method [13] to encode features extracted from different images. Inspired by the spatial pooling method that used within the same feature map, in this paper, we proposed to use feature map pooling that can fuse feature maps of different images into a fixed-sized feature map.

2.2 Feature Map Fusion

One straightforward way to tackle varying size features is using bag of visual words method to fuse features extracted from multiple images into one single fix-sized vector [13]. Another method is using recurrent neural networks to encode information across frames [11]. Inspired by the spatial pooling used in CNN and the observation that strong activations in high level feature maps corresponding to object level information, we proposed to use feature map pooling to aggregate features extracted from different images.

After through VGG-16, feature maps for each image can be obtained. Similar to spatial pooling, there are three ways to aggregate these features, max pooling, mean pooling and subtraction pooling.

$$\begin{aligned} v{(k, h, w)}=\max \limits _{i = 1,...,n} f_{i}(k, h, w) \end{aligned}$$

(1)

where v(k, h, w) is the value of kth fused feature map at position (h, w), and $f_{i}(k, h, w)$ is value of kth feature map of image i at (h, w). Finally, arbitrary number feature maps are merged into one.

Mean pooling is also a commonly used aggregation strategy. It is used here to produce a single feature maps averaged over all the extracted feature maps, as follows:

$$\begin{aligned} v{(k, h, w)}=\frac{1}{n}\sum \limits _{i = 1,...,n} f_{i}(k, h, w) \end{aligned}$$

(2)

Subtraction pooling can be achieved similarly as follows:

$$\begin{aligned} v{(k, h, w)}= | f_{i}(k, h, w) - f_{i-1}(k,h,w)| \qquad (1<= i <=n) \end{aligned}$$

(3)

In this paper, we test all pooling methods.

2.3 Group Images Classification

After extraction of high level semantic information using a VGG-16 network, we adopt three different fusion methods as described in Sect. 2.2 to encode the features from multiple images into one representation. After fusing these feature maps, a following five-layer convolution and three-layer fully connected layers with softmax classifier is applied to perform the classification.

3 Experiments

3.1 Dataset

The existing image classification methods mainly focus on classifying a single image into different categories in term of the scene or objects it contains. The popular datasets for image classification are ImageNet [4], Places [20, 21], etc. To the best of our knowledge, there is no public dataset that can be used for testing our method. To evaluate our method and facilitate future researches on this topic, we build two datasets named GUD-5 and GUD-12 (GUD is a acronym for Group image Understanding Dataset), respectively. GUD-5 contains 5 classes, and GUD-12 is an extension of GUD-5 to have 12 classes. Each class in these two datasets is comprised of 100 groups and each group contains 2 to 9 images. The images are downloaded from Sina Microblog^{Footnote 1} by searching the social activity key words such as skiing, birthday party, etc. We only select the results which contain multiple images. Then, 9 students are asked to filter out the results by answering the question “Does this group of images show skiing?”. Some examples are showned in Fig. 1. 10% of these samples are used as testing set, and the remaining 90% are as training set.

3.2 Training

We use the VGG-16 network [17] pre-trained on ImageNet to extract features from each image in the group. Then we select high level feature maps as representation of this image and fuse them using three different strategies as described in Sect. 2.2. The following is a network with 5 convolutional and 3 fully-connected layers that used for group images classification. The network can be trained in an end-to-end manner. It was implemented in Tensorflow [1] and trained on a Nvidia GeoForce GTX 1080 GPU with a batch size of 128 and a momentum of 0.9 using SGD optimization method. The learning rate was set starting from 0.01 and decreased by every 50 epochs. The training will converge in 300 epochs.

3.3 Impacts of Fusion Layers

There are 13 convolutional layers and 3 full-connected layers in VGG-16. Which layer should be selected to fuse is undermined. We test performances of different layers to select the suitable layer to fuse. The simplest method is choosing the last fully-connected layer of the VGG-16 to represent the image. However, the experiments demonstrate a poor performance, as shown in Figs. 3 and 4. And compared with lower level features, we get the best results when the feature map sized $28 \times 28$ that is the 10th or 11th layer features. We can see the results are very poor when we fuse the low level features such as 5th and 6th layer features. This is because it is hard to get the global information when only extract texture features or color features from the group images. However, it is also difficult to fuse the sematic features if we extract too high level features. Therefore, good classification results can be obtained if we select the 10th or 11th layer features.

3.4 Impacts of Fusion Strategies

We evaluate performances of three fusion strategies, as shown in Figs. 3 and 4. It can be seen that, compared with the maximum fusion method, average fusion works most stable. Results obtained by subtracting fusion in GUD-5 and GUD-12 are very different. The reason behind this may be that average fusion can better balance the global information, subtracting can better capture the major difference information between group images. Global information can get better performance when the category is small. However, due to the interference of similar categories, average fusion will be difficult to obtain better result for large category. In this work, the difference comparison between images will be more important. Therefore, subtracting fusion can achieve better results for 12 classes.

3.5 Comparison with Baseline Method

To further demonstrate the effectiveness of the proposed method, we compare it with a baseline method i.e. bag of visual word (BOW). We use SIFT descriptor [10] to extract feature from each image and cluster SIFT features of all the training images into 1000 clusters. Using BOW model, each group can be represented as a 1000 dimensional vector. A linear SVM [18] is adopted to classify groups into corresponding categories. The experimental settings are the same to that of the proposed method, i.e. 10% of the dataset are used for testing and 90% are used as training set. We compare it with our method on the both datasets. The results are shown in Table 1. We can see that, comparing with the baseline, our method obtains a much higher results both on the two datasets. This proves that compared with the traditional processing method, CNN network has more powerful feature extraction ability and can better learn more high level semantic information. In this paper, fusing the high level features can make better use of the relationship between images than simply extracting low-level features from each image for classification.

Table 1. Comparisons between the proposed and BOW method on the two datasets

Full size table

3.6 Results Analysis

There may exist noise images in one group that do not agree with other images to have the same or similar theme. How to select the most relevant images to the theme of the group images and filter out irrelevant parts from the group images is a very important problem. In order to solve this, we train a classification network to remove these images and compare the results with that of the original one on the GDU-5 dataset. We use five-layer convolution and three-layer fully connected layers with softmax classifier to classify these images. We give each image in the same group the same label which is the theme of the group images and try to classify them. If it is difficult for the classified network to judge which topic the image should belong to, we will think this image has nothing to do with it’s group images’ theme and delete it from this group. Through the filter processing, we can get the images making largest contribution to the theme of the group and filter out irrelevant ones from the group. Figure 5 shows the results. As it can be seen, we can get better results after the filter processing. Even with these noisy images, our method can still have promising performance (Fig. 6).

Compared with the results on GUD-5 dataset, our method have poor results on GUD-12 dataset. This maybe because some classes in GUD-12 are very hard to be discriminated. Figure 7 shows the confusion matrix on GUD-12. From which we can see the classification results are very good on “soccer games” or “amusement park”. This may because these two classes have distinct theme and salient objects such as football, ferris wheel and so on. However for activities such as “barbecue” and “visiting aquarium”, it is hard to classify them because these activities are all with dense population and have few salient objects. At the same time, these subjects may contain more kinds of objects which are not very similar such as varieties of foods and water animals. These could also lead to the bad experimental result. Increasing training data may help to alleviate this problem.

4 Conclusion

We have proposed a new architecture for group images classification that may be useful for understanding images in social media platforms. We believe considering images under the same message in a entirety manner is better than processing them one by one, for images may show a strong correlated theme even they have very different contents. To this end, we propose a fusion method to fuse different features extracted from each image within the same group, and design a deep fusion network to perform classification. We test our method on two datasets collected from Chinese biggest microblog platform Sina Microblog. Experiments demonstrate that it is challenging yet possible to classify a group of images. In the future work, we shall work on generating the theme of a group images in an unsupervised manner.

Notes

1.
weibo.com.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Antonio Torralba, R.F., Freeman, W.T.: 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1958–1970 (2008)
Article Google Scholar
Chang, K.Y., Liu, T.L., Lai, S.H.: From co-saliency to co-segmentation: an efficient and fully unsupervised energy minimization model. In: Computer Vision and Pattern Recognition, pp. 2129–2136 (2011)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Google Scholar
Joulin, A., Bach, F., Ponce, J.: Discriminative clustering for image co-segmentation. In: Computer Vision and Pattern Recognition, pp. 1943–1950 (2010)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Li, H., Ngan, K.N.: A co-saliency model of image pairs. IEEE Trans. Image Process. 20, 3365–3375 (2011)
Article MathSciNet MATH Google Scholar
Liao, K., Liu, G., Hui, Y.: An improvement to the sift descriptor for image representation and matching. Pattern Recogn. Lett. 34, 1211–1220 (2013)
Article Google Scholar
McLaughlin, N., Martinez del Rincon, J., Miller, P.: Recurrent convolutional network for video-based person re-identification. In: Computer Vision and Pattern Recognition, pp. 1325–1334 (2016)
Google Scholar
Mukherjee, L., Singh, V., Dyer, C.R.: Half-integrality based algorithms for cosegmentation of images. In: Computer Vision and Pattern Recognition, pp. 2028–2035 (2009)
Google Scholar
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Article Google Scholar
Rother, C., Kolmogorov, V., Minka, T., Blake, A.: Cosegmentation of image pairs by histogram matching - incorporating a global constraint into MRFs. In: Computer Vision and Pattern Recognition, pp. 993–1000 (2006)
Google Scholar
Rubinstein, M., Joulin, A., Kopf, J., Liu, C.: Unsupervised joint object discovery and segmentation in internet images. In: Computer Vision and Pattern Recognition, pp. 1939–1946 (2013)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tsang, I.W., Kwok, J.T., Cheung, P.M.: Core vector machines: fast SVM training on very large data sets. J. Mach. Learn. Res. 6, 363–392 (2005)
MathSciNet MATH Google Scholar
Xiao, J., Ehinger, K.A., Hays, J., Torralba, A., Oliva, A.: Sun database: exploring a large collection of scene categories. Int. J. Comput. Vision 119(1), 3–22 (2016)
Article MathSciNet Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Torralba, A., Oliva, A.: Places: an image database for deep scene understanding. arXiv preprint arXiv:1610.02055 (2016)
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the Hong Kong, Macao, and Taiwan Science and Technology Cooperation Program of China under Grant L2015TGA9004. This work is also supported by research grant 008/2014/AMJ from Maco.

Author information

Authors and Affiliations

The State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Wenting Zhao, Yunhong Wang & Qingjie Liu
National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, 100029, China
Xunxun Chen
The Department of Computer and information Science, Faculty of Science and Technology, University of Macau, Taipa, 853, Macau
Yuanyan Tang

Authors

Wenting Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yunhong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xunxun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuanyan Tang
View author publications
You can also search for this author in PubMed Google Scholar
Qingjie Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingjie Liu .

Editor information

Editors and Affiliations

Civil Aviation University of China, Tianjin, China
Jinfeng Yang
Tianjin University, Tianjin, China
Qinghua Hu
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Huazhong University of Science and Technology, Wuhan, China
Xiang Bai
Xi’an Jiaotong University, Xi’an, China
Deyu Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, W., Wang, Y., Chen, X., Tang, Y., Liu, Q. (2017). Learning Deep Feature Fusion for Group Images Classification. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 772. Springer, Singapore. https://doi.org/10.1007/978-981-10-7302-1_47

Download citation

DOI: https://doi.org/10.1007/978-981-10-7302-1_47
Published: 30 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7301-4
Online ISBN: 978-981-10-7302-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics