Abstract
With the rapid development of social media, people tend to post multiple images under the same message. These images, we call it group images, may have very different contents, however are highly correlated in semantic space, which refers to the same theme that can be understood by a reader, easily. Understanding images present in one group has potential applications such as recommendation, user analysis, etc. In this paper, we propose a new research topic beyond the traditional image classification that aims at classifying a group of images in social media into corresponding classes. To this end, we design an end-to-end network which accepts variable number of images as input and fuses features extracted from them for classification. The method are tested on two newly collected datasets from Microblog and compared with a baseline method. The experiment demonstrates the effectiveness of our method.
Wenting Zhao is the first author of this paper and is a graduate student who is pursuing her master degree in School of Computer Science and Engineering, Beihang University.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Image classification is a fundamental task in computer vision which aims at classifying images into corresponding categories in term of its scene or objects. Recent researches on this topic mainly focus on classifying a single image into its scene or object categories, such as works in [2, 8, 19, 21]. Recently, with the rapid development of social media and mobile devices, more and more people are willing to share ideas and experiences on the social media platforms, e.g. Twitter, Flicker, Microblog. Most of these messages posted by users contain not only text information but also some images which express emotions, opinions and social activities, etc. Obtaining information from these data is quite important for a wide range of applications, including advertisements, recommendation, etc. Most researches about these social messages now focus on analyzing text or extracting information from a single image. However, the fact is that one post or message usually consists of multiple images and sometimes there is no text messages. Therefore, lots of important information will be lost if only text or single image is used. We believe that images within one message or post should be processed in their entirety as they show similar semantic information and most of these have the same theme. Therefore, analyzing images in a group manner may help to better understand the message. Thus, how to simply use the relationship between a group of images to analyze the user’s behavior state or the semantic theme of the group images is a very interesting and challenging subject. To solve this, we do a lot of researches on multiple images processing.
Aside from the widely studied image or scene classification, one similar topic to ours is action recognition from video sequences [5, 7, 16], in which adjacent frames referring to the same action in a video clip can be assigned with a label indicating a specific action category. Another similar research topics are image co-segmentation and co-saliency detection, which aim at detecting foreground objects or salient regions from multiple images containing similar of same objects over simple background [3, 6, 9, 12, 14, 15]. All these researches as well as our’s take multiple images as input. However, our research is fundamentally different from that of the aforementioned researches in that multiple images posted under the same message show very complex contents. For example, one user post a group of images when he/she is attending a soccer game. The images may comprise of playground, audiences and soccer players, etc. They hardly contain the same objects. An example is shown in Fig. 1, from which we can see the images in the soccer games including playground, scoreboard and players, etc. It is unlikely that the same object appears in different images. Even if the same object appears in more than one images, it is still hard to understand the relationship between these images due to scale variations caused by the different angle or the distance while taking photos. Furthermore, the shooting time span is extensive and discontinuous. From Fig. 1 we can see that both the space and times of photoing these images within the same group are discrete.
In this paper, we put forward a two-step framework to classify the group images in the social media platforms. Firstly, a VGG-16 model [17] pre-trained on ImageNet is employed to extract high level semantic features from images. Then, we propose a feature map fusion strategy to combine all features extracted from different images together to perform classification. We elevate our method on two newly collected datasets GUD-5 and GUD-12, and compare it with a baseline method. Experimental results demonstrate that a group of images under the same message show strong correlation in semantic space and it is feasible to use deep learning method to classify them.
Examples of group images collected from microblog. The images are retrieved from Sina Microblog using key words such as “birthday”, “cycling” etc. We collect two datasets, one contains 5 classes (we name it GUD-5 as abbreviation of group image understanding dataset), the other one is an extension of GUD-5, which has 12 classes and each of them consists of 100 groups. Because the maximum number of images under one message is 9, in our datasets, each group may contain 2 to 9 images. From these examples, we can see the contents of these images varies greatly and the backgrounds are complex, making it very difficulty to be classified.
2 Group Image Classification
One image can show very rich information. People may tell different stories from the same image. However, we believe that, although having very different content, images under the same post or message in social media platforms are highly relevant in semantic space. They narrow down the semantic space so that people can easily understand what is the meaning of this message or post. Here, we demonstrate that, it is possible to classify multiple images in the same message into different categories. In this section, we describe the details of our method to classify the group images. The brief architecture of our method is shown in Fig. 2.
2.1 Feature Extraction
We use a pre-trained CNN on ImageNet to extract features from each image. In this work, VGG-16 [17] is employed. Let \(\mathcal {I} = \{I_i\}_{i=1}^{n}\) be a set of images contained in one group, n is the number of images ranging from 2 to 9. After feature extraction, we have a set of high level features to represent semantic information of each image \(f_{i} = \mathrm {VGG}(I_{i})\). It should be noted the number of images are not fixed in each group, thus the size of features varies with the number of images in each group. Because we need fixed-sized features to be classified, these features for each group should be further processed to have the same size. One way is using bag of visual words method [13] to encode features extracted from different images. Inspired by the spatial pooling method that used within the same feature map, in this paper, we proposed to use feature map pooling that can fuse feature maps of different images into a fixed-sized feature map.
2.2 Feature Map Fusion
One straightforward way to tackle varying size features is using bag of visual words method to fuse features extracted from multiple images into one single fix-sized vector [13]. Another method is using recurrent neural networks to encode information across frames [11]. Inspired by the spatial pooling used in CNN and the observation that strong activations in high level feature maps corresponding to object level information, we proposed to use feature map pooling to aggregate features extracted from different images.
After through VGG-16, feature maps for each image can be obtained. Similar to spatial pooling, there are three ways to aggregate these features, max pooling, mean pooling and subtraction pooling.
where v(k, h, w) is the value of kth fused feature map at position (h, w), and \(f_{i}(k, h, w)\) is value of kth feature map of image i at (h, w). Finally, arbitrary number feature maps are merged into one.
Mean pooling is also a commonly used aggregation strategy. It is used here to produce a single feature maps averaged over all the extracted feature maps, as follows:
Subtraction pooling can be achieved similarly as follows:
In this paper, we test all pooling methods.
2.3 Group Images Classification
After extraction of high level semantic information using a VGG-16 network, we adopt three different fusion methods as described in Sect. 2.2 to encode the features from multiple images into one representation. After fusing these feature maps, a following five-layer convolution and three-layer fully connected layers with softmax classifier is applied to perform the classification.
The network of classifying group images. Firstly, we utilize the VGG-16 network pre-trained on ImageNet to extract the feature maps. Then, three different fusion strategies are applied to encode all the feature maps to a fix-sized features. Finally, we design a classification network with 5 convolutional layers and 3 fully-connected layers to perform classification.
3 Experiments
3.1 Dataset
The existing image classification methods mainly focus on classifying a single image into different categories in term of the scene or objects it contains. The popular datasets for image classification are ImageNet [4], Places [20, 21], etc. To the best of our knowledge, there is no public dataset that can be used for testing our method. To evaluate our method and facilitate future researches on this topic, we build two datasets named GUD-5 and GUD-12 (GUD is a acronym for Group image Understanding Dataset), respectively. GUD-5 contains 5 classes, and GUD-12 is an extension of GUD-5 to have 12 classes. Each class in these two datasets is comprised of 100 groups and each group contains 2 to 9 images. The images are downloaded from Sina MicroblogFootnote 1 by searching the social activity key words such as skiing, birthday party, etc. We only select the results which contain multiple images. Then, 9 students are asked to filter out the results by answering the question “Does this group of images show skiing?”. Some examples are showned in Fig. 1. 10% of these samples are used as testing set, and the remaining 90% are as training set.
3.2 Training
We use the VGG-16 network [17] pre-trained on ImageNet to extract features from each image in the group. Then we select high level feature maps as representation of this image and fuse them using three different strategies as described in Sect. 2.2. The following is a network with 5 convolutional and 3 fully-connected layers that used for group images classification. The network can be trained in an end-to-end manner. It was implemented in Tensorflow [1] and trained on a Nvidia GeoForce GTX 1080 GPU with a batch size of 128 and a momentum of 0.9 using SGD optimization method. The learning rate was set starting from 0.01 and decreased by every 50 epochs. The training will converge in 300 epochs.
3.3 Impacts of Fusion Layers
There are 13 convolutional layers and 3 full-connected layers in VGG-16. Which layer should be selected to fuse is undermined. We test performances of different layers to select the suitable layer to fuse. The simplest method is choosing the last fully-connected layer of the VGG-16 to represent the image. However, the experiments demonstrate a poor performance, as shown in Figs. 3 and 4. And compared with lower level features, we get the best results when the feature map sized \(28 \times 28\) that is the 10th or 11th layer features. We can see the results are very poor when we fuse the low level features such as 5th and 6th layer features. This is because it is hard to get the global information when only extract texture features or color features from the group images. However, it is also difficult to fuse the sematic features if we extract too high level features. Therefore, good classification results can be obtained if we select the 10th or 11th layer features.
3.4 Impacts of Fusion Strategies
We evaluate performances of three fusion strategies, as shown in Figs. 3 and 4. It can be seen that, compared with the maximum fusion method, average fusion works most stable. Results obtained by subtracting fusion in GUD-5 and GUD-12 are very different. The reason behind this may be that average fusion can better balance the global information, subtracting can better capture the major difference information between group images. Global information can get better performance when the category is small. However, due to the interference of similar categories, average fusion will be difficult to obtain better result for large category. In this work, the difference comparison between images will be more important. Therefore, subtracting fusion can achieve better results for 12 classes.
3.5 Comparison with Baseline Method
To further demonstrate the effectiveness of the proposed method, we compare it with a baseline method i.e. bag of visual word (BOW). We use SIFT descriptor [10] to extract feature from each image and cluster SIFT features of all the training images into 1000 clusters. Using BOW model, each group can be represented as a 1000 dimensional vector. A linear SVM [18] is adopted to classify groups into corresponding categories. The experimental settings are the same to that of the proposed method, i.e. 10% of the dataset are used for testing and 90% are used as training set. We compare it with our method on the both datasets. The results are shown in Table 1. We can see that, comparing with the baseline, our method obtains a much higher results both on the two datasets. This proves that compared with the traditional processing method, CNN network has more powerful feature extraction ability and can better learn more high level semantic information. In this paper, fusing the high level features can make better use of the relationship between images than simply extracting low-level features from each image for classification.
3.6 Results Analysis
There may exist noise images in one group that do not agree with other images to have the same or similar theme. How to select the most relevant images to the theme of the group images and filter out irrelevant parts from the group images is a very important problem. In order to solve this, we train a classification network to remove these images and compare the results with that of the original one on the GDU-5 dataset. We use five-layer convolution and three-layer fully connected layers with softmax classifier to classify these images. We give each image in the same group the same label which is the theme of the group images and try to classify them. If it is difficult for the classified network to judge which topic the image should belong to, we will think this image has nothing to do with it’s group images’ theme and delete it from this group. Through the filter processing, we can get the images making largest contribution to the theme of the group and filter out irrelevant ones from the group. Figure 5 shows the results. As it can be seen, we can get better results after the filter processing. Even with these noisy images, our method can still have promising performance (Fig. 6).
Compared with the results on GUD-5 dataset, our method have poor results on GUD-12 dataset. This maybe because some classes in GUD-12 are very hard to be discriminated. Figure 7 shows the confusion matrix on GUD-12. From which we can see the classification results are very good on “soccer games” or “amusement park”. This may because these two classes have distinct theme and salient objects such as football, ferris wheel and so on. However for activities such as “barbecue” and “visiting aquarium”, it is hard to classify them because these activities are all with dense population and have few salient objects. At the same time, these subjects may contain more kinds of objects which are not very similar such as varieties of foods and water animals. These could also lead to the bad experimental result. Increasing training data may help to alleviate this problem.
4 Conclusion
We have proposed a new architecture for group images classification that may be useful for understanding images in social media platforms. We believe considering images under the same message in a entirety manner is better than processing them one by one, for images may show a strong correlated theme even they have very different contents. To this end, we propose a fusion method to fuse different features extracted from each image within the same group, and design a deep fusion network to perform classification. We test our method on two datasets collected from Chinese biggest microblog platform Sina Microblog. Experiments demonstrate that it is challenging yet possible to classify a group of images. In the future work, we shall work on generating the theme of a group images in an unsupervised manner.
Notes
- 1.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Antonio Torralba, R.F., Freeman, W.T.: 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1958–1970 (2008)
Chang, K.Y., Liu, T.L., Lai, S.H.: From co-saliency to co-segmentation: an efficient and fully unsupervised energy minimization model. In: Computer Vision and Pattern Recognition, pp. 2129–2136 (2011)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Joulin, A., Bach, F., Ponce, J.: Discriminative clustering for image co-segmentation. In: Computer Vision and Pattern Recognition, pp. 1943–1950 (2010)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Li, H., Ngan, K.N.: A co-saliency model of image pairs. IEEE Trans. Image Process. 20, 3365–3375 (2011)
Liao, K., Liu, G., Hui, Y.: An improvement to the sift descriptor for image representation and matching. Pattern Recogn. Lett. 34, 1211–1220 (2013)
McLaughlin, N., Martinez del Rincon, J., Miller, P.: Recurrent convolutional network for video-based person re-identification. In: Computer Vision and Pattern Recognition, pp. 1325–1334 (2016)
Mukherjee, L., Singh, V., Dyer, C.R.: Half-integrality based algorithms for cosegmentation of images. In: Computer Vision and Pattern Recognition, pp. 2028–2035 (2009)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Rother, C., Kolmogorov, V., Minka, T., Blake, A.: Cosegmentation of image pairs by histogram matching - incorporating a global constraint into MRFs. In: Computer Vision and Pattern Recognition, pp. 993–1000 (2006)
Rubinstein, M., Joulin, A., Kopf, J., Liu, C.: Unsupervised joint object discovery and segmentation in internet images. In: Computer Vision and Pattern Recognition, pp. 1939–1946 (2013)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tsang, I.W., Kwok, J.T., Cheung, P.M.: Core vector machines: fast SVM training on very large data sets. J. Mach. Learn. Res. 6, 363–392 (2005)
Xiao, J., Ehinger, K.A., Hays, J., Torralba, A., Oliva, A.: Sun database: exploring a large collection of scene categories. Int. J. Comput. Vision 119(1), 3–22 (2016)
Zhou, B., Khosla, A., Lapedriza, A., Torralba, A., Oliva, A.: Places: an image database for deep scene understanding. arXiv preprint arXiv:1610.02055 (2016)
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)
Acknowledgement
This work was supported in part by the Hong Kong, Macao, and Taiwan Science and Technology Cooperation Program of China under Grant L2015TGA9004. This work is also supported by research grant 008/2014/AMJ from Maco.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhao, W., Wang, Y., Chen, X., Tang, Y., Liu, Q. (2017). Learning Deep Feature Fusion for Group Images Classification. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 772. Springer, Singapore. https://doi.org/10.1007/978-981-10-7302-1_47
Download citation
DOI: https://doi.org/10.1007/978-981-10-7302-1_47
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7301-4
Online ISBN: 978-981-10-7302-1
eBook Packages: Computer ScienceComputer Science (R0)