Keywords

1 Introduction

Image classification is a fundamental task in computer vision which aims at classifying images into corresponding categories in term of its scene or objects. Recent researches on this topic mainly focus on classifying a single image into its scene or object categories, such as works in [2, 8, 19, 21]. Recently, with the rapid development of social media and mobile devices, more and more people are willing to share ideas and experiences on the social media platforms, e.g. Twitter, Flicker, Microblog. Most of these messages posted by users contain not only text information but also some images which express emotions, opinions and social activities, etc. Obtaining information from these data is quite important for a wide range of applications, including advertisements, recommendation, etc. Most researches about these social messages now focus on analyzing text or extracting information from a single image. However, the fact is that one post or message usually consists of multiple images and sometimes there is no text messages. Therefore, lots of important information will be lost if only text or single image is used. We believe that images within one message or post should be processed in their entirety as they show similar semantic information and most of these have the same theme. Therefore, analyzing images in a group manner may help to better understand the message. Thus, how to simply use the relationship between a group of images to analyze the user’s behavior state or the semantic theme of the group images is a very interesting and challenging subject. To solve this, we do a lot of researches on multiple images processing.

Aside from the widely studied image or scene classification, one similar topic to ours is action recognition from video sequences [5, 7, 16], in which adjacent frames referring to the same action in a video clip can be assigned with a label indicating a specific action category. Another similar research topics are image co-segmentation and co-saliency detection, which aim at detecting foreground objects or salient regions from multiple images containing similar of same objects over simple background [3, 6, 9, 12, 14, 15]. All these researches as well as our’s take multiple images as input. However, our research is fundamentally different from that of the aforementioned researches in that multiple images posted under the same message show very complex contents. For example, one user post a group of images when he/she is attending a soccer game. The images may comprise of playground, audiences and soccer players, etc. They hardly contain the same objects. An example is shown in Fig. 1, from which we can see the images in the soccer games including playground, scoreboard and players, etc. It is unlikely that the same object appears in different images. Even if the same object appears in more than one images, it is still hard to understand the relationship between these images due to scale variations caused by the different angle or the distance while taking photos. Furthermore, the shooting time span is extensive and discontinuous. From Fig. 1 we can see that both the space and times of photoing these images within the same group are discrete.

In this paper, we put forward a two-step framework to classify the group images in the social media platforms. Firstly, a VGG-16 model [17] pre-trained on ImageNet is employed to extract high level semantic features from images. Then, we propose a feature map fusion strategy to combine all features extracted from different images together to perform classification. We elevate our method on two newly collected datasets GUD-5 and GUD-12, and compare it with a baseline method. Experimental results demonstrate that a group of images under the same message show strong correlation in semantic space and it is feasible to use deep learning method to classify them.

Fig. 1.
figure 1

Examples of group images collected from microblog. The images are retrieved from Sina Microblog using key words such as “birthday”, “cycling” etc. We collect two datasets, one contains 5 classes (we name it GUD-5 as abbreviation of group image understanding dataset), the other one is an extension of GUD-5, which has 12 classes and each of them consists of 100 groups. Because the maximum number of images under one message is 9, in our datasets, each group may contain 2 to 9 images. From these examples, we can see the contents of these images varies greatly and the backgrounds are complex, making it very difficulty to be classified.

2 Group Image Classification

One image can show very rich information. People may tell different stories from the same image. However, we believe that, although having very different content, images under the same post or message in social media platforms are highly relevant in semantic space. They narrow down the semantic space so that people can easily understand what is the meaning of this message or post. Here, we demonstrate that, it is possible to classify multiple images in the same message into different categories. In this section, we describe the details of our method to classify the group images. The brief architecture of our method is shown in Fig. 2.

2.1 Feature Extraction

We use a pre-trained CNN on ImageNet to extract features from each image. In this work, VGG-16 [17] is employed. Let \(\mathcal {I} = \{I_i\}_{i=1}^{n}\) be a set of images contained in one group, n is the number of images ranging from 2 to 9. After feature extraction, we have a set of high level features to represent semantic information of each image \(f_{i} = \mathrm {VGG}(I_{i})\). It should be noted the number of images are not fixed in each group, thus the size of features varies with the number of images in each group. Because we need fixed-sized features to be classified, these features for each group should be further processed to have the same size. One way is using bag of visual words method [13] to encode features extracted from different images. Inspired by the spatial pooling method that used within the same feature map, in this paper, we proposed to use feature map pooling that can fuse feature maps of different images into a fixed-sized feature map.

2.2 Feature Map Fusion

One straightforward way to tackle varying size features is using bag of visual words method to fuse features extracted from multiple images into one single fix-sized vector [13]. Another method is using recurrent neural networks to encode information across frames [11]. Inspired by the spatial pooling used in CNN and the observation that strong activations in high level feature maps corresponding to object level information, we proposed to use feature map pooling to aggregate features extracted from different images.

After through VGG-16, feature maps for each image can be obtained. Similar to spatial pooling, there are three ways to aggregate these features, max pooling, mean pooling and subtraction pooling.

$$\begin{aligned} v{(k, h, w)}=\max \limits _{i = 1,...,n} f_{i}(k, h, w) \end{aligned}$$
(1)

where v(khw) is the value of kth fused feature map at position (hw), and \(f_{i}(k, h, w)\) is value of kth feature map of image i at (hw). Finally, arbitrary number feature maps are merged into one.

Mean pooling is also a commonly used aggregation strategy. It is used here to produce a single feature maps averaged over all the extracted feature maps, as follows:

$$\begin{aligned} v{(k, h, w)}=\frac{1}{n}\sum \limits _{i = 1,...,n} f_{i}(k, h, w) \end{aligned}$$
(2)

Subtraction pooling can be achieved similarly as follows:

$$\begin{aligned} v{(k, h, w)}= | f_{i}(k, h, w) - f_{i-1}(k,h,w)| \qquad (1<= i <=n) \end{aligned}$$
(3)

In this paper, we test all pooling methods.

2.3 Group Images Classification

After extraction of high level semantic information using a VGG-16 network, we adopt three different fusion methods as described in Sect. 2.2 to encode the features from multiple images into one representation. After fusing these feature maps, a following five-layer convolution and three-layer fully connected layers with softmax classifier is applied to perform the classification.

Fig. 2.
figure 2

The network of classifying group images. Firstly, we utilize the VGG-16 network pre-trained on ImageNet to extract the feature maps. Then, three different fusion strategies are applied to encode all the feature maps to a fix-sized features. Finally, we design a classification network with 5 convolutional layers and 3 fully-connected layers to perform classification.

3 Experiments

3.1 Dataset

The existing image classification methods mainly focus on classifying a single image into different categories in term of the scene or objects it contains. The popular datasets for image classification are ImageNet [4], Places [20, 21], etc. To the best of our knowledge, there is no public dataset that can be used for testing our method. To evaluate our method and facilitate future researches on this topic, we build two datasets named GUD-5 and GUD-12 (GUD is a acronym for Group image Understanding Dataset), respectively. GUD-5 contains 5 classes, and GUD-12 is an extension of GUD-5 to have 12 classes. Each class in these two datasets is comprised of 100 groups and each group contains 2 to 9 images. The images are downloaded from Sina MicroblogFootnote 1 by searching the social activity key words such as skiing, birthday party, etc. We only select the results which contain multiple images. Then, 9 students are asked to filter out the results by answering the question “Does this group of images show skiing?”. Some examples are showned in Fig. 1. 10% of these samples are used as testing set, and the remaining 90% are as training set.

3.2 Training

We use the VGG-16 network [17] pre-trained on ImageNet to extract features from each image in the group. Then we select high level feature maps as representation of this image and fuse them using three different strategies as described in Sect. 2.2. The following is a network with 5 convolutional and 3 fully-connected layers that used for group images classification. The network can be trained in an end-to-end manner. It was implemented in Tensorflow [1] and trained on a Nvidia GeoForce GTX 1080 GPU with a batch size of 128 and a momentum of 0.9 using SGD optimization method. The learning rate was set starting from 0.01 and decreased by every 50 epochs. The training will converge in 300 epochs.

3.3 Impacts of Fusion Layers

There are 13 convolutional layers and 3 full-connected layers in VGG-16. Which layer should be selected to fuse is undermined. We test performances of different layers to select the suitable layer to fuse. The simplest method is choosing the last fully-connected layer of the VGG-16 to represent the image. However, the experiments demonstrate a poor performance, as shown in Figs. 3 and 4. And compared with lower level features, we get the best results when the feature map sized \(28 \times 28\) that is the 10th or 11th layer features. We can see the results are very poor when we fuse the low level features such as 5th and 6th layer features. This is because it is hard to get the global information when only extract texture features or color features from the group images. However, it is also difficult to fuse the sematic features if we extract too high level features. Therefore, good classification results can be obtained if we select the 10th or 11th layer features.

3.4 Impacts of Fusion Strategies

We evaluate performances of three fusion strategies, as shown in Figs. 3 and 4. It can be seen that, compared with the maximum fusion method, average fusion works most stable. Results obtained by subtracting fusion in GUD-5 and GUD-12 are very different. The reason behind this may be that average fusion can better balance the global information, subtracting can better capture the major difference information between group images. Global information can get better performance when the category is small. However, due to the interference of similar categories, average fusion will be difficult to obtain better result for large category. In this work, the difference comparison between images will be more important. Therefore, subtracting fusion can achieve better results for 12 classes.

Fig. 3.
figure 3

Performances of our method by using different fusion strategies and fusing different layers on GUD-5 dataset.

3.5 Comparison with Baseline Method

To further demonstrate the effectiveness of the proposed method, we compare it with a baseline method i.e. bag of visual word (BOW). We use SIFT descriptor [10] to extract feature from each image and cluster SIFT features of all the training images into 1000 clusters. Using BOW model, each group can be represented as a 1000 dimensional vector. A linear SVM [18] is adopted to classify groups into corresponding categories. The experimental settings are the same to that of the proposed method, i.e. 10% of the dataset are used for testing and 90% are used as training set. We compare it with our method on the both datasets. The results are shown in Table 1. We can see that, comparing with the baseline, our method obtains a much higher results both on the two datasets. This proves that compared with the traditional processing method, CNN network has more powerful feature extraction ability and can better learn more high level semantic information. In this paper, fusing the high level features can make better use of the relationship between images than simply extracting low-level features from each image for classification.

Fig. 4.
figure 4

Performances of our method by using different fusion strategies and fusing different layers on GUD-12 dataset.

Fig. 5.
figure 5

Influence of with and without the noise images for classification. It can be seen that although with much noise, our method can still classify group images with a high accuracy.

Table 1. Comparisons between the proposed and BOW method on the two datasets

3.6 Results Analysis

There may exist noise images in one group that do not agree with other images to have the same or similar theme. How to select the most relevant images to the theme of the group images and filter out irrelevant parts from the group images is a very important problem. In order to solve this, we train a classification network to remove these images and compare the results with that of the original one on the GDU-5 dataset. We use five-layer convolution and three-layer fully connected layers with softmax classifier to classify these images. We give each image in the same group the same label which is the theme of the group images and try to classify them. If it is difficult for the classified network to judge which topic the image should belong to, we will think this image has nothing to do with it’s group images’ theme and delete it from this group. Through the filter processing, we can get the images making largest contribution to the theme of the group and filter out irrelevant ones from the group. Figure 5 shows the results. As it can be seen, we can get better results after the filter processing. Even with these noisy images, our method can still have promising performance (Fig. 6).

Fig. 6.
figure 6

The network of filtering out irrelevant parts from the group images. We use a classification network with 5 convolutional layers and 3 fully-connected layers to get the parts which make largest contribution to the theme of the group images.

Compared with the results on GUD-5 dataset, our method have poor results on GUD-12 dataset. This maybe because some classes in GUD-12 are very hard to be discriminated. Figure 7 shows the confusion matrix on GUD-12. From which we can see the classification results are very good on “soccer games” or “amusement park”. This may because these two classes have distinct theme and salient objects such as football, ferris wheel and so on. However for activities such as “barbecue” and “visiting aquarium”, it is hard to classify them because these activities are all with dense population and have few salient objects. At the same time, these subjects may contain more kinds of objects which are not very similar such as varieties of foods and water animals. These could also lead to the bad experimental result. Increasing training data may help to alleviate this problem.

Fig. 7.
figure 7

Confusion matrix on GUD-12 dataset.

4 Conclusion

We have proposed a new architecture for group images classification that may be useful for understanding images in social media platforms. We believe considering images under the same message in a entirety manner is better than processing them one by one, for images may show a strong correlated theme even they have very different contents. To this end, we propose a fusion method to fuse different features extracted from each image within the same group, and design a deep fusion network to perform classification. We test our method on two datasets collected from Chinese biggest microblog platform Sina Microblog. Experiments demonstrate that it is challenging yet possible to classify a group of images. In the future work, we shall work on generating the theme of a group images in an unsupervised manner.