Keywords

1 Introduction

Accurately classifying food category is an important step for healthy diet management [11, 23]. Although the consulting service from experts could offer professional dietary analysis and advice, the high price and time costs fairly limit its popularization to the public [20]. As an alternative, it is much more desirable to classify and manage personal diet using the easily accessible food images, which could be captured by various smart devices, such as, cell phone, tablet and so on [6, 8].

In recent years, the deep convolutional neural network (CNN) based algorithms have achieved great success in image classification [15, 21, 22], He et al. [10] presented deep residual network and made a breakthrough on generic image classification. Deep CNN was also applied to food image recognition in [24]. Zhang et al. [28] proposed to pick deep filter responses for fine-grained image classification. The methods mentioned above used whole image as input, which easily involves the background noise and interferences the recognition for foreground objects. To address this issue, some region based algorithms are proposed to classify the image from its discriminative areas. Huang et al. [12] proposed polygon-based classifier to detect the discriminative regions for fine-grained classification. Yang et al. [25] utilized statistics of pairwise local features for food recognition. In [2, 12, 13, 25], some manually annotated landmarks are also employed to locate the discriminative areas, such as, the beaks and eyes of the birds, which efficiently improves the classification accuracy for bird species.

Specific to the food image, the discriminative regions could also guide us to focus on the category related subtle details. More specifically, Chinese food categories vary widely and their ingredients change a lot. With its high similarity between categories and great difference among intra categories, it is very challenging to recognise food images in computer vision [3]. As discussed in many fine-grained image classification literatures [2, 12, 13], the discriminative regions play an important role in distinguishing object categories with subtle difference. The similar observations could be found in food images. As shown in Fig. 1, it is difficult to distinguish these two dishes between Hongshaorou and Hongshaoniurou. By removing the useless round plate and only focusing on the details in the center of the dishes, we can differentiate them from the details of containing or not containing fat.

Fig. 1.
figure 1

Examples of food images that have high similarities between categories and great difference among intra categories. The first row shows Hongshaorou images and the second row shows the similar dish Hongshaoniurou. It is difficult to distinguish these two kinds of dishes if we do not pay attention to the fat of pork marked by the green window in the first row and fat-free of beef marked by the red window in the second row. (Color figure online)

Inspired by the aforementioned observations, we propose a discriminative region guided deep neural network, which follows a two-step scheme to implement food image classification. At first, we utilize the normalized average saliency map to extract the most discriminative regions which are applied to all input images. The proposed method explores the common property of multiple saliency maps and significantly reduces the computational complexity and hardware overhead in training phase. Secondly, the multi-scale strategy is applied to the preprocessed input image, which describes the discriminative regions from different resolutions. More specifically, the input image and the feature map. We resize the input image to 224 \(\times \) 224 and 448 \(\times \) 448 on VGG-16 to obtain a base model separately. In addition, inspired by [26], we fuse the feature maps generated from lower and intermediate layers to capture the multi-scale information, where lower layers describe color and edge features, and intermediate layers describe texture and contextual information. Experimental results on a large scale database confirm that the proposed method could efficiently improve the food image classification accuracy.

In comparison with previous works, the contributions of this paper could be summarized into three folds. Firstly, we utilize the normalized average saliency map to derive the most discriminative region of each image. Secondly, the multi-scale strategy is proposed which combines multi-scale input image and multi-scale feature fusion. Thirdly, we build a challenging large-scale Chinese food image dataset CF90 which includes 135,000 images.

The remainder of the paper is organized as follows. We describe our proposed normalized average saliency map strategy and multi-scale method in Sect. 2. Then the dataset and experimental results are evaluated in Sect. 3. We make a conclusion in Sect. 4.

2 Our Approach

With a two-step scheme method in our approach, at first we apply a normalized average saliency map based pooling strategy to the input image to preserve the category aware discriminative regions. Secondly, we apply multi-scale strategy on VGG-16, which includes multi-scale input image and multi-scale feature fusion.

Fig. 2.
figure 2

Overview of the network architecture. The first row shows VGG-16 framework. Mask data is generated by the normalized saliency map to locate the discriminative regions in the image. After operating the multi-scale input image on VGG-16, the resolution of the input image is 448 \(\times \) 448, we use dilation operation in convolution layers to match the dimension of the fc6. We continue with fusing the multi-scale feature map in the feature extraction stage through cascading the output of conv2_2 to conv3_3, conv4_3 to conv5_3.

2.1 Network Architecture

The network architecture is shown in Fig. 2. We implement our experiment on AlexNet [15] and VGG-16 [21], and select VGG-16 as our base network because of its better performance. We compute the saliency map of each image and generate a normalized average saliency map. The mask-layer add the normalized average saliency map on the input image and extract the most discriminative regions. We resize the image to 448 \(\times \) 448 which changes the default input resolution of 224 \(\times \) 224 of VGG-16. In order to match the input dimension of fc6-layer and utilize the pre-trained model parameters, we adopt dilation operation which will be discussed later.

2.2 Normalized Average Saliency Map

Visual saliency is the most attention-catching parts in images [9]. The information a saliency map contains reveals the discriminative region and the core massage that the image wants to transfer. Deep CNN was adopted to find saliency map in [16, 29], Zhang et al. [27] proposed an effective method at 80 FPS based on minimum barrier salient object. Deep network based saliency map methods are always not so efficient as traditional ones, hence we choose Zhang’s method to obtain saliency map in this paper.

Instead of only extracting the saliency map to implement the classification task, we use the normalized average saliency map to maintain the useful information and enough noises in the food image. Figure 3 shows some original saliency maps, average saliency maps, and normalize average saliency maps in our dataset. We transfer the normalized average saliency map to a mask, add it on the input image to preserve the most critical information of the image.

Fig. 3.
figure 3

The first column shows two categories of original images, followed by their saliency maps, average saliency maps, normalize average saliency map, and the images after operation. We choose the normalized average saliency map as a universal mask and operate it on the input image.

2.3 Multi-scale Input Image

Multi-scale input image contains different details cause its various resolution [18]. In this paper, we choose VGG-16 as our base model which has an input image resolution of 224 \(\times \) 224 and obtain an average accuracy at 89%. CF90 dataset has images resolution more than 280 \(\times \) 280, we adopt multi-scale input image method and resize the image to 448 \(\times \) 448. Through the improvement of input resolution, the extracted feature will be more precise. There are a number of resize methods available in Opencv such as NEAREST, LINEAR. We choose LANCZOS in order to gain a better resizing results. Figure 4 shows an example of different resolution. We can see the details are well defined in the image after resizing operation. Full connecting layers in the deep learning network always perform a significant role to gain the high-level features [17]. In the pre-trained caffemodel of VGG-16, the parameters of fc6, fc7-layer consists the most important part of the model [19]. We must match the input dimension if we want to use these parameters. After operating 448 \(\times \) 448 input image on the network, the conv5_3-layer’s output dimension changes to 14 \(\times \) 14. In order to match the fc6-layer’s dimension of 7 \(\times \) 7, we utilize the dilation operation as shown in Fig. 5. Dilation operation means we add 0 to the kernels when doing convolutional computation and expand the kernel_size. The relation between kernel_size \(\alpha \), dilation \(\beta \), and kernel_extend \(\gamma \) is shown in Eq. 1. We set the dilation parameter \(\beta \) to 2 in layer conv4_3, conv5_1, conv5_2, and conv5_3. After 4 times of dilation, we down sample the feature map to a size of 7 \(\times \) 7 and match the fc6-layer well.

$$\begin{aligned} \gamma = \beta * (\alpha -1)+1 \end{aligned}$$
(1)
Fig. 4.
figure 4

The examples of different resolutions after resizing the image. The first row shows the image of resolution of 224 \(\times \) 224, the second row shows the image of resolution of 448 \(\times \) 448.

Fig. 5.
figure 5

The first column shows the original convolutional kernel, the second column shows the dilated convolutional kernel.

2.4 Multi-scale Feature Fusion

Deep CNN shows a good performance on feature extraction and gives a reference to the classification or detection tasks for further process [3]. With the deepening of the network, the high-level feature can be learned and the details lost at the same time. For food image classification, the low-level features such as the edge, color and texture will be important for the ultimate classification stage [7]. In this paper, we fuse the feature maps generated from lower and intermediate layers to cascade multi-scale information.

As shown in Fig. 2, we combine the output of con2_2 with conv3_3, conv4_3 with conv5_3. After pooling the feature map into a suitable size, we add a scale layer to automatical learn the weights of the adding feature, where \(\lambda \) is a learnable parameter. The adding feature as an auxiliary information, we call it \(\mu \), and name the origin feature as \(\nu \), then we get the fused feature \(\xi \),

$$\begin{aligned} \xi =\nu +\lambda *(0.0001\mu ) \end{aligned}$$
(2)

where \(\lambda \) means the learnable parameters in the network. We set the initial weight of \(\mu \) to 0.0001, which is small enough to balance the most important origin feature and the auxiliary information.

3 Experiment

This section introduces CF90 dataset and evaluates our approaches on food image classification based on CF90. We randomly choose 1,200 images to train and 300 to test in each category on a single NVIDIA Titan GPU. The measure standard is average accuracy which display the proportion of test images that are correctly classified. The experiment will be discussed in 2 aspects: food image classification baseline and improvement. In the improvement stage, we first implement normalized average saliency map on the input image, then apply the multi-scale method based on the best model in the experiment. With a two-step scheme training method, we wish the approaches improve performance on our dateset.

3.1 CF90 Dataset

In this section, we will discuss and introduce CF90 dataset. To the best of our knowledge, the publicly available food image dataset till now is Recipe1M [1], UNIMIB2016 [5], and FOOD-101 [4]. The Recipe1M dataset consists of 1m cooking recipe and related 800k food images. UNIMIB2016 contains 73 categories of dishes. The images of FOOD-101 dataset was collected from foodspotting.com which are upload by people in real life. All above are Western food images, hence we collect images from www.xiachufang.com, image.baidu.com and images.google.com to build the largest Chinese food image dataset: CF90.

Our dataset contains 90 categories of common Chinese food and the most popular dishes on the menu website. Each category contains 1,500 images and we randomly select 1,200 images for training and the rest for testing. The resolution of the images are more than 280 \(\times \) 280. Cause the images are taken by people from different conditions without artificial data clean up, the dataset keep a good redundancy to train. With 135,000 images in total, CF90 contains the popular food classes such as Yuxiangrousi, Qingjiaorousi, Chaotudousi, Hongshaoqiezi, Fanqiechaodan, Hongshaorou, Pidanshourouzhou, Mantou, Shuizhuyu, Qingzhengyu, Paiguluobotang, Paiguyumitang, Nanguazhou, Baozi, etc., as shown in Fig. 6. The images diverse from the light, shooting angle, and satisfy the semantic and visual richness. Email to scharlie92@outlook.com and reach the dataset.

Fig. 6.
figure 6

Examples of CF90 dataset.

3.2 Food Image Classification Baseline

In the first stage, we implement our experiment on AlexNet and VGG-16 based on the public Caffe platform [14]. For each of them, we fine-tune the model from the pre-trained ImageNet caffemodel as its good result. We resize the images to a fixed 256 \(\times \) 256 resolution, from which randomly crop 224 \(\times \) 224 when training on VGG-16 and 227 \(\times \) 227 when training on AlexNet separately. We evaluate the average accuracy during testing stage.

We compare the average accuracy of AlexNet and VGG-16, as shown in Fig. 7, VGG-16 has a better performance of accuracy at 89% cause the deepening of the network. We exhibit the best 5 classification results and worst 5 results with their wrong results in VGG-16 by the way. The top accuracy reaches 98% and the worst is 32%. We choose VGG-16 to be the base model and implement the following experiment on it.

Fig. 7.
figure 7

The accuracy curve when training VGG-16 and AlexNet. VGG-16 obtains the best accuracy of 89%, AlexNet obtains the best accuracy at 67%.

3.3 Food Image Classification Improvement

In the second stage, we test our method one-by-one, firstly is the normalized average feature map strategy, then the multi-scale network including multi-scale input and multi-scale feature fusion.

Normalized Average Saliency Map. We perform the normalized average saliency map method on base model and use batch size of 5, learning rate of 0.0001, weight decay of 0.0005 and momentum of 0.9. We terminate training at 100k iterations, which is determined on a 108k\(\backslash \)27k train\(\backslash \)val split. Figure 8 shows the accuracy curve when training and Table 1 shows the specific accuracy.

Multi-scale Network. This section is followed by the multi-scale input image and the multi-scale feature fusion to improve the classification performance. The multi-scale input image strategy means we train the model with 448 \(\times \) 448 images unlikely in Sect. 3.2 we use the input of 224 \(\times \) 224. As mentioned in Fig. 5, we adopt the dilation operation to match the dimension requirement of fc6-layer. Following is the multi-scale feature fusion strategy which means the combination of low/mid/high-level feature in the network as shown in Fig. 2. Own to this multi-scale strategy, we improve the accuracy by 1%. Figure 8 shows the final accuracy curve and our proposed approach achieve 2.58% improvement from VGG-16 baseline overall.

Fig. 8.
figure 8

Accuracy curves of proposed methods in this paper.

We give an example of top-10 error and top-10 right results on CF90 dataset in Fig. 9. After 4.6 epochs, the top accuracy is 99.667% and the top error is 1.667%. The texts below the first row images show their label names and their classification accuracy. The texts below the second row images show their label names, their classification accuracy, and the wrong categories they are classified. The best 10 categories have discriminative appearances to be classified and the worst 10 categories have high similarity with the wrongly classified categories, some have the same ingredients, the similar color and shape.

Table 1. Top-1 accuracy and top-5 accuracy of our approaches. With a two-step scheme training method, we improve the accuracy 2.585% by the baseline.
Fig. 9.
figure 9

The best 10 classification results and the worst 10 classification results.

4 Conclusion

This paper focus on food image classification, we propose normalized average saliency map for discriminative regions and multi-scale method to improve the performance. With a two-step scheme training methods and fine-tune the ImageNet pre-trained VGG-16 network, we improve the accuracy 2.58% by baseline. Our method does not rely on the pairwised saliency map of each image and is more universal to match the image that the object located in the center. At the same time, we contribute a novel Chinese food image dataset: CF90. We future demonstrate optimizing the dataset and the efficiency of the proposed method on CF90.