Discriminative Region Guided Deep Neural Network Towards Food Image Classification

Chen, Yali; Yang, Yanping; Fang, Qing; Yao, Xiaoyu

doi:10.1007/978-981-10-7302-1_48

Yali Chen¹⁶,
Yanping Yang¹⁶,
Qing Fang¹⁶ &
…
Xiaoyu Yao¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 772))

Included in the following conference series:

CCF Chinese Conference on Computer Vision

2407 Accesses

Abstract

Food image classification plays an important role in smart health management, such as, diet analysis and food recommendation. Due to the similar appearance and shape between different foods, it is quite challenging to distinguish various food categories from their images. To address this issue, we propose a discriminative region guided deep neural network to classify the food images. More specifically, a saliency map based pooling strategy is applied to the input image to preserve the category aware discriminative regions. Meanwhile, the multi-scale fusion scheme is employed in our deep neural network to describe the discriminative regions across different resolutions. Experimental results on a large-scale Chinese food database show that, the average accuracy the proposed method is as high as 91.18%, and outperforms the baseline by 2.58%.

You have full access to this open access chapter, Download conference paper PDF

Automatic Food Recognition Using Deep Convolutional Neural Networks with Self-attention Mechanism

Article Open access 09 January 2024

Overview of Deep Learning in Food Image Classification for Dietary Assessment System

Food Recognition for Dietary Assessment Using Deep Convolutional Neural Networks

Keywords

1 Introduction

Accurately classifying food category is an important step for healthy diet management [11, 23]. Although the consulting service from experts could offer professional dietary analysis and advice, the high price and time costs fairly limit its popularization to the public [20]. As an alternative, it is much more desirable to classify and manage personal diet using the easily accessible food images, which could be captured by various smart devices, such as, cell phone, tablet and so on [6, 8].

In recent years, the deep convolutional neural network (CNN) based algorithms have achieved great success in image classification [15, 21, 22], He et al. [10] presented deep residual network and made a breakthrough on generic image classification. Deep CNN was also applied to food image recognition in [24]. Zhang et al. [28] proposed to pick deep filter responses for fine-grained image classification. The methods mentioned above used whole image as input, which easily involves the background noise and interferences the recognition for foreground objects. To address this issue, some region based algorithms are proposed to classify the image from its discriminative areas. Huang et al. [12] proposed polygon-based classifier to detect the discriminative regions for fine-grained classification. Yang et al. [25] utilized statistics of pairwise local features for food recognition. In [2, 12, 13, 25], some manually annotated landmarks are also employed to locate the discriminative areas, such as, the beaks and eyes of the birds, which efficiently improves the classification accuracy for bird species.

Specific to the food image, the discriminative regions could also guide us to focus on the category related subtle details. More specifically, Chinese food categories vary widely and their ingredients change a lot. With its high similarity between categories and great difference among intra categories, it is very challenging to recognise food images in computer vision [3]. As discussed in many fine-grained image classification literatures [2, 12, 13], the discriminative regions play an important role in distinguishing object categories with subtle difference. The similar observations could be found in food images. As shown in Fig. 1, it is difficult to distinguish these two dishes between Hongshaorou and Hongshaoniurou. By removing the useless round plate and only focusing on the details in the center of the dishes, we can differentiate them from the details of containing or not containing fat.

Inspired by the aforementioned observations, we propose a discriminative region guided deep neural network, which follows a two-step scheme to implement food image classification. At first, we utilize the normalized average saliency map to extract the most discriminative regions which are applied to all input images. The proposed method explores the common property of multiple saliency maps and significantly reduces the computational complexity and hardware overhead in training phase. Secondly, the multi-scale strategy is applied to the preprocessed input image, which describes the discriminative regions from different resolutions. More specifically, the input image and the feature map. We resize the input image to 224 $\times $ 224 and 448 $\times $ 448 on VGG-16 to obtain a base model separately. In addition, inspired by [26], we fuse the feature maps generated from lower and intermediate layers to capture the multi-scale information, where lower layers describe color and edge features, and intermediate layers describe texture and contextual information. Experimental results on a large scale database confirm that the proposed method could efficiently improve the food image classification accuracy.

In comparison with previous works, the contributions of this paper could be summarized into three folds. Firstly, we utilize the normalized average saliency map to derive the most discriminative region of each image. Secondly, the multi-scale strategy is proposed which combines multi-scale input image and multi-scale feature fusion. Thirdly, we build a challenging large-scale Chinese food image dataset CF90 which includes 135,000 images.

The remainder of the paper is organized as follows. We describe our proposed normalized average saliency map strategy and multi-scale method in Sect. 2. Then the dataset and experimental results are evaluated in Sect. 3. We make a conclusion in Sect. 4.

2 Our Approach

With a two-step scheme method in our approach, at first we apply a normalized average saliency map based pooling strategy to the input image to preserve the category aware discriminative regions. Secondly, we apply multi-scale strategy on VGG-16, which includes multi-scale input image and multi-scale feature fusion.

2.1 Network Architecture

The network architecture is shown in Fig. 2. We implement our experiment on AlexNet [15] and VGG-16 [21], and select VGG-16 as our base network because of its better performance. We compute the saliency map of each image and generate a normalized average saliency map. The mask-layer add the normalized average saliency map on the input image and extract the most discriminative regions. We resize the image to 448 $\times $ 448 which changes the default input resolution of 224 $\times $ 224 of VGG-16. In order to match the input dimension of fc6-layer and utilize the pre-trained model parameters, we adopt dilation operation which will be discussed later.

2.2 Normalized Average Saliency Map

Visual saliency is the most attention-catching parts in images [9]. The information a saliency map contains reveals the discriminative region and the core massage that the image wants to transfer. Deep CNN was adopted to find saliency map in [16, 29], Zhang et al. [27] proposed an effective method at 80 FPS based on minimum barrier salient object. Deep network based saliency map methods are always not so efficient as traditional ones, hence we choose Zhang’s method to obtain saliency map in this paper.

Instead of only extracting the saliency map to implement the classification task, we use the normalized average saliency map to maintain the useful information and enough noises in the food image. Figure 3 shows some original saliency maps, average saliency maps, and normalize average saliency maps in our dataset. We transfer the normalized average saliency map to a mask, add it on the input image to preserve the most critical information of the image.

2.3 Multi-scale Input Image

Multi-scale input image contains different details cause its various resolution [18]. In this paper, we choose VGG-16 as our base model which has an input image resolution of 224 $\times $ 224 and obtain an average accuracy at 89%. CF90 dataset has images resolution more than 280 $\times $ 280, we adopt multi-scale input image method and resize the image to 448 $\times $ 448. Through the improvement of input resolution, the extracted feature will be more precise. There are a number of resize methods available in Opencv such as NEAREST, LINEAR. We choose LANCZOS in order to gain a better resizing results. Figure 4 shows an example of different resolution. We can see the details are well defined in the image after resizing operation. Full connecting layers in the deep learning network always perform a significant role to gain the high-level features [17]. In the pre-trained caffemodel of VGG-16, the parameters of fc6, fc7-layer consists the most important part of the model [19]. We must match the input dimension if we want to use these parameters. After operating 448 $\times $ 448 input image on the network, the conv5_3-layer’s output dimension changes to 14 $\times $ 14. In order to match the fc6-layer’s dimension of 7 $\times $ 7, we utilize the dilation operation as shown in Fig. 5. Dilation operation means we add 0 to the kernels when doing convolutional computation and expand the kernel_size. The relation between kernel_size $\alpha $, dilation $\beta $, and kernel_extend $\gamma $ is shown in Eq. 1. We set the dilation parameter $\beta $ to 2 in layer conv4_3, conv5_1, conv5_2, and conv5_3. After 4 times of dilation, we down sample the feature map to a size of 7 $\times $ 7 and match the fc6-layer well.

$$\begin{aligned} \gamma = \beta * (\alpha -1)+1 \end{aligned}$$

(1)

2.4 Multi-scale Feature Fusion

Deep CNN shows a good performance on feature extraction and gives a reference to the classification or detection tasks for further process [3]. With the deepening of the network, the high-level feature can be learned and the details lost at the same time. For food image classification, the low-level features such as the edge, color and texture will be important for the ultimate classification stage [7]. In this paper, we fuse the feature maps generated from lower and intermediate layers to cascade multi-scale information.

As shown in Fig. 2, we combine the output of con2_2 with conv3_3, conv4_3 with conv5_3. After pooling the feature map into a suitable size, we add a scale layer to automatical learn the weights of the adding feature, where $\lambda $ is a learnable parameter. The adding feature as an auxiliary information, we call it $\mu $, and name the origin feature as $\nu $, then we get the fused feature $\xi $,

$$\begin{aligned} \xi =\nu +\lambda *(0.0001\mu ) \end{aligned}$$

(2)

where $\lambda $ means the learnable parameters in the network. We set the initial weight of $\mu $ to 0.0001, which is small enough to balance the most important origin feature and the auxiliary information.

3 Experiment

This section introduces CF90 dataset and evaluates our approaches on food image classification based on CF90. We randomly choose 1,200 images to train and 300 to test in each category on a single NVIDIA Titan GPU. The measure standard is average accuracy which display the proportion of test images that are correctly classified. The experiment will be discussed in 2 aspects: food image classification baseline and improvement. In the improvement stage, we first implement normalized average saliency map on the input image, then apply the multi-scale method based on the best model in the experiment. With a two-step scheme training method, we wish the approaches improve performance on our dateset.

3.1 CF90 Dataset

In this section, we will discuss and introduce CF90 dataset. To the best of our knowledge, the publicly available food image dataset till now is Recipe1M [1], UNIMIB2016 [5], and FOOD-101 [4]. The Recipe1M dataset consists of 1m cooking recipe and related 800k food images. UNIMIB2016 contains 73 categories of dishes. The images of FOOD-101 dataset was collected from foodspotting.com which are upload by people in real life. All above are Western food images, hence we collect images from www.xiachufang.com, image.baidu.com and images.google.com to build the largest Chinese food image dataset: CF90.

Our dataset contains 90 categories of common Chinese food and the most popular dishes on the menu website. Each category contains 1,500 images and we randomly select 1,200 images for training and the rest for testing. The resolution of the images are more than 280 $\times $ 280. Cause the images are taken by people from different conditions without artificial data clean up, the dataset keep a good redundancy to train. With 135,000 images in total, CF90 contains the popular food classes such as Yuxiangrousi, Qingjiaorousi, Chaotudousi, Hongshaoqiezi, Fanqiechaodan, Hongshaorou, Pidanshourouzhou, Mantou, Shuizhuyu, Qingzhengyu, Paiguluobotang, Paiguyumitang, Nanguazhou, Baozi, etc., as shown in Fig. 6. The images diverse from the light, shooting angle, and satisfy the semantic and visual richness. Email to scharlie92@outlook.com and reach the dataset.

3.2 Food Image Classification Baseline

In the first stage, we implement our experiment on AlexNet and VGG-16 based on the public Caffe platform [14]. For each of them, we fine-tune the model from the pre-trained ImageNet caffemodel as its good result. We resize the images to a fixed 256 $\times $ 256 resolution, from which randomly crop 224 $\times $ 224 when training on VGG-16 and 227 $\times $ 227 when training on AlexNet separately. We evaluate the average accuracy during testing stage.

We compare the average accuracy of AlexNet and VGG-16, as shown in Fig. 7, VGG-16 has a better performance of accuracy at 89% cause the deepening of the network. We exhibit the best 5 classification results and worst 5 results with their wrong results in VGG-16 by the way. The top accuracy reaches 98% and the worst is 32%. We choose VGG-16 to be the base model and implement the following experiment on it.

3.3 Food Image Classification Improvement

In the second stage, we test our method one-by-one, firstly is the normalized average feature map strategy, then the multi-scale network including multi-scale input and multi-scale feature fusion.

Normalized Average Saliency Map. We perform the normalized average saliency map method on base model and use batch size of 5, learning rate of 0.0001, weight decay of 0.0005 and momentum of 0.9. We terminate training at 100k iterations, which is determined on a 108k$\backslash $27k train$\backslash $val split. Figure 8 shows the accuracy curve when training and Table 1 shows the specific accuracy.

Multi-scale Network. This section is followed by the multi-scale input image and the multi-scale feature fusion to improve the classification performance. The multi-scale input image strategy means we train the model with 448 $\times $ 448 images unlikely in Sect. 3.2 we use the input of 224 $\times $ 224. As mentioned in Fig. 5, we adopt the dilation operation to match the dimension requirement of fc6-layer. Following is the multi-scale feature fusion strategy which means the combination of low/mid/high-level feature in the network as shown in Fig. 2. Own to this multi-scale strategy, we improve the accuracy by 1%. Figure 8 shows the final accuracy curve and our proposed approach achieve 2.58% improvement from VGG-16 baseline overall.

We give an example of top-10 error and top-10 right results on CF90 dataset in Fig. 9. After 4.6 epochs, the top accuracy is 99.667% and the top error is 1.667%. The texts below the first row images show their label names and their classification accuracy. The texts below the second row images show their label names, their classification accuracy, and the wrong categories they are classified. The best 10 categories have discriminative appearances to be classified and the worst 10 categories have high similarity with the wrongly classified categories, some have the same ingredients, the similar color and shape.

Table 1. Top-1 accuracy and top-5 accuracy of our approaches. With a two-step scheme training method, we improve the accuracy 2.585% by the baseline.

Full size table

4 Conclusion

This paper focus on food image classification, we propose normalized average saliency map for discriminative regions and multi-scale method to improve the performance. With a two-step scheme training methods and fine-tune the ImageNet pre-trained VGG-16 network, we improve the accuracy 2.58% by baseline. Our method does not rely on the pairwised saliency map of each image and is more universal to match the image that the object located in the center. At the same time, we contribute a novel Chinese food image dataset: CF90. We future demonstrate optimizing the dataset and the efficiency of the proposed method on CF90.

References

Salvador, A., Hynes, N., Aytar, Y., Marin, J., Ofli, F., Weber, I., Torralba, A.: Learning cross-modal embeddings for cooking recipes and food images. In: CVPR (2017)
Google Scholar
Angelova, A., Zhu, S.: Efficient object detection and segmentation for fine-grained recognition. In: CVPR (2013)
Google Scholar
Beijbom, O., Joshi, N., Morris, D., Saponas, T.S., Khullar, S.: Menu-match: Restaurant-specific food logging from images. In: Computer Vision (2015)
Google Scholar
Bossard, L., Guillaumin, M., Gool, L.V.: Food-101: mining discriminative components with random forests. In: ECCV (2014)
Google Scholar
Ciocca, G., Napoletano, P., Schettini, R.: Food recognition: a new dataset, experiments and results. IEEE J. Biomed. Health Inform. 21(3), 588–598 (2016)
Article Google Scholar
Cordeiro, F., Bales, E., Cherry, E., Fogarty, J.: Rethinking the mobile food journal: exploring opportunities for lightweight photo-based capture. In: ACM Conference on Human Factors in Computing Systems (2015)
Google Scholar
Fan, J., Gao, Y., Luo, H., Jain, R.: Mining multilevel image semantics via hierarchical classification. IEEE Trans. Multimedia 10(2), 167–184 (2008)
Article Google Scholar
Farooq, M., Sazonov, E.: A novel wearable device for food intake and physical activity recognition. Sensors 16(7), 1067 (2016)
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Herranz, L., Jiang, S., Xu, R.: Modeling restaurant context for food recognition. IEEE Trans. Multimedia 19(2), 430–440 (2017)
Article Google Scholar
Huang, C., Li, H., Xie, Y., Wu, Q., Luo, B.: PBC: polygon-based classifier for fine-grained categorization. IEEE Trans. Multimedia 19(4), 673–684 (2016)
Article Google Scholar
Huang, C., Meng, F., Luo, W., Zhu, S.: Bird breed classification and annotation using saliency based graphical model. J. VCIR 25(6), 1299–1307 (2014)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.: Caffe: convolutional architecture for fast feature embedding. Eprint arXiv (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Lee, G., Tai, Y.W., Kim, J.: Deep saliency with encoded low level distance map and high level features. In: CVPR (2016)
Google Scholar
Lin, M., Chen, Q., Yan, S.: Network in network. Comput. Sci. (2013)
Google Scholar
Lin, Z., Hua, G., Davis, L.S.: Multi-scale shared features for cascade object detection. In: ICIP (2013)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Google Scholar
Martin, C.K., Correa, J.B., Han, H., Allen, H.R., Rood, J.C., Champagne, C.M., Gunturk, B.K., Bray, G.A.: Validity of the remote food photography method (rfpm) for estimating energy and nutrient intake in near real time. Obesity 20(4), 891–899 (2012)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)
Google Scholar
Tammachat, N., Pantuwong, N.: Calories analysis of food intake using image recognition. In: ICITEE (2014)
Google Scholar
Yanai, K., Kawano, Y.: Food image recognition using deep convolutional network with pre-training and fine-tuning. In: ICME (2015)
Google Scholar
Yang, S., Chen, M., Pomerleau, D., Sukthankar, R.: Food recognition using statistics of pairwise local features. In: CVPR (2010)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Google Scholar
Zhang, J., Sclaroff, S., Lin, Z., Shen, X., Price, B., Mech, R.: Minimum barrier salient object detection at 80 fps. In: ICCV (2016)
Google Scholar
Zhang, X., Xiong, H., Zhou, W., Lin, W., Tian, Q.: Picking deep filter responses for fine-grained image recognition. In: CVPR (2016)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
Yali Chen, Yanping Yang, Qing Fang & Xiaoyu Yao

Authors

Yali Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yanping Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qing Fang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yali Chen .

Editor information

Editors and Affiliations

Civil Aviation University of China, Tianjin, China
Jinfeng Yang
Tianjin University, Tianjin, China
Qinghua Hu
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Huazhong University of Science and Technology, Wuhan, China
Xiang Bai
Xi’an Jiaotong University, Xi’an, China
Deyu Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Y., Yang, Y., Fang, Q., Yao, X. (2017). Discriminative Region Guided Deep Neural Network Towards Food Image Classification. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 772. Springer, Singapore. https://doi.org/10.1007/978-981-10-7302-1_48

Download citation

DOI: https://doi.org/10.1007/978-981-10-7302-1_48
Published: 30 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7301-4
Online ISBN: 978-981-10-7302-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics