1 Introduction

The problem of healthy and balanced meal is being seriously tackled by the different health agencies with the aim at reducing obesity and unbalanced nutrition. For example, the Department of Health of the Italian Government promoted an extensive campaign for food and nutrition educationFootnote 1. The Department of Health of the Australian Government, compiled a very detailed report with guidelines for healthy foods in school canteensFootnote 2. In Japan, the Ministry of Education, Science and Culture, the Ministry of Health and Welfare, and the Ministry of Agriculture, Forestry and Fisheries developed dietary guidelines with the aim of promoting better dietary patternsFootnote 3. Similar actions can be found across many other countries (e.g. UKFootnote 4, USAFootnote 5, etc...).

The significant increase rates of obesity and diabetes for both adults and children make it extremely important to devise ways for accurate tracking of nutritional intakes. The conventional way to track food intake has been carried out by exploring manually recorded logs on a daily basis, which is error prone due to delayed reporting, inability to estimate the food type and quantity. More accurate and user-friendly solutions can be achieved by taking advantage of technology, e.g one can simply take a picture of a plate of food using a smartphone and corresponding calorie of the food can be calculated automatically by employing computer vision techniques. In the recent years many research works have demonstrated that machine learning and computer vision techniques can help to build systems to automatically recognize diverse foods and to estimate the food quantity and calories [2, 10, 15, 19, 20, 24, 29, 30].

Fig. 1.
figure 1

A general workflow of automatic dietary monitoring.

As shown in Fig. 1, a general work-flow of automatic dietary monitoring a number of computational tasks need to be accomplished. Initially, a given image is segmented in order to locate the boundaries of the food regions (food segmentation). Then, each food region is processed to identify the depicted food (food recognition). Finally, the quantity of each recognized food is estimated (quantity estimation) which paves the way towards calorie measurement. It is undeniable that accurately accomplishing each of these tasks is very important to achieve a well-working dietary monitoring system. In addition, food segmentation has to be used as a pre-processing step for recognition images with multiple food items.

It is clear that any mistakes, e.g under-segmentation or over-segmentation may influence the performance of the other tasks which would inevitably yield to incorrect calorie calculation. Moreover, if the dietary monitoring system is aimed at supporting the user in logging its food consumption in real scenarios, the food segmentation task must be robust to different image acquisition conditions such as changes in illumination, different point of views or different cameras.

Food analysis is a particularly challenging task since food is characterized by an intrinsic high intra-class variability. For instance, the same food can have very diverse visual appearance in the images due to slightly different preparations, presentations, and background. At the same time, we can have a low inter-class variability in the food appearance: different food may have extremely similar visual characteristics that make them difficulty to disambiguate. Figure 2 shows some examples of these two problems.

Fig. 2.
figure 2

Top row: examples of high intra-class variability in visual appearance in food images (“Curry rice”). Bottom row: examples of low inter-class variability (“Fried noodles” vs. “Spaghetti”, “Donuts” vs. “Bagel”, and “French fries” vs. “Poutine”).

Due to their robustness with respect to data variability and thanks to their ability to extract and process relevant information within the images, Deep Convolutional Neural Networks (DCNNs) have been effectively used in different computer vision tasks. For these reason, in this paper we aim to explore the potential of DCNNs on food and non-food image segmentation where semantic segmentation paradigm is used to separate food regions from the non-food regions. The results of this task could be leveraged both for a food recognition task and for a quantity and calories estimation task. Here we are interested in evaluating the performance of an efficient DCNN (GUNet [27, 28]) with respect to variability in illumination conditions that can be found in food images taken in real scenarios. We want to investigate if and how we can cope with these variability at training time. To this end we have designed an experimental setup to evaluate a fast DCNN network trained on datasets of images rendered using different illuminants in a similar way as in [7]. We choose GUNet since, differently from other more complex and deep networks, it could be used also in mobile applications.

The paper is organized as follows. In Sect. 2 we briefly revise the state of the art in food segmentation. In Sect. 3, we describe our proposed approach for food segmentation. In Sect. 4 we present the segmentation results on a newly annotated food dataset. Finally, Sect. 5 concludes the paper.

2 Related Works

In this Section, we describe recent works dealing with the problem of food segmentation both using conventional approach as well as based on DCNNs.

Among the approaches using conventional image segmentation techniques adopted to food images we can find the work by He et al. [21] which evaluate two segmentation algorithms exploiting normalized cut (NCut) and local variation. Based on the experiments, local variation is found to be more stable among the two algorithms. In [15, 16] JSEG segmentation algorithm [18] is used to locate food regions in images acquired in a canteen environment. Coupled with ad-hoc post-processing (given the image domain) of the resulting segmented images it is able to achieve very good performances. In [3], the JSEG-based food segmentation approach is further analyzed with extensive experiments evaluating different color spaces and algorithm’s parameters used for the segmentation. Results show that, segmentation results can be improved with respect to the base algorithm.

In order to avoid the high computational cost of normalized cut-based segmentation algorithm in particular when applied per pixel on large images, Wang et al. [32] combine NCut with image superpixels. Superpixels are computed using the SLIC algorithm [1] and rely on color and texture cues. Results obtained on food and non food datasets, show that the proposed method exhibits a fast computation and an efficient use of memory.

Precise automatic food segmentation is a difficult task, in particular if it need to be performed on images acquired in the wild. For this reason, Inunganbi et al. [22] propose an interactive segmentation approach instead. They approach the segmentation task as a two class classification problem (i.e. food vs. non-food). Segmentation is performed using Random Forest ensemble learning, and boundary detection & filling and Gappy Principal Component Analysis methods are applied to refine the initial segmentation result.

The stunning success of DCNNs for image classification, encouraged the use of these techniques for other computer vision tasks as well. In particular they have been successfully employed for the semantic segmentation task where they are able to predict the category label of each image pixel among a set of given ones. However, properties that make DCNNs advantageous for classification tasks, i.e. robustness to spatial transformations and ability to learn increased abstraction of data, impede accuracy of the system for segmentation tasks where precise localization is desired rather than abstraction of spatial details [13]. For this reason, DCNNs that must perform image segmentation need to be designed with operations (i.e. layers) specifically tailored for the segmentation task as in FCN [25], DeepLab [13], SegNet [5] and GUNet [28] networks.

In the context of food segmentation, DCNNs have been used by Dehais et al. [17] to segment the food of already detected dishes in an image. The method combines region growing/merging techniques with a deep CNN-based food border detection to detect food regions outlines.

[30] is the first work accomplishing semantic segmentation specifically for food images by employing DCNNs. Specifically, the network architecture used is based on the earliest version of DeepLab [12]. This model uses a CNN to provide the unary potentials of a CRF, and a fully connected graph to perform edge-sensitive label smoothing (as in bilateral filtering). The network is initialized on ImageNet and fine-tuned on a newly annotated dataset of more than 12,000 images: Food201-segmented.

Bolanos et al. [9] Bolanos et al. employed DCNNs to simultaneously perform food localization and recognition. First, the method produces a food activation map on the input image for generating bounding boxes proposals of food regions. Then, each proposal is processed to recognize food types or food-related objects present in each bounding box. A similar approach is employed by Wang et al. [33]. The segmentation method uses the class activation maps and an adapted VGG-16 architecture to perform weakly supervised learning. The network is trained on food datasets as a top-down saliency model, and a Global Average Max Pooling (GAMP) layer is introduced into the architecture. The activation maps are used as constraints in biased normalized cut. The final segmentation masks are obtained by binarizing the biased normalized cut.

In order to be effective in the context of dietary monitoring in real scenarios (i.e. in the wild), food segmentation algorithms must be robust to changes in illumination conditions. Since the image acquisition environment is mostly uncontrolled we can have large variations in illuminants. A network trained on pristine image datasets could have low performances in real usage. For this reason, large, heterogeneous and representative, for the task at hand, food image datasets are crucial for the design of effective methods for food recognition, segmentation and quantity estimation. One of the largest dataset for food segmentation is Food201-segmented of [30]. Unfortunately the dataset is not publicly available. A smaller but available segmented dataset is, for example, the UNIMIB2015 and UNIMIB2016 datasets [15, 16] created collecting tray images in a canteen environment. Finally, if the application require that the processing is to be performed on mobile devices (i.e. acquisition and analysis) in order to reduce the bandwidth consumption, it is essential that the DCNN used for segmentation must be light and thus requiring few operations (see [6] for a benchmark of different DCNN architectures).

3 Experimental Setup

For our experiments we adopted the Food50M described in Sect. 3.1. We performed our experiments using an efficient network named GUNet [28] which is presented more in details in Sect. 3.2. Every model is evaluated by means of the pixel-based IoU measure computed as:

$$\begin{aligned} IoU = \frac{groundtruth \cap prediction}{groundtruth \cup prediction} \end{aligned}$$
(1)

this formula can be alternatively written as:

$$\begin{aligned} IoU = \frac{TP}{FP + TP + FN} \end{aligned}$$
(2)

where TP, FP and FN represent True Positive, False Positive and False Negative pixels respectively. For every experiment we report the IoU measure computed independently for every class. We also report the mean of the food and non-food classes as a more synthetic indicator of the performance of the whole segmentation method.

3.1 Food Image Dataset

Available large food image datasets with segmentation information are scarce. The largest dataset in the literature is the one introduced in [30] but unfortunately is not publicly available. Other image datasets are either small of do not have a pixel-based segmentation annotation. For these reason, we decide to use the Food-50 dataset [23] as a base to create our own segmentation datasets. With respect to the available UNIMIB2015 [15] and UNIMIB2016 [16] datasets, Food-50 is large enough to be used for training and validation and is more diverse. Food-50 contains 5,000 images divided into 50 food classes. We manually annotated the regions of all the food items present in the images. This mean that each food belonging to one of the 50 classes as well as any other food found, such as vegetable in side dishes or used for presentation, was annotated as “food”. We call this food image dataset Food50Seg. Figure 3 shows some examples of images in the dataset with the corresponding pixel-based annotation.

Fig. 3.
figure 3

Some segmentation examples of our Food50Seg food image dataset. Top row: “Arepas”, “Lobster”, and “Popcorn”. Bottom row: “Corn”, “Fried rice”, and “Chocolate”.

The images in the Food50 dataset were collected on the Internet from different sources. However, all the images seem to have been acquired using different cameras and mobile devices. They exhibit no evident illuminant cast so we can assume that a form of white balancing has been applied by the imaging devices. Moreover, we can assume the images use the sRGB color space, and thus are rendered under CIE D65 illuminant. To understand the effect of lighting variations in the context of semantic food segmentation, we modified the Food-50 dataset by artificially changing the illumination conditions of the images. In particular we follow a procedure similar to [7] by using nine blackbody radiators with color temperature from 3000 K to 8000 K with step of 1000 K, and with color temperature of 10000 K, 12000 K, and 14000 K. We call this dataset Food50M. Figure 4 shows an image rendered using the nine illuminants considered.

Fig. 4.
figure 4

An example of a food image in the Food50M dataset rendered using the nine illuminants considered. From top to bottom: “Dumplings”, “Bread”, “Sushi”, and “Salad”.

3.2 Network Architecture

For our experiments we adopted the GUNet architecture presented for the first time by Mazzini in [27]. In particular we rely on the improved version with the Guided Upsampling Module [28]. This network is designed for real-time applications and thus its main characteristic is the low inference time. The reasons that lead us to employ this architecture are twofold: first, this model can be employed on mobile devices for dietary monitoring applications in real conditions. Second, the architecture is also very fast at train time allowing us to train a relatively high quantity of models for our experiments on a single Titan Xp GPU.

The network follows an encoder-decoder design. The encoder has a multibranch structure to encode features at multiple resolutions: the first branch is the deepest one, encoding the most abstract features whereas the second branch is shallower by design in order to encode fine details without being too computational heavy. The first part of the decoder is a Fusion Module. It is composed by a first part where signals are pre-processed independently followed by a second part where signals are jointly processed and information coming from multiple resolution branches is fused together. The Decoder ends with a layer named Guided Upsampling Module that efficiently upsample the feature map. In this way the network is able to produce a semantic segmentation map of the same resolution of the network input.

The network exhibits a total of 19 Millions of parameters and requires 58,7 GFLOP to perform single inference on a 512\(\times \)1024 image. Most parameters are in the Encoder part.

3.3 Training Recipe

All the network configurations in this paper have been trained with Stochastic Gradient Descent (SGD) plus momentum. Following [28], we set the base learning rate to 0.001 and trained for 150 epochs. We adopted a fixed step learning rate policy. The initial value is decreased two times by a order of magnitude (at 100 and 200 epochs).

4 Experimental Results

4.1 Assessing Robustness to Illuminants

With a first cluster of experiments, we want to assess if our DCNN model trained on Food50 training-set is robust to changes of the illuminant in the scene. The same network, i.e. GUNet [28] described in Sect. 3.2, is trained only once on the original dataset and tested over nine modified versions of the test set with illuminants. Numerical results for this experiment are reported in Table 1 in terms of IoU measure vs illuminant color. The first column reports the network performance on the original test set. It is clearly noticeable a performance degradation on all the modified test sets with respect to the baseline. In particular the IoU levels decay with higher intensity for extreme cast values. The highest IoU values are obtained between 7000 K and 8000 K suggesting that the distribution of illuminants in the original image set is centered on such values (consider that the daylight temperature is 6500 K).

Table 1. Results of our DCNN model tested on different illuminants. The first column no-cast reports the results on the original test set.

By looking at detailed results (i.e. first two rows) in Table 1 we can observe an interesting detail: IoU values for the non-food class remains quite high for every color cast whereas the real performance decrease is remarkable only for the food class. This might be related to the fact that the non-food images in our test set have been collected “in the wild” and thus they exhibit a wider range of illuminants. The trained model is thus more robust to illuminant changes. In our next experiments we will augment the training set with images with illuminants to measure if the model can acquire a certain degree of robustness to this types of image transformations.

4.2 Augmented Training Set

With this set of experiments, we want to assess the behaviour of the DCNN if trained on an augmented training set and observe the difference with the same model trained on the natural dataset. In detail, we want to test if the model benefits from being trained on a dataset with a specific illuminant and compare it with a model which is trained on the union set of all the illuminants. In Table 2 are reported the results for three clusters of experiments: the first three lines represent our baseline from Table 1 (i.e. a single model trained on the original dataset).

Table 2. Results of three different train setups tested against various illuminant casts. First 3 rows: the model is trained on the original test set. Line 4 to 6: the model is trained on the specific illuminant only. Last 3 rows: the training set is augmented with all the illuminants.

The second cluster of experiments is exposed from line 4 to 6. Every column represents the result obtained by training and testing a model on that specific illuminant. Notice that the first and the last column are missing because they represent redundant data: the first column represents the model trained and tested on the no-cast dataset, which is shown in the first three rows of the table. Missing data in the last column represents the model trained and tested on all illuminants, which is presented in the last column. We augmented the original training set with images rendered with that illuminant. The test set, like in all the others experiments, consists of images with a single modified illuminant. The performance raises dramatically with respect to the baseline. Most benefit is relative to the food class.

The last 3 lines of Table 2 represent the results of a single model trained on a set composed of the union set of all the illuminants. Surprisingly, for some illuminants this is the best model outperforming even the models trained on specific illuminants. This suggest that the DCNN model benefits from training on a higher number of images with higher variability. We suppose that the model trained on different illuminants generalize better because it become able to discriminate food from non food areas, abstracting from the type of illuminant in the scene.

In Table 2, the first column indicates tests made on the original test set without any illuminant cast. The last column indicates a test set composed by the union of all the test sets with illuminant casts. The model trained on all the illuminants exhibits an interesting behaviour even in this two cases. With respect to the no-cast set it behaves slightly better than the baseline model, suggesting that this for of data augmentation do not hurt the performance on the original data, i.e. the transformations are balanced and they do not introduce a bias in the data. Concerning the all-illuminant test set it shows a similar performance with respect to the no-cast set. Notice that the performance of the model trained only on the original data are visibly degraded.

Fig. 5.
figure 5

Some segmentation results obtained using the different training strategies. no-cast denotes that the training is performed on the original data. “all ill.” denotes that all the illuminants have been used during training. Images have been resized for display.

Table 3. Results of our DCNN model trained on different illuminants and tested on the original test set. The first column no-cast represents the results of the model trained on the original data.

In the last cluster of experiments we test the networks trained on specific illuminants against the original test set. With these experiments we want to verify if models trained on those augmented datasets maintain an acceptable segmentation performance even in the original test set not affected by illumination cast. Results are shown in Table 3. For the model trained on the 3000 K illuminant the mean IoU si quite lower than the baseline no-cast. This is probably due to the fact that 3000 K is an extreme illuminant color cast and such cast is not present in any of the original test set images. Surprisingly for some other illuminants like 14000 K or 8000 K the specific illuminant train setting is beneficial even when testing on the original set. However these numbers are not significantly higher and could be due the intrinsic randomness when initializing CNN models. As a general consideration we can state that the model do not exhibit evident performance degradation.

Figure 5 shows some visual results obtained using the different training strategies. The no-cast column represents the output of the model trained only on the original data. “all ill.” denotes the output of the model trained on the union set of every illuminant cast. The no-cast model clearly produces the worst visual results. In some images (i.e. rows 4,8) it doesn’t even detect anything at all. Furthermore the three bottom images are clearly under-segmented. Every other method achieve overall better segmentation results. This is in line with what emerges from Table 3.

5 Conclusions

In this work we explored the potential of Deep Convolutional Neural Networks on food image segmentation to discriminate food regions from the background. In particular, we evaluated the performance of an efficient DCNN with respect to the variability in illumination conditions on real scene images. We built a new dataset named Food50M where images from the train and test set have been modified with nine different illuminants. Results show that the network trained on pristine data is not able to cope with strong illuminant shifts. By training and testing the model on a specific illuminant, performance improves with respect to the baseline as expected. Using all illuminants in training we are able to further improve the IoU measure. This demonstrates that we can effectively empower the model robustness exploiting an augmented training set composed by images with simulated illuminants. As future works we plan to evaluate color constancy algorithms or filter removal methods as pre-processing steps during the training phase, both traditional as well as CNN-based such as [8, 11, 14, 31]. Moreover, our food/non-food segmentation approach can be coupled with a semantic food segmentation technique such as [4, 13, 26] in order to recognize food-specific regions. This will allow to implement a system able to recognize food and estimate their properties such as quantity, calories, etc...