Learning to Rank Food Images

Allegra, Dario; Erba, Daniela; Farinella, Giovanni Maria; Grazioso, Giovanni; Maci, Paolo Danilo; Stanco, Filippo; Tomaselli, Valeria

doi:10.1007/978-3-030-30645-8_57

Dario Allegra¹⁴,
Daniela Erba¹⁵,
Giovanni Maria Farinella¹⁴,
Giovanni Grazioso¹⁵,
Paolo Danilo Maci¹⁴,
Filippo Stanco¹⁴ &
…
Valeria Tomaselli¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11752))

Included in the following conference series:

International Conference on Image Analysis and Processing

1918 Accesses

Abstract

In the last decade food understanding has become a very attractive topic. This has implied the growing demand of Computer Vision algorithms for automatic diet assessment to treat or prevent food related diseases. However, the intrinsic variability of food, makes the research in this field incredibly challenging. Although many papers about classification or recognition of food images have been published in recent years, the literature lacks of works which address volume and calories estimation problem. Since an ideal food understanding engine should be able to provide information about nutritional values, the knowledge of the volume is essential. Differently from the state-of-art works, in this paper we address the problem of volume estimation through Learning to Rank algorithms. Our idea is to work with a predefined set of possible portion size and exploit a ranking approach based on Support Vector Machine (SVM) to sort food images according to the volume. At the best of our knowledge, this is the first work where food volume analysis is treated as a raking problem. To validate the proposed methodology we introduce a new dataset of 99 food images related to 11 food plates. Each food image belongs to one over three possible portion size (i.e., small, medium, large). Then, we provide a baseline experiment to assess the problem of learning to rank food images by using three different image descriptors based on Bag of Visual Words, GoogleNet and MobileNet. Experimental results, confirm that the exploited paradigm obtain good performances and that a ranking function for food volume analysis can be successfully learnt.

You have full access to this open access chapter, Download conference paper PDF

A novel approach to estimate the weight of food items based on features extracted from an image using boosting algorithms

Article Open access 29 November 2023

Visual Aware Hierarchy Based Food Recognition

A Benchmark Dataset to Study the Representation of Food Images

Keywords

1 Introduction

Nowadays, people tend to ignore the impact that the food may have in their life. Unfortunately, an inadequate nutrition is one of the main causes of many chronic diseases such as obesity, diabetes, cancer, osteoporosis, dental diseases and cardiovascular problems [1, 26]. Also, people ignoring healthy diet can incur in malnutrition problems.

Malnutrition can be defined as “any nutritional imbalance” that comprises over and under-nutrition and mainly involves elderly people, even in developed countries. The inadequate nutrition has been documented either in institutionalised as well as in free living elderly and exerts negative effects on health outcomes. The prevalence of malnutrition was estimated as $14\%$ in older population. Moreover, the 28–45% of older people, recently moved to care homes or hospitals, were malnourished on admission [10]. This situation has serious consequences for individuals as well as for society, including the increasing risk for morbidity, mortality and consequently for social cost. However, nutrition screening of older adults is extremely difficult: some of the screening methods can be self-reported, with possible misreports, others can be only administered by trained clinicians, while the biochemical markers are time consuming and expensive to use.

In recent years, these facts led Computer Vision researchers to develop new solutions for automatically collected information during food intake in the context of people diet monitoring [3, 20, 23, 30]. Nevertheless, the intrinsic food variability in colour and shapes, as well as the great assortment of ingredients, makes very challenging the development of an efficient and effective food understanding engine. Ideally, a comprehensive system should be able to detect food dish, recognise the ingredients, estimate the volume and finally provide nutritional values. In the last decade, the spread of annotated datasets of food images [5, 11,12,13, 16, 19, 22, 24, 29] coupled with the massive use of learned based feature, have led promising performance for food detection and recognition tasks. On the contrary, quantity estimation studies have suffered the lack of proper datasets. Since 3D information have to be inferred for a correct volume estimation, in a totally unsupervised environment with no spatial references, it results in a extremely challenging problem [2, 6,7,8, 18, 23, 25].

However, in order to simplify patients diet monitoring, the current practice in healthcare facilities is to serve a set of standard food portions (e.g., small, medium, large). In this context, it is not necessary to estimate the exact amount of food in the dish, and the problem can be addressed using ranking strategies. In other words, given two images $I_1$ and $I_2$, one looks for a function $rank(\cdot )$ such that $rank(I_1)<rank(I_2)$ if the food amount depicted in $I_1$ is lower then the food amount depicted in $I_2$. With this idea in mind, and inspired by [21], we investigate the use of Learning to Rank algorithms [17] to sort food images according to the portion size.

To this aim, we introduce a new annotated dataset of 99 images and provide a baseline by using Ranking SVM [14] with three different kinds of visual features, i.e. Bag of words, GoogleNet features [27, 28] and MobileNet features [15]. Although Learning to Rank have been successfully used in Information Retrieval, Natural Language Processing and Data Mining [17], at the best of our knowledge this is the first time it is employed for food amount ranking. Hence, the novel contributions of this work is two-fold:

a new annotated dataset to face food ranking problem with respect to the portion size;
the assessment of Ranking SVM [21] to learn a ranking function with different visual features.

The paper is organized as following. In Sect. 2 we report the related works in the field. In Sect. 3 we focus on the contribution of this work by detailing the proposed dataset and the considering a Learning to Rank approach. Section 4 describes the experimental settings and reports results. Section 5 summarise conclusions.

2 Related Works

Several works related to food understanding are available in literature. Most of them focus on food detection, recognition and classification and are motivated by the increasing demand of assistive technologies for diet monitoring [3, 20, 23, 30].

In 2009, Puri et al. [23] proposed a system for food recognition and volume estimation. The classifier was obtained as a linear combination of multiple weak SVM classifiers trained on texture features. The volume was inferred by using RANSAC and dense stereo matching for depth map construction. In [8], Dehais et al. employed stereo pairs to compute disparity map and build a dense points cloud which is aligned to the table plane. This framework was designed to work by using a reference card placed on the table. By assuming the different food items in the plate are already segmented, each segment is projected on the 3D space for volume computation. In 2013, Chen et al. [6] presented a method for volume estimation from single view image. The approach requires a specific shape model for each food category and a calibrated reference marker. In 2017, Dehais et al. [7] proposed to estimate volume of multi-food meals with unconstrained 3D shape using stereo vision. The approach required two meals image of food placed inside elliptical plate, a credit card sized reference next to the plate and a segmentation map. Allegra et al. [2], proposed to use RGB-D images and supervised learning to perform depth estimation. The authors performed semantic segmentation by using U-Net and then they used a modified version of the CNN in [9] for depth inference from single RGB input. In 2018, Lu et al. [18] presented a Multi-Tasking Learning approach to estimate volume of food items from single RGB image. The proposed CNN architecture is composed by multiple modules. The first one is a feature extraction module and it is composed by ResNet50 and Feature Pyramid Network (FPN). The second module is the depth prediction net, which is mainly based on an encoder-decoder design architecture with skip connections and multi-scale side predictions. Semantic segmentation is performed by a Region Proposal Network (RPN) and a recognition net. The output of RPN is used to provide a food candidate mask and then the final volume was obtained through a CNN regressor.

Unlike the previous works, in this paper we propose to address food amount estimation as a raking problem. We neither perform depth estimation nor other 3D reconstruction steps since we operate in a context with a set of fixed food portions (small, medium and large portions served at the canteen of the University). Ranking is one of the fundamental problems in information retrievals. Given a query q and a collection of documents D that match the query, ranking consists of sorting the documents according to some criterion. Learning to Rank (LTR) refers to machine learning strategies for training a model for a ranking task [17]. Specifically, we use LTR to train a model for a ranking task which lies in sorting RGB food images according to the relative visual attribute “portion size”. The concept of relative attribute was introduced in [21] by Parikh and Grauman. Their aim was to provide a way to estimate the degree of an attribute give an image. Hence, differently than predicting the presence of an attribute, a relative attribute indicates the strength of an attribute in an image with respect to other images.

3 Materials and Methods

The main contribution of this paper lies in using Rankig SVM to train a model for ranking food images according to the portion size. Additionally, for validation purpose, we introduce a new dataset whose details are described in the following section. The dataset is publicly available for research purposes^{Footnote 1}. In order to provide a baseline, we perform a comparison between three different kinds of image descriptors which we exploit with Rankig SVM: Bag of Words [4], GoogleNet features [28] and MobileNet features [15].

3.1 Proposed Dataset

The proposed dataset consists of 99 RGB images belonging to 11 different classes (es. “insalata orzo e verdura”, “cordon bleu”, “fusilli alla crudaiola”, etc.). Each image is associated to one portion size among three possible portion sizes: small, medium and large. For each class we collected 9 images corresponding to 3 images for each portion size. Moreover, in order to introduce variability, during acquisition we have used plates with two different diameters: 18cm (small plate) and 22.8cm (large plate). All the acquisition have been performed by a standard RGB camera fixed in a support and a centered top view with respect to the plate. In Fig. 1 are shown some examples of images belonging to the proposed dataset. At the best of our knowledge, state-of-art datasets are not suitable to test LTR approaches. Hence, differently from them, the proposed dataset includes multiple portion sizes for each dish to properly test LTR methods. To promote new task of ranking food images and to make repeatable our experiments, the proposed dataset is publicly available.

3.2 Ranking SVM

Ranking SVM is a popular rank method proposed by Herbrich et al. [14]. The idea behind Ranking SVM is to transform ranking into pairwise classification and employ the standard SVM strategy to perform the learning task according to one specific attribute. Given the image descriptor $\mathbf x _i \in \mathbb {R}^n$ of the image $I_i$, the final aim is to find a ranking function $r_m(\cdot )$:

$$\begin{aligned} r_m(\mathbf x _i)=\mathbf w _m^T\mathbf x _i \end{aligned}$$

(1)

such that $r_m(\mathbf x _i) \ge r_m(\mathbf x _j)$ if and only if the strength of the attribute in $I_i$ is higher than the strength of the attribute in the image $I_j$.

Hence, more formally, Ranking SVM is defined as the following constrained optimization problem:

$$\begin{aligned} \text {minimize}\quad \frac{1}{2}{\left\| \mathbf w _m^T\right\| }_2^2+C\left( \sum {\xi _{ij}^2}+\sum {\gamma _{ij}^2}\right) \end{aligned}$$

(2)

subject to:

$$\begin{aligned} {\mid }{} \mathbf w _m^T(\mathbf x _i-\mathbf x _j){\mid } \ge 1-\xi _{ij} \quad \forall (i,j) \in O_m \end{aligned}$$

(3)

$$\begin{aligned} {\mid }{} \mathbf w _m^T(\mathbf x _i-\mathbf x _j){\mid } \le \gamma _{ij} \quad \forall (i,j) \in S_m \end{aligned}$$

(4)

$$\begin{aligned} \xi _{ij} \ge 0 \quad \gamma _{ij} \ge 0 \end{aligned}$$

(5)

where $\xi _{ij}$ and $\gamma _{ij}$ are slack variables to relax the constrains and control SVM margins; $C > 0$ is a regularization parameters to limit the growth of slack variables; $O_m$ is the set of the ordered pair (i, j) such that image $I_i$ has a higher presence of the attribute than the image $I_j$; $S_m$ is the set of the ordered pair (i, j) such that the images $I_i$ and $I_j$ have about the same presence of the attribute.

Hence, during the training phase, one has to provide the descriptors of the training images and the set $O_m$ and $S_m$.

3.3 Image Representation

In order to apply Ranking SVM, each RGB image $I_i$ has to be described by a feature vector $\mathbf x _i$. In this study we employ three difference strategies to build the descriptor $\mathbf x _i$, namely the Bag of Words paradigm with SURF, pre-trained GoogLeNet Inception v3 and pre-trained MobileNet. In the following, we report some details about the employed representation models.

Bag of Visual Words. All BoW approaches are based on frequency statistics on primitive unit (pixel, lexical unit, etc.) of a finite set. It means that a vocabulary of “words” must be built.

The general idea behind this method is to represent an image as a histogram of visual words frequency. As first step, key-point descriptors are extracted from the images. Subsequently, a clustering algorithm is employed to quantize the key-points feature space by identifying K centroids to be used as a vocabulary V composed by K visual words. Then, the final representation of an image consists of a normalised histogram H, where the bin $H_i$ is related to the frequency of the visual word $v_i \in V$. For experimental purpose, we have used Speeded Up Robust Features (SURF) [4] for key-points extraction, and K-Means algorithm to create a vocabulary with a specific size K. In our experiments we adopt a 64-dimensional key-point descriptor and a vocabulary size of $K=500$.

GoogLeNet Inception V3. GoogLeNet Inception v3 [28] is a Convolutional Neural Network (CNN) pre-trained on ImageNet. Its architecture consists of 42 layers. Differently from the previous version [27], in v3 the computational burden is limited, but the effectiveness in term of accuracy is unaltered. This is realized by replacing $5 \times 5$ convolutions with $3 \times 3$ convolutions and by employing factorization methods. In this work, we used pre-trained GoogLeNet Inception v3 to extract visual features from food images, so the last layer is removed and a Global Average Pooling layer is placed to get a 2048-dimensions feature vector.

MobileNet. MobileNet [15] is based on a streamlined architecture that employs depthwise separable convolution to build light and efficient deep neural networks. Depthwise separable convolution is a form of factorized convolutions which replaces a standard convolution with both, depthwise convolution and a $1 \times 1$ convolution called pointwise. This makes MobileNet suitable for low-powerful devices. To extract visual features for our experiments, we used a MobileNet pre-trained on ImageNet. The architecture consists of 28 layers, but we remove the last one and use Global Average Pooling to get a 256-dimensions descriptor.

4 Experiments and Results

To assess the proposed approach, we employ three-fold cross validation method. Hence, we run three times the experiments by using 66 images for training purpose and 33 for testing. Finally, we average the results. Note that test set always includes 1 small, 1 medium and 1 large portion image for each of the 11 classes. Moreover, all the images are resized to $250 \times 250$ pixels. In training phase we consider all the possible pairs in order to define $O_m$ and $S_m$ (see Sect. 3.2), whereas in test phase we validate the method by considering the ordering inferred for images triplets. For a proper quantitative evaluation we use two different evaluation measures, i.e. accuracy and Hamming distance, that are detailed in the next sections.

4.1 Training Phase

Training Ranking SVM model requires to build the set $O_m$ and $S_m$ described in Sect. 3.2. The number of all possible pairs (i, j) can be easily computed as ${66}\atopwithdelims (){2}$, hence ${\mid }O_m {\mid }+ {\mid }S_m \mid = 2145$. Moreover, since $S_m$ includes all the pairs related to the images with the same degree of the considered attribute (i.e, portion size), we have that ${\mid } S_m {\mid } = 3\times {{22}\atopwithdelims (){2}} = 693$. The rest of 1452 pairs belong to $O_m$. We would like to highlight that although 66 images could be considered a limited number of data for training purpose, the actual training data for Ranking SVM lie in the pairs in $O_m \cup S_m$, i.e. 2145.

4.2 Testing Phase

Since in this work we want to distinguish between three different portion size, we decide to evaluate the proposed method by considering images triplets. Hence, we use the ${{33}\atopwithdelims (){3}} = 5456$ possible triplets as testing samples. For each images triplet $\{I_s,I_p,I_q\}$ and the related descriptors $\{\mathbf {x}_s$, $\mathbf x _p$, $\mathbf x _q\}$, we compared the ground truth ranking $r^*(\cdot )$ (which we can compute using labels: small, medium, large) with respect to the one inferred through Ranking SVM approach, i.e. $r_m(\cdot )$.

4.3 Evaluation Measures

To quantitatively evaluate the performances of a learned ranking function $r_m(\cdot )$, we used two different measures: accuracy and Hamming distance. In this context, the accuracy is related to the number of correctly sorted triplets over the total number of tested triplets. More formally, we define the accuracy of the ranking function $r_m(\cdot )$:

$$\begin{aligned} acc=\sum _{k=1}^{M}\left( \frac{\delta _k}{M}\right) \end{aligned}$$

(6)

where M is the number of testing triplets. Given the ground truth ranking values $\{r^*(\mathbf x ^k_s), r^*(\mathbf x ^k_p), r^*(\mathbf x ^k_q)\}$ related to the k-th triplet $\{I^k_s,I^k_p,I^k_q\}$, and assuming a ranking order $r^*(\mathbf x _s) \le r^*(\mathbf x _p) \le r^*(\mathbf x _q)$, the value of $\delta _k$ is defined as following:

$$\begin{aligned} \delta _k = {\left\{ \begin{array}{ll} 1 &{} \text {if}\quad r_m(\mathbf x _s) \le r_m(\mathbf x _p) \le r_m(\mathbf x _q) \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

(7)

In a nutshell, we can say that $\delta _k=1$ if the ground truth ranking order and the inferred ranking order agrees on a triplet.

Differently, Hamming distance evaluates the specific mismatching between the elements of the ground truth sorted triplet and the elements of the predicted one. Moreover, we assign a higher penalty if the mismatching occurs between a small portion and a large portion. For the sake of formalism, we define an ordered triplet as $\mathbf T =(I_s,I_p,I_q)$ and a function $label: \mathbb {I} \rightarrow \{1,2,3\}$:

$$\begin{aligned} label(I) = {\left\{ \begin{array}{ll} 1 &{} \text {if image } I \text { depicts a small portion} \\ 2 &{} \text {if image } I \text { depicts a medium portion}\\ 3 &{} \text {if image } I \text { depicts a large portion} \end{array}\right. } \end{aligned}$$

(8)

Then, if $\mathbf T ^*$ is a correctly ordered triplet and $\mathbf T _m$ is the predicted one, the hamming distance between $\mathbf T *$ and $\mathbf T _m$ is defined as:

$$\begin{aligned} hd(\mathbf T ^*,\mathbf T _m)=\sum _{k=1}^{3}{\mid } label(T^*_k)-label(T_{mk}){\mid } \end{aligned}$$

(9)

where $T_k$ is the k-th images in the ordered triplet. In our experimental settings, Hamming distance is computed for all the testing triplets and then the average is considered.

4.4 Results

To benchmark the dataset and validate the proposed approach we perform test with BoW, GoogLeNet and MobileNet features on 5456 triplets. Experimental results in terms of both accuracy and Hamming distance are reported in Table 1.

Considering the evaluation in terms of accuracy, Ranking SVM on food images achieves best results when deep learning based features are used to represent images. This result is not surprising, since deep learning features tend to be more descriptive than the classic ones, even without fine-tuning. Specifically, MobileNet representation performed better than GoogLeNet in our experiments ($94.05\%$ vs $87.84\%$).

The performance in terms of Hamming distance follow the same trend. Since it measures the error degree, MobileNet presents the lowest average Hamming distance. For a better understanding of the values report in Table 1, we want to remark that, according the description given in Sect. 4.3, the higher value (i.e., the maximum error) for Hamming distance is 4. For example, it can be obtained if we compare to ordered triplet with labels (1, 2, 3) and (3, 2, 1) respectively.

Table 1. Experimental results

Full size table

5 Conclusion

The work presented in this paper is motivated by the massive demand for automatic systems for food intake monitoring. In a context where a predefined set of portion size are available, we proposed a novel approach for food portion sorting based on Learning to Rank algorithms. This allows to face the problem by avoiding depth and 3D estimation as in literature works which operate in less-constrained environments. To validate the proposed method, we introduce a new dataset of 99 images, which includes 11 categories and three distinct food portions, namely small, medium and large. Experiments have been performed with three different image descriptors based on Bags of Words, GoogleNet and MobileNet and with the LTR algorithm Ranking SVM. Results show that the evaluated method gives the opportunity for future extensions and industrial applications. Currently, we are planning to extend the dataset by performing new acquisitions in real healthcare facilities under the supervision of nutritionists.

Notes

1.
Dataset Page: http://iplab.dmi.unict.it/foodLTR/.

References

Diet, nutrition and the prevention of chronic diseases. Technical report. WHO Technical Report Series - 916, Report of a Joint WHO/FAO Expert Consultation, January 2002
Google Scholar
Allegra, D., et al.: A multimedia database for automatic meal assessment systems. In: Battiato, S., Farinella, G.M., Leo, M., Gallo, G. (eds.) ICIAP 2017. LNCS, vol. 10590, pp. 471–478. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70742-6_46
Chapter Google Scholar
Arab, L., Estrin, D., Kim, D.H., Burke, J., Goldman, J.: Feasibility testing of an automated image-capture method to aid dietary recall. Eur. J. Clin. Nutr. 65, 1156–1162 (2011)
Article Google Scholar
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32
Chapter Google Scholar
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Chapter Google Scholar
Chen, H.C., et al.: Model-based measurement of food portion size for image-based dietary assessment using 3D/2D registration. Meas. Sci. Technol. 24(10), 105701 (2013)
Article Google Scholar
Dehais, J., Anthimopoulos, M., Shevchik, S., Mougiakakou, S.: Two-view 3D reconstruction for food volume estimation. IEEE Trans. Multimed. 19, 1090–1099 (2017)
Article Google Scholar
Dehais, J., Shevchik, S., Diem, P., Mougiakakou, S.G.: Food volume computation for self dietary assessment applications. In: International Conference on Bioinformatics and Bioengineering, November 2013
Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Neural Information Processing Systems, vol. 3, pp. 2366–2374, January 2014
Google Scholar
Elia, M., Stratton, R.J.: Geographical inequalities in nutrient status and risk of malnutrition among English people aged 65 y and older. Nutrition 21(11), 1100–1106 (2005)
Article Google Scholar
Farinella, G.M., Allegra, D., Moltisanti, M., Stanco, F., Battiato, S.: Retrieval and classification of food images. Comput. Biol. Med. 77, 23–39 (2016)
Article Google Scholar
Farinella, G.M., Allegra, D., Stanco, F.: A benchmark dataset to study the representation of food images. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 584–599. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16199-0_41
Chapter Google Scholar
Farinella, G.M., Allegra, D., Stanco, F., Battiato, S.: On the exploitation of one class classification to distinguish food vs non-food images. In: Murino, V., Puppo, E., Sona, D., Cristani, M., Sansone, C. (eds.) ICIAP 2015. LNCS, vol. 9281, pp. 375–383. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23222-5_46
Chapter Google Scholar
Herbrich, R., Graepel, T., Obermayer, K.: Large margin rank boundaries for ordinal regression. In: Bartlett, P.J., Schölkopf, B., Schuurmans, D., Smola, A.J. (eds.) Advances in Large Margin Classifiers, vol. 88, pp. 115–132. MIT Press, Cambridge (2000)
Google Scholar
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)
Google Scholar
Kawano, Y., Yanai, K.: Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 3–17. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16199-0_1
Chapter Google Scholar
Li, H.: A short introduction to learning to rank. IEICE Trans. Inf. Syst. 94-D, 1854–1862 (2011)
Article Google Scholar
Lu, Y., Allegra, D., Anthimopoulos, M., Stanco, F., Farinella, G.M., Mougiakakou, S.: A multi-task learning approach for meal assessment. In: Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management, pp. 46–52 (2018)
Google Scholar
Matsuda, Y., Hoashi, H., Yanai, K.: Recognition of multiple-food images by detecting candidate regions. In: International Conference on Multimedia and Expo, pp. 25–30, July 2012
Google Scholar
O’Loughlin, G., et al.: Using a wearable camera to increase the accuracy of dietary analysis. Am. J. Prev. Med. 44, 297–301 (2013)
Article Google Scholar
Parikh, D., Grauman, K.: Relative attributes. In: International Conference on Computer Vision, pp. 503–510 (2011)
Google Scholar
Pouladzadeh, P., Yassine, A., Shirmohammadi, S.: FooDD: food detection dataset for calorie measurement using food images. In: Murino, V., Puppo, E., Sona, D., Cristani, M., Sansone, C. (eds.) ICIAP 2015. LNCS, vol. 9281, pp. 441–448. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23222-5_54
Chapter Google Scholar
Puri, M., Zhu, Z., Yu, Q., Divakaran, A., Sawhney, H.: Recognition and volume estimation of food intake using a mobile device. In: Workshop on Applications of Computer Vision, December 2009
Google Scholar
Ragusa, F., Furnari, A., Farinella, G.M.: Understanding food images to recommend utensils during meals. In: Battiato, S., Farinella, G.M., Leo, M., Gallo, G. (eds.) ICIAP 2017. LNCS, vol. 10590, pp. 419–425. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70742-6_40
Chapter Google Scholar
Rhyner, D., et al.: Carbohydrate estimation by a mobile phone-based system versus self-estimations of individuals with type 1 diabetes mellitus: a comparative study. J. Med. Internet Res. 18, e101 (2016)
Article Google Scholar
Suthumchai, N., Thongsukh, S., Yusuksataporn, P., Tangsripairoj, S.: FoodForCare: an Android application for self-care with healthy food. In: International Student Project Conference (ICT-ISPC), pp. 89–92, May 2016
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition, pp. 1–9, June 2015
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Xin, W., Kumar, D., Thome, N., Cord, M., Precioso, F.: Recipe recognition with large multimodal food dataset. In: International Conference on Multimedia Expo Workshops, pp. 1–6, July 2015
Google Scholar
Zhu, F., et al.: The use of mobile devices in aiding dietary assessment and evaluation. IEEE J. Sel. Top. Sig. Process. 4, 756–766 (2010)
Article Google Scholar

Download references

Acknowledgments

We acknowledge Laser Consortium (Monza) for supporting the acquisition of food images used in this work.

Author information

Authors and Affiliations

University of Catania, Catania, Italy
Dario Allegra, Giovanni Maria Farinella, Paolo Danilo Maci & Filippo Stanco
University of Milan, Milan, Italy
Daniela Erba & Giovanni Grazioso
STMicroelectronics, Catania, Italy
Valeria Tomaselli

Authors

Dario Allegra
View author publications
You can also search for this author in PubMed Google Scholar
Daniela Erba
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Maria Farinella
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Grazioso
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Danilo Maci
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Stanco
View author publications
You can also search for this author in PubMed Google Scholar
Valeria Tomaselli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dario Allegra .

Editor information

Editors and Affiliations

University of Trento, Povo, Italy
Elisa Ricci
Mapillary Research, Graz, Austria
Samuel Rota Bulò
University of Amsterdam, Amsterdam, The Netherlands
Cees Snoek
Fondazione Bruno Kessler, Povo, Italy
Oswald Lanz
Fondazione Bruno Kessler, Povo, Italy
Stefano Messelodi
University of Trento, Povo, Italy
Nicu Sebe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Allegra, D. et al. (2019). Learning to Rank Food Images. In: Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N. (eds) Image Analysis and Processing – ICIAP 2019. ICIAP 2019. Lecture Notes in Computer Science(), vol 11752. Springer, Cham. https://doi.org/10.1007/978-3-030-30645-8_57

Download citation

DOI: https://doi.org/10.1007/978-3-030-30645-8_57
Published: 02 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30644-1
Online ISBN: 978-3-030-30645-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)