Abstract
Food image segmentation which aims to distinguish various ingredients is crucial for food safety, as estimating calories and other nutrients is important for human health and sustainable development. However, the performances of current image segmentation methods are inferior on food image datasets due to the significant diversity of appearances and distinctive conditions between ingredients and daily props, while these methods have insufficient capabilities for feature extraction of food images. In addition, utilizing attention mechanisms to obtain contextual detail information and long-range dependencies leads to a quadratic computational complexity. In this paper, we propose a Cross Spatial Attention (CSA) module to extract richer spatial features from food images, with lower time and space complexity. Specifically, the CSA module aggregates the contextual information by cross-calculation of horizontal and vertical dimensions. And experiments demonstrate that, by taking a two-step cross-calculation, each pixel could eventually capture global long-range dependencies. Furthermore, our method integrates a Channel Attention (CA) module to selectively highlight interdependent channel information by integrating relevant features across all feature maps. Then the outputs of these two attention modules are aggregated to enhance the representation of the image feature. Convincing performance improvement is achieved on the FoodSeg103, UECFoodPix and ADE20K. Moreover, the proposed network achieved better trade-off between accuracy and efficiency.











Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Oh SW, Lee J-Y, Xu N, Kim SJ (2022) Space-time memory networks for video object segmentation with user guidance. IEEE Trans Pattern Anal Mach Intell 44(1):442–455
Ye Y, Yang K, Xiang K, Wang J, Wang K (2020) Universal semantic segmentation for fisheye urban driving images. In: 2020 IEEE International conference on systems, man, and cybernetics (SMC). IEEE, pp 648–655
Wei X, Du J, Liang M, Ye L (2019) Boosting deep attribute learning via support vector regression for fast moving crowd counting. Pattern Recogn Lett 119:12–23
He Y, Yang D, Roth H, Zhao C, Xu D (2021) Dints: differentiable neural network topology search for 3d medical image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5841–5850
Xu L, Du J, Li Q (2013) Image fusion based on nonsubsampled contourlet transform and saliency-motivated pulse coupled neural networks. Mathematical Problems in Engineering 2013
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Cao J, Mao D-h, Cai Q, Li H-s, Du J-p (2013) A review of object representation based on local features. Journal of Zhejiang University Science C 14(7):495–504
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE International conference on computer vision, pp 764–773
Yang M, Yu K, Zhang C, Li Z, Yang K (2018) Denseaspp for semantic segmentation in street scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3684–3692
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Wu Z, Liu C, Wen J, Xu Y, Yang J, Li X (2023) Selecting high-quality proposals for weakly supervised object detection with bottom-up aggregated attention and phase-aware loss. IEEE Trans Image Process 32:682–693
Fu J, Liu J, Jiang J, Li Y, Bao Y, Lu H (2020) Scene segmentation with dual relation-aware attention network. IEEE Transactions on Neural Networks and Learning Systems 32(6):2547–2560
Dong X, Wang W, Li H, Cai Q (2021) Windows attention based pyramid network for food segmentation. In: 2021 IEEE 7th International conference on cloud computing and intelligent systems (CCIS). IEEE, pp 213–217
Wang W, Min W, Li T, Dong X, Li H, Jiang S (2022) A review on vision-based analysis for automatic dietary assessment. Trends in Food Science & Technology
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154
Li Q, Du J, Song F, Wang C, Liu H, Lu C (2013) Region-based multi-focus image fusion using the local spatial frequency. In: 2013 25th Chinese control and decision conference (CCDC). IEEE, pp 3792–3796
Wu X, Fu X, Liu Y, Lim E-P, Hoi SC, Sun Q (2021) A large-scale benchmark for food image segmentation. In: Proceedings of the 29th ACM international conference on multimedia, pp 506–515
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2017) Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 633–641
Fang Y, Deng W, Du J, Hu J (2020) Identity-aware cyclegan for face photo-sketch synthesis and recognition. Pattern Recogn 102:107249
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–818
Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions. International Conference on Learning Representations
Peng C, Zhang X, Yu G, Luo G, Sun J (2017) Large kernel matters–improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4353–4361
Yuan Y, Huang L, Guo J, Zhang C, Chen X, Wang J (2021) Ocnet: object context for semantic segmentation. Int J Comput Vision 129(8):2375–2398
Wu T, Tang S, Zhang R, Cao J, Zhang Y (2021) Cgnet: a light-weight context guided network for semantic segmentation. IEEE Trans Image Process 30:1169–1179
Yuan Y, Xie J, Chen X, Wang J (2020) Segfix: model-agnostic boundary refinement for segmentation. In: European conference on computer vision. Springer, pp 489–506
Geng Z, Guo M-H, Chen H, Li X, Wei K, Lin Z (2020) Is attention better than matrix decomposition? In: International conference on learning representations
Guo M-H, Xu T-X, Liu J-J, Liu Z-N, Jiang P-T, Mu T-J, Zhang S-H, Martin RR, Cheng M-M, Hu S-M (2022) Attention mechanisms in computer vision: a survey. Computational Visual Media, 1–38
Li F, Cai Q, Li H, Chen Y, Cao J, Li S (2022) Attentive frequency learning network for super-resolution. Appl Intell 52(5):5185–5196
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803
Kou F, Du J, Lin Z, Liang M, Li H, Shi L, Yang C (2018) A semantic modeling method for social network short text based on spatial and temporal characteristics. Journal of computational science 28:281–293
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 603–612
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Park J, Woo S, Lee J-Y, Kweon I-S (2018) Bam: bottleneck attention module. In: British machine vision conference (BMVC). British machine vision association (BMVA)
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Shu X, Yang J, Yan R, Song Y (2022) Expansion-squeeze-excitation fusion network for elderly activity recognition. IEEE Trans Circuits Syst Video Technol 32(8):5281–5292
Shu X, Zhang L, Qi G-J, Liu W, Tang J (2021) Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Trans Pattern Anal Mach Intell 44(6):3300–3315
Misra D, Nalamada T, Arasanipalai AU, Hou Q (2021) Rotate to attend: convolutional triplet attention module. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3139–3148
Min W, Jiang S, Liu L, Rui Y, Jain R (2019) A survey on food computing. ACM Computing Surveys (CSUR) 52(5):1–36
Chen J, Ngo C-W (2016) Deep-based ingredient recognition for cooking recipe retrieval. In: Proceedings of the 24th ACM international conference on multimedia, pp 32–41
Zhou F, Lin Y (2016) Fine-grained image classification by exploring bipartite-graph labels. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1124–1133
Min W, Jiang S, Sang J, Wang H, Liu X, Herranz L (2016) Being a supercook: joint food attributes and multimodal content modeling for recipe retrieval and exploration. IEEE Trans Multimedia 19(5):1100–1113
Ge M, Elahi M, Fernaández-Tobías I, Ricci F, Massimo D (2015) Using tags and latent factors in a food recommender system. In: Proceedings of the 5th international conference on digital health 2015, pp 105–112
Trattner C, Elsweiler D (2017) Investigating the healthiness of internet-sourced recipes: implications for meal planning and recommender systems. In: Proceedings of the 26th international conference on world wide web, pp 489–498
Phanich M, Pholkul P, Phimoltares S (2010) Food recommendation system using clustering analysis for diabetic patients. In: 2010 International conference on information science and applications. IEEE, pp 1–8
Sadilek A, Kautz H, DiPrete L, Labus B, Portman E, Teitel J, Silenzio V (2016) Deploying nemesis: preventing foodborne illness by data mining social media. In: Proceedings of the AAAI conference on artificial intelligence, vol 30, pp 3982–3989
Schiboni G, Amft O (2018). In: Tamura T, Chen W (eds) Automatic dietary monitoring using wearable accessories. Springer, Cham, pp 369–412
Aguilar E, Remeseiro B, Bolaños M, Radeva P (2018) Grab, pay, and eat: semantic food detection for smart restaurants. IEEE Trans Multimedia 20(12):3266–3275
Shimoda W, Yanai K (2015) Cnn-based food image segmentation without pixel-wise annotation. In: International conference on image analysis and processing. Springer, pp 449–457
Shroff G, Smailagic A, Siewiorek DP (2008) Wearable context-aware food recognition for calorie monitoring. In: 2008 12th Ieee International symposium on wearable computers. IEEE, pp 119–120
Chang Y-W, Chen Y-Y (2006) An improve scheme of segmenting colour food image by robust algorithm. Proc Algo 2006:331–335
De Silva, LC, Pereira A, Punchihewa A (2005) Food classification using color imaging. In: Published in the proceedings of the annual conference on image and vision computing (IVCNZ 2005), University of Otago, Dunedin, New Zealand. Citeseer, pp 28–29
Meyers A, Johnston N, Rathod V, Korattikara A, Gorban A, Silberman N, Guadarrama S, Papandreou G, Huang J, Murphy KP (2015) Im2calories: towards an automated mobile vision food diary. In: Proceedings of the IEEE international conference on computer vision, pp 1233–1241
Morikawa C, Sugiyama H, Aizawa K (2012) Food region segmentation in meal images using touch points. In: Proceedings of the ACM multimedia 2012 workshop on multimedia for cooking and eating activities, pp 7–12
Shimoda W, Yanai K (2016) Foodness proposal for multiple food detection by training of single food images. In: Proceedings of the 2nd international workshop on multimedia assisted dietary management, pp 13–21
Zhu F, Bosch M, Khanna N, Boushey CJ, Delp EJ (2011) Multilevel segmentation for food classification in dietary assessment. In: 2011 7th International symposium on image and signal processing and analysis (ISPA). IEEE, pp 337–342
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Okamoto K, Yanai K (2021) Uec-foodpix complete: a large-scale food image segmentation dataset. In: Pattern recognition. ICPR International workshops and challenges: Virtual Event, Proceedings, Part V. Springer, pp 647–659. Accessed 15 Jan 2021
Contributors M (2020) MMSegmentation, an Open Source Semantic Segmentation Toolbox. https://github.com/open-mmlab/mmsegmentation
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Acknowledgements
This work was supported by scientific research program of Beijing Municipal Education Commission KZ202110011017, National Natural Science Foundation of China (No. 62277001) and Beijing Natural Science Foundation (No. L233026).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dong, X., Li, H., Wang, X. et al. CANet: cross attention network for food image segmentation. Multimed Tools Appl 83, 60987–61006 (2024). https://doi.org/10.1007/s11042-023-17916-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17916-z
Keywords
Profiles
- Haisheng Li View author profile