Abstract
Due to the similar appearances among many retail products, it is a big challenge to identify the product with high accuracy and low computational cost in smart retail scenes. In this paper, we proposed a lightweight retail product identification and localization method based on an improved convolutional neural network. First, we use group convolution and deep separable convolution to optimize the structure of the backbone network and reduce the amount of calculation. Second, the multiscale structure was adjusted to optimal scales. We further use the k-means clustering algorithm to re-cluster six anchors with different sizes. Third, we introduced spatial pyramid pooling (SPP) to replace pooling by convolution to effectively improve the robustness against image distortion, such as cropping and scaling. Finally, we use mosaic data enhancement method to improved the robustness of the network. Experiments on the RPC dataset show that, compared with YOLOv5, the number of parameters is reduced by 1/6.4 times, and FLOPs is reduced by 1/9 times. Experiments on the DeepBlue Retail Dataset show that compared with YOLOv5, the number of parameters is reduced by 1/7.8 times, and FLOPs is reduced by 1/9.3 times. Realtime evaluation under the same hardware show that the FPS of the proposed model is 123 in the forward inference test, while the FPS of the YOLOv5 model under the same conditions is 58.










Similar content being viewed by others
References
Baz I, Yoruk E, Cetin M (2016) Context-aware hybrid classification system for fine-grained retail product recognition. In: 2016 IEEE 12th image, video, and multidimensional signal processing workshop, Bordeaux, France, pp 1–5
Bochkovskiy A, Wang CY , Liao H (2020) YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv:2004.10934
Chong T, Bustan I, Wee M (2016) Deep learning approach to planogram compliance in retail stores. Semantic Scholar, pp 1–6
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition, San Diego, CA, pp 886–893
Efraty B, Huang C, Shah SK, Kakadiaris IA (2011) Facial landmark detection in uncontrolled conditions. In: 2011 International joint conference on biometrics, pp 1–8
Farren D (2017) Classifying food items by image using Convolutional Neural Networks
Geng W, Han F, Lin J et al (2018) Fine-grained grocery product recognition by one-shot learning. In: Proceedings of the 26th ACM International conference on multimedia, Republic of Seoul, Korea, pp 1706–1714
Girshick R et al (2014) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In: 2014 IEEE conference on computer vision and pattern recognition, Columbus, OH, USA, pp 580–587
Girshick R (2015) Fast R-CNN. In: 2015 IEEE international conference on computer vision, Santiago, Chile, pp 1440–1448
He K et al (2015) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Howard AG et al (2017) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861
Huang C, Jiang H (2019) Image indexing and content analysis in children’s picture books using a large-scale database. Multimed Tools Appl 78 (15):20679–20695
Huang C, Efraty BA, Kurkure U, Papadakis M, Shah SK, Kakadiaris IA (2012) Facial landmark configuration for improved detection. In: 2012 IEEE International workshop on information forensics and security, pp 13–18
Huang C, Jin Y, Zhao Y, Yu Y, Zhao L (2009) Speech emotion recognition based on re-composition of two-class classifiers. In: 2009 3rd International conference on affective computing and intelligent interaction and workshops, pp 1–3
Huang C et al (2013) Practical speech emotion recognition based on online learning: From acted data to elicited data. Mathematical Problems in Engineering
Huang CW, Jin Y, Zhao Y, Yu YH, Zhao L (2010) Design and establishment of practical speech emotion database. Tech Acoust 29(4):396–399
Huang C, Jiang H (2019) Image indexing and content analysis in children’s picture books using a large-scale database. Multimed Tools Appl 78 (15):20679–20695
Jin Y, Zhao Y, Huang C, Zhao L (2010) The design and establishment of a Chinese whispered speech emotion database. Tech Acoust 29(1):63–68
Jin Y, Zhao Y, Huang C, Zhao L (2009) Study on the emotion recognition of whispered speech. In: 2009 WRI global congress on intelligent systems, vol 3, pp 242–246
Jin Y, Zhao Y, Huang C, Zhao L (2010) The design and establishment of a Chinese whispered speech emotion database. Tech Acoust 29(1):63–68
Jund P, Abdo N, Eitel A et al (2016) The Freiburg groceries dataset. arXiv preprint, arXiv:1611.05799
Karlinsky L, Shtok J, Tzur Y et al (2017) Fine-grained recognition of thousands of object categories with single-example training. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4113–4122
Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: International conference on neural information processing systems, Minneapolis, MN, USA, pp 1–8
Kumar K, Shrimankar D (2018) F-DES: Fast and Deep Event Summarization. IEEE Trans Multimed 20(2):323–334
Kumar K, Shrimankar D et al (2018) Eratosthenes sieve based key-frame extraction technique for event summarization in videos. Multimed Tools Appl 77:7383–7404
Kumar K, Shrimankar D (2018) Deep Event Learning boosT-up Approach: DELTA. Multimed Tools and Appl 77:26635–26655
Kumar K (2021) Text query based summarized event searching interface system using deep learning over cloud. Multimed Tools and Appl 80:11079–11094
Kumar K, Sinha S, Manupriya P D-pnr: Deep license plate number recognition. Proceedings of 2nd International Conference on Computer Vision & Image Processing, pp 37–46, (2018)
Leutenegger S, Chli M, Siegwart RY (2011) BRISK: Binary Robust invariant scalable keypoints. In: 2011 International conference on computer vision, Barcelona, Spain, pp 2548–2555
Lin T et al (2017) Feature Pyramid Networks for Object Detection. In: 2017 IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, pp 936–944
Liu L, Zhou B, Zou Z et al (2018) A smart unstaffed retail shop based on artificial intelligence and IoT. 2018 IEEE 23rd International workshop on computer aided modeling and design of communication links and networks (CAMAD), pp 1–4
Lowe DG (2004) Distinctive Image Features from Scale-Invariant Keypoints. Int J Comput Vis 60(2):91–110
Luo V, Huang C et al (2013) Emotional feature analysis and recognition from Vietnamese speech. J Signal Process 29(10):1423–1432
Milella A et al (2021) 3D Vision-Based Shelf Monitoring System for Intelligent Retail, ICPR International Workshops and Challenges, Milan, Italy, pp 447–459
Merler M, Galleguillos C, Belongie S (2007) Recognizing groceries in situ using in vitro training data. In: 2007 IEEE conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2007.383486, pp 1–8
Paolanti M et al (2019) Robotic retail surveying by deep learning visual and textual data. Robot Auton Syst 118:179–188
Ren S et al (2017) Faster R-CNN: towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Redmon J et al (2016) You Only Look Once: Unified, Real-Time Object Detection. In: 2016 IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, pp 779–788
Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: 2017 IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, pp 7263–7271
Redmon J, Farhadi A (2018) YOLOv3: An Incremental Improvement. arXiv:1804.02767
Santra B, Mukherjee DP (2019) A comprehensive survey on computer vision based approaches for automatic identification of products in retail store. Image Vis Comput 86:45–63
Shankar V et al (2021) How Technology is Changing Retail. J Retail 97(1):13–27
Sharma S, Kumar K, Singh N (2017) D-FES: Deep facial expression recognition system, 2017 Conference on Information and Communication Technology (CICT), pp 1–6. https://doi.org/10.1109/INFOCOMTECH.2017.8340635
Sharma S, Kumar K (2021) ASL-3DCNN: American sign language recognition technique using 3-D convolutional neural networks. Multimed Tools and Appl 80:26319–26331
Singh N, Dhanak N et al (2017) HDML: habit detection with machine learning. ICCCT-2017: Proceedings of the 7th International Conference on Computer and Communication Technology, pp 29–33
Sun H, Zhang J, Akashi T (2020) TemplateFree: product detection on retail store shelves, vol 15
Sriram T et al (1996) Applications of barcode technology in automated storage and retrieval systems. In: Proceedings of the 22nd international conference on industrial electronics, control, and instrumentation, Taipei, Taiwan, pp 641–646
Srivastava MM (2020) Bag of Tricks for Retail Product Image Classification. In: Image analysis and recognition, Póvoa de Varzim, Porto, Portugal, pp 71–82
Szegedy C et al (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition, Boston, MA, pp 1–9
Sonmez EB, Albayrak S (2017) A survey of product recognition in shelf images. 2017 International Conference on Computer Science and Engineering (UBMK), pp 145–150
Tonioni A, Di Stefano L (2019) Domain invariant hierarchical embedding for grocery products recognition. Computer Vision and Image Understanding, (182):81-92
Want R (2006) An introduction to, RFID technology. IEEE Pervasive Computing 5(1):25–33
Wang W et al (2020) A self-attention-based destruction and construction learning fine-grained image classification method for retail product recognition. Neural Comput Appl 32(18):14613–14622
Wu C, Huang C, Chen H (2015) Automatic recognition of emotions and actions in bi-modal video analysis. In: International conference on internet of vehicles, pp 427–438
Wei X-S et al (2019) RPC: A large-scale retail product checkout dataset. arXiv preprint, arXiv:1901.072491901.07249. URL: https://rpc-dataset.github.io/, accessed on May 22, 2022
Yan J, Lu G, Li X, et al. (2020) FENP: a database of neonatal facial expression for pain analysis. IEEE transactions on affective computing, https://doi.org/10.1109/TAFFC.2020.3030296
Yun S et al (2019) CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In: 2019 IEEE/CVF international conference on computer vision, Seoul, Korea (South), pp 6022–6031
Yan J, Lu G, Li X et al (2020) FENP: a database of neonatal facial expression for pain analysis. IEEE Transactions on Affective Computing
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, J., Huang, C., Zhao, L. et al. Lightweight identification of retail products based on improved convolutional neural network. Multimed Tools Appl 81, 31313–31328 (2022). https://doi.org/10.1007/s11042-022-12872-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12872-6