Robotic retail surveying by deep learning visual and textual data
Introduction
In the retail industry, the occurrence of Shelf Out of Stock (SOOS) situations is a significant problem. SOOS events are often strongly related to planogram design, where a planogram represents the way that stock keeping units (SKUs) are organised among the shelves [1]. The global average out-of-stock rate is about 8%, meaning that retailers have about 4% losses in sales. Out-of-stock situations happen for several reasons; the main one is defective shelf replenishment practises (surveying and re-stocking), which result in 70%–90% of cases leading to SOOS. Another 10%–30% result from problems in the supply chain, leading to store-OOS [2]. Promo activities also strongly influence shoppers’ behaviour and result in SOOS with a strong impact on the overall retail turnover.
The retail industry is undergoing tremendous changes: from bar-code to RFID technology, from experience driven decisions to data driven processes, and from off-line versus online to omni-channel strategies [3]. Consequently, a large number of benefits is available for both retailers and customers, but it is only feasible in cases of frequent and accurate inventory, shelf assortment, and promo optimisation. All data driven processes are reliable and useful in dealing with SOOS situations along with marketing activities and supply chain management automation [4], [5].
This paper proposes ROCKy, which stands for “Retail Out of Stock” a low-cost mobile robot for detecting SOOS events both in real-time and on-demand, which has already been described in [6]. In addition to the identification of SOOS and misplaced items, ROCKy can survey promotions and discounts, model changes in shelf planogram (i.e., vertical product displacement) and store layouts, or monitor the warehouse during night time. Promotional materials have a crucial role in the store management procedure; usually, in a grocery store, every week there are new promo materials (e.g. stickers, special price tags, new product tags, etc.) to be placed around and to be removed accordingly. Promo materials strongly affects consumer behaviours and sell-out performances of a product category or of the overall store. The system can navigate the store using a modified potential field approach based on data coming from shopping carts trackers to find the most visited areas. Also, the proposed approach enables retailers to analyse the store performance by comparing different shelf layouts and to address issues such as ease of selection, trading up, and overall shopping experience. Typical examples are: sell-out increase with respect to a different packaging or a promo material on the shelf; cross-merchandising between two ore more different products to increase the basket size; efficiency of the selection process to prove a more effective product displacement and a reduction of the customer selection time in front of the shelf.
ROCKy consists of a TurtleBot, on-board RGBD camera for navigation, a low-power netbook for running fundamental algorithms, and a top-mounted GoPro HD personal cameras for shelf images collection. It makes use of six cameras to capture images and videos of the shelves on either side of the robot. These cameras take 12 MP photos (with a resolution of and a horizontal FOV of 122.6 degrees) every five seconds. Six cameras are mounted on the top of ROCKy. The TurtleBot robotic platform and its cameras are shown in Fig. 1 with a standard grocery retail environment used for testing and some shelf pictures that the robot collected during real world experiments.
ROCKy relies on a real-time locating system (RTLS) based on ultra-wideband (UWB) technology. The same localisation system is used to track human customers all day and to evaluate a store heat map. ROCKy starts from these grid-based heat map (Fig. 3) to move around, giving priority to hot zones (red areas) and using a potential field approach for its navigation. ROCKy captures images of the store’s shelves during business hours with a multiple-view camera to record consumer behaviours. These images are classified into three different categories and mapped in the grid-based store map for retail surveying:
- •
SOOS: Images of SOOS (incorrect scenario with high priority).
- •
PA: Images of the shelf with products and PA (scenario to be checked or updated with the store’s promotional plan).
- •
Normal: Images of planogram in a standard layout (correct scenario).
To classify these pictures as SOOS, Normal, and PA, it is essential to judge both the visual elements and the included text at once. While a picture showing cookies with the phrase “Special Offer 50% off” is considered PA, the same picture containing the words “Gluten Free” might be considered Normal. These categories can be important indicators of the shelf availability: to monitor the daily situation, to measure SOOS at store level effectively and accurately, and to control and manage the total impact of promotions and offers. SOOS leads to disappointed customers, but the disappointment is even bigger when the customer goes to the store because of an advertised promoted product and does not find the offer on the shelf. The approach introduced in [7] to estimate the overall content of the images based on both visual and textual information is used in this paper on the images acquired by ROCKy.
To classify the pictures, a machine learning classifier based on visual and textual features extracted from two specially trained Deep Convolutional Neural Networks (DCNNs) is implemented.
For the visual feature extractor, VGG-16 net [8], AlexNet [9], CaffeNet [10], GoogLeNet [11], and ResNet [12] with 50 layers and ResNet with 101 layers were used and applied to the whole image, trained by fine-tuning a model pre-trained on the ImageNet dataset. For the textual feature extractor, the DCNN architecture was used, proposed by [13] and created by fine-tuning a model that has been previously trained on synthesised planogram images. The model first had to detect and recognise text before extracting features. Moreover, the DCNN’s performance was compared with the ones of long short-term memory (LSTM) recurrent neural networks. With reference to these features, six state-of-the-art classifiers, i.e., kNearest Neighbours (kNN) [14], [15], Support Vector Machine (SVM) [16], Decision Tree (DT) [17], Random Forest (RF) [18], Na’́ive Bayes (NB) [19], and Artificial Neural Network (ANN) [20], were evaluated to classify the overall planogram image content.
The previously described classifiers were applied to the Shelf Management Assortment (SMART) Dataset, containing pictures acquired by ROCKy in a real retail environment during business hours. Both visual and textual elements concerning shelves in the targeted store were present in the dataset with a total of 14.244 images. Ground truth has been manually evaluated by three human annotators to make it more reliable. The SMART Dataset is publicly available1 for research purposes. The application of our approach to this dataset showed good results in terms of precision, recall, and F1-score, which demonstrated the effectiveness of the proposed approach.
The main contributions of this paper, aside from extending the system and the analysis presented in [6], are (i) the ROCKy navigation platform based on real-time consumer behaviour heat maps that can reduce the overall retail survey time by 45%, (ii) the collection and analysis of a real retail dataset for deep learning purposes that is public to all researchers with more than 14.000 shelf images, (iii) the proposal of a novel method that evaluates the visual and textual content of an image simultaneously, and (iv) a real environment test with results that prove the main goals of the paper mentioned above.
The paper is organised as follows: Section 2 is an overview of the CNNs employed in the retail field; Section 3 describes the UWB techology used to monitor the trajectories in store; Section 5 introduces our approach that consists of a visual model (Section 6), a textual model (Section 7), and a fusion model (Section 8), and gives details on the SMART Dataset (Section 9); (Section 10) presents the results; and (Section 11) discusses the conclusions and future works.
Section snippets
Related works
This section provides an overview of various works where robots have been used in the retail environment and works related to image classification using DCNNs. With the goal to change people’s lifestyles for the better, robots are being deployed in various fields, such as construction, transportation, services, cleaning, surveillance, welfare, etc. [21], [22], [23]. Robots are also being increasingly deployed in the retail environment, for both indoor and outdoor services. In [24], a virtual
Robot navigation and RTLS technology
See Fig. 2.
The ROCKy navigation framework
ROCKy’s robotic system detects SOOS and PA in real-time and on-demand. On the basis of the authors’ previous experience with robot localisation techniques [45], [46], the system leverages structured movement trajectories inside the store, offering higher accuracy and minimised surveying time. The robot knows a representation of the environment map that includes obstacles, cashiers, and the main entrance. Navigation is done through a shopper cart tracking system based on UWB technology, which is
Materials and method
The approach presented in [7], i.e., the combination of the visual and textual features, has been used and extended for the development of the proposed system. The framework for joint visual and textual analysis, as well as the novel retail dataset (SMART Dataset) used for evaluation, was comprised of three main components: the visual feature extractor, the textual feature extractor, and the fusion classifier (see Fig. 4). Two trained DCNNs were used for visual and textual feature extraction.
Visual feature extractor
The visual feature extractor provides information about the visual part of the picture. For this purpose, it is trained with image labels that indicate the visual category of the images. The training is performed by fine-tuning a DCNN. Different DCNNs were tested to chose the ones with the best performance: VGG-16 net [8], an AlexNet [9], a CaffeNet [10], a GoogLeNet [11], and a ResNet [12] with 50 layers, and a ResNet with 101 layers. The DCNNs have been pre-trained on the ImageNet dataset [9]
Textual features extractor
The textual feature extractor provides information about the textual category of a picture. It is trained with image labels that indicate the textual category of the images. Multiple components make up the textual feature extractor. The central component is a character-level CNN [13], extended for this analysis by one additional convolution layer. This extra layer, inserted before the last pooling layer, has a kernel size of three and produces 256 features. Two training phases have been applied
Fusion classifier
Fusion classifier estimates the overall content of an image on the basis of the visual and textual features. Hence, the visual and textual features extracted from DCNN were pooled in the predictor vector and the machine learning model it is trained with indicated the overall category of the images. Based on all features, six state-of-the-art classifiers—k-Nearest Neighbour (kNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), and Artificial Neural
Shelf management assortment — SMART dataset
The framework is comprehensively evaluated on the SMART Dataset, a visual and textual retail dataset made of pictures that ROCKy acquired during different experiments in different stores. The SMART Dataset is composed of 14.244 shelves images. SMART is the first dataset in that field built for these purposes and is publicly available. As previously described, these images are divided into three categories, which include:
- •
4748 SOOS images: Images of SOOS (incorrect scenario with high priority).
- •
Results and discussion
Here we report the results of the experiments conducted on SMART Dataset. The performance of the fusion classifier is presented, with the performance of the visual and textual category classifiers (based on the visual and textual feature extractors) being the key indicators to the overall classification. For the experimental analysis, the labelled dataset has been split into a training set and a test set. Each classifier was trained solely through the training set, while the test set was used
Conclusion and future works
In this paper, we extended and tested ROCKy, a mobile robot system able to detect and map SOOS and PA in a grocery retail environment. In addition to the identification of SOOS, PA, and misplaced items, ROCKy can provide information on promotions and discounts to the customers. The proposed system could also be used to monitor warehouses and run surveillance at night. ROCKy is not trying to replace workers, but transforming their work to being more interactive and helpful for the customers.
We
Acknowledgements
This work was funded by Grottini Lab, Italy (www.grottinilab.com). The authors would like to thank Andrea Felicetti and Michele Bernardini for their precious support for this work.
Declaration of competing interest
No conflict of interest.
Marina Paolanti received the Ph.D. degree in computer science in the Department of Information Engineering (DII), Università Politecnica delle Marche, in 2018. Her Ph.D. thesis was on “Pattern Recognition for challenging Computer Vision Applications”. She is currently a PostDoc Researcher with DII. Her research focuses on Pattern Recognition Methods applied to several fields ranging from biology, retail, social media, geomatics to video surveillance. She is a member of the IEEE and CVPL.
References (50)
- et al.
Shelf space re-allocation for out of stock reduction
Comput. Ind. Eng.
(2017) - et al.
Mobile robot for retail inventory using RFID
- et al.
Public entities driven robotic innovation in urban areas
Robot. Auton. Syst.
(2017) - et al.
Remote retail monitoring and stock assessment using mobile robots
- et al.
Robotic delivery service in combined outdoor–indoor environments: Technical analysis and user evaluation
Robot. Auton. Syst.
(2018) - et al.
A Comprehensive Guide to Retail Out-of-stock Reduction in the Fast-moving Consumer Goods Industry
(2007) Increasing efficiency in the supply chain for short shelf life goods using RFID tagging
Int. J. Retail Distrib. Manage.
(2003)- et al.
Desperately seeking shelf availability: An examination of the extent, the causes, and the efforts to address retail out-of-stocks
Int. J. Retail Distrib. Manage.
(2003) - et al.
Mobile robot for retail surveying and inventory using visual and textual analysis of monocular pictures based on deep learning
- M. Paolanti, C. Kaiser, R. Schallner, E. Frontoni, P. Zingaretti, Visual and textual sentiment analysis of...
Very deep convolutional networks for large-scale image recognition
Imagenet classification with deep convolutional neural networks
Caffe: Convolutional architecture for fast feature embedding
Character-level convolutional networks for text classification
Lsimpute: Accurate estimation of missing values in microarray data with least squares methods
Nucleic Acids Res.
Missing value estimation methods for DNA microarrays
Bioinformatics
Support-vector networks
Mach. Learn.
Induction of decision trees
Mach. Learn.
Random forests
Mach. Learn.
An empirical study of the naive Bayes classifier
An introduction to computing with neural nets
ASSP Mag.
Deep convolutional neural network for automatic detection of damaged photovoltaic cells
Cited by (36)
MARLIN: A cloud integrated robotic solution to support intralogistics in retail
2024, Robotics and Autonomous SystemsAssessing an on-site customer profiling and hyper-personalization system prototype based on a deep learning approach
2022, Technological Forecasting and Social ChangeCitation Excerpt :Two strategies are currently adopted for tracking customers' behavior in the retail environment: UWB technologies and RGB-D cameras. UWB technology for indoor tracking is reliable in terms of accuracy and low costs; it tracks consumer movements and sends data to a cloud server (Paolanti et al., 2019). RGB-D cameras capture the activities of a potential customer in a shop, interfering minimally with such activities.
Accelerating materials discovery using machine learning
2021, Journal of Materials Science and TechnologyCitation Excerpt :This illustrates that the application of the reinforcement learning method needs less information and is easier to design, which is conducive to deal with more complicated decision problems. Besides, deep reinforcement learning, which combines reinforcement learning with deep learning, is developing rapidly as a research hotspot in the field of artificial intelligence, such as in automatic driving, natural language processing, robot and other realms [39–44]. A basic framework of materials discovery and design based on ML methods is shown in Fig. 3, in which, three main steps are mentioned: the construction of samples, the building of algorithm models, models verification and materials prediction.
Augmenting organizational decision-making with deep learning algorithms: Principles, promises, and challenges
2021, Journal of Business ResearchCitation Excerpt :Such image detection DL algorithms can provide input to the decision-making process in fashion firms. For example, by identifying and segmenting types of apparel (in terms of patterns, colors, and shapes) and linking it to demographics and age groups of buyers, promotional strategies could be designed (Paolanti et al., 2019). Managers may use such insights on trends and seasonality in (a) targeting more appropriate customer groups with existing products, and (b) gathering feedback from sales patterns to existing products as well as (c) informing design decisions on new products.
Multidisciplinary Pattern Recognition applications: A review
2020, Computer Science ReviewAnalyzing computer vision models for detecting customers: practical experience in a mexican retail
2024, International Journal of Advances in Intelligent Informatics
Marina Paolanti received the Ph.D. degree in computer science in the Department of Information Engineering (DII), Università Politecnica delle Marche, in 2018. Her Ph.D. thesis was on “Pattern Recognition for challenging Computer Vision Applications”. She is currently a PostDoc Researcher with DII. Her research focuses on Pattern Recognition Methods applied to several fields ranging from biology, retail, social media, geomatics to video surveillance. She is a member of the IEEE and CVPL.
Luca Romeo received the Ph.D. degree in computer science in the Department of Information Engineering (DII), Università Politecnica delle Marche, in 2018. His Ph.D. thesis was on “applied machine learning for human motion analysis and affective computing”. He is currently a PostDoc Researcher with DII and he is affiliated with the Unit of Cognition, Motion and Neuroscience and Computational Statistics and Machine Learning, Fondazione Istituto Italiano di Tecnologia Genova. His research topic includes Machine learning applied to biomedical applications, affective computing and motion analysis.
Massimo Martini received the M.Sc. degree in Computer Science Engineering at Università Politecnica delle Marche with a thesis entitled “Deep Learning Techniques for Visual And Text Analysis of Images in the Retail Environment”. He is currently a Ph.D. student in the Department of Information Engineering (DII), Università Politecnica delle Marche. His research focuses on Deep Learning applied to several fields.
Adriano Mancini received the Ph.D. degree in intelligent artificial systems the Department of Information Engineering (DII), Università Politecnica delle Marche, in 2010. His Ph.D. thesis was on “A new methodological framework for land use/land cover mapping and change detection”. He currently holds an assistant professor position with DII. His research focuses on mobile robotics also for assisted living, machine learning, image processing, and geographic information systems.
Emanuele Frontoni received the Ph.D. degree in intelligent artificial systems the Department of Information Engineering (DII), Università Politecnica delle Marche, in 2006. His Ph.D. thesis was on “vision-based robotics”. He is Associate Professor at DII. His research focuses on artificial intelligence and computer vision techniques applied to robotics, internet of things, e-health, and ambient assisted living. He is a member of the ASME MESA TC, GIRPR, and AI*IA.
Primo Zingaretti is currently a Professor of computer science with the Università Politecnica delle Marche, Italy. His main research interests are in artificial intelligence, robotics, intelligent mechatronic systems, computer vision, pattern recognition, image understanding and retrieval, information systems, and e-government. Robotics vision and geographic information system have been the main application areas, with great attention directed to the technological transfer of research results. He has authored over 150 scientific research papers in English, He is a member of the ASME, a Co-Founder of AI*IA, and Member of GIRPR-IAPR.