Elsevier

Robotics and Autonomous Systems

Volume 118, August 2019, Pages 179-188
Robotics and Autonomous Systems

Robotic retail surveying by deep learning visual and textual data

https://doi.org/10.1016/j.robot.2019.01.021Get rights and content

Highlights

  • ROCKy is a mobile robot for data collection and surveying in a retail store.

  • ROCKy detects Shelf Out of Stock and Promotional Activities based on Deep Learning.

  • The deep learning approach evaluates visual and textual content of an image.

  • The approach was applied on a new public dataset with annotated shelves images.

  • Experimental results confirmed the effectiveness of the approach.

Abstract

Autonomous systems for monitoring and surveying are increasingly used in retail stores, since they improve the overall performance of the store and reduce the manpower cost. Moreover, an automated system improves the accuracy of collected data by avoiding human-related factors. This paper presents ROCKy, a mobile robot for data collection and surveying in a retail store that autonomously navigates and monitors store shelves based on real-time store heat maps; ROCKy is designed to automatically detect Shelf Out of Stock (SOOS) and Promotional Activities (PA) based on Deep Convolutional Neural Networks (DCNNs). The deep learning approach evaluates visual and textual content of an image simultaneously to classify and map SOOS and PA events during working hours. The proposed approach was applied and tested on several real scenarios, presenting a new public dataset with more than 14.000 annotated shelves images. Experimental results confirmed the effectiveness of the approach, showing high accuracy (up to 87%) in comparison with the existing state of the art SOOS and PA monitoring solutions, and a signification reduction of retail surveying time (45%).

Introduction

In the retail industry, the occurrence of Shelf Out of Stock (SOOS) situations is a significant problem. SOOS events are often strongly related to planogram design, where a planogram represents the way that stock keeping units (SKUs) are organised among the shelves [1]. The global average out-of-stock rate is about 8%, meaning that retailers have about 4% losses in sales. Out-of-stock situations happen for several reasons; the main one is defective shelf replenishment practises (surveying and re-stocking), which result in 70%–90% of cases leading to SOOS. Another 10%–30% result from problems in the supply chain, leading to store-OOS [2]. Promo activities also strongly influence shoppers’ behaviour and result in SOOS with a strong impact on the overall retail turnover.

The retail industry is undergoing tremendous changes: from bar-code to RFID technology, from experience driven decisions to data driven processes, and from off-line versus online to omni-channel strategies [3]. Consequently, a large number of benefits is available for both retailers and customers, but it is only feasible in cases of frequent and accurate inventory, shelf assortment, and promo optimisation. All data driven processes are reliable and useful in dealing with SOOS situations along with marketing activities and supply chain management automation [4], [5].

This paper proposes ROCKy, which stands for “Retail Out of Stock” a low-cost mobile robot for detecting SOOS events both in real-time and on-demand, which has already been described in [6]. In addition to the identification of SOOS and misplaced items, ROCKy can survey promotions and discounts, model changes in shelf planogram (i.e., vertical product displacement) and store layouts, or monitor the warehouse during night time. Promotional materials have a crucial role in the store management procedure; usually, in a grocery store, every week there are new promo materials (e.g. stickers, special price tags, new product tags, etc.) to be placed around and to be removed accordingly. Promo materials strongly affects consumer behaviours and sell-out performances of a product category or of the overall store. The system can navigate the store using a modified potential field approach based on data coming from shopping carts trackers to find the most visited areas. Also, the proposed approach enables retailers to analyse the store performance by comparing different shelf layouts and to address issues such as ease of selection, trading up, and overall shopping experience. Typical examples are: sell-out increase with respect to a different packaging or a promo material on the shelf; cross-merchandising between two ore more different products to increase the basket size; efficiency of the selection process to prove a more effective product displacement and a reduction of the customer selection time in front of the shelf.

ROCKy consists of a TurtleBot, on-board RGBD camera for navigation, a low-power netbook for running fundamental algorithms, and a top-mounted GoPro HD personal cameras for shelf images collection. It makes use of six cameras to capture images and videos of the shelves on either side of the robot. These cameras take 12 MP photos (with a resolution of 4000×3000 and a horizontal FOV of 122.6 degrees) every five seconds. Six cameras are mounted on the top of ROCKy. The TurtleBot robotic platform and its cameras are shown in Fig. 1 with a standard grocery retail environment used for testing and some shelf pictures that the robot collected during real world experiments.

ROCKy relies on a real-time locating system (RTLS) based on ultra-wideband (UWB) technology. The same localisation system is used to track human customers all day and to evaluate a store heat map. ROCKy starts from these grid-based heat map (Fig. 3) to move around, giving priority to hot zones (red areas) and using a potential field approach for its navigation. ROCKy captures images of the store’s shelves during business hours with a multiple-view camera to record consumer behaviours. These images are classified into three different categories and mapped in the grid-based store map for retail surveying:

  • SOOS: Images of SOOS (incorrect scenario with high priority).

  • PA: Images of the shelf with products and PA (scenario to be checked or updated with the store’s promotional plan).

  • Normal: Images of planogram in a standard layout (correct scenario).

To classify these pictures as SOOS, Normal, and PA, it is essential to judge both the visual elements and the included text at once. While a picture showing cookies with the phrase “Special Offer 50% off” is considered PA, the same picture containing the words “Gluten Free” might be considered Normal. These categories can be important indicators of the shelf availability: to monitor the daily situation, to measure SOOS at store level effectively and accurately, and to control and manage the total impact of promotions and offers. SOOS leads to disappointed customers, but the disappointment is even bigger when the customer goes to the store because of an advertised promoted product and does not find the offer on the shelf. The approach introduced in [7] to estimate the overall content of the images based on both visual and textual information is used in this paper on the images acquired by ROCKy.

To classify the pictures, a machine learning classifier based on visual and textual features extracted from two specially trained Deep Convolutional Neural Networks (DCNNs) is implemented.

For the visual feature extractor, VGG-16 net [8], AlexNet [9], CaffeNet [10], GoogLeNet [11], and ResNet [12] with 50 layers and ResNet with 101 layers were used and applied to the whole image, trained by fine-tuning a model pre-trained on the ImageNet dataset. For the textual feature extractor, the DCNN architecture was used, proposed by [13] and created by fine-tuning a model that has been previously trained on synthesised planogram images. The model first had to detect and recognise text before extracting features. Moreover, the DCNN’s performance was compared with the ones of long short-term memory (LSTM) recurrent neural networks. With reference to these features, six state-of-the-art classifiers, i.e., kNearest Neighbours (kNN) [14], [15], Support Vector Machine (SVM) [16], Decision Tree (DT) [17], Random Forest (RF) [18], Na’́ive Bayes (NB) [19], and Artificial Neural Network (ANN) [20], were evaluated to classify the overall planogram image content.

The previously described classifiers were applied to the Shelf Management Assortment (SMART) Dataset, containing pictures acquired by ROCKy in a real retail environment during business hours. Both visual and textual elements concerning shelves in the targeted store were present in the dataset with a total of 14.244 images. Ground truth has been manually evaluated by three human annotators to make it more reliable. The SMART Dataset is publicly available1 for research purposes. The application of our approach to this dataset showed good results in terms of precision, recall, and F1-score, which demonstrated the effectiveness of the proposed approach.

The main contributions of this paper, aside from extending the system and the analysis presented in [6], are (i) the ROCKy navigation platform based on real-time consumer behaviour heat maps that can reduce the overall retail survey time by 45%, (ii) the collection and analysis of a real retail dataset for deep learning purposes that is public to all researchers with more than 14.000 shelf images, (iii) the proposal of a novel method that evaluates the visual and textual content of an image simultaneously, and (iv) a real environment test with results that prove the main goals of the paper mentioned above.

The paper is organised as follows: Section 2 is an overview of the CNNs employed in the retail field; Section 3 describes the UWB techology used to monitor the trajectories in store; Section 5 introduces our approach that consists of a visual model (Section 6), a textual model (Section 7), and a fusion model (Section 8), and gives details on the SMART Dataset (Section 9); (Section 10) presents the results; and (Section 11) discusses the conclusions and future works.

Section snippets

Related works

This section provides an overview of various works where robots have been used in the retail environment and works related to image classification using DCNNs. With the goal to change people’s lifestyles for the better, robots are being deployed in various fields, such as construction, transportation, services, cleaning, surveillance, welfare, etc. [21], [22], [23]. Robots are also being increasingly deployed in the retail environment, for both indoor and outdoor services. In [24], a virtual

Robot navigation and RTLS technology

See Fig. 2.

The ROCKy navigation framework

ROCKy’s robotic system detects SOOS and PA in real-time and on-demand. On the basis of the authors’ previous experience with robot localisation techniques [45], [46], the system leverages structured movement trajectories inside the store, offering higher accuracy and minimised surveying time. The robot knows a representation of the environment map that includes obstacles, cashiers, and the main entrance. Navigation is done through a shopper cart tracking system based on UWB technology, which is

Materials and method

The approach presented in [7], i.e., the combination of the visual and textual features, has been used and extended for the development of the proposed system. The framework for joint visual and textual analysis, as well as the novel retail dataset (SMART Dataset) used for evaluation, was comprised of three main components: the visual feature extractor, the textual feature extractor, and the fusion classifier (see Fig. 4). Two trained DCNNs were used for visual and textual feature extraction.

Visual feature extractor

The visual feature extractor provides information about the visual part of the picture. For this purpose, it is trained with image labels that indicate the visual category of the images. The training is performed by fine-tuning a DCNN. Different DCNNs were tested to chose the ones with the best performance: VGG-16 net [8], an AlexNet [9], a CaffeNet [10], a GoogLeNet [11], and a ResNet [12] with 50 layers, and a ResNet with 101 layers. The DCNNs have been pre-trained on the ImageNet dataset [9]

Textual features extractor

The textual feature extractor provides information about the textual category of a picture. It is trained with image labels that indicate the textual category of the images. Multiple components make up the textual feature extractor. The central component is a character-level CNN [13], extended for this analysis by one additional convolution layer. This extra layer, inserted before the last pooling layer, has a kernel size of three and produces 256 features. Two training phases have been applied

Fusion classifier

Fusion classifier estimates the overall content of an image on the basis of the visual and textual features. Hence, the visual and textual features extracted from DCNN were pooled in the predictor vector and the machine learning model it is trained with indicated the overall category of the images. Based on all features, six state-of-the-art classifiers—k-Nearest Neighbour (kNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), and Artificial Neural

Shelf management assortment — SMART dataset

The framework is comprehensively evaluated on the SMART Dataset, a visual and textual retail dataset made of pictures that ROCKy acquired during different experiments in different stores. The SMART Dataset is composed of 14.244 shelves images. SMART is the first dataset in that field built for these purposes and is publicly available. As previously described, these images are divided into three categories, which include:

  • 4748 SOOS images: Images of SOOS (incorrect scenario with high priority).

Results and discussion

Here we report the results of the experiments conducted on SMART Dataset. The performance of the fusion classifier is presented, with the performance of the visual and textual category classifiers (based on the visual and textual feature extractors) being the key indicators to the overall classification. For the experimental analysis, the labelled dataset has been split into a training set and a test set. Each classifier was trained solely through the training set, while the test set was used

Conclusion and future works

In this paper, we extended and tested ROCKy, a mobile robot system able to detect and map SOOS and PA in a grocery retail environment. In addition to the identification of SOOS, PA, and misplaced items, ROCKy can provide information on promotions and discounts to the customers. The proposed system could also be used to monitor warehouses and run surveillance at night. ROCKy is not trying to replace workers, but transforming their work to being more interactive and helpful for the customers.

We

Acknowledgements

This work was funded by Grottini Lab, Italy (www.grottinilab.com). The authors would like to thank Andrea Felicetti and Michele Bernardini for their precious support for this work.

Declaration of competing interest

No conflict of interest.

Marina Paolanti received the Ph.D. degree in computer science in the Department of Information Engineering (DII), Università Politecnica delle Marche, in 2018. Her Ph.D. thesis was on “Pattern Recognition for challenging Computer Vision Applications”. She is currently a PostDoc Researcher with DII. Her research focuses on Pattern Recognition Methods applied to several fields ranging from biology, retail, social media, geomatics to video surveillance. She is a member of the IEEE and CVPL.

References (50)

  • SimonyanK. et al.

    Very deep convolutional networks for large-scale image recognition

    (2014)
  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

  • JiaY. et al.

    Caffe: Convolutional architecture for fast feature embedding

  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going...
  • K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision...
  • ZhangX. et al.

    Character-level convolutional networks for text classification

  • T.H. et al.

    Lsimpute: Accurate estimation of missing values in microarray data with least squares methods

    Nucleic Acids Res.

    (2004)
  • TroyanskayaO. et al.

    Missing value estimation methods for DNA microarrays

    Bioinformatics

    (2001)
  • CortesC. et al.

    Support-vector networks

    Mach. Learn.

    (1995)
  • QuinlanJ.R.

    Induction of decision trees

    Mach. Learn.

    (1986)
  • BreimanL.

    Random forests

    Mach. Learn.

    (2001)
  • RishI.

    An empirical study of the naive Bayes classifier

  • LippmannR.

    An introduction to computing with neural nets

    ASSP Mag.

    (1987)
  • M. Sturari, M. Paolanti, E. Frontoni, A. Mancini, P. Zingaretti, Robotic platform for deep change detection for rail...
  • PierdiccaR. et al.

    Deep convolutional neural network for automatic detection of damaged photovoltaic cells

  • Cited by (36)

    • Assessing an on-site customer profiling and hyper-personalization system prototype based on a deep learning approach

      2022, Technological Forecasting and Social Change
      Citation Excerpt :

      Two strategies are currently adopted for tracking customers' behavior in the retail environment: UWB technologies and RGB-D cameras. UWB technology for indoor tracking is reliable in terms of accuracy and low costs; it tracks consumer movements and sends data to a cloud server (Paolanti et al., 2019). RGB-D cameras capture the activities of a potential customer in a shop, interfering minimally with such activities.

    • Accelerating materials discovery using machine learning

      2021, Journal of Materials Science and Technology
      Citation Excerpt :

      This illustrates that the application of the reinforcement learning method needs less information and is easier to design, which is conducive to deal with more complicated decision problems. Besides, deep reinforcement learning, which combines reinforcement learning with deep learning, is developing rapidly as a research hotspot in the field of artificial intelligence, such as in automatic driving, natural language processing, robot and other realms [39–44]. A basic framework of materials discovery and design based on ML methods is shown in Fig. 3, in which, three main steps are mentioned: the construction of samples, the building of algorithm models, models verification and materials prediction.

    • Augmenting organizational decision-making with deep learning algorithms: Principles, promises, and challenges

      2021, Journal of Business Research
      Citation Excerpt :

      Such image detection DL algorithms can provide input to the decision-making process in fashion firms. For example, by identifying and segmenting types of apparel (in terms of patterns, colors, and shapes) and linking it to demographics and age groups of buyers, promotional strategies could be designed (Paolanti et al., 2019). Managers may use such insights on trends and seasonality in (a) targeting more appropriate customer groups with existing products, and (b) gathering feedback from sales patterns to existing products as well as (c) informing design decisions on new products.

    • Analyzing computer vision models for detecting customers: practical experience in a mexican retail

      2024, International Journal of Advances in Intelligent Informatics
    View all citing articles on Scopus

    Marina Paolanti received the Ph.D. degree in computer science in the Department of Information Engineering (DII), Università Politecnica delle Marche, in 2018. Her Ph.D. thesis was on “Pattern Recognition for challenging Computer Vision Applications”. She is currently a PostDoc Researcher with DII. Her research focuses on Pattern Recognition Methods applied to several fields ranging from biology, retail, social media, geomatics to video surveillance. She is a member of the IEEE and CVPL.

    Luca Romeo received the Ph.D. degree in computer science in the Department of Information Engineering (DII), Università Politecnica delle Marche, in 2018. His Ph.D. thesis was on “applied machine learning for human motion analysis and affective computing”. He is currently a PostDoc Researcher with DII and he is affiliated with the Unit of Cognition, Motion and Neuroscience and Computational Statistics and Machine Learning, Fondazione Istituto Italiano di Tecnologia Genova. His research topic includes Machine learning applied to biomedical applications, affective computing and motion analysis.

    Massimo Martini received the M.Sc. degree in Computer Science Engineering at Università Politecnica delle Marche with a thesis entitled “Deep Learning Techniques for Visual And Text Analysis of Images in the Retail Environment”. He is currently a Ph.D. student in the Department of Information Engineering (DII), Università Politecnica delle Marche. His research focuses on Deep Learning applied to several fields.

    Adriano Mancini received the Ph.D. degree in intelligent artificial systems the Department of Information Engineering (DII), Università Politecnica delle Marche, in 2010. His Ph.D. thesis was on “A new methodological framework for land use/land cover mapping and change detection”. He currently holds an assistant professor position with DII. His research focuses on mobile robotics also for assisted living, machine learning, image processing, and geographic information systems.

    Emanuele Frontoni received the Ph.D. degree in intelligent artificial systems the Department of Information Engineering (DII), Università Politecnica delle Marche, in 2006. His Ph.D. thesis was on “vision-based robotics”. He is Associate Professor at DII. His research focuses on artificial intelligence and computer vision techniques applied to robotics, internet of things, e-health, and ambient assisted living. He is a member of the ASME MESA TC, GIRPR, and AI*IA.

    Primo Zingaretti is currently a Professor of computer science with the Università Politecnica delle Marche, Italy. His main research interests are in artificial intelligence, robotics, intelligent mechatronic systems, computer vision, pattern recognition, image understanding and retrieval, information systems, and e-government. Robotics vision and geographic information system have been the main application areas, with great attention directed to the technological transfer of research results. He has authored over 150 scientific research papers in English, He is a member of the ASME, a Co-Founder of AI*IA, and Member of GIRPR-IAPR.

    View full text