Keywords

1 Introduction

The paradigm of Industry 4.0 is prominently changing the warehousing activities as well as the logistics procedures for medium and big vendors. The use of autonomous mobile robots for transportation and delivery [4] or the use of aerial vehicles for autonomous warehouse inventory [13, 18] has gained increasingly popularity over recent years. The advances in technology together with the evolution of machine learning and deep learning in computer vision is demonstrating such a field of research being feasible and fruitful. The explosion of e-commerce made the warehouse management a very critical task. Signalling logistics intervention on time and thus avoiding waste of time or inefficiency in delivery of goods becomes the priority of every vendors wants to enter in the electronic markets.

In this work, an autonomous warehousing inventory approach is proposed, which exploits unmanned aerial vehicles thought to perform a continuous check of packages in stocks. The proposed solution is meant to be computationally efficient as well as providing a level of accuracy compliant with an effective warehouse management. It uses light Convolutional Neural Network (CNN) models to detect and recognise on real-time the labels of packages during an aerial scanning of the environment. The aerial vehicle, with onboard environment sensing consisting in four common RGB cameras and sufficient computational power, is able to autonomously inspect the warehouse environments and perform the computer vision tasks to track the labels/barcodes of the packages. The experimental results obtained on video acquisitions of a real warehouse environment demonstrated the feasibility of the proposed solution and its reliable usage for the autonomous logistics intervention. It is worth notifying that the proposed approach does not aim to also perform the recognition of text or codes in the packaging labels since well established and reliable procedures are available for that task.

After discussing the related work in the next section, Sect. 3 details the proposed solution and the neural network models used, while Sect. 4 presents the experimental results achieved on a real warehouse scenario. Section 5 draws the conclusion and the future directions of this works. Figure 1 shows example of acquisitions carried out in real warehouse environments and highlights what the proposed solution is meant to recognise as tags and what may represent false positives.

Fig. 1.
figure 1

Frames extracted from video acquisitions of a aerial vehicle during flight. The pictures is an example of a case study of tag detection/recognition in warehouse inventorying. The yellow bounding boxes highlight the real tags while in red ones all other possible false positive labels. (Color figure online)

2 Related Work

Micro aerial vehicles (MAVs) are recently gaining a lot of consideration, both in research and in commercial applications,e.g., surveillance and tracking, aerial photography, inspection, rescue missions and so on [3]. Although the wide range of sensors nowadays available, aerial vehicles (also often referred as drones) are still thought to keep significant advantage from the usage of cameras for the tasks of autonomous driving and obstacles avoidance [6, 21]. We recently passed from drone models equipping a monocular camera vision system [5] to stereo-vision cameras [17, 20] combined with tiny laser scanners [11, 12] that enables a rich acquisition of the scene and its 3D reconstruction. Fossel et al. [7] combined Hector SLAM [14] and OctoMap [10] to build an accurate three-dimensional occupancy model of the environment and thus demonstrating the high efficacy of obstacle avoidance solutions. The works above mentioned, together with the vast literature on autonomous drones, demonstrate the high reliability of indoor autonomous driving of aerial vehicles, which is considered established in this work and not further explored. Rather, the solution discussed in this paper focuses the attention of the task of inventorying the packs in the stock by a real-time localisation and recognition of the packs and the pallets by computer vision techniques. When dealing with warehousing activities, the use of wood pallets for packs transportation and logistics is still considered of utmost importance. The work by Mohamed et al. [16] has proposed a light solution for autonomous forklift that is able to recognise and fork the pallets for transportation, which indeed uses rangefinder-based system. Following their proposed idea, we built a solution completely based on RGB cameras that is able to detect and track the labels of packages (consisting in alphanumeric characters and bar-codes) to enable the autonomous warehouse inventorying. A similar approach has also been presented by Beul et al. [2] which is based on AprilTag detector [22] to locate the packages. Although the high performance of such a solution, it makes the assumption that the operational environment, i.e., the warehouse that may be a very wide indoor space, has to be seamlessly covered by specific tags. These ones consist in special black/white figures, similar to QRCode, by which the aerial vehicle can locate itself and recognise the packs. Such a solution can of course be difficult to scale on increasing size of the warehouse. On the same line, the work by Guérin et al. [9] and Bae et al. [1] presented a novel warehouse inventory based on unmanned aerial vehicles to make the inventory process completely autonomous. In both cases, the drones are meant to scan the goods during the flight by achieving very good results. On the other side, they used the Barcode scanner (the former) and the RFID technology (the latter). Other similar approaches in the literature making usage of RFID technology can be found. Even though they ensure high precision and accuracy, then cannot easily generalise to the all diverse factors affecting how the goods and packs are labelled during the warehousing logistics procedures. What characterise the solution proposed in this paper is that is it totally based on videos acquisitions from RGB camera and therefore, by a proper training stage, can adapt to a huge variety of pack tags and labelling without needing special constraints. Such a choice also makes hard a fair comparison with other approaches in the literature that, as said above, often keep advantage from dedicated sensors. Further, the specialised sensors ease the detection/recognition of tags and labels but also provide a representation of the acquired data that differs from simple RGB images.

3 Tags Detection with R-CNN

As said in the previous section, the work proposed in this paper is inspired to the work by Mohammed et al. [16]. On the other side, rather than using laser scanner, it is totally based on RGB cameras, mounted on the aerial vehicle.

Image processing applications have been demonstrated to keep benefit from Convolutional Neural Networks (CNNs) models [15]. The main component of a CNN is represented by the convolutional layer. It features a local connectivity patterns that forces the network to operate on limited receptive fields. Together with other layers, like pooling, ReLU layers and fully connected layer, the CNN models have become increasingly complex in size and in number of parameters to learn. From one side, that made CNNs more and more performing. On the other side, the computing power needed growth accordingly thus making infeasible to run such neural models on hardware with limited resources. Over recent years, advancements in CNN models led to an alternative version called Region-based Convolutional Neural Networks (R-CNNs) used for object detection and image classification [19, 23]. A R-CNN is organised in two steps: (i) a collection of possible bounding boxes containing an object are extracted from the input layer and (ii) the region of interest (ROIs) are then submitted to a classifier to determine if they contain one of the known object to recognise (i.e., the classes of the supervised problem). Inside the family of R-CNN, has gained significant attention the model called Faster R-CNN [8], which is exploited in this work. Faster R-CNNs are characterised from being composed of few convolutional layes followed by a fully connected layer called Region Proposal Network (RPN). The RPN receives as input the convolutional feature maps generated from the previous layers and operates by passing a \(n\times n\) sliding window over them. It proposes bounding box candidates (called anchors) according to predefined aspect ratios and scales. For each anchor a value of Intersection over Union (IoU) ranging in [0, 1] is computed. It represents the overlap ratio of the anchors and of ground truth bounding boxes (see Definition 1). The ROIs extracted from the input according to the anchors that are above an empirical threshold of IoU, are then provided to a classifier that is in charge of classifying the detected objects. This implies that Faster R-CNNs achieve efficient and fully end-to-end training, as a single CNN is used for region proposal and classification [16].

Definition 1

Given A be the anchor, B the ground truth bounding box and area(\(\odot \)) a function computing the area in pixel of the bounding box \(\odot \), the Intersection over Union (IoU) is defined as:

$$\begin{aligned} IoU=\frac{area(A)\cap area(B)}{area(A)\cup area(B)} \end{aligned}$$
(1)
Fig. 2.
figure 2

The overall proposed model consisting in the combination of Faster R-CNN and a shallow CNN.

Table 1. The configuration of the Faster R-CNN model network and the shallow CNN-based classifier.

The model used in this work is composed by a Faster R-CNN which is divided into three stages: the input layer, the intermediate convolutional stage, and the final fully connected stage. The input layer consists of the input image corresponding to aerial acquisition video frames. The convolutional part of the network model is composed by two convolutional layers, interleaved by two ReLU layers and a final max-pooling layer. The final stage, i.e., the RPN layer, is composed by two fully connected layers ending in a softmax classification layer which determines if the proposed ROIs belong to the class of tags or not. In order to further assess the reliability of the detected tags, a CNN-based classifier is trained to classify the most promising ROIs detected by the Faster R-CNN as tags. It consists in a shallow CNN model that takes as input a crop of the full-size original images that contains the ROIs only, therefore it consists in a subset of the entire image containing all the ROIs detected by the Faster R-CNN. The middle layer consists of a single convolutional layer followed by ReLU and max-pooling layers. The final fully connected layer and a softmax layer is in charge of classifying the image. An overall view of the network model is depicted in Fig. 2, it can be noticed that the network is rather shallow, since the objective of the proposed architecture is to perform online recognition on UAVs with limited computational resources and either to avoid overfitting due to the limited number of samples in the dataset.

4 Experimental Results

Before introducing the achieved results and the data used for the experimentation, further details of the network model configuration are provided in Table 1.

The experimentation has been carried out on ad-hoc RGB videos acquired at 720p resolution in real warehouse environments involving the usage of a drone while performing a vertical scanning of the packs on the shelves. From those videos, 292 frames have been extracted. The frames have been selected among those containing tags and containing other pieces of papers that did not represent tags. Even though the training and testing performed offline on commodity hardware, the feasibility of real-time processing of the proposed solution is supported by the results reported in [16]. The dataset is then manually labelled to build the ground truth for classification performance evaluation. The experiments has been performed on both RGB color original images and grayscale converted video frames. Precision/Recall (PR) curves have been used to show the achieved results together with the average precision (AP) metric. We remember that Precision is a ratio of true positive instances to all positive instances of objects in the detector, based on the ground truth. Recall is a ratio of true positive instances to the sum of true positives and false negatives in the detector, based on the ground truth. Average precision is computed over all the detection results and returned as a numeric scalar in range [0, 1].

According to the designed model, we first trained the Faster R-CNN on the training set, which is 70% of the dataset. The number of epochs is set to 10. The PR curves in Fig. 3 show that working with grayscale images introduces a slight improvement in performance, even though the two experiments are about equal each other in terms of average precision (0.87 and 0.85 for grayscale and RGB training respectively).

Fig. 3.
figure 3

Training performance of Faster R-CNN on RGB training set and on converted grayscale training set.

Once the Faster R-CNN training is complete and ROIs generated, the shallow CNN-based classifier has been trained on all the extracted ROIs as input. Even in this case the number of epochs is set to 10. In total, 89 ROIs of tags and 131 non-tags have been manually labelled. According to CNN model constraints, the images have been scaled to 250 \(\times \) 250 resolution and used to train different configurations of the classifier. First of all, it has been used in its designed configuration (as described in Table 1 above) (ConvNet1) and also by adding an extra convolutional layer and ReLU layer (ConvNet2Conv). The starting learning rate has been set to \(10^{-3}\) and a mini-batch size of 50 samples per iteration has been used. Moreover, we also performed data augmentation of the dataset by introducing slight variations in translation, rotation and scaling of the tags. Reducing the learning rate to \(10^{-5}\) and with mini-batch size to 32 samples, the training has been performed on both RGB images (NewConvNet) and grayscale (ConvNetGray) thus resulting in the PR curves shown in Fig. 4. On each plot, the PR curves in different configurations above mentioned have been presented while average precision is reported in figure legend besides each variant of the CNN-based classifier. It results clear how the data augmentation and the lower learning rate significantly improved the recall and precision of the classification.

In the attempt of further improving the average precision of the proposed architecture, we use the weights of the layers from CNN-based classifier trained on grayscale images of ROIs containing the tags (ConvNetGray). Such weights have been used as to initialize the Convolutional layers of the Faster R-CNN and then trained this last model network on input images RGB of the training set. Figure 5 shows how such a solution improved significantly the classification performance.

Fig. 4.
figure 4

The PR curves of classification achieved in different configurations for the CNN-based classifier. (Top) the PR curves of classification on training the CNN with RGB images. (Bottom) The classification curves using the CNN trained on grayscale images.

Fig. 5.
figure 5

A comparison of Faster R-CNN performance on different training conditions.

Fig. 6.
figure 6

PR curves obtained by using the Faster R-CNN with pre-training on grayscale images for tags detection and the configurations of the CNN-based classifier for classification. (Color figure online)

Following this intuition, we tested the performance of the Faster R-CNN with pre-trained layers on grayscale images and we again provided the ROIs coming from the CNN-based classifier in all the configuration discussed above. Figure 6 summarises the level of performance achieve in all treatments. We can observe how the pre-training performed on Faster R-CNN introduced a significant increase in PR curves for all the conditions considered. Even though the higher level of performance achieved on using RGB images, working with grayscale images leads to a desirable reduction of computing demand. Since the algorithm has to run on board the aerial vehicles, we chose to test the grayscale configuration (Faster R-CNN pre-trained with grayscale images + CNN on grayscale ROIs) to assess the reliability of the proposed solution. Inspecting the responses of the CNN models, we observed that the Faster R-CNN model reaches a confidence of detection that is most of the time above the threshold of 0.98 both introducing a very negligible rate of false positive. Moreover, when the Faster R-CNN is wrong, i.e., it detects a ROIs as containing a tag when it is does not or vice versa, most of the CNN-based classifier also fails in correctly classifying the tag. This often happens when occlusions or limited field of view of the camera do not allow to unambiguously acquire the tags. This leads to suppose a feasible solution which uses the two-stage approach, i.e. Faster R-CNN + Shallow CNN, only for training while uses the solely Faster R-CNN for detecting and recognising the tags.

5 Conclusions

Warehouse management is commonly associated with complex and dynamic processes presenting critical problems for warehouse managers across industries. Accurate inventorying procedure becomes increasingly challenging on wider and wider warehouse environments. Inaccurate inventory causes problems such as maintaining improper stock levels and buildups of obsolete inventory. Fluctuations in demand on seasonality as well as the speed of purchases in the electronic markets, represents critical issue to face with and proper logistics operations can determine the success or big faults. In this paper a simple solution based on Convolutional Neural Network for warehouse inventorying has been presented. Using unmanned aerial vehicles, the objective is enabling an autonomous warehouse inventorying system according to which the drones are able to localise and recognise the packs in stock and, in case, to signal the missing or the decrease of offer of some products. Even though micro aerial vehicles have received a strong acceleration over recent years, they still suffer from high battery consumption and limited processing power. On the opposite side, machine learning and deep learning techniques for computer vision tasks have achieved considerable high performance at the cost of proportional computational demand, which is far from being compliant with portable computing devices. To such a purpose, the solution proposed in this work is meant to require limited computation power but able to meet the requirements of a precise and accurate autonomous scanning of a warehouse environment. The solution is based on a two-stage learning process, where a Faster R-CNN is used to detect the tags during the aerial check of the packs on the shelves. Once possible tags have been localised, the regions of interest extracted from the video acquisition are promptly sent to a shallow CNN-based classifier which is charge of classifying the right tags and differentiate them from other piece of paper, or residual notes on the packs that do not contribute to the inventory procedures. The experimental analysis carried out on real scenarios by using real video acquisitions in a warehouse facility during a vertical aerial scanning of the stocks have demonstrated the feasibility of the proposed approach. The results achieved in different configurations of the CNN models demonstrate that a high precision can be reached resulting in an average precision above 80%. Observations on CNN behaviour during training led to a round of experiments that showed how to keep advantage for the exclusive usage of the Faster R-CNN model upon the training performed on the shallow classifier and then using those pre-trained layers in the Faster R-CNN. Precision achieved aside, such an attempt demonstrated a computational lightening of the proposed solution which further confirmed the feasibility of the autonomous warehousing inventorying by unmanned aerial vehicles and which can be used for future improvements of the approach proposed in this study.