Keywords

1 Introduction

3D object detection and tracking plays a pivotal role in many areas such as augmented reality, robotic manipulation, and autonomous driving, et al. Most of the current 3D object detection and tracking methods depend on the existence of 3D digital models [1]. Detection and tracking are implemented by matching local features such as feature points, contours, or normals between the current 2D image and the projected images of the 3D model. However, it is difficult to obtain the accurate textured 3D model or the CAD model of a 3D object with a monocular camera merely. Besides, traditional methods show limited capabilities in cluttered scenes especially. Therefore, it is essential to propose convenient and effective 3D object detection and tracking strategies.

This paper presents a novel method for 3D object detection and tracking in monocular images, which combines visual Simultaneous Localization and Mapping (vSLAM) and deep learning-based object detection. The method contains the 3D object reconstruction stage, the recognition and tracking stage. During the first stage, only 3D feature points that are reconstructed from the target object areas are reserved with the help of deep learning-based object detection. The 3D object map does not contain the points of the environment, and will not bring interference to the tracking. During the second stage, candidate objects are first detected in the observed image to assist the recognition. The ambition of this paper is to put forward a method to interact with the 3D object effectively and quickly, which eliminates the need for an existing 3D model and improves the capabilities in cluttered environment. The deep learning-based detection is utilized to recognize and detect the object, which improves the robustness in cluttered environment. Besides, an effective and efficient outliers removal algorithm is proposed particularly for the maps reconstructed from vSLAM.

2 Related Work

Methods of 3D object detection and tracking can be roughly divided into two kinds according to whether needing existing 3D models or not.

Most of the current methods are conducted with the assist of existing 3D models. Multiple images are generated by projecting the existing 3D models in various angles, and these images are regarded as the templates. The problem is then transformed into comparing the current image and the templates. Lepetit et al. [1] presented a survey about monocular model-based 3D tracking methods of rigid objects. Prisacariu et al. [2] proposed a probabilistic framework for simultaneous region-based 2D segmentation and 2D to 3D pose tracking using a known 3D model. Hinterstoisser et al. [3] proposed a novel image representation based on spread image gradient orientations for template matching and represented a 3D object with a limited set of templates. Their method can be extended by taking 3D surface normal orientations into account if a dense depth sensor is available. With the development of deep learning, convolutional neural network-based methods are proposed. Crivellaro et al. [4] predicted the 3D pose of each part of the object in the form of the 2D projections of a few control points with the assist of a Convolutional Neural Network(CNN). The 6D pose can be calculated with an efficient PnP algorithm [5]. However, these algorithms need large amounts of data for training and are difficult to be extended.

Some works are conducted without the need of 3D models, whereas, they utilized the SLAM methods. Feng et al. [6] proposed an on-line object reconstruction and tracking system, which segments the object from the background, reconstructs and tracks the object using the visual simultaneous localization and mapping (SLAM) techniques. However, their method shows weak resistance to cluttered environments and the reconstruction error accumulates when dealing with the full object. Besides, Jason et al. [7] proposed a system that first scans the object and then start tracking. Whereas, their 3D scanning is based on the structured light principle. Compared with these methods, we propose to reconstruct the sparse points of the object ahead of detection and tracking, by utilizing the real-time SLAM system. Besides, in order to increase the resistance to cluttered environment, we utilize the deep learning-based object detection.

Fig. 1.
figure 1

Architecture of the proposed method.

3 Method

Overview of our method is illustrated in Fig. 1 and detailed processing steps are elaborated in two subsections, namely 3D object mapping, 3D object recognition and tracking.

3.1 3D Object Mapping

The object mapping stage consists three steps: 3D object detection, 3D object mapping and outliers removal. In the first step, the 2D bounding box of the target object is detected firstly in the observed frame. In the second step, the target object is mapped using vSLAM. In the third step, a novel filtering step is conducted to remove noisy points.

3D Object Detection. The 2D bounding box of the target object in the observed frame should be detected firstly, since we need to reconstruct the 3D map of the target object without the points of the environment.

A deep learning-based 3D object detection method is used to accomplish this task. The current deep learning-based detection network, such as Faster R-CNN [8], PVANET [9], show satisfactory results in object detection tasks. We select PVANET [9] as the algorithm to obtain the 2D bounding box of the object considering the tradeoff between speed and accuracy. PVANET predict results in two steps which is similar to Faster R-CNN. But it adopts modules of concatenated ReLU, Inception, and HyperNet to reduce the expense on multi-scale feature extraction and trains the network with batch normalization, residual connections, and learning rate scheduling based on plateau detection. These strategies make it achieves the state-of-the-art performance and can be processed in real-time. In this work, the object detector is used to acquire bounding boxes of common commodities rather than their specific product names. So we classify these commodities into four categories according to their 3D shape, which are carton, bottle, cans and barrel. Samples of different shapes are shown in Table 1. In order to reduce the impact of color on training, half of the collected images are converted to grayscale and added to the training set.

Table 1. Categories of the commodities used for training.

Since all regions of the potential objects are detected in the observed frame, the target object is manually selected. During the following frames, 2D tracking algorithms will be conducted to obtain regions of the target object consistently. Among the most 2D trackers, TLD [10] is selected for its high accuracy. Note that we enlarged each tracking results to 10%, which ensures that the target object lies in the detected region and improves the stability.

Visual SLAM-Based 3D Object Mapping. In order to conduct the object reconstruction in real-time, we utilize visual SLAM algorithms. Among the algorithms, we select ORB-SLAM [11] since it achieves more stable and accurate results. The input of ORB-SLAM is a video stream. During mapping, the object is kept stable on a desk. ORB-SLAM finds two frames to initialize and computes the initial map through the matched ORB features automatically. During mapping, the poses of the camera with respect to the object are estimated, new keyframes are added, and new 3D points are added into the map. Meanwhile, the object region is tracked through the 2D tracker, and the bounding boxes are stored for each keyframe of ORB-SLAM. After scanning the object for a round, the 3D points of the map that are mapped from the bounding boxes region of the object will be segmented from the original map, as shown in Fig. 2(b).

Outliers Removal. Since we use detection rather than segmentation, noisy points are be brought in. The use of detection is quite fast than segmentation. Besides, the background improves the accuracy and robustness of pose estimation by providing more feature points. The bounding box region in Fig. 2(a) contains areas that belong to the background, and these areas are reconstructed into those green points in Fig. 2(b). Aiming at these noisy points, an outlier removal algorithm is proposed particularly for the map built by visual SLAM.

The algorithm of outliers removal is illustrated in Algorithm 1. The initial map is reconstructed by the first two keyframes, and the \(z-axis\) is pointing to the center of the object. This information is utilized for the removal of outliers. Since there exist many outliers that are far away from the object, the resolution is computed in a subset of the map. Otherwise, the resolution will be too large for filtering. Since the object region could be quite sparse in some area, it is not suitable to remove outliers according to the point cloud density as well. \({S_I}\) in Algorithm 1 is regarded as the final map of the target object, and stored into the object map database. Results of outliers removal are shown in Fig. 2(c).

figure a
Fig. 2.
figure 2

The reconstructed maps. (a) One of the keyframes. (b) The segmented object points. Since the object area contains the background region, outliers are imported. (c) The results of outliers removal.

3.2 3D Object Recognition and Tracking

The 3D object recognition and tracking stage consists two steps: 3D object recognition and 3D object tracking. In the first step, the target object is recognized and the corresponding map is loaded for relocalization. In the second step, the target object is tracked using template-based methods.

3D Object Recognition. Meanwhile, we’ve already stored many maps about the commodities in the object map database. The object in the current frame should be recognized firstly, and the corresponding map is then loaded for relocalization and tracking.

In order to improve the recognition robustness in cluttered environment, candidate objects will be detected firstly using deep learning-based detection algorithm in Sect. 3.1, since background may bring extreme interference. Aiming at the candidate object regions, the target object will be recognized by comparing the current image and keyframes of the object maps in the object map database, as shown in Fig. 3.

Fig. 3.
figure 3

Each candidate object region will compared with keyframes of each map.

We propose a fast method for comparing the current frame and keyframs which utilizes DBoW2 [12]. During the reconstruction of the object map, the DBoW2 features are extracted for all keyframes of each map. Firstly, the reverse order files, which contain the relationships between each word and the related keyframes, are extracted for each object map. Secondly, the DBoW2 words are extracted for the input frame. Thirdly, for keyframes of one object map, find the keyframe that has the largest number of similar words according to the reverse order files. Fourthly, traverse all maps and record the number of the similar words of the most similar keyframe. Finally, select the keyframe that have the most words, load the corresponding map and conduct relocalization. A successful relocalization illustrates a correct recognition. When the relocalization fails, another frame will be imported and conducted another recognition step.

3D Object Tracking. After a successful recognition, the map is loaded and an initial pose is achieved by relocalization. Template matching-based tracking method is adopted to conduct tracking.

The tracking follows the thread of the ORB-SLAM. Firstly, a new frame is acquired from the camera, and a prior pose is estimated according to a constant velocity motion model. Secondly, the reconstructed object map points are projected into the image according to the frame’s prior pose estimation. Correspondences between the 2d feature points in the current frame and the projected 2D points are obtained by comparing ORB features. A local bundle adjustment is performed to optimize the camera pose. This motion-only BA optimizes the camera orientation \(R \in SO(3)\) and position \(t \in {R^3}\), minimizing the reprojection error between matched 3D points \({X^i} \in {R^3}\) in world coordinates and keypoints \({x^i} \in {R^2}\), with \(i \in \chi \) the set of all matches, as is shown in Eq. 1.

$$\begin{aligned} \{ R,t\} = \mathop {\arg \min }\limits _{R,t} \sum \limits _{i \in \chi } {\rho ({{\left\| {{x^i} - \pi (R{X^i} + t)} \right\| }^2})} \end{aligned}$$
(1)

where \(\rho \) is the robust Huber cost function and \({\pi ()}\) is the projection function.

The poses of the camera w.r.t the target object are achieved since no environmental points exist in the 3D object map. In case the tracking is lost, and the relocalization fails for many times, the deep learning-based object detector is conduct to detect candidate objects. Then, the reverse order files are used to recognize the object in the input frame and conduct relocalization again. In our method, when the tracking is lost for four seconds, the object recognition will be conducted until the success of relocalization.

4 Experiments

In this section, we evaluated the accuracy and robustness of the proposed method, and made comparisons and discussions. The experiments are conducted on a PC with an Inter(R) Core(TM) i5-4460 processor (3.2 GHz), a NVIDIA GeForce GTX 1080 graphics card and a calibrated Logitech Pro C920 (\(640 \times 480\)) webcam. The proposed method achieves real-time performance.

4.1 Accuracy

3D Object Detection. In this section, we evaluate the accuracy of the detection for commodities in daily environment.

We captured and labeled more than 2.1 million bounding boxes for 75 kinds of commodities. These categories cover different colors, shapes, and textures of commodities that are common in daily life. One tenth of the dataset is regarded as the test set. We classified these goods into 4 categories described in Sect. 3.1 during the training process. The detection accuracy is evaluated by Intersection over Union (IoU) score. Average Precision (AP) in each single category and mean Average Precision (mAP) across all the 4 categories are exhibited in Table 2 and some detection results of different kinds of commodities are shown in Fig. 4.

Table 2. Accuracies of the 4 categories.
Fig. 4.
figure 4

The detected commodities in the images. Each commodity is detected with a high confidence.

3D Object Mapping. In this section, we evaluated the accuracy of the mapping on four commodities, which have different textures and different geometries. Two factors are evaluated which are the accuracy of the 2D tracking during reconstruction and the accuracy of the reconstructed 3D object map.

For the accuracy of 2D tracking, we calculate the average IoU score on keyframes of the map. During the reconstruction, we stored the images of keyframes, and recorded 2D tracking results for the target object. We manually selected bounding boxes of the target object to obtain the Ground Truth. The average IoU score for each kind of commodity is shown in Table 3. And we can see that accurate tracking results are achieved.

For the accuracy of the reconstructed 3D object map, two metrics are adopted which are the repetition ratio \({L_r}\) and the reference error ratio \({\delta _r}\). Let the reconstructed 3D object map be \({M_R}\). We manually segment the object of the keyframes, and extract the 3D points that are reconstructed from these object regions. These 3D points are regarded as the Ground Truth 3D object map \({M_G}\). Let \({N_L}\) represents the repetitive number of points between \({M_R}\) and \({M_G}\), the repetition ratio \({L_r}\) is measured as the ratio between \({N_L}\) and \({N_R}\), where \({N_R}\) represents the number of points in the reconstructed map. The reference error \({\varepsilon _\mathrm{{r}}}\) represents the average Euclidean distance between each point on \({M_R}\) and its nearest point on \({M_G}\), as is shown in Eq. 2. The reference error ratio \({\delta _r}\) is measured as a percentage of the diameter of \({M_G}\). The distances between the reconstructed map and the Ground Truth map of two commodities is shown in Fig. 5, from which we can see that most points in the reconstructed map belong to the Ground Truth map. The results of \({L_r}\) and \({\delta _r}\) for four commodities are illustrated in Table 3. From the table, we can see that accurate results are achieved.

$$\begin{aligned} {\varepsilon _r} = \sum \limits _{{p_r} \in {M_R}} {\mathop {\arg \min }\limits _{{p_r}} d({p_r},{M_G})} /{N_R} \end{aligned}$$
(2)
Fig. 5.
figure 5

The distances between the reconstructed map and the Ground Truth map.

Table 3. Accuracies of the tracking results and the reconstructed map.

3D Object Tracking. In this section, we evaluate the accuracy of our tracking system by conducting experiments similar to the one in Ref. [13]. A drink box, whose map has been reconstructed, is moved on a planar desk, and the trajectory of the box is estimated by our system. The estimated trajectory of the centroid of the target box is shown in Fig. 6. We can see that our system does retrieve a trajectory approximately lying on a plane.

Fig. 6.
figure 6

Estimated trajectories of the centroid of the target box.

4.2 Robustness

In this section, we evaluated the robustness of our tracking system. The tracking is stable when at least 15 pairs of matched feature points are detected in the current frame. In order to shown the tracking results vividly, we designed an augmented reality system, in which a virtual toy is attached with the target object according to the poses achieved from our system. The criteria we use contain robust to scale change, small visible regions, different angles, dynamic background, fast motion and partial occlusion. From the Fig. 7, we conclude that our tracking system have a strong capabilities in resistance to these situations.

4.3 Comparisons and Discussions

In this section, we compared our algorithm with existing methods. The largest advantage of our system lies in that we do not need the 3D model of the target object, which extends the range of objects that can be augmented. Whereas, most of current methods [2, 3] need existing 3D models. The work that is most similar with us is Feng et al. [6]. Compared with them, our method is more versatile and allows the reuse of 3D maps. Besides, the reconstruction of the target object utilized the ORB-SLAM, which involves the loop closure detection, and the maps built are more precise. What’s more, our method supports multiple objects, as long as the 3D maps are reconstructed.

Fig. 7.
figure 7

Robustness on different situations: scale, dynamic environment, different angles and small visible regions.

There exist situations when our reconstruction fails. Since ORB-SLAM relies on the feature points for finding correspondences, our algorithms can not handle commodities without texture or with high reflection. This is the case where current feature point-based methods all face difficulties.

5 Conclusions

Most of the current 3d object detection and tracking methods require existing 3D models as a prior condition. However, this limits the creativity of individuals. In this paper, we present a method for 3D object detection and tracking without the need for existing 3D models. Instead, we reconstruct the coarse models in advance and then conduct the recognition and tracking. Our method combines deep learning-based 2D object detection, visual SLAM and template-based object tracking together. Users can “scan” the object firstly and then track it. The deep learning-based detection framework can handle objects in cluttered environment and can handle multiple objects. Our method has been verified on commodities and illustrated satisfactory results.