Keywords

1 Introduction

Simultaneous Localization and Mapping (SLAM) systems have been widely used for autonomous robot exploration both in indoor and outdoor environments. One major application field of SLAM-based mobile robot is Ambient Assisted Living (AAL) [7]: assistive robots are designed to help disabled individuals or people with reduced mobility to move more easily in daily life. In most cases, these robots provide their service on level ground in an indoor environment (from room to room, or inside a building with corridors) in forms of wheelchair [22], smart walker [33] or robot coach [12]. AAL robots often combine inputs from multiple sensors (e.g. LiDAR, sonars and cameras) to achieve more robust localization capability, which leads to complex hardware integration and excessive cost [6].

With the rapid developments of visual SLAM [35], many SLAM systems are able to track and build the map in real-time from purely visual information. This kind of visual SLAM has been an active research topic for more than twenty years with contributions coming from Robotics, Computer Vision and other related fields.

The emergence of visual SLAM systems like ORB-SLAM [25] makes it possible to build mobile AAL robots with low cost hardware, e.g. a single camera run on an embedded system. Visual SLAM can also help to build 3D map of the environment, which will provide more useful information of the surrounding for tasks like obstacle avoidance than traditional 2D maps. Since AAL robots only involve in-plane navigation on level ground of indoor environments, visual SLAM systems like ORB-SLAM can be further optimized by reducing from 6DoF tracking to 3DoF. For example, ORB-SLAM is based on ORB feature [27], which is a fast alternative of SIFT [19] or SURF [2]. ORB is composed of a rotation-invariant descriptor - rotated-BRIEF, which is useful for 6DoF tracking (e.g. hand-held camera), but not necessary for in-plane navigation.

Aiming to build a monocular SLAM system for AAL robots that usually run on embedded systems, we want to further optimize the state-of-the-art visual SLAM framework by finding appropriate lightweight descriptors that improve real-time tracking performance and reduce computational cost for in-plane navigation. So in this paper, based on the framework of ORB-SLAM - a milestone of feature-point based SLAM system, we compared different lightweight local descriptors by evaluating their influence on system performance in level ground navigation scenarios.

2 Related Work

2.1 Monocular SLAM

Visual SLAM can be performed with a single monocular camera, which is the simplest and cheapest sensor setup among all choices. This simplicity allows monocular SLAM to run on embedded systems or smartphones with minimal hardware integration effort, which encourages many years of research on this topic. Monocular SLAM algorithms have evolved from filtering to keyframe-based bundle adjustment (BA) algorithms, with many implementations lying in the middle ground between them. Filtering methods create a model based on the information gained over all past frames with a probability distribution, every frame is processed by the filter to jointly estimate the map feature locations and the camera pose [8]. Unlike filtering methods, keyframe-based approaches [23] estimate the map using global bundle adjustment for only a small number of past frames, which remain relatively efficient even processing large number of features from the keyframes. The work of Strasdat et al. [29] demonstrated that keyframe bundle adjustment outperforms filtering in term of accuracy per unit of computing time by measuring entropy reduction and tracking error.

The most representative keyframe-based system is marked by PTAM [16], which first introduced the idea of splitting camera tracking and mapping into parallel threads. Various systems are proposed in recent years targeting different issues in the front end and back end such as iSAM [14], FrameSLAM [17], etc. Another type of methods standing out of framework of filtering and keyframe approaches is called direct SLAM, e.g. LSD-SLAM [9]. Direct SLAM method builds large scale semi-dense maps directly upon optimization over image pixel intensities instead of bundle adjustment over features, which offers more potential for related applications.

However, some intrinsic problems of monocular vision systems, e.g. scale drift and failing with pure rotations, still make monocular SLAM difficult to initialize despite simple hardware setup, which lead to the development of stereo and RGB-D vision systems.

2.2 Keypoint Features

Keypoint features are generally salient points (e.g. corners) encoded by information from local image regions that are invariant to viewpoint and lighting condition changes. Many visual SLAM systems use corner detectors in their tracking pipeline, e.g., a machine learning approach called FAST [26] is often used in real-time applications, and its improved version is integrated in other methods like ORB [27]. Besides corner detectors, another popular local descriptor is the Scale Invariant Feature Transform (SIFT) [19], which first achieves scale-invariant keypoint detection using histograms containing main properties of local appearance. However, the high dimension descriptor of SIFT makes it difficult to be used in real-time situations, which leads to different variants such as the Speeded-Up Robust Features (SURF) [2], PCA-SIFT [15] and other types of lightweight local descriptors.

Lightweight local descriptors are mainly designed to be computation-efficient, so the generating and matching of descriptors can run at frame rate. For example, the BRIEF descriptor [5] directly generates bit strings by simple binary tests in a smoothed image patch, and is augmented with rotation invariance by rotated-BRIEF (ORB). Unlike BRIEF, BRISK [18] and its successor FREAK [1] use a circular sampling pattern to compute intensity comparisons between point pairs. Another descriptor named LDB [34] computes a binary string for an image patch using simple intensity and gradient difference tests on pairwise grid cells, which is demonstrated to achieve greater accuracy and faster speed for tracking tasks than state-of-the-art algorithms.

Lots of evaluations and comparisons of keypoint detectors and descriptors have been done to help us choose among enormous options for a given application. Some surveys compare a special group of algorithms like Juan & Gwun’s work [13] on SIFT-related methods, while others include more detectors and descriptors to compare with [20, 32]. In the field of visual SLAM, there are also many existing work on the performance comparison of interest point detectors and descriptors [3, 10, 24]. The common conclusion that we can draw from these surveys is that there is a trade-off between accuracy and computation cost. SIFT and related methods offer better matching performance with high computational cost, while lightweight descriptors provide less precise matching at a much higher speed [21].

The aforementioned evaluations have covered a wide range of detectors and descriptors, but some recent advances like LDB haven’t been compared altogether. Moreover, these studies mostly target at general 6DoF tracking scenarios, cases for 3DoF in-plane level ground navigation haven’t been addressed yet.

3 Experiment

In order to find an optimized configuration of monocular SLAM systems in level ground navigation scenarios, we compared different lightweight local descriptors by evaluating their influence on system performance based on the framework of ORB-SLAM. The descriptor used in ORB-SLAM is rotated-BRIEF (or rBRIEF), which is BRIEF enhanced with rotation-invariance. Since we only have yaw rotation in level ground scenarios, BRIEF is already sufficient and we expect more efficient tracking with BRIEF as rotation is not considered. Another lightweight, and claimed to be ultra-fast descriptor that we included in the evaluation is LDB [34]. As mentioned in Sect. 2, LDB is an efficient binary descriptor that has the same length as BRIEF (32 bits), and is much shorter than BRISK and SURF (both have 64 bits). Other popular descriptors exceeding a length of 64 bits are excluded from comparison.

So in this experiment, we choose to compare three lightweight descriptors: BRIEF, ORB (rotated-BRIEF) and LDB (without rotation invariance).

3.1 Dataset

Existing Datasets. We first considered existing public visual SLAM datasets for the evaluation task undertaken. The datasets that satisfy our testing requirements should only involve yaw rotation and in-plane translation (3DoF), which excludes most hand-held sequences such as the TUM RGB-D benchmark [30] and NYU Depth dataset [28]. Moreover, we prefer video recordings of indoor environment as AAL robots are mostly designed for indoor service, which again filters out datasets for large-scale outdoor environments, e.g. KITTI dataset [11] for car driving and the EuRoC dataset [4] for aerial vehicle navigation.

Finally we selected two sequences from the TUM RGB-D dataset (we use only the color images) that are designed for testing and debugging purpose - fr1/xyz and fr2/xyz. These two sequences only contain translation movements within a small movement range, which is not strictly “in-plane”, but no rotation is involved. The TUM dataset also provides a tool that implements two methods for calculating the error between the estimated trajectory and the real one, namely Absolute Trajectory Error (ATE) and Relative Pose Error (RPE), both are useful for comparison of tracking performance.

Level Ground Sequences. Since we found little existing datasets for level ground indoor navigation, we decided to make some recordings that satisfy the requirements mentioned above. We mounted a monocular camera on a robotic walker - a standard four-wheel (no motor control) assistive walker combined with different sensors. The user stands behind the walker and walks forward while pushing the walker by holding the handles. A laptop computer running the SLAM algorithm is put on the robotic walker and connected to the camera mounted in front of the walker via a USB cable.

We choose three types of trajectories to be tested, including straight line, zigzag and octagon paths (Fig. 1). These segments have increased complexity and their combination can represent most use case that we encounter for level ground navigation. The length of each segment for these trajectories is chosen arbitrarily according to the room size.

Fig. 1.
figure 1

Trajectories of the level ground video sequences, from left to right: line, zigzag and octagon.

The level ground sequences used in this experiment were captured by a Logitech C525 camera, with the auto-focus function turned off. The intrinsic parameters of the camera are: focal lengths - f\(_x\) = 820.2028 and f\(_y\) = 819.9700, the principal point (u,v) = (255.4357, 222.3254), and the radial distortion - K\(_1\) = 0.0378 and K\(_2\) = −0.3324. The three sequences that we recorded are stored as 640 * 480 images with a frame rate of 30 fps. The line, zigzag and octagon sequences last respectively 33, 54 and 110 s, and are saved as 803, 1291 and 2647 images.

3.2 Performance Metrics

Time and accuracy are two fundamental aspects that represent the real-time responsiveness and quality of a SLAM system. The performance metrics that we use to evaluate the influence of different descriptors are thus divided into the following groups:

Time: We logged time used for descriptor generation and matching since they directly reflect a descriptor’s time efficiency. We also want to see the impact of changing keypoint descriptor on system performance, so we measured the execution time of the whole SLAM process along with the time for different states - initialization, tracking and relocalization. A good SLAM system should spend less time to initialize and relocate, leaving more time for tracking.

Matching Accuracy: Keypoint matching between frames is used to recover the camera’s change of pose. We counted the number of matched keypoints as more correct matches generally lead to more accurate recovered pose. When regarding descriptor matching as a classification problem and each keypoint to be an individual class, we can use J3 (Eq. 1) to quantify class separability which is based on within and between class scatter matrix: S\(_w\) and S\(_b\) (Eq. 2) [31].

$$\begin{aligned} J_{3} = trace\{S_{w}^{-1}S_{m}\} \end{aligned}$$
(1)
$$\begin{aligned} S_{m}=S_{w}+S_{b}=\sum _{i=1}^{M}p_{i}s_{i}+\sum _{i=1}^{M}p_{i}(\mu _{i}-\mu _{0})(\mu _{i}-\mu _{0})^{T} \end{aligned}$$
(2)

where \(S_{m}\) is the global covariance matrix. To compute \(S_{w}\), \(p_{i}\) and \(s_{i}\) are the probability and covariance matrix of class i. For \(S_{b}\), \(\mu _{i}\) is the average feature vector for class i and \(\mu _{0}\) is the average vector for all classes. Higher J3 value computed from all the binary strings of a descriptor indicates better matching capability. To compute J3, we selected 50 images at the end of each sequence and collected all descriptor binaries for keypoints extracted from the very first image. Finally, only keypoints that have more than 30 binary strings for all three descriptors are included.

Tracking Accuracy: Since we use part of the TUM RGB-D dataset, we can make use of some useful tools provided by the authors. Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) are two methods well-suited for measuring the performance of visual SLAM systems when ground-truth trajectory is available. In this experiment, our level ground recordings don’t have ground-truth data that are compatible with TUM tools, so these two methods are only applied to TUM dataset.

4 Results

We performed our tests on a laptop computer with Intel (R) Core (TM) i7-5700HQ CPU @ 2.70 GHz with 8G RAM, running Ubuntu 16.04 LTS. For each video sequence, we run the SLAM system under each testing condition for 10 times to see their averaged performance. Hereafter we name each testing condition by the name of the descriptor in use, i.e. LDB, BRIEF and ORB condition.

4.1 Time

Figure 2 shows the time performance for the whole video sequences under each condition for two descriptor-related tasks: descriptor generation and matching. The results show that, LDB is slightly quicker for keypoint matching, but takes more time to generate the binary code than two other methods, and the result is almost consistent across different video sequences.

In addition to absolute time duration for a task, we also computed the proportion of that task in the total time of the whole SLAM process since the total time differs under each condition. On average, LDB has the highest time rate for descriptor generation (57.9%) and lowest time rate for keypoint matching (5.9%). Regarding ORB, it has the highest matching time cost rate (7.6%), but has similar performance in descriptor generation (52.3%) with BRIEF (51.5%).

Fig. 2.
figure 2

Descriptor-related time performance for different video sequences (in seconds)

Fig. 3.
figure 3

Total and state-wise execution time with different video sequences (in seconds).

As mentioned in previous section, we collected the execution time for the whole process as well as for each system state. As shown in Fig. 3, all conditions have good performance in fr2/xyz with most time spent on tracking (from 98.8% to 99.3%), while with other sequences all conditions take more time to initialize, among which ORB suffers a steeper increase (up to 12.6%).

Both zigzag and octagon sequences include yaw rotations, relocalization occurred under all conditions on these two sequences. In the zigzag sequence, initialization remains acceptable for LDB and BRIEF (6.9% and 7.4%), whereas ORB increases rapidly (41.3%). LDB spends most time for tracking (85.7%) and least for relocalization (7.4%), BRIEF (52.9%) has similar tracking time as ORB (44.2%), but much more time for relocalization (39.7%). Octagon sequence contains multiple in-place rotations with relatively short transition, as a consequence, all conditions have bad performance. The best condition in this case - BRIEF is able to run tracking for half of the total time (54.8%), while the others have to relocate from time to time.

If we take the sum for all video sequences, ORB condition spends more time for initialization than LDB and BRIEF, while BRIEF condition outperforms others in tracking, relocalization and total time with a slight advantage.

Table 1. Average number of matched keypoints per frame and J3 score for each condition.

4.2 Matching Accuracy

Table 1 shows the average number of matched keypoints per frame and J3 score for the whole sequences. We can see that for the number of matched keypoints, LDB (mean = 235) has similar performance with BRIEF (mean = 250), while ORB (mean = 199) has much lower number than both of them. Regarding J3 score, from the frames we choose (all three conditions run tracking during this period), we find that LDB has the highest score in all sequences (mean = 94.30), and the mean score of ORB (mean = 59.16) is far lower than the other two.

4.3 Tracking Accuracy

We use ATE and RPE to compute the trajectory error of fr1/xyz and fr2/xyz. As shown in Table 2 and Fig. 4, all conditions have similar performance for ATE (0.3625 \(\sim \) 0.3666) in sequence fr2/xyz, however, BRIEF gets much smaller error than the other two in fr1/xyz with an error of 0.057 m. For RPE, we sum translation and rotation error separately. Same as ATE, the performance for all conditions are close in fr2/xyz, but BRIEF still outperforms the others in fr1/xyz.

Table 2. Measurement of tracking accuracy for each condition (in meter and degree).
Fig. 4.
figure 4

Measurement of tracking accuracy for each video sequence.

5 Discussion

From the above results, we can see that descriptor generation is still the most time-consuming task for local feature based SLAM system that takes more than half of the total system running time. The use of BRIEF provides faster binary code generation that allows more time for tracking and less total time than with the other two descriptors.

The ORB descriptor is indeed rotated-BRIEF, with additional rotation-invariance ability compared to BRIEF, however, according to our tests, this augmentation largely reduced the number of matched keypoints per frame, which hinders not only the system time efficiency, but also matching ability (as more matched keypoints lead to better tracking result). This reduction is mainly due to the additional angular constraints during keypoint matching. Since rotation invariance is not required in level ground navigation, ORB descriptor is not recommended for this type of application.

Through all the tests with various sequences, we find the performance of different descriptors tends to diverge as the camera motion becomes more complicated (from line to octagon), and remains at the same level with very smooth and slow motion (e.g. in fr2/xyz). Globally, BRIEF retains robust tracking performance in difficult situations, although more sequences should be included to further confirm this observation on trajectory estimation quality.

In fact, when running pilot test for our robotic walker with ORB-SLAM, we found that the system struggled to initialize in indoor environment with many white walls around. The keypoints that the system can extract at runtime are too few to support functional tracking. We had to paste some texture-rich pictures on the walls to facilitate keypoints extraction. On the contrary, if tests were taken in an outdoor environment, the number of keypoints should no longer be a problem. In this case, LDB would be an appropriate choice since it has highest J3 score among our tested descriptors.

6 Conclusion

In this work, we conducted an experiment to test the influence of different lightweight local descriptors on the performance of monocular SLAM system, in aim to find the best choice among LDB, BRIEF and ORB for level ground indoor navigation. The results indicate that BRIEF outperforms the others both in terms of time and trajectory accuracy, though it provides slightly lower matching quality than LDB. To conclude, BRIEF would be a preferred component of monocular SLAM systems designed for indoor level ground navigation.

In the future, with advances from the computer vision community, more lightweight descriptors can be included for comparison and we can also evaluate the impact of keypoint extraction methods. To further improve the usability of SLAM systems for robotic walker as monocular SLAM systems are sometimes delicate to initialize, we can take stereo, RGB-D and even inertial sensors into consideration.