Keywords

1 Introduction

In the context of robot navigation with vision, the task of Simultaneous Localization And Mapping (SLAM) is an important task. The entire SLAM process relies on recognizing the places the robot has already visited to achieve visual loop closure detection. The major tasks are like representing the frames with the help of visual descriptors and subsequently judging the similarity between the frames based on the descriptors. It is to be noted that in the context of this work place recognition refers to recognising whether a place has been visited previously or not. Various approaches have been followed by the researchers. Some of the major approaches are as follows.

1.1 BoW Based Approaches

The BoW (Bag-of-words) approach was first successfully applied to image classification and retrieval [19]. Here, a fixed size vocabulary is used as a vector quantizer to classify descriptors in an image frame. The vectors consists of image patches which acts as features and are generally chosen randomly from image patches with textured neighbourhood. The FABMAP model [4] considers a sequence of non-overlapping frames and checks if each frame belongs to an already visited place. It suffers from the problem of perceptual aliasing. In order to deal with the issue of perceptual aliasing as prevalent in FABMAP, methods like SeqSLAM [15] perform correlation-based matching on short sequences of images instead of depending directly on individual image frames. Voting based methods [9, 13, 20] perform a nearest neighbour search on the image descriptor space to identify potential matches. It is quite similar to the original bag of words approach. Sometimes image descriptors like SIFT [12], BRISK [11] or FREAK [1] are also used to form the descriptor vector. For fast and accurate nearest neighbour search in loop closure detection, it is essential to reduce the feature dimensions. Several Methods [2, 13] have been presented in this direction. By means of majority voting, similar images are identified and loop closure is detected by thresholding on the similarity value.

1.2 Deep Learning Based Approaches

Convolutional neural network (CNN) based approaches have been developed for loop closure detection. Chen et al. [3] used the Overfeat network [17] trained on the ImageNet dataset to extract features from the image frames. Using a sequence of convolution and pooling operations, it is possible to obtain dense representations of the images and perform search on the low dimensional vector space. However, in this approach the network was pre-trained on the ImageNet dataset [5] and thus it is optimized mainly for object recognition and not oriented towards place recognition as desired for loop closure detection. Denoising autoencoders (DA) have also been used for localization tasks [18]. It uses a denoising autoencoder with fully connected layers to extract features for comparing structural similarity of two images.

Designing the descriptors suitable for loop closure detection is quite challenging. It depends on the scenes and conditions under consideration. It has motivated us to rely on deep learning that can automatically extract the features and can be utilized for place recognition. The paper is organized as follows. Brief introduction and survey is followed by proposed methodology in Sect. 2. Experimental result and conclusion are placed in Sects. 3 and 4 respectively.

2 Proposed Methodology

In this work we propose an autoencoder based deep learning network that extracts a lower dimensional vector representation of an image. With an autoencoder trained to encode and decode an image, the task of loop detection reduces to finding the distance between the encoded vectors of the query image and the input image. Whenever the distance falls below a certain threshold a loop closure can be reported. The value of the threshold can be either learned or tuned based on previous experience about the alteration limits of the environment. The reconstruction process in this case uses the concept of switch matrix which holds the position of the pixel selected during a pooling layer of the encoder so that proper mapping can be done during decoding. The methodology is detailed in the subsequent subsections.

Some of the important aspects of our research are:

  • Our 12 layer architecture with LCA layers reduces the input image of \(96\times 336\) i.e. 32256 pixels to only 200 dimensional feature vectors

  • The quality of reconstruction from the decoder part of the network ensures that the 200 features extracted by our method capture important structural properties of the image.

  • While detecting loop closures, it is often encountered that the objects of a place we had previously visited, have shifted by a few metres (or pixels) by the time we are arriving back at that place. Also, the camera poses of the two time instants are likely to be different. Hence, it is important that the features extracted from the image are translationally invariant to some extent, which is guaranteed by the pooling layers of the encoder network.

  • Compared to traditional approaches of computing expensive features from an image, our deep features are generated by a series of dot products, non-linearities and pooling operations, which boosts real time performance significantly.

2.1 Architecture

At the heart of the proposed architecture lies a deconvolution net. It is further modified by adding a layer of locally connected autoencoders to map an image frame into a representation vector of n dimensions. The higher the value of n, the greater is the capability of the vector to encode unique macro level features of the scene in each of its elements. The choice of optimal size for the vector is subjected to further research. The value of n is empirically chosen as 200 in this work. We discuss the architecture in the following two subsections.

Deconvolution Net: Deep autoencoders were initially studied by Hinton et al. [8] for reducing the dimensionality of raw input data with neural networks. This approach was later extended for image [10] and document retrieval [7] tasks. But when working with images, fully connected autoencoders ignore local 2D image structure and hence suffer from a redundancy in learning the parameters. The visual field of the features are made to span the entire input thus destroying local structural information. In this case enforcing local connectivity and weight sharing [14, 21] not only scales well for realistic image sizes, but also removes redundancies in the input to model discriminative representations.

Proposed architecture is essentially a 12 layer deep deconvolution net with only the middle layer as a layer of locally connected autoencoders. The first six layers are for encoding and the last six layers are for reconstructing the input which structurally is the mirror image of the encoding network. The features of the 6\(^{\text {th}}\) layer (the layer of locally connected autoencoders) are used as representations for the image frames. Here, the stride of both convolution and pooling layers defines the number of pixels the kernel shifts. The pad defines the number of extra zero value pixels padded on the boundary after convolution or pooling. In our case we have chosen the zero-pad as 1, stride for convolution as 1 and kernels of dimensions \(3\times 3\), similar to the architecture of Noh et al. [16].

As deep learning involves huge amount of matrix computation, to speed up the process without significant loss of accuracy pooling technique is used. The image size to next convolution layer is diminished by selecting one pixel value of the next layer input, from a patch of the output image of the previous convolution layer. Max-pooling selects the pixel value which is maximum within the patch. Table 1 presents the complete architecture in tabular form.

Table 1. Description of the proposed network: The first half consists of convolution (conv) and pool layers, followed by encoding and decoding (LCA) and finally a number of deconvolution (deconv) and unpooling layers.

Locally Connected Autoencoders: The feature maps at the output of the 6\(^{\text {th}}\) layer are passed through a layer of locally connected autoencoders (LCA) to learn a further lower dimensional representation. An LCA is a fully connected 2 layer feedforward neural network where the number of input neurons is equal to the number of output neurons and the number of hidden neurons is equal to the dimension of the autoencoder, which in this case is 40. Each of the 5 feature maps are passed through an autoencoder and projected into a representation vector of 40 dimensions. All such representations are stacked on top of one another to form the 200 dimensional representation of the image frame. Using local connections instead of using a fully connected layer not only helps capturing distinguishing features from each feature map separately but also reduces the number of parameters to be learned.

Proposed approach is to some extent similar to the approach presented in [18]. But instead of a fully-connected autoencoder, in the proposed architecture deconvolution net is used for following reasons: (a) weight sharing (as done in convolution layers) extracts more meaningful and significant features when dealing with raw image data, (b) The features extracted by the proposed model are six layers deep and hence are more abstract compared to the features extracted by their 1 layer deep denoising autoencoder, (c) More importantly, pooling operations in the encoder of the deconvolution net introduce some translational invariance in the features extracted which is a very essential characteristic for detecting loop closures.

2.2 Training Methodology

The training is the commonly used two stage process [8] namely, greedy layer-wise unsupervised pretraining and global fine-tuning. The greedy unsupervised pretraining proceeds in a layerwise fashion. Keeping in mind the difficulty of jointly training a deep neural network architecture with respect to a global objective, at this stage, each layer is pretrained in an unsupervised fashion by taking the output of the previous layer and producing a new representation as output. This phase is called layer-wise because only the parameters of one layer are updated at a time keeping the others fixed. Normally fine tuning phase is supervised. However it has been shown [8] that for autoencoder networks, test accuracy improves significantly when the fine tuning phase is also unsupervised. Based on that observation, we have also adopted unsupervised fine tuning. Once the full autoencoder network is trained with respect to a global objective, the output of the LCA (locally connected autoencoder) layer are used as the learnt representations of the images in the dataset.

Fig. 1.
figure 1

Confusion matrices for sequences 5, 6, 7 and 8 of KITTI dataset. Darker the value, images are more similar.

3 Results and Analysis

In order to carry out the experiment we have worked with the KITTI Odometry dataset [6]. For training, we used Sequences 0–4 with dataset augmentation (approximately, 100,000 images). Sequences 9 and 10 are used for validation and tested with sequences 5–8. Total time taken to train the network, was approximately 3.5 days for pretarining and 1.5 days for fine tuning, on an NVidia Quadro M5000 GPU with 8 GB VRAM. It is to be mentioned that during test, proposed method can process approximately 150 frames per second. On a machine with 1.4 Ghz Intel i5 CPU with 4 GB RAM, FABMAP has a maximum speed of 40FPS whereas our method operates at 110FPS, image dimension being \(336 \times 96\).

For each of the test sequences 5, 6, 7 and 8, the generated confusion matrices are shown in Fig. 1 at a scale of 0 to 255. It shows the Euclidean distance between the learnt representation vector of the images in the sequence. It may be noted that along the diagonal the distance should be zero (as it is the distance with itself). By applying a threshold on the distances (similarity)loop closure is detected. The threshold on distance should be low enough to avoid the false detection. It has to be chosen keeping in mind that when the robot revisits the place there may be change in illumination, angle of view or even dynamic objects may also get shifted. In our experiment, it is empirically taken as 5. Image vectors with a distance less than the threshold qualify for a loop closure. In KITTI dataset, sequence 5 contains loop closure. Hence we have tested with the same and compared the outcome with OpenFabmap [4] and OpenSeqSlam [15]. The loop closure detection matrices are shown in Fig. 2. White denotes loop closure. It is clear in Fig. 2 that like others proposed methodology detects the closures successfully. It is to be noted that OpenSeqSlam suffers from over detection and significant miss is also present in both OpenSeqSlam and OpenFabmap. On the other hand, miss and false detection both are less for the proposed methodology. It may be noted that comparison was done by thresholding and then comparing it with ground truth confusion matrix on a pixel to pixel basis.

Fig. 2.
figure 2

Loop Closure Detection: ground truth matrix (top left) matrices for proposed methodology (top right), OpenSeqSlam [15] (bottom left) and OpenFabmap [4] (bottom right).

4 Conclusion

In this work we have proposed a deep learning autoencoder network that can represent an image with significantly lower dimension. But it preserves considerably the contextual and spatial information. As a result such representation becomes useful for applications like loop closure detection in SLAM. In our approach, we tried to combine the best of both the deep learning approaches (weight sharing in CNNs and unsupervised feature learning in DAs) in a deconvolution net. The advantage of this approach is that vectors generated for two frames of the same scene which differ geometrically but are similar contextually and by content, are quite close to each other. Thus the approach works in general place recognition tasks also and holds the promise to be extended to context and content based image matching problems.