1 Introduction

This paper presents a method for segmenting three-dimensional (3D) point clouds representing unstructured and natural terrains.Footnote 1 The approach is applied to underwater bathymetric point cloud data collected using an autonomous underwater vehicle (AUV). The segmentation is tested on point clouds acquired by two different sensors: a downward facing stereo camera pair and a structured light laser system (Roman et al. 2010). The latter provides range/angle 1D scans accumulated into point clouds as the platform moves. Examples of such 3D scans are shown in Fig. 1.

Fig. 1
figure 1

Example point cloud scans which were manually labelled for the evaluation (in this figure the colors are mapped to height). a A shipwreck site with oblong amphoras. b Another shipwreck site with cylindrical pipe segments. a, b Are captured using the structured light. c A dense stereo point cloud example; this point cloud can also be seen with additional texture in Fig. 6b. The data sets are further detailed in Sect. 4.1 (Color figure online)

As underwater platforms are increasingly being used for high-resolution mapping tasks in scientific, archaeological, industrial and defense applications, accurate 3D point cloud segmentation is critical to high resolution map building, classification tasks of natural scenes (e.g. algae, corals, sand, canyons) and man-made objects (e.g. shipwrecks, pipes) and for navigation purposes. The footprint and effective range of underwater sensors, however, is very limited relative to aerial and terrestrial equivalents. In many cases the survey areas also do not satisfy typical assumptions, such as the presence, type or density of features in 3D data observed, from terrestrial and aerial robots. These impede the precision of segmentation algorithms developed for structured environment using high resolution scanning sensors.

The core of the proposed segmentation algorithm lies in a novel groundFootnote 2 extraction process. The 3D point cloud of the terrain is converted to a two-dimensional (2D) discrete signal and then transformed to the frequency domain using discrete Fourier transform (DFT) for analysis. In the frequency domain, the lower frequency components are assumed to represent the slower varying undulations of the underlying ground, while the smaller objects lying on top of the ground are represented with higher frequencies. The extraction of the ground plane data can then be accomplished similarly to denoising/low pass filtering of the signal. The cut-off frequency, below which the frequency components associated with the ground are selected, is automatically determined using peak detection on the frequency signal. The intuition that a given frequency directly maps to a feature size in the spatial domain is exploited for automatic segmentation. A user can also specify a maximum admissible object size to drive the automatic detection of the cut-off frequency. The segmentation process is fast since the bulk of the calculation is carried out using Fourier transforms and, furthermore, it does not require any training. The output of the process is an estimated separation layer or ground surface. Object segments are formed from the non-ground points aboveFootnote 3 this surface by applying a standard voxel based clustering (i.e. proximity based clustering). The extracted segments can be used in down-stream processing modules, as features for classification for instance, or features for segment based map alignment.

The proposed approach was initially introduced by Douillard et al. (2012a). This publication brings the following additional elements. The approach is compared against vision based segmentation techniques such as Graph Cuts by interpreting the point cloud as a depth image. It is also applied to point clouds generated from dense stereo in addition to the original tests on the structure light point clouds. The method is first evaluated in terms of segmentation performance based on hand-labelled point clouds. It is then evaluated in the context of an alignment pipeline. In both cases, the comparisons are favorable to the proposed approach.

The reminder of the paper is organized as follows. In Sect. 2 we review the related literature before introducing our segmentation approach in Sect. 3. We present the experimental setup, the data sets used and the benchmark segmentation methods in Sect. 4. Results are presented and discussed in Sect. 5. Conclusions are proposed in Sect. 6.

2 Related work

Segmentation has been studied for several decades and in particular in computer vision, where it is often formulated as graph clustering, such as in Graph Cuts (Boykov and Funka-Lea 2006). Graph Cuts segmentation has been extended to 3D point clouds by Golovinskiy and Funkhouser (2009) using \(k\)-nearest neighbors (kNN) to build a 3D graph with edge weights assigned according to an exponential decay in length. The method requires prior knowledge on the location of the objects to be segmented.

In ground robot applications, the segmentation algorithm of Felzenszwalb (2004) (FH) for natural images has gained popularity due to its efficiency (Schoenberg et al. 2010; Strom et al. 2010; Triebel et al. 2010; Zhu et al. 2010). Zhu et al. (2010) build a 3D graph with kNN while assuming the ground to be flat for removal during preprocessing. 3D partitioning is obtained with the FH algorithm, which results in under-segmentation. This is corrected via a posterior classification that includes the class “under-segmented”. Triebel et al. (2010) explore unsupervised probabilistic segmentation in which the FH algorithm is modified for range images and provides an over-segmentation during preprocessing. Segmentation is cast into a probabilistic inference process both in the range image and in feature space using conditional random fields (CRF). Their evaluation does not involve ground truth data. Schoenberg et al. (2010) and Strom et al. (2010) have applied the FH algorithm to colored 3D data obtained from a co-registered camera/laser pair. The weights on the image graph are computed based on a number of features, including point distances, pixel intensity differences and angles between surface normals estimated at each 3D point. The FH algorithm is then run on a graph representing either the range image or the color image. In practice the evaluation is only done on road segments, or visually. On the contrary, the evaluation presented here is performed on scans with point-wise labels assigned manually. All of the mentioned algorithms reason locally and attempt to identify boundaries between objects in the scene whereas the approach proposed here reasons globally across the whole depth image by analyzing the overall frequency content to find underlying larger scale patterns.

A maximum spanning tree approach to the segmentation of 3D point clouds is proposed by Pauling et al. (2009). Each node in the graph is a 3D Gaussian ellipsoids. Thus ellipsoids are used as geometric primitives. The merging of ellipsoids during the growing of the tree is based on two novel distance metrics, each producing a different segmentation “style”. The resulting segmentation is similar to a super voxel type of partitioning (by analogy to super pixels in computer vision) with voxels of ellipsoidal shapes and various sizes. This approach does not explicitly identify a ground surface, which, in the approach proposed here, facilitates segment matching for sub-map alignment.

Three segmentation techniques for 3D scans acquired in urban environments are proposed by Douillard et al. (2010). This suite of algorithms is able to process point clouds of different densities but the algorithms all assume the platform to be located on a drivable surface so that ground seeds are available—ground seeds are not readily available in the underwater scans as the vehicle is floating above the seafloor (as can be inferred from Fig. 1 for instance).

The graphics community has been investigating the problem of segmenting 3D point clouds and concurrently labelling the extracted sub-parts. A benchmark approach for 3D mesh segmentation was proposed by Chen et al. (2009). Building on this benchmark, Kalogerakis et al. (2010) proposed a CRF based technique for segmentation and labelling. The approach is shown to accurately classify object sub-parts, where sub-parts are assumed to belong to a pre-defined set of labels (e.g. arm, leg, torso, head). As pointed out by the authors, a potential limitation of the approach is in its requirement for point-wise labelling of the 3D meshes used for training. It is shown however, that a small number of labelled meshes suffice to obtain high accuracy. The segmentation approach proposed in this paper does not provide sub-part classification but does not require learning, and expects only one parameter to be specified. It is in addition fast, since it is based on Fourier transforms.

Segmentation of underwater terrain using acoustic and range data typically has focused on larger scale, lower resolution applications such as generating habitat maps of the seafloor based on multi-beam sonar data (Brown et al. 2011). The objective in these cases is to group areas of the seafloor with similar textures (derived from bathymetry or backscatter) as representing a common habitat, each segment representing hundreds or thousands of square meters of seafloor. Another use for segmentation is in acoustic side-scan sonar imagery as part of mine-detection systems (Reed et al. 2006), where segmentation is used to separate the returns into ground, target and shadow. While the focus in these cases is the detection of particular man-made structures, side-scan data usually has swaths of tens to hundreds of meters in width and targets appear as clusters of relatively few pixels. Segmentation has also been employed to aid object detection and tracking for obstacle avoidance (Petillot et al. 2001). In this case the segmentation is usually applied at the individual sensor readings as a way of dealing with noise or reducing computational demands further along the processing pipeline, rather than considering 3D surfaces observed through multiple scans. To our knowledge this work is the first to address segmentation of ground and non-ground objects in short-range, high resolution 3D point clouds from underwater scenes.

The contribution of this work lies in the definition of a novel ground extraction method for unstructured terrain. The approach analyses the frequency content of the terrain and only requires the user to set one parameter to specify the maximum size of objects beyond which objects are identified as part of the underlying ground (or background signal). This approach is introduced in detail next.

3 Ground-object segmentation

This section describes the core contribution of this article: a mechanism to segment the ground and objects from 3D point cloud data of unstructured terrains without ground seeds. The proposed segmentation approach involve three main steps. (1) Pre-processing of the point cloud to form a 2D elevation map aligned with the DC component (the zero frequency term) of the elevation signal; this is described in Sect. 3.1. (2) The resulting elevation map is transformed into the frequency domain via DFT. Filtering operations are subsequently applied to extract an underlying ground surface. These operations are described in Sect. 3.2. (3) Finally, ground and object points are separated following the process described in Sect. 3.3.

3.1 Generation of elevation maps

The aim of the preprocessing steps presented next is to generate a discrete 2D signal on a regular grid whose DC component is aligned with the horizontal plane. The obtained 2D discrete signal can then be transformed to the frequency domain with fast Fourier transform (FFT). Note that the requirement of a regular grid is core to FFT (Kunis 2006). This has implications on the type of point clouds that can be processed with the proposed approach. Alternatively, non-uniform DFTs or Polar FFTs (Kunis 2006) can be employed to overcome these limitation. This is left as future work.

To obtain a 2D discrete signal from the input point cloud, a reference frame local to the point cloud is first defined as follows. Its origin is set at the centre of mass of the point cloud. The three eigen vectors of the point cloud are computed. These are then used to define the axes of the reference frame. Each point cloud comes in the format of a set of raster scans and the raster direction, and its normal. These allow to define which of the three eigen vectors correspond to the x and y axes, respectively. The z axis is defined by the remaining eigen vector.

In this reference frame, an axis aligned elevation grid is populated by voxelizing the point cloud. The maximum height in each column of the voxel grid is kept as the elevation. A resolution of \(2\, {\mathrm{cm}}\) is used in all the experiments reported here. Some of the cells in the elevation grid may not contain any points. The height in these cells is estimated via nearest neighbor interpolation. Interpolation in the data sets processed here is not likely to modify the frequency content of the original elevation signal since empty cells represent at most 10 % of the elevation grid. However, this is not true in more sparse scans, such as side-looking Velodyne scans (Douillard et al. 2012b) where the polar pattern of the scans induces a significant number of empty cells in a regular grid. Non-trivial interpolation mechanisms would be required to estimate the height in these cells. Alternatively, other frequency transformation approaches, as explain above, can be employed.

3.2 Ground surface extraction

The formulation of the ground extraction process presented in this section is based on the observation that the ground surface in a scene can be interpreted as the underlying lower frequency undulations where smaller objects with higher frequency components sit on (e.g. objects on the ground). This “ground” surface may not physically exist in all natural scenes. For instance, larger objects such as rocks often merge with their support at their boundary to form continuous surfaces. However, the estimation of a separation layer—which we refer to as ground surface—defined as the lower frequency content of the elevation signal (Fig. 2a, b) allows us to identify distinct segments above the “ground” (Fig. 2c, d). These segments can then be used for further analysis (e.g. classification) or as landmarks for navigation (see Sect. 5.2). Intuitively, this segmentation process also relates to a saliency detection mechanism and thus information theory; however, the definition of the theoretical relationships between the two is beyond the scope of this paper. In the remainder of this paper, the word “ground” will be used to refer to both the separation layer estimated with the mechanisms described below and the points below this layer.

Fig. 2
figure 2

Ground object segmentation. a, b Examples of ground surfaces extracted with the proposed segmentation method (colors are mapped to height). The two sub-maps used in this example have an extent of about \([4\,\, {\mathrm{m}} \times 4\,\, {\mathrm{m}}]\). c, d Resulting segmentation. Grey indicates ground points and blue shades object points. The estimated ground surfaces a, b are more sparse than the point clouds c, d as they are estimated from the elevation grid. In both cases most of the parts of the terrain belonging to the underlying seafloor are correctly identified. This is further quantified in Sect. 5.1. e, f Segmentation of objects above the ground, where different segments are indicated by different colours, and the ground is again indicated in grey (Color figure online)

3.2.1 Frequency domain filtering

The aim of performing filtering in the frequency domain is to identify those lower frequency components that form the ground surface. We employ 2D DFT and inverse DFT to transform the elevation map to and from the frequency domain. An example of frequency response given by an elevation map is shown in Fig. 4a. In this section we assume that the cut-off frequency (i.e. the frequency beyond which the frequency components are considered to be objects) is given. The next section explains how this cut-off frequency is automatically determined.

Applying a hard threshold in the frequency domain creates the well known ringing effect in the reconstructed spatial signal (Proakis and Manolakis 1996). To avoid this ringing effect a low-pass filter with smooth cut-off can be used. We achieve this by constructing a linear phase frequency domain filter with Butterworth-like magnitude response. Another advantage of this filter is that the magnitude in the passband region is almost constant (Proakis and Manolakis 1996), hence the ratios of the frequencies below the cut-off are maintained. The equation of this Butterworth-like filter is:

$$\begin{aligned} \begin{array}{l c r} |T(d)| = \frac{1}{\sqrt{1 - \epsilon ^{2} \cdot \left( \frac{d}{d_{c}} \right) ^{2 \cdot n}}}&\text { , }&\qquad n \ge 1 \end{array} \end{aligned}$$
(1)

where \(T\) is the filter response, \(\epsilon = \sqrt{ 10^{A/10} - 1 }\) is the band edge value with \(A\) being the passband variation. Here, \(d\) is the distance from DC, \(d_{c}\) is the cut-off point and \(n\) is the filter order. The filter order defines how sharp the damping is; \(n=2\) is used in our implementation. Fig. 3 shows a 3D example of a Butterworht-like filter response. Notice the flatness at DC and the smoothness at the cut-off, which we are interested in.

Fig. 3
figure 3

3D Butterworth-like filter response for a \(2{\text {nd}}\) order filter with a cut-off at \(0.5\cdot \omega _{N}\). The location of the cut-off is shown with a dashed magenta line

The 2D frequency response of an image has the same dimension as the image itself. The frequencies evaluated by the DFT along one dimension of the image are \([0,1,2,\dots ,\frac{N}{2}]\frac{1}{N}\), where \(N\) is the number of pixels along that dimension. In the case of non-square images (such as the elevation grids processed here) the range of frequencies evaluated along each dimension is different as it depends on \(N\). Hence, the cut-off frequency is here encoded as a ratio (a unitless number between \([0, 1]\)) of the range of frequencies. This implies that consistent filtering of objects, independent of their orientation and location in the image, can be obtained in non-square images by modulating the cut-off frequency by an elliptical gain. The extent of the major and minor axis of the ellipse is given by: \(a = f_{c} \cdot \frac{w}{2}\) and \(b = f_{c} \cdot \frac{h}{2}\), where \(f_{c} \subset [0, 1]\) is the cut-off frequency ratio and \(w\)/\(h\) the image width/height. The 2D filter with elliptical weighting is then obtained by replacing the ratio \(\frac{d}{d_{c}}\) in (1) by:

$$\begin{aligned} \left( \frac{d}{d_{c}} \right) ^{2} = \left( \frac{u-u_{o}}{a} \right) ^{2} + \left( \frac{v-v_{o}}{b} \right) ^{2} \end{aligned}$$
(2)

where \((u_{o}, v_{o})\) is the ellipse origin—this is the center of the shifted frequency response image. As before, a shifted image is re-arranged so that the DC is in the centre of the image. An example of a filtered frequency response image is shown in Fig. 4b.

Fig. 4
figure 4

Bode magnitude plots of KnidosH sub-map 16 (Fig. 2f). a The original frequency response image and b the filtered frequency response image. The frequencies above the cut-off (here set to \(7.99\,\%\) and shown with a magenta star in the plots) are suppressed by \(>40\,\, {\mathrm{dB}}\)

3.2.2 Cut-off frequency estimation

The cut-off frequency is the threshold in the frequency domain that defines the ground/non-ground segmentation. As previously mentioned, the ground is assumed to correspond to the slower undulations of the terrain, with objects sitting on top of the ground adding higher frequency components to the relief. To find the cut-off frequency, we exploit the relationship between spatial scales and frequency components of the terrain. Slower terrain undulations correspond to larger scale structures in the relief, and vice versa. Thus one can define a minimum size \(m_c\) of a structure so that this structure is part of the ground. Given some additional modelling, this minimum size can then be mapped to a frequency. The latter is our cut-off frequency \(f_c\).

A simple step function model is used here to map spatial scales to frequency components. It is illustrated in Fig. 5. The frequency response of a step function is a \(\mathrm{sinc}\) function (see, for example, Proakis and Manolakis 1996) (Fig. 5b). Correspondingly, the response to a 2D step (a cube) is \(J_{1}(z)/z\), where \(J_{1}(z)\) is the first order Bessel function (Horn 1986). Knowing the size of the cube \(\tau \), the location of the first zero crossing of the frequency response is given by \(2/\tau \) (Fig. 5). The wider the object, the closer the location of the first zero crossing to the DC. This is consistent with the analysis above identifying larger objects with lower frequency components. The first zero crossing of the first order Bessel function is followed by progressively decreasing undulations (Fig. 5b). The range of frequencies up to the first zero crossing contain more than \(90\,\%\) of the signal energy. Therefore, we assume that the overall shape of the object is maintained when applying a low-pass filter with a cut-off frequency at the first zero crossing. Given this modeling, a cut off frequency \(f_c > \frac{2}{\tau }\) implies that an object of size \(\tau \) is included in the ground. On the other hand, if \(f_c \le \frac{2}{\tau }\), an object of size \(\tau \) is partly filtered out during ground extraction. This also implies that the maximum size of objects \(m_c\) beyond which objects are included in the ground surface can be approximated as \(\frac{2}{f_c}\). Strictly speaking, objects smaller than \(\frac{2}{f_c}\) are not removed but only partly filtered out, with this effect being stronger for smaller objects. The latter equation, transposed from the space defined by elevation grid to the metric domain, becomes:

$$\begin{aligned} m_c = 2 / f_c \cdot \text {res}_g, \end{aligned}$$
(3)

where \(\text {res}_g\) is the resolution of the elevation grid.

Fig. 5
figure 5

By modeling an object of size \(\tau \) in the spatial domain as a square (a), we can estimate the cut-off frequency at the end of the first undulation in the frequency domain by \(2/\tau \) (b)

This formula allows the ground extraction algorithm to automatically adjust the detected cut-off frequency given a maximum object size (or conversely a minimum ground structure size). The ground extraction algorithm is formulated in such a way that the user provides a relative value \(m_u \subset [0,1]\), as opposed to an absolute value \(m_c\). \(m_u\) represents the maximum object size with respect to the extent, \(e\), of the sub-map. As illustrated in Fig. 4, the frequency content of a terrain signal tend to contain numerous peaks and troughs. The peak the furthest from DC obeying \(m_c \le m_u \cdot e\) is selected as the cut-off point.

The outcome of this process is a cut-off frequency selected in such a way that the maximum size of objects considered as non-ground are \(m_u \cdot e\). As an example, in the case of sub-map 6 from the KnidosH data set (shown in Fig. 2), with \(m_u=0.5\) the ground extraction algorithm sets \(f_c\) to the second closest peak. Due to the overall constant upward curvature of this sub-map, a range of frequency components is required to model the underlying ground.

3.2.3 Ground surface reconstruction

Once \(f_c\) is defined, and filtering is applied to the frequency domain image, the estimate of the ground surface is built by applying the inverse DFT. Examples of reconstructed ground surfaces are shown in Fig. 2c, d.

3.3 Ground/object separation

Given an estimated ground surface, the objects above this surface are obtained by comparing the height of the points in the original point cloud to the height of the surface. If a point is on or below the ground surface, it is labelled as ground; if it is above it is labelled as object:

$$\begin{aligned} \forall p: \qquad \text {label}(p) = {\left\{ \begin{array}{ll} \text {ground} &{} \text {if } z_{p} \le z_{gs},\\ \text {object} &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(4)

where \(z_{p}\) and \(z_{gs}\) are respectively the height values of a 3D point \(p\) and the height value of the ground surface at that point.

This set of rules allow to only identify positive obstacles. Negative obstacles can also be detected with the proposed approach by defining a margin above and below the ground surface, within which points are identified as ground, and outside which points are non ground. The simple set of rules above proved sufficient in our experiments.

Once the object points are identified, they can be clustered using a standard voxel based clustering: non-empty voxels in contact of each other are gathered in the same cluster. The resulting set of clusters form segments of object points. Object segmentation is illustrated in Fig. 2e, f. These segments are used as landmarks for navigation in Sect. 5.2.

4 Experimental setup

The proposed segmentation method was previously tested (Douillard et al. 2012a) on two data sets generated by a structured light sensor (Roman et al. 2010) and compared against three segmentation algorithms from the Point Cloud Libary (Bogdan Rusu and Cousins 2011). In this paper we extend the experiments to include a stereo camera data set gathered by our AUV Sirius during a campaign off the east coast of Tasmania, Australia. Furthermore, since we convert the 3D point cloud to a 2D signal—which can be represented as a gray scale image—we also compare the proposed approach to image based segmentation methods. We therefore compare the segmentation capabilities of two image based approaches on these three data sets. Sects. 4.14.2 detail the data sets and the benchmark algorithms. Sect. 4.3 briefly reiterates the segment based point cloud matching algorithm employed during the alignment tests.

4.1 Data sets

The basic concept of the structured light sensor (Roman et al. 2010) consists of a green laser sheet projected on the seafloor and a calibrated camera imaging the laser line. The data set consists of shipwrecks surveyed at approximately an altitude of three meters with the vehicle traveling 10–15 cm/sec. An initial odometry based navigation solution is provided using a Doppler velocity log, fiber-optic gyrocompass, and depth sensor. Sub-maps are broken before the accumulated odometry error has become significant relative to the bathymetry sensor resolution (Roman and Singh 2007). An ultra-short baseline acoustic positioning system is used to reference the survey to a geodetic datum, but is otherwise too inaccurate to localize the sub-maps or correct the navigation within a sub-map. Once extracted from the collected images and projected, the resulting point density is approximately seven points per square centimeter. The individual points have a range resolution better than one centimeter. The two following data sets are processed: (1) KnidosH, which contains 101 sub-maps; (2) Pipewreck, which contains 42 sub-maps. In each set, sub-maps contain around 600,000 points.

The stereo camera pair is part of a large sensor suite onboard the AUV Sirius (Williams et al. 2010). It is mounted underneath the AUV which is traveling at an altitude of \(\sim 2\,\, {\mathrm{m}}\) at \(0.5\,\, {\mathrm{m/s}}\). The resolution of the cameras are \(1,360 \times 1,024\,\, {\mathrm{pixels}}\) and the resulting footprint \(\sim 1.5\,\, {\mathrm{m}} \times 1.2\,\, {\mathrm{m}}\). Four dense stereo reconstruction algorithms were tested including the efficient large scale stereo matching algorithm (Geiger et al. 2010), the semi global block matching (SGBM) algorithm (Hirschmuller 2008), the standard Block Matching algorithm (Bradski 2000), and a Graph Cut based disparity matching algorithm (Kolmogorov and Zabih 2001); the last three being run using OpenCV implementations (Bradski 2000). The SGBM clearly and consistently provided better and less noisy reconstruction across the dataset, which confirms the results presented by Steingrube et al. (2009). With this reconstruction technique, we were able to produce dense stereo point clouds, with each cloud containing approximately one million points. To evaluate the performance of the segmentation algorithm a data set gathered from the East coast of Tasmania, Australia, was used. From the Tasmania data set, two stereo point clouds were hand labelled to evaluate the segmentation algorithm. In the remainder of this document we also refer to a point cloud generated from each stereo image pair as a sub-map.

Figure 6a shows an overview of the KnidosH data set and Fig. 6b shows the two stereo point clouds from the Tasmania data set textured with the RGB image.

Fig. 6
figure 6

a Overview of the KindosH shipwreck. The site is approximately 16 m long and 10 m wide. b Example of stereo point clouds from the Tasmania data set. The point clouds are textured with the colour information contained in the associated images; colours are shown here to facilitate the visual interpretation of the point clouds but are not used in the segmentation process. Left sub-map 7, right sub-map 8 (Color figure online)

4.2 Benchmark segmentation techniques

This sections presents the three point cloud segmentation and two image based segmentation methods which where used as benchmarks.

4.2.1 Naive ground extraction

In this first approach, the mean height of the point cloud is averaged, and the ground is simply defined as being the set of points below the mean height. This approach will be referred to as the “Naive” method.

4.2.2 MLESAC-based plane extraction

In this second approach, a plane is fitted to the point cloud and the ground is defined as the points below this plane. The Point Cloud Library (Bogdan Rusu and Cousins 2011) implementation of a MLESAC (Maximum Likelihood Estimation SAmple Consensus) based plane fitter is used in the proposed experiments. This method will be referred to as the “Planar” method.

4.2.3 Grid-based ground extraction

In this third approach, the point cloud is voxelised, and the ground points are identified as the ones contained in the bottom voxel of each column of the voxel grid. The grid resolution relates, to some extent, to the value of the cut-off frequency automatically estimated by the proposed frequency based segmentation method. A higher resolution (i.e. a smaller grid separation) will result in finer details of the terrain being included in the estimated ground surface, which, to some extent, corresponds to larger cut-off frequency in the context of the frequency based method. However, as developed in Sect. 3, the Fourier formalism allows the algorithm to reason about the range of scales contained in the terrain relief, and, given a maximum admissible object size, it allows to automate the selection of the relief scales relevant to the definition of the ground. Such reasoning is not readily applicable in a grid based elevation segmentation and its resolution would have to be manually adjusted as a function of the data set. The grid resolution is set to \(2\,\, {\mathrm{cm}}\) in our experiments. This approach is referred to as the “Grid” method.

4.2.4 k-Means image segmentation

The first image based segmentation algorithm uses k-Means clustering. The k-Means algorithm is an iterative algorithm which tries to minimize the sum of the point-to-centroid distances for \(k\) clusters by reassigning points to their nearest cluster centroid and recalculating the cluster centroids (Seber 1984). The process is repeated until a (local) minima is found. However, by replicating the search several times with random starting points a solution that is close to a global minima can be achieved. In our experiments we set the number of clusters \(k=2\) and replicate the search three times. This approach is referred to as the “KMeans” algorithm.

4.2.5 Continuous max-flow segmentation

The second image based segmentation algorithm is the fast continuous max-flow algorihtm (Yuan et al. 2010), which is an extension of the classic graph cuts (Boykov and Funka-Lea 2006). In this more recent work, the authors solve the minimum image partition problem efficiently by posing a convex constrained non-smooth optimization formulation. We employ the implementation by the authorsFootnote 4 and pose the segmentation problem as a 2 label max-flow problem. We refer to this approach as the “CMF” method.

4.3 Alignment algorithm

The segmentation process presented in this work is developed as a preprocessing step to allow accurate alignment of underwater scans. The survey of 3D alignment techniques presented by Salvi et al. (2007) shows that most approaches involve two main steps: coarse alignment followed by fine alignment. Coarse alignment is the step this study is concerned with, since fine alignment is usually addressed with the standard iterative closest point (ICP) algorithm (Besl and McKay 1992). We employ the S-ICP algorithm by Douillard et al. (2012b), which performs point cloud matching by matching segments of the data.

The main steps involved in this matching algorithm are as follows. A shape distance metric akin to an ICP residual is used to compute shape distances between all segments of one scan to all segments to the other scan. Given this set of distances, an optimal assignment is computed with the Hungarian algorithm. This provides pairwise segment associations across the two scans to be aligned. Potentially incorrect segmentation association are then filtered out by checking the consistency of the relative positions of associated segments. The process results in a set of segment-to-segment associations across two scans, or as here, across two sub-maps. The alignment is then computed using ICP with point-to-point associations constrained in associated segments. In the context of this alignment technique, as in the context of most feature based alignment techniques, consistent extraction of segments/features is critical to the accurate registration of different but overlapping sub-maps. Sect. 5.2 shows that the proposed approach achieves an appropriate level of consistency in the segmentation to allow alignment. This is due to its ability to approximate globally the underlying structure of the seafloor in a fully data driven manner.

A number of ICP variants could also be used for comparisons. Generalized ICP (Segal et al. 2009) for instance is a ICP variant that also attempts to limit incorrect point-to-point associations. While generalized ICP adds a probabilistic layer to the original ICP algorithm, it does not implement explicit shape matching, but rather outlier removal. It can be seen as a robustified point-to-plane ICP. Matching is still performed using point-to-plane associations and thus, while it can better deal with noise, incorrect point matches are generated similarly to standard point-to-plane ICP. The aim of the experiments presented here is to show the benefit of explicit shape matching over point-to-point or point-to-plane matching. One version of ICP is thus chosen to compare the basic point association mechanism common to most ICP variants to explicit shape matching. Note also that point-to-plane matching does not readily apply in unstructured (underwater) environments.

5 Experiments

This section presents two sets of experimental results. First, segmentation evaluation is performed using ground-truth hand labelled data (Sect. 5.1). Second, the segmentation is evaluated in a feature based alignment where segments are used as features and fed to the S-ICP alignment algorithm (Sect. 5.2). In both sets of experiments, the proposed frequency based segmentation approach is compared to the five other techniques; these were described in Sect. 4.2.

5.1 Comparisons to hand labelled segmentation

From each of the KnidosH, Pipewreck and Tasmania data sets two sub-maps were chosen and manually labelled. For each point the labels ground and object were assigned. These sub-maps (Fig. 7) were chosen since their content is representative of the different data sets.

Fig. 7
figure 7

The six sub-maps hand-labeled as ground (blue) or object (red). a, d KnidosH sub-maps 6 and 16. b, e Pipewreck sub-maps 13 and 41. c, f Tasmania sub-maps 7 and 8 (Color figure online)

5.1.1 Segmentation quality metric

The segmentation methods are compared to hand labeled segmentation to determine which points are correctly or incorrectly labelled as object and ground. The evaluation is performed by quantifying the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) rates in the binary classification task of identifying the classes ground and object; where positive refers to the object class and negative to the ground class. The results are presented in terms of the \(F_{1}\) score, which produces a weighted score as follows:

$$\begin{aligned} F_{1} = \frac{2 \cdot TP}{2 \cdot TP + FN + FP} \end{aligned}$$
(5)

5.1.2 Segmentation results

The results of the segmentations are presented in Table 1. The best \(F_{1}\) score for each sub-map is indicated in bold. For the purpose of detailed analysis the true positive rate (TPR) and the true negative rate (TNR) are also given.

Table 1 Comparison of segmentation methods with ground truth in terms of the \(F_{1}\) score

The data sets used here represent very different environment, scales and complexity and as such the performance of the various algorithms differ significantly. On the whole, these figures indicate that the proposed method compares favourably to the other approaches. The Planar method generally produced the least proportion of FNs. However, this comes at the expense of having the largest FPs. This is not surprising given that the sub-maps are not flat. The grid based method performed well on the structured light data sets but poorly on the stereo point clouds. In the case of the dense stereo point cloud, the grid method finds ground points both at the top and the bottom of the steps in the terrain relief. The grid method reasons per elevation column, that is locally, and thus is not able to find a global underlying ground surface.

The two image based segmentation methods display little difference in terms of performance. The CMF provides poor performance on the two most challenging data sets (KnidosH sub-map 16 and Tasmania sub-map 8). On the contrary, the frequency based method produced the lowest number of FPs while having very low FNs too, resulting in significantly higher \(F_{1}\) score. The results from this experiment shows that the proposed method provides the most consistent performance in identifying the underlying ground surface compared to the five benchmark techniques. The segmentations produced by the six techniques on sub-map 8 from the Tasmania data set are shown in Fig. 8.

Fig. 8
figure 8

The result of the six segmentation algorithms on Tasmania sub-map 8. The colors in the figures correspond to TP (red), FN (cyan), FP (magenta) and TN (blue). a Naive, b Planar, c Grid, d KMeans, e CMF, f FFT (Color figure online)

5.2 Comparisons of alignment results

This sections presents an evaluation of the segmentation methods in terms of the quality of the alignment they lead to. The extracted segments are fed to a feature based alignment method (Douillard et al. 2012b) and the quality of the resulting alignment is used as a second metric to evaluate segmentation. Intuitively, this second metric expresses the ability of the segmentation to consistently extract landmarks on the seafloor, in different, but overlapping sub-maps. Consistent segmentation allows in turn matching and thus alignment. From a higher level perspective, this second metric for segmentation captures the ability of the segmenters to provide segments/landmarks useful for navigation. The proposed segmentation approach plus the benchmark techniques detailed in Sect. 4.2 are applied on each sub-map of the two structured light data sets (illustrated in Fig. 1a, b). The alignment pipeline described in Sect. 4.3 is then run to provide alignment quality results. The dense stereo data set is not used in this second set of experiments since the overlap between successive sub-maps in this data set is in general too small to allow re-alignment.

The aim of the experiments presented here is to show that pairwise alignment through accurate terrain feature segmentation can locally improve sub-map registration. The use of these local pairwise registrations in a global scan alignment process (Borrmann et al. 2008) or in online navigation algorithms (Roman and Singh 2007) is left for future work.

To define pairs of sub-maps for alignment, each sub-map set is traversed as follows. The first sub-map \(s_1\) is paired up with the closest neighbor sub-map (closest in terms of the distance between their centroid); this sub-map then becomes \(s_2\). \(s_2\) is paired up with the closest sub-maps amongst the sub-maps not already in a pair. Each sub-map in a data set is assigned to a pair by repeating this process.

5.2.1 Alignment quality metric

To evaluate the quality of the alignment, the metric employed attempts to capture the crispness of the aligned scan pairs. This metric consists of voxelizing the aligned point cloud and counting the number of occupied voxels, \(Nx\). The lower the \(Nx\) value, the crisper the point cloud and in turn the more accurate the alignment. The rationale for using this metric as opposed to the ICP residual was developed by Douillard et al. (2012b). The voxel resolution used in all the experiments presented here is \(2\,\, {\mathrm{cm}}\).

A Hausdorff distance based metric has been proposed in the context of bathymetric SLAM to evaluate the crispness of SLAM maps (Roman and Singh 2006). The metric proposed here captures a similar quantity and is chosen for the simplicity of its implementation. Crispness of 3D point clouds has also been defined using an Entropy based formulation (Sheehan et al. 2010). The formulation is elegant and principled but was not used here for the following two reasons: (1) it is expensive to compute and is effectively only computed in 2D in (Sheehan et al. 2010), (2) it suffers from limitations similar to the limitations of the metric proposed here: the resolution (or window size) of the Parzen estimator has an effect on the crispness value similar to the resolution of the grid based estimator used here, and different resolution will provide different crispness values. Given the simplicity of the grid based crispness metric, it was chosen in the proposed experiments.

5.2.2 Alignment results

The result of the alignments are summarized in Tables 2 and 3. Naive, Planar, Grid, KMeans and CMF refer to the techniques described in Sects. 4.2.1, 4.2.2, 4.2.3, 4.2.4, 4.2.5. FFT refers to the proposed segmentation algorithm.

A number of quantities are used to evaluate the quality of the alignments associated to each segmentation methods. The first is the mean of the variation of the \(Nx\) metric (\({\varDelta } Nx\)) before and after the alignment is applied:

$$\begin{aligned} {\varDelta } Nx = \frac{Nx_{after}-Nx_{before}}{Nx_{after}}, \end{aligned}$$
(6)

where the indices “before” and “after” refer to before and after applying the alignment. This and the associated standard deviation are given as percentages in Table 2. A negative value of \({\varDelta } Nx\) corresponds to an improvement of the crispness of the point cloud. A positive value, on the contrary, corresponds to the point cloud loosing in sharpness during the alignment. This may, for instance, be due to segments being mismatched across sub-maps (i.e. a data association failure).

Table 2 Mean and std deviation of the delta crispness after alignment (\(\mu \left( {\varDelta } Nx \right) [\%]\) / \(\sigma \left( {\varDelta } Nx \right) [\%]\))

S-ICP only computes an alignment if at least \(N_{seg}\) segments are matched across sub-maps. \(N_{seg}\) is specified by the user and is set to 3 in our experiments. If less than \(N_{seg}\) segments are associated during a run of S-ICP, the alignment is not computed and \({\varDelta } Nx\) is set to 0. The number of sub-maps where the alignment has degraded (\({\varDelta } Nx > 0\)), has not been calculated (\({\varDelta } Nx = 0\)), and has improved (\({\varDelta } Nx < 0\)) is reported in Table 3. The mean and the standard deviation provided in Table 2 are computed using only the sub-map pairs, for which the alignment is computed (i.e. \({\varDelta } Nx > 0\) and \({\varDelta } Nx < 0\)).

In the case of rather flat terrain, only small segments will be extracted, which may not be large enough to be accepted for processing. The minimum segment extent is fixed to \(2\, {\mathrm{cm}}\) in our implementation. The absence of large enough segments in any of the sub-maps in a pair results in the pair not being processed. These cases are shown as the fourth value in Table 3. The best results for each data set are indicated in bold in both tables.

Table 3 Count of the number of sub-maps that had their crispness worsened, unchanged, improved or were not processed (\(\#{\varDelta } Nx > 0\) / \(\#{\varDelta } Nx = 0\) / \(\#{\varDelta } Nx < 0\) / \(\#unprocessed\))

Table 2 shows that for both data sets the proposed method (last row) outperforms the benchmark methods by providing on average the best improvement in crispness when used in conjunction with S-ICP. The associated standard deviation (column 2) is of the same order of magnitude as the mean, suggesting a non-negligible range of variations in \({\varDelta } Nx\). This is due to the quality of the initial navigation solution, which varies across the data set and is on average better in the KnidosH data set. In some cases, the provided navigation is accurate (in particular along the linear segments of the trajectory) and little improvement is obtained in terms of point cloud sharpness. In other cases, in particular for areas re-visited from different transects (near loop closures), the alignment provided by S-ICP leads to a more significant improvement in the point cloud crispness. Examples of alignment results are shown in Fig. 9. The other indicators shown in Table 3 confirm the relative performance of the four methods. For both data sets, the proposed approach leads to some of the smallest number of incorrect alignments (\(\#{\varDelta } Nx > 0\)), to the smallest number (modulo 1 sub-map pair) of non-aligned sub-map pairs (\(\#{\varDelta } Nx = 0\)), and to the largest number of improving alignments (\(\#{\varDelta } Nx < 0\)).

The percentages in Table 2 are small (in terms of their absolute values) due to the following reason. As can be seen in Fig. 9, an alignment results in overlapping segments being moved relative to one another compared with the original, uncorrected alignment provided by the navigation solution. The resulting change in the number of occupied cells corresponds to a small number of points compared to the number of points in the two sub-maps, thus explaining the small values of \({\varDelta } Nx\) expressed in %. Figure 9 gives an intuition on how \({\varDelta } Nx\) values map to actual alignments.

The Naive segmentation method is not expected to provide high performance but is used here to gauge the merits of a simple model. The corresponding results show that accurate alignment requires more terrain modelling than a simple threshold based separation. The Pipewreck data set (Fig. 9a, b) contains large sections of flat terrain. This implies that the mean height is in these flat sections (as opposed to being above them) and that these flat sections are identified as non-ground. These large segments cannot be matched across sub-maps since their extent and shape is less representative of the terrain itself than of the extents of the sub-map they were extracted from. A similar behaviour can be observed in the KnidosH data set.

The Planar segmentation method is well suited to the Pipewreck data, in which most of the sub-maps contain non-negligible visible portions of rather flat ground (Fig. 9a, b). The extraction of a planar surface via a RANSAC method is in this case appropriate and leads to an accurate ground segmentation (see results in Sect. 5.1). The application of the same method to the KnidosH data set, which contains more diverse relief (Fig. 9c, d) shows that simple plane extraction is not general enough to provide accurate segmentation / alignment in a variety of terrains. This method provides on average an improvement of the point cloud crispness in the Pipewreck data set (\({\varDelta } Nx = -2.04\)), but tends to lead to point clouds with a degraded sharpness in the KnidosH data set (\({\varDelta } Nx = 0.43\)).

Fig. 9
figure 9

Examples of alignment results using the DFT ground segmentation method in conjunction with S-ICP. a, c, Pairs of sub-maps before alignment; b, d, the same sub-maps after alignment. In the two top plots, a pair of sub-maps from the Pipewreck data set. The reference sub-map (i.e. the sub-map not being aligned) is shown in blue with the segments generated by the DFT-based method shown in turquoise. The sub-map being aligned is shown in red with the segments generated by the DFT-based method shown in yellow. The black segments in the aligned pair indicate the segments matched by S-ICP across the two point clouds and used to compute the alignment. As can be seen in (b), corresponding segments are brought right on top of each other after alignment. This alignment corresponds to \({\varDelta } Nx = -5.4\,\%\). Note that if the alignment were to be performed using standard (dense) ICP, the two sub-maps would be brought completely on top of each other due to the ground point dominating the cost function in the ICP optimisation. c, d Show a pair of sub-maps that belongs to the KnidosH data set. The same colour coding as a and b is used except that the ground points in the aligned map (in red in a and b) are not shown for the sake of clarity. In d it can be seen that the corresponding segments are brought right on top of each other by the alignment. This alignment corresponds to \({\varDelta } Nx = -12.6\,\%\) (Color figure online)

The next step up in terms of complexity of the terrain model corresponds to the Grid method. A more accurate modelling of the terrain relief obtained with the Grid method allows improved performance with respect to the two previous techniques. For both data sets, the Grid method leads to an improvement of the point cloud crispness: \({\varDelta } Nx = -0.25\) for the KnidosH data set and \({\varDelta } Nx = -3.47\) for the Pipewreck data set. As explained in Sect. 4.2.3, however, automating the definition of a grid resolution that allows the larger scales of the relief defining the underlying ground to be captured is not trivial without resorting to a spectral type of analysis.

The two (depth) image based segmentation algorithms, KMeans and CMF, are not able to provide consistent alignment improvements. As can be seen in Table 3, the extracted segments are not consistent enough across sub-maps to be matched. This is in particular true in the case of the Pipewreck data set where no matches were found (\(\#{\varDelta } Nx = 0\) is equal to the number of sub-maps for both methods). The segments typically extracted by these two methods tend to be large pieces of terrain extending to the edges of the sub-map as opposed to defining smaller pieces of terrain well contained within the sub-maps. These are thus not consistent across sub-maps, and do not allow matching.

The proposed frequency based segmentation method is able to perform better than the methods making assumption of a (partially) flat terrain (Naive and Planar) and methods requiring a resolution to be fixed a priori (Grid). It can reconstruct non-planar terrains (as illustrated in Fig. 2c) and reason globally on the point cloud (up to some formatting into an elevation grid) as opposed to reasoning only locally in a column of heights. The set of results presented in Tables 2 and 3 provide experimental evidence in favor of the applicability of the proposed ground segmentation method to a range of different terrains and its ability to generate features useful for navigation.

For a video illustrating the proposed segmentation method please visit http://youtu.be/n0F42jliKdY.

6 Conclusion

This paper has introduced a method for segmenting 3D scans of underwater unstructured terrains. The method extracts a ground surface by selecting the lower frequency components in the frequency domain that define the slower varying undulations of the terrain. The points above the estimated ground surface are clustered via standard proximity clustering to form object segments. The experimental results show that the approach is applicable to a range of different terrains and is able to generate segments that are useful landmarks for navigation. Future work will integrate segmentation based registration to a global scan alignment process, and to online navigation algorithms. The proposed ground extraction mechanism will also be extended to perform non-regular DFT or polar FFT to process polar scans that generate irregular grids.