Elsevier

Pattern Recognition

Volume 86, February 2019, Pages 354-367
Pattern Recognition

Scene wireframes sketching for Unmanned Aerial Vehicles

https://doi.org/10.1016/j.patcog.2018.09.017Get rights and content

Highlights

  • New method for obtaining a 3D reconstruction of objects and scenes based on lines.

  • The model does not require previous result from other SfM pipelines.

  • Resilient to low texture and a low number of images. Aimed for UAVs.

Abstract

This paper introduces novel insights to improve the state-of-the-art line-based unsupervised observation and abstraction models of man-made environments. The increasing use of autonomous UAVs inside buildings and around human-made structures demands new accurate and comprehensive representation of their operation environments. Most of the 3D scene abstraction methods use invariant feature point matching, nevertheless some sparse 3D point clouds do not concisely represent the structure of the environment. The presented approach is based on observation and representation models using the straight line segments. The goal of the work is a complete method based on the matching of lines, that provides a complementary approach to state-of-the-art methods when facing 3D scene representation of poor texture environments for future autonomous UAV. Oppositely to other recently published methods obtaining 3D line abstractions, the proposed method features 3D segment abstraction in the absence of a previously generated point based reconstruction. Another advantage is the ability to group the resulting 3D lines according to different planes, for exploiting coplanar line intersections. These intersections are used like feature points in the reconstruction process. It has been proved that this method exclusively based on lines can obtain spatial information in the adverse situations when a SIFT-like SfM pipeline fails to generate a dense point cloud.

Introduction

Modern unmanned aerial vehicles (UAV), commonly known as drones, feature on-board video cameras that stream real time visual information about their environment. These frames captured by UAVs need to be processed to allow a feasible understanding of the environment in real time. Fetching an usable representation of the scene is crucial, and the result compromises the success of ongoing mission specific analysis. A proper representation of the environment eases the programming of autonomous movement routines, by identifying its location in space, finding obstacle-free trajectories, or learning movement routines. In this work, we consider the first stage of computing structural 3D information for an autonomous drone from its own video captures or, in general, for multiple views of an environment of interest.

The vast majority of the current approaches for 3D scene reconstruction are based on point clouds [1]. Commonly, points are matched between pairs of views based on their descriptors, then triangulated [2] to make an initial estimation of their location in 3D space, and finally their poses are adjusted by least squares minimization [3]. We can distinguish three common stages of a reconstruction process: The first stage is the estimation of camera poses by Structure-from-Motion (SfM), which outputs the estimated camera poses and reconstructed features. The second step is the computationally more expensive Multi-View Stereo (MVS) [4], [5] that can add up to millions of points into a dense cloud. The third stage is intended for post processing, and based on fitting more complex 3D elements to the estimated point clouds, including planes [6], [7] or lines [8], [9]. However, these approaches are heavily dependent on the dense point clouds, and will not permit a feasible solution when a dense point cloud is not available. That applies in the case of UAVs, when an on-flight real time reconstruction is required, because expensive adjustments can not be delivered on time, when the video stream lacks high definition, the digital noise is persistent, or the received images have to be converted from an analog video source. Even with the appropriate setup, conditions related to poor texture of the surrounding objects, a poor environment illumination, blurring or the lack of sufficient available high definition pictures of the scene when the vehicle is moving fast, might compromise the building of a dense cloud.

The logical evolution of the environment abstraction from multiple views is to incorporate line-based pipelines that do not require a detailed point-based description of the areas of interest, and are resilient to difficulties on the point cloud creation. Point based 3D reconstructions may be accurate when created from high definition frames and after expensive bundle adjustments. Nevertheless, in addition to provide geometrical information richer than interest points, lines and planes do summarize the limits of man-made structures. Besides, coplanar line primitives can be intersected to further reveal spatial information. These advantages make lines a good candidate to team with point feature detectors and descriptors [8], [10], [11]. They offer the possibility of combining individual similarities of pairs of segments, related to strong constraints of parallelism and orthogonality [12], [13], and specially coplanarity constraints [14]. This paper exploits the latter ones, and the inlier groups are determined based on the homography between different views. The constraint that can be exploited is that the intersection of a pair of coplanar lines, even if it might not resemble a physical point, is still geometrically invariant under perspective projection.

Our approach is rooted on line based abstraction in the absence of high density point clouds [15], and it takes advantage of the robustness of a state-of-the-art line matching algorithm that works among pairs of images [16]. The inputs are both the images and the camera intrinsic parameters, and the outputs are both the camera extrinsics and the line based 3D sketch. Line detection and matching algorithms are prone to errors, so the system is designed for outliers rejection , is robust to changes in illumination conditions, blurring due fast camera movements, and also capable to real-time reconstruction from streamed video.

A 3D reconstruction, abstraction or spatial sketch is the estimation of the position of singular primitives captured in several images, by using geometrical relationships and minimization adjustments. The first problem is to filter the geometrical features that have been detected and put them in correspondence among views, and segment them into clusters. The second task is to apply geometrical projection of line endpoints with just two views by assuming a Gaussian noise model for perturbation of their location [2]. The third objective involves adjustment algorithms to fetch all the camera poses and estimate the whole environment abstraction. The forth and last stage is to use the adjusted spatial configuration to improve the abstraction itself, attending to structural properties. The main hypothesis of this work is that straight segments resemble a source of geometrical information that can be employed, independently of dense point clouds, for creating spatial 3D reconstructions. The premise was to use line matchings as the cross-view relationship to develop the abstraction on, oppositely to fitting these primitives to described feature points [10]. Our method does not assume any Manhattan configuration, this means that no alignment is performed according to any preferred direction. The model is engineered considering the most common case for UAV of unknown camera poses. Therefore, camera extrinsics are not provided for the line matching process, neither any parameter or limitation for the degrees of freedom that could ease their estimation. Camera pose must also be recovered through an SfM process. On the other hand, just the camera intrinsic parameters and the target images are required for the abstraction. The main result of this work is a set of improvements for the complete sequence of algorithms required for 3D abstraction using lines, mainly focused on extracting the most information from the images.

The goal of this work is to obtain a real-time three dimensional representation of a scene by using a limited number of matched straight segments. The optimal abstraction the authors look for is built by meaningful descriptive lines whose resemble the limits of the 3D planes of the scene, avoiding representations of several redundant atomic short segments. In order to assure that we improved the state-of-the-art scene abstraction based on lines, we need to prove: Firstly, the implemented multi-scale line detection, and the structure based matching algorithms are profitable in poorly illumined environments and low texture scenes. Secondly, the proposed exploitation of intersections brings valuable information into the least-squares adjustment. For the first task, the observed lines are firstly detected in frames, and then put in correspondence, dragging outliers during the matching process. Our approach takes advantage of multi-scale line detection and matching [16] to increase the accuracy of the line endpoints triangulation among pairs of line-matched frames. Secondly, our method goes one step ahead in the least squares adjustment of cameras and lines by exploiting geometrical relationships of the coplanar lines. After classifying the spatial lines according to their co-planarity, the intersection of the observed lines are brought into a second run of the least squares adjustment of cameras and lines.

Lines need to be detected, and optionally, junctions [17]. Segments can be put in correspondence by characterizing their appearance and structure [18], [19]. Sparse Bundle Adjustment (SBA) is a method for simultaneously optimizing a set of camera poses and visible points, based on minimization of an objective function. The input in the case of lines can be the endpoints of the matched lines or the probability distribution of their location shen2000uncertainty.

Bartoli and Sturm [20] goes an step forward and brings up the Plücker coordinates 4-parameter representation of lines, for the will of a more suitable least-squares based optimization SBA. They also include the trifocal tensor [21] because they just use lines as inputs. Same as did in [8], where in addition avoided enforcing Plücker constraints to improve the SBA efficiency, by employing the Cayley representation of the Plücker coordinates. Their cost function computes the squared re-projection error from the sum of the squared shortest distance from each observed endpoint to the reprojected infinite line. Several approaches base the line reconstruction on previous results of a feature point based SfM pipeline. In this group, [22] match these segments to other images according to the distances from their back-projected endpoints to neighboring observed line endpoints. This is also the case of the stereo line reconstruction introduced in [23], based on block matching. This work is able to deal with both lines and feature points, and uses RANSAC for outliers rejection. Hofer et al. [10] teamed the accurate feature point detector and descriptor SIFT with a line detection based on LSD [24], and matching based on the mentioned previously run point-based SfM. Camera extrinsics are an input to the algorithm, and are obtained from the third-party feature-point based reconstruction before matching the lines [10]. The limitation is the requirement of a dense SfM reconstruction as a base for the spatial lines abstraction and the camera extrinsics given as inputs. They define a line cluster as a group of spatial lines discriminated by spatial proximity in the reconstruction, and the segmentation is done by projecting from the input camera poses. They also implemented a cost function for the SBA optimization similar to that presented in [8]. On the other hand there are other approaches based on geometric relationships on the structure of the lines proposing to locate discrete point counterparts on a line over different views [25]. The present work is framed in the group of studies that are not dependant of feature point based SfM pipelines, and uses 4 parameters for updating the line representation during the non-linear optimization.

We propose a method with a set of improvements over state-of-the-art algorithms for automatic 3D scene abstraction based on lines:

  • 1.

    The proposed method does not require a dense point cloud. It makes feasible a reconstruction based exclusively on line matchings, performed independently over pairs of images. Our approach estimates camera extrinsics, but does not root the line matching on these spatial projections, preventing the uncertainty from SfM to propagate and merge with the uncertainty related to 2D line detection and matching. Previous reconstruction methods rely on a third party SfM pipeline [10], [23], providing the camera poses and dense point clouds, to base line matching and 3D reconstruction on.

  • 2.

    We propose to segment the set of spatial lines by fitting them to different planes through RANSAC. The observed intersections of 2D co-planar lines are therefore described according to the observed matched lines. These segmented group of intersections are projected from every camera plane. They are finally included into the cost function for a second SBA run, taking advantage of this accurate source of observed points in correspondence. Most of the published methods are intended for urban environments, where many lines are coplanar. Nevertheless, they do not retrieve additional information from the images according to the spatial structure. Hence, the projected lines use to be the sole primitive input to the cost function for a least-squares minimization.

  • 3.

    Our proposed Line Observation is built by merging independent line matchings over pairs of scale-space images. These groups of matched 2D lines define unique global entities before the 3D stages of reconstruction, in order to avoid the problem of redundant lines and to reduce the number of matching outliers when dealing with multiple views. On the other hand, some recent methods verify the matching candidates based on their support on neighboring views [10], for later clustering them based on their spatial proximity, instead of performing a global matching of the observed lines individually. This is a source of uncertainty, as the matching criteria is tied to the accuracy of every camera pose that was adjusted with point-based SfM, and used as input.

The current implementation include the following differences compared to the previous methods:

  • 1.

    The whole set of algorithms is implemented into a ROS node that includes capturing the streamed video from a commercial drone. This allows experimental testing without the requirement of external image processing pipelines. To our knowledge, there are no other line based reconstruction pipelines completely integrated in ROS. It runs on real-time up to approximately 800  ×  600 resolution, depending on how many images are stored in the buffer for the reconstruction. The illumination conditions and number of edges on the image are also crucial for the processing times.

  • 2.

    The complete sequence of algorithms runs without a human in the loop, and the process goes from the detection of lines to the output of the scene abstraction described by a set of 3D line segments. It can also run in real-time if image resolutions are not big. It must be stressed that several published works employ human assistance, mainly during the line matching stage [26], or need the input of camera poses and an external point feature based SfM pipeline, avoiding the feasibility of a real-time execution [8], [10].

The rest of the paper is structured as follows. A high level overview of our method along with the definitions and problems to solve are presented in Section 2. Section 3 goes through the low level details of the method, explanation of algorithms and the implications. In Section 4, the embedding of intersections of coplanar lines and their contribution to the improvement of the 3D abstraction is described. Experimental results are presented and discussed in Section 5. Finally, Section 6 presents the main conclusions of this work.

Section snippets

High level overview

The architecture of the line sketching model comprises three layers as shown in Fig. 1. (i) The 2D layer extracts and unifies the observed projections of edges from the set of views of a scene, where in general there might be large camera translations and rotations among the different views. (ii) The 3D abstraction layer builds stereo subsystems for every possible pair of views, which are fed into a first SBA, that outputs a 3D wireframe-based sketch and the camera poses, valid up to scale.

Low level description

The 3D line based sketch {ϒ, Γ} is built based on the knowledge of correspondences among observed lines l and the intrinsics of all the cameras. These observed lines are obtained from edge detection on scale-spaces, and this is a novelty of the present method compared to other recently published ones [8], [10], [20]. This detection method brings up the number of image scales where the line was detected, an additional segment appearance attribute that is used in the matching algorithm. The

Line intersections

Any line segment detection process is prone to misalignments, wrong segmentation, incorrect placement of endpoints and also cases of unsuited merging of two curved edges resembling a real straight line. Likewise, some line endpoints are neither well defined in an image, when resembling the edges of shadows, or over-exposed photos. Even with these non optimal conditions, edge segments might be correctly matched on several pairs of view. While this kind of inlier matchings would be welcome in a

Experimental results

The experimental section of this work is willing to test the proposed line based reconstruction method and compare it against other methods under the same conditions. Our fully automatic approach performs all the processes from image datasets or video frames capture, through line detection and matching, to camera pose estimation and 3D abstraction. This section is willing to prove the following facts:

  • 1.

    The proposed method, if based exclusively on lines, returns the pose of the cameras and a

Conclusions

This paper presents a novel integration of a set of algorithms to create a line-based spatial sketch, showing the main structures of the man-made environment, and the camera poses. It receives as input the camera intrinsic parameters and at least 3 pictures. The set of methods include novel observation relations of groups of straight segments that are captured from different camera poses. The method features the novelty of employing coplanar line intersections like feature points. The first

Acknowledgment

This work has received financial support from the Xunta de Galicia through grant ED431C 2017/69 and Xunta the Galicia (Centro singular de investigación de Galicia accreditation 2016–2019) and the European Union (European Regional Development Fund - ERDF) through grant ED431G/08.

Roi Santos was granted with his M.Sc. in Physics at 2010. His first job was in the New University of Lisbon, supported by a Leonardo da Vinci grant. Then he moved to Indra www.indracompany.com, where I worked for half year. During 2012–2013 he performed his next research work at Von Karman Institute (Belgium). He started my Ph.D. in Citius at 2014.

References (32)

  • Y. Furukawa et al.

    Accurate, dense, and robust multiview stereopsis

    IEEE PAMI

    (2010)
  • M. Rothermel et al.

    Sure: photogrammetric surface reconstruction from imagery

    Proceedings of the LC3D Workshop

    (2012)
  • KimC. et al.

    Planar structures from line correspondences in a manhattan world

    Proceedings of ACCV

    (2014)
  • C. Raposo et al.

    Piecewise-planar stereoscan: structure and motion from plane primitives

    Proceedings of ECCV

    (2014)
  • ZhangL. et al.

    Structure and motion from line correspondences: representation, projection, initialization and sparse bundle adjustment

    J. Vis. Commun. Image Represent.

    (2014)
  • M. Hofer et al.

    Efficient 3d scene abstraction using line segments

    CVIU

    (2017)
  • Cited by (3)

    • Learning-based resilience guarantee for multi-UAV collaborative QoS management

      2022, Pattern Recognition
      Citation Excerpt :

      In recent years, UAVs have witnessed a wide range of applications in many scenarios [11,12]. The flexible deployment, high maneuverability, and more importantly, significantly lower production costs of UAVs contribute to the increasingly important roles they play in various applications [13–15]. In particular, UAVs can serve as aerial base stations (BSs) to provide services to ground users.

    Roi Santos was granted with his M.Sc. in Physics at 2010. His first job was in the New University of Lisbon, supported by a Leonardo da Vinci grant. Then he moved to Indra www.indracompany.com, where I worked for half year. During 2012–2013 he performed his next research work at Von Karman Institute (Belgium). He started my Ph.D. in Citius at 2014.

    Xose M. Pardo is an associate professor of Software and Computer Systems at the University of Santiago de Compostela (Spain). He received a Ph.D. in Physics from this university in 1998, for his research on 3D medical image analysis. He has been a postdoctoral research fellow at the Computer Vision Center of Barcelona (Spain) and the INRIA Sophia Antipolis (France), between 1998 and 2000. In recent years, his interest has shifted to biologically inspired computer vision, and includes visual saliency, object and scene recognition, human activity recognition and machine learning. At the moment, they are mostly working on projects related to robot vision, object and scene recognition, photogrammetry and visual inspection.

    Xose R. Fdez-Vidal received the MS and Ph.D. degrees, both in Physics, from the University of Santiago de Compostela (Spain) in 1991 and 1996, respectively. He has been member of the Applied Physics Department in the University of Santiago de Compostela since 1992, where he an associate professor nowadays. In recent years, his interest has shifted to biologically inspired computer vision, and includes visual saliency, object and scene recognition.

    View full text