Interactive reconstruction of virtual environments from video sequences

doi:10.1016/S0097-8493(02)00285-6

Computers & Graphics

Volume 27, Issue 2, April 2003, Pages 293-301

https://doi.org/10.1016/S0097-8493(02)00285-6 Get rights and content

Abstract

There are many real-world applications of Virtual Reality requiring the construction of complex and accurate three-dimensional models that represent real environments. In this paper, we describe a rapid and robust semi-automatic system that allows such environments to be quickly and easily built from video sequences captured with standard consumer-level digital cameras. The system combines an automatic camera calibration algorithm with an interactive model-building phase, followed by automatic extraction and synthesis of surface textures from frames of the video sequence. The capabilities of the system are illustrated using a variety of example reconstructions.

Introduction

Building three-dimensional models that resemble real environments has always been a difficult problem. Generally, this requires modelling accurate three-dimensional geometry, as well as surface materials or textures covering each surface. In addition to the appearance of the environment, modelling the behaviour of objects is also very important if a Virtual Environment (VE) is to allow any kind of user interaction. Generally, a scene hierarchy is constructed by specifying the relationships between objects in the scene. These relationships can then be used to assist the user when interacting with the environment.

Traditional methods of constructing models have involved a skilled user and a three-dimensional computer-aided design (CAD) program. Accurately modelling a real environment in such a way can only be done if the user has obtained maps or blueprints of the scene, or has access to the scene in order to take precise physical measurements. Either way, the process is slow and laborious for anything but the simplest of scenes. Obtaining surface texture information has also traditionally been very difficult, requiring manual warping and mapping of photographics images onto each surface.

In the field of computer vision, automatic techniques have recently been developed that allow three-dimensional information to be constructed directly from photographs of the scene [1], [2], [3], [4]. Typically, these algorithms analyse multiple images of an environment in order to infer the position and attributes of each camera, as well as the three-dimensional location of a dense set of points corresponding to important features in the images. In order to build more useful polygonal models, these points must be triangulated and subsequently segmented into separate objects. Similar triangulation and segmentation algorithms are required when expensive laser-range scanners are used to sample the scene geometry [5], [6].

Automatic reconstruction techniques are, however, not yet robust enough to build useful VEs, which must have a simple enough form to allow them to be rendered in real-time and must contain enough structure so that a user can interact with important objects. Additionally, automatic algorithms are able only to reconstruct geometry that is seen explicitly in the images. This causes problems such as the appearance of holes and gaps in the model in regions that are not visible in any image, but which might well become visible when the model is being examined from different viewpoints.

In order to overcome problems such as these, semi-automatic approaches have been proposed that effectively combine user-guided segmentation of objects with automatic calculation of the position of those objects and the camera parameters [7], [8], [9], [10], [11], [12]. The benefit of using semi-automatic, rather than fully automatic algorithms, is that we can employ user knowledge when modelling the environment: the walls of a room may be identified by the user and modelled as single large polygons, thereby overcoming problems caused by object occlusion. An object hierarchy may also be easily maintained during the construction of the scene, and environments may be constructed in an incremental fashion, with large features specified at the start of the construction process and extra details added as necessary depending upon the envisaged use of the model.

The approach described in this paper falls between these automatic and semi-automatic categories. One significant disadvantage of current semi-automatic techniques is that they are limited to small numbers of input images due to the large amount of user interaction required to identify enough common features for camera calibration. In contrast, the main source of input data for our system is video sequences. Because of this, we can combine automated feature tracking [13], robust structure-from-motion [14] and self-calibration algorithms to automatically determine the camera parameters for each frame of the sequence. An overview of the calibration process is given in Fig. 1.

Once these calibration data have been obtained, we use semi-automatic techniques to build geometric descriptions of the objects in the scene. This is achieved by interactively manipulating the position, orientation and size of simple objects such as polygons, boxes and cylinders so that their projections into the frames of the video sequence match the projections of real objects. This manipulation is achieved using hierarchical parent–child constraints and image-based constraints that indicate the preferred projections of object vertices into frames of the sequence. We will describe a non-linear optimization algorithm that is capable of manipulating the position of these objects in real-time, so that the constraints specified by the user are best satisfied. Combining automatic camera calibration algorithms with interactive geometry reconstruction allows the user to spend more time modelling important features of the scene, rather than preparing the system for camera calibration.

We are interested in applying these techniques and algorithms to real-world problems, where the ability to quickly construct an accurate VE corresponding to a real scene can be an extremely powerful tool. In particular, we are studying the problem of constructing VEs of crime scenes, using forensic photographs [15] and video images and are cooperating with the Greater Manchester Police force (UK) and the UK's National Training Centre for Scientific Support to Crime Investigation (NTCSSCI) [16].

An earlier pilot project [17], [18] examined the feasibility of using VE reconstructions for police work. In the pilot project we undertook an entirely manual creation of a VE corresponding to a real crime scene, using architectural drawings and forensic photographs. The construction was extremely labour intensive, however, discussions with police officers, forensic scientists and trainers who had evaluated the pilot project demonstrated that such reconstructions can be of great benefit, offering new possibilities for analysis, training and briefing presentations. For such applications, the accuracy and fidelity of the VE are clearly very important, although different applications will be satisfied with various degrees of fidelity. The techniques described in this paper are currently being evaluated with reference to this area, although the algorithms described here are general enough to build reconstructions for a variety of different applications. Due to confidentiality requirements, the example reconstructions we show at the end of the paper are not from real crime scenes. Instead, we show examples of the system in use for typical indoor and outdoor reconstructions.

The remainder of this paper is structured as follows: in the next section we describe our approach to automatic calibration of video sequences. Following that, Section 3 describes how this calibration data may be used in an interactive setting to reconstruct geometric descriptions of objects in the scene. Once geometry has been reconstructed, we describe our automatic algorithm for extracing surface texture data in Section 4. Finally, different examples of the system in use are given in Section 5, and the paper closes with a discussion of current and future work in Section 6.

Section snippets

Video sequence calibration

Before we can start to reconstruct a three-dimensional model of the scene, we need to calibrate the camera used to record the video sequence. This involves estimating values for lens distortion, the camera's intrinsic parameters such as focal length and principal point, as well as its extrinsic parameters such as position and orientation, for each frame of the sequence. This section provides an overview of the techniques used. Further details can be found in [14].

Geometric lens distortion must

Model reconstruction

Once the camera calibration data have been obtained for each frame of a video sequence using the algorithms described above, the process of interactive model reconstruction can begin. The user builds the model by interactively specifying the position, orientation and size of objects from a user-extensible library of shapes. As these primitives are created, a scene graph is maintained that describes the layout of the scene. The user manipulates these primitives in image space, attempting to

Texture extraction

Once a geometric description of the environment is available, texture maps may be extracted from the video footage and mapped onto surfaces. In order to achieve this, a buffer is constructed for each frame and the primitives are scan-converted into each buffer, colour-coded with a unique identification number for each polygon. This allows us to quickly determine the region in each frame that contributes to the texture map for each surface.

For every polygon in the scene, a texture fragment is

Results

Fig. 3, Fig. 4, Fig. 5 show some more complex examples of reconstructions performed using the techniques described in this paper. In each figure, two frames of the video sequence are shown on the left and in the middle, with a wireframe representation of the model overlayed. On the right-hand side is a texture-mapped rendering of the final reconstruction, shown from a novel viewpoint. Calibration and reconstruction times are given with each figure.

Fig. 5 was reconstructed from a video sequence

Conclusion and future work

In this paper, we have presented methods for reconstructing Virtual Environments from video sequences. We have briefly discussed the methods we use to automatically calibrate the camera for each frame of a sequence, as well as our interactive method for model reconstruction. We have also shown examples of reconstructions that have been built using these methods.

There are still several weaknesses of the approach that will be addressed in the future. Most importantly, reconstructing complex,

Acknowledgements

The authors would like to thank their colleagues in the Advanced Interfaces Group for helpful discussions. This research has been funded by the UK's Engineering and Physical Sciences Research Council (EPSRC), under Grant Number GR/M14531, “REVEAL: Reconstruction of Virtual Environments with Accurate Lighting”. We would also like to thank Detective Inspector David Heap and Pat Davis from Greater Manchester Police, and Nick Sawyer from the NTCSSCI for their continued support.

References (30)

P.H. Torr et al.
MLESAC: a new robust estimator with application to existing image geometry
Computer Vision and Image Understanding
(2000)
Beardsley P, Torr P, Zisserman A. 3D model aquisition from extended image sequences. In: Buxton B, Cipolla R (Eds.),...
Fitzgibbon A, Zisserman A. Automatic camera recovery for closed or open image sequences. In: Burkhard H, Neumann B...
Hartley R. Euclidean reconstruction from uncalibrated views. Applications of Invariance in Computer Vision, Lecture...
Pollefeys M, Koch R, van Gool L. Structure and motion from image sequences. In: Kahmen G, editor. Proceedings of the...
Y. Yu et al.
Extracting objects from range and radiance images
IEEE Transactions on Visualization and Computer Graphics
(2001)
3rdTech, Deltasphere 3d scene digitizer,...
Becker S, Michael Bove V Jr. Semi-automatic 3-d model extraction from uncalibrated 2-d camera views. SPIE Symposium on...
Debevec P, Taylor C, Malik J. Modeling and rendering architecture from photographs: a hybrid geometry- and image-based...
Bougnoux S, Robert L. Totalcalib: a fast and reliable system for off-line calibration of image sequences. In:...

Cipolla R, Boyer E. 3d model acquisition from uncalibrated images. Proceedings of the IAPR Workshop on Machine Vision...

Poulin P, Ouimet M, Frasson M. Interactive modeling with photogrammetry. In: Proceedings of the Eurographics Workshop...

Hakim SE. A practical approach to creating precise and detailed 3d models from single and multiple views. In:...

Shi J, Tomasi C. Good features to track. In: Proceedings of the IEEE Conference on Computer Vision and Pattern...

Gibson S, Cook J, Hubbold R, Howard T. Accurate camera calibration for off-line, video-based augmented reality. In: ACM...

Cited by (37)

IterGANs: Iterative GANs to learn and control 3D object transformation
2019, Computer Vision and Image Understanding
Citation Excerpt :
In order to learn a representation for object manipulation, we cast this problem into an image-to-image transformation task, with the goal to transform an input image following a given 3D transformation to an target image. For this kind of object manipulation, often either stereoscopic cameras (Ko et al., 2007; Bruno et al., 2010) or temporal data streams (Pollefeys et al., 2008; Gibson et al., 2003) have been used to infer depth cues, while our aim is to obtain the target image from a single input image only. Similarly as humans are able to do so with one eye closed (Vishwanath and Hibbard, 2013), there has also been works that aim to reconstruct 3D from a single image (Saxena et al., 2009; Rematas et al., 2017), however these typically require external provided 3D object models, e.g. balloon shapes (Vicente and Agapito, 2013), or focus on a single class of objects only (Park et al., 2017).
We are interested in learning visual representations which allow for 3D manipulations of visual objects based on a single 2D image. We cast this into an image-to-image transformation task, and propose Iterative Generative Adversarial Networks (IterGANs) which iteratively transform an input image into an output image. Our models learn a visual representation that can be used for objects seen in training, but also for never seen objects. Since object manipulation requires a full understanding of the geometry and appearance of the object, our IterGANs learn an implicit 3D model and a full appearance model of the object, which are both inferred from a single (test) image. Two advantages of IterGANs are that the intermediate generated images can be used for an additional supervision signal, even in an unsupervised fashion, and that the number of iterations can be used as a control signal to steer the transformation. Experiments on rotated objects and scenes show how IterGANs help with the generation process.
A semi-interactive panorama based 3D reconstruction framework for indoor scenes
2011, Computer Vision and Image Understanding
Citation Excerpt :
The perspective extrusion process is summarized in Table 2. In related work such as [9], object parameters are defined indirectly in terms of geometric objects, e.g. a rectangular box. In pictures of indoor scenes, objects are frequently occluded, making the use of geometric objects difficult.
We present a semi-interactive method for 3D reconstruction specialized for indoor scenes which combines computer vision techniques with efficient interaction. We use panoramas, popularly used for visualization of indoor scenes, but clearly not able to show depth, for their great field of view, as the starting point. Exploiting user defined knowledge, in term of a rough sketch of orthogonality and parallelism in scenes, we design smart interaction techniques to semi-automatically reconstruct a scene from coarse to fine level. The framework is flexible and efficient. Users can build a coarse walls-and-floor textured model in five mouse clicks, or a detailed model showing all furniture in a couple of minutes interaction. We show results of reconstruction on four different scenes. The accuracy of the reconstructed models is quite high, around 1% error at full room scale. Thus, our framework is a good choice for applications requiring accuracy as well as application requiring a 3D impression of the scene.
Temporal synchronization of non-overlapping videos using known object motion
2011, Pattern Recognition Letters
This paper presents a robust technique for temporally aligning multiple video sequences that have no spatial overlap between their fields of view. It is assumed that (i) a moving target with known trajectory is viewed by all cameras at non-overlapping periods in time, (ii) the target trajectory is estimated with a limited error at a constant sampling rate, and (iii) the sequences are recorded by stationary cameras with constant frame rates and fixed intrinsic and extrinsic parameters. The proposed approach reduces the problem of synchronizing N non-overlapping sequences to the problem of robustly estimating a single line from a set of appropriately-generated points in $R^{N + 1}$ . This line describes all temporal relations between the N sequences and the moving target. Our technique can handle arbitrarily-large misalignments between the sequences and does not require any a priori information about their temporal relations. Experimental results with real-world and synthetic sequences demonstrate that our method can accurately align the videos.
Application of video photogrammetry to analyse mechanical systems in the undergraduate physics laboratory
2006, European Journal of Physics
IterGANs: Iterative GANs to learn and control 3d object transformation
2018, arXiv
A solution for crime scene reconstruction using time-of-flight cameras
2017, arXiv

View all citing articles on Scopus

View full text

Technical sectionInteractive reconstruction of virtual environments from video sequences

Abstract

Introduction

Section snippets

Video sequence calibration

Model reconstruction

Texture extraction

Results

Conclusion and future work

Acknowledgements

Computer Vision and Image Understanding

Extracting objects from range and radiance images

IEEE Transactions on Visualization and Computer Graphics

Technical section
Interactive reconstruction of virtual environments from video sequences