A hierarchy of cameras for 3D photography

https://doi.org/10.1016/j.cviu.2004.03.013Get rights and content

Abstract

The view-independent visualization of 3D scenes is most often based on rendering accurate 3D models or utilizes image-based rendering techniques. To compute the 3D structure of a scene from a moving vision sensor or to use image-based rendering approaches, we need to be able to estimate the motion of the sensor from the recorded image information with high accuracy, a problem that has been well-studied. In this work, we investigate the relationship between camera design and our ability to perform accurate 3D photography, by examining the influence of camera design on the estimation of the motion and structure of a scene from video data. By relating the differential structure of the time varying plenoptic function to different known and new camera designs, we can establish a hierarchy of cameras based upon the stability and complexity of the computations necessary to estimate structure and motion. At the low end of this hierarchy is the standard planar pinhole camera for which the structure from motion problem is non-linear and ill-posed. At the high end is a camera, which we call the full field of view polydioptric camera, for which the motion estimation problem can be solved independently of the depth of the scene which leads to fast and robust algorithms for 3D Photography. In between are multiple view cameras with a large field of view which we have built, as well as omni-directional sensors.

Introduction

The concept of 3D photography and imaging was always of great interest to humans. Early attempts to record and recreate images with depth were the stereoscopic drawings of Giovanni Battista della Porta around 1600, and the stereoscopic viewers devised by Wheatstone and Brewster in the 19th century. As described in [34], in the 1860’s Francois Villème invented a process known as photo sculpture, which used 24 cameras, to capture the notion of a 3D scene. Later a 3D photography and imaging technique was invented by Lippmann in 1908 under the name of integral photography, where the object was observed by a large number of small lenses arranged on a photographic sheet resulting in many views of the object from different directions [35]. Today modern electronic display techniques enable the observer to view objects from arbitrary view points and explore virtual worlds freely. These worlds need to be populated with realistic renderings of real life objects to give the observer the feel of truly spatial immersion. This need fuels the demand for accurate ways to recover the 3D shape and motion of real world objects. In general, the approaches to recover the structure of an object are either based on active or passive vision sensors, i.e., sensors that interact with their environment or sensors that just observe without interference. The main examples for the former are laser range scanners [38] and structured light based stereo configurations, where a pattern is projected onto the scene and the sensor uses the image of the projection on the structure to recover depth from triangulation [5], [12], [44]. For a recent overview over different approaches to active range sensing and available commercial systems see [6]. The category of passive approaches consists of stereo algorithms based on visual correspondence where the cameras are separated by a large baseline [39] and structure from motion algorithms on which we will concentrate [17], [24]. Since correspondence is a hard problem for widely separated views, we believe that the structure from motion paradigm offers the best approach to 3D photography [16], [37], because it interferes the least with the scene being imaged and the recovery of the sensor motion enables us to integrate depth information from far apart views for greater accuracy while taking advantage of the easier correspondence due to dense video. In addition, the estimation of the motion of the camera is a fundamental component of most image-based rendering algorithms.

There are many approaches to structure from motion (e.g., see [17], [24] for an overview), but essentially all these approaches disregard the fact that the way how images are acquired already determines to a large degree how difficult it is to solve for the structure and motion of the scene. Since systems have to cope with limited resources, their cameras should be designed to optimize subsequent image processing.

The biological world gives a good example of task specific eye design. It has been estimated that eyes have evolved no fewer than 40 times, independently, in diverse parts of the animal kingdom [13]. These eye designs, and therefore the images they capture, are highly adapted to the tasks the animal has to perform. The sophistication of these eyes suggests that we should not just focus our efforts on designing algorithms that optimally process a given visual input, but also optimize the design of the imaging sensor with regard to the task at hand, so the subsequent processing of the visual information is facilitated. This focus on sensor design has already begun, we just mention as an example the influential work on catadioptric cameras [28].

In [32] we presented a framework to relate the design of an imaging sensor to its usefulness for a given task. Such a framework allows us to evaluate and compare different camera designs in a scientific sense by using mathematical considerations.

To design a task specific camera, we need to answer the following two questions:

  • (1)

    How is the relevant visual information that we need to extract to solve our task encoded in the visual data that a camera can capture?

  • (2)

    What is the camera design and image representation that optimally facilitates the extraction of the relevant information?

To answer the first question, we first have to think about what we mean by visual information. When we think about vision, we usually think of interpreting the images taken by (two) eyes such as our own—that is, perspective images acquired by camera-type eyes based on the pinhole principle. These images enable an easy interpretation of the visual information by a human observer. Therefore, most work on sensor design has focused on designing cameras that would result in pictures with higher fidelity (e.g., [18]). Image fidelity has a strong impact on the accuracy with which we can make quantitative measurements of the world, but the qualitative nature of the image we capture (e.g., single versus multiple view point images) also has a major impact on the accuracy of measurements which cannot be measured by a display-based fidelity measure. Since nowadays most processing of visual information is done by machines, there is no need to confine oneself to the usual perspective images. Instead we propose to study how the relevant information is encoded in the geometry of the time-varying space of light rays which allows us to determine how well we can perform a task given any set of light ray measurements.

To answer the second question we have to determine how well a given eye can capture the necessary information. We can interpret this as an approximation problem where we need to assess how well the relevant subset of the space of light rays can be reconstructed based on the samples captured by the eye, our knowledge of the transfer function of the optical apparatus, and our choice of function space to represent the image. By modeling eyes as spatio-temporal sampling patterns in the space of light rays we can use well-developed tools from signal processing and approximation theory to evaluate the suitability of a given eye design for the proposed task and determine the optimal design. The answers to these two questions then allow us to define a fitness function for different camera designs with regard to a given task.

In this work, we will extend our study of the structure of the time-varying plenoptic function captured by a rigidly moving imaging sensor to analyze how the ability of a sensor to estimate its own rigid motion is related to its design and what this effect has on triangulation accuracy, and thus on the quality of the shape models that can be captured.

Section snippets

Plenoptic video geometry: how is 3D motion information encoded in the space of light rays?

The space of light rays is determined by the geometry and motion of the objects in space, their surface reflection properties, and the light sources in the scene. The most general representation for the space of light rays is the plenoptic parameterization. At each location xR3 in free space, the radiance, that is the light intensity or color observed at x from a given direction rS2 at time tR+, is measured by the plenoptic function L(x;r;t);L:R3×S2×R+Γ. Γ denotes here the spectral energy,

Ray incidence constraint

The ray incidence constraint is defined in terms of a scene point P and set of rays li : = (xi, ri). The incidence relation between the scene point PR3 and rays li, defined by their origins xiR3 and directions riS2, can be written as[ri]×xi=[ri]×Pi,where [r]× denotes the skew-symmetric matrix so that [r]×x = r × x. The rays that satisfy this relation form a 3D-line pencil in the space of light rays.

The geometric incidence relations for light rays lead to extensions of the familiar multi-view

Ray identity constraint

In a static world, where the albedo of every scene point is not changing over time, the brightness structure of the space of light rays is time-invariant, thus if a camera moves rigidly and captures two overlapping sets of light rays at two different time instants, then a subset of these rays should match exactly and would allow us to recover the rigid motion from the light ray correspondences. Note that this is a true brightness constancy constraint because we compare each light ray to itself.

Feature computation in the space of light rays

To utilize the constraints described above we need to define the notion of correspondence in mathematical terms. In the case of the ray identity constraint we have to evaluate if two sets of light rays are identical. If the scene is static we can use the difference between the sets of light rays that are aligned according to the current motion estimate as our matching criterion. This criterion is integrated over all rays and we expect that at the correct solution, we will have a minimum. As

Hierarchy of cameras for 3D photography

Another important criteria for the sensitivity of the motion estimation problem is the size of the field of view (FOV) of the camera system. The basic understanding of these difficulties has attracted few investigators over the years [10], [11], [20], [21], [27], [36]. These difficulties are based on the geometry of the problem and they exist in the cases of small and large baselines between the views, that is for the case of continuous motion as well as for the case of discrete displacements

Sensitivity of motion and depth estimation using perturbation analysis

To assess the performance of different camera designs we have to make sure that the algorithms we use to estimate the motion and shape are comparable. In this work we will restrict our analysis to the case of instantaneous motion of the camera. We will compare a number of standard algorithms for ego-motion estimation as described in [43] to solving a linear system based on the plenoptic motion flow equation Eq. (7). This linear system relates the plenoptic motion flow to the rigid motion

Experimental results

To assess the performance of different camera models with regard to motion estimation, we compare a number of standard algorithms for ego-motion estimation as described in [43] against a multi-camera stereo system. We used Jepson and Heeger’s linear subspace algorithm and Kanatani’s normalized minimization of the epipolar constraint. We assume similar error distributions for the optical flow and disparity distributions, both varying from 0 to 0.04 radians (0–2 degrees) in angular error. This

Conclusion

According to ancient Greek mythology Argus, the hundred-eyed guardian of Hera, the goddess of Olympus, alone defeated a whole army of Cyclops, one-eyed giants. Inspired by the mythological power of many eyes we proposed in this paper a mathematical framework for the design of cameras used for 3D photography. Based on this framework we developed hierarchies of cameras for 3D motion and 3D shape estimation. We analyzed the structure of the space of light rays, and found that large field of view

Acknowledgment

The support through the National Science Foundation Award 0086075 is gratefully acknowledged.

References (44)

  • S.J. Maybank

    Algorithm for analysing optical flow based on the least-squares method

    Image Vision Comput.

    (1986)
  • P. Baker, R. Pless, C. Fermuller, Y. Aloimonos, A spherical eye from multiple cameras (makes better models of the...
  • P. Baker, Y. Aloimonos, Structure from motion of parallel lines, In: Proc. European Conf. Computer Vision, vol. 4,...
  • P. Baker, R. Pless, C. Fermuller, Y. Aloimonos, Camera networks for building shape models from video, in: Workshop on...
  • G. Baratoff, Y. Aloimonos, Changes in surface convexity and topology caused by distortions of stereoscopic visual...
  • P.J. Besl

    Active optical range imaging sensors

    Mach. Vis. Appl.

    (1988)
  • F. Blais, A review of 20 years of ranges sensor development, in: Videometrics VII Proceedings of SPIE-IST Electronic...
  • R.C. Bolles et al.

    Epipolar-plane image analysis: an approach to determining structure from motion

    Internat. J. Comput. Vision

    (1987)
  • J. Chai, X. Tong, H. Shum, Plenoptic sampling, in: Proc. of ACM SIGGRAPH, 2000, pp....
  • A.R. Chowdhury et al.

    Stochastic approximation and rate-distortion analysis for robust structure and motion estimation

    Internat. J. Comput. Vision

    (2003)
  • K. Daniilidis, On the Error Sensitivity in the Recovery of Object Descriptions. PhD thesis, Department of Informatics,...
  • K. Daniilidis et al.

    Visual navigation: from biological systems to unmanned ground vehicles

  • J. Davis, R. Ramamoorthi, S. Rusinkiewicz, Spacetime stereo: a unifying framework for depth from triangulation, in:...
  • R. Dawkins

    Climbing Mount Improbable

    (1996)
  • A. Edelman et al.

    The geometry of algorithms with orthogonality constraints

    SIAM J. Matrix Anal. Appl.

    (1998)
  • C. Fermüller et al.

    Observability of 3D motion

    Internat. J. Comput. Vision

    (2000)
  • P. Fua

    Regularized bundle-adjustment to model heads from image sequences without calibration data

    Internat. J. Comput. Vision

    (2000)
  • R. Hartley et al.

    Multiple View Geometry in Computer Vision

    (2000)
  • F. Huck et al.

    Visual Communication

    (1997)
  • A.D. Jepson, D.J. Heeger, Subspace methods for recovering rigid motion II: Theory. Technical Report RBCV-TR-90-36,...
  • J. Kosecka, Y. Ma, S.S. Sastry, Optimization criteria, sensitivity and robustness of motion and structure estimation,...
  • Cited by (8)

    View all citing articles on Scopus
    View full text