A hierarchy of cameras for 3D photography
Introduction
The concept of 3D photography and imaging was always of great interest to humans. Early attempts to record and recreate images with depth were the stereoscopic drawings of Giovanni Battista della Porta around 1600, and the stereoscopic viewers devised by Wheatstone and Brewster in the 19th century. As described in [34], in the 1860’s Francois Villème invented a process known as photo sculpture, which used 24 cameras, to capture the notion of a 3D scene. Later a 3D photography and imaging technique was invented by Lippmann in 1908 under the name of integral photography, where the object was observed by a large number of small lenses arranged on a photographic sheet resulting in many views of the object from different directions [35]. Today modern electronic display techniques enable the observer to view objects from arbitrary view points and explore virtual worlds freely. These worlds need to be populated with realistic renderings of real life objects to give the observer the feel of truly spatial immersion. This need fuels the demand for accurate ways to recover the 3D shape and motion of real world objects. In general, the approaches to recover the structure of an object are either based on active or passive vision sensors, i.e., sensors that interact with their environment or sensors that just observe without interference. The main examples for the former are laser range scanners [38] and structured light based stereo configurations, where a pattern is projected onto the scene and the sensor uses the image of the projection on the structure to recover depth from triangulation [5], [12], [44]. For a recent overview over different approaches to active range sensing and available commercial systems see [6]. The category of passive approaches consists of stereo algorithms based on visual correspondence where the cameras are separated by a large baseline [39] and structure from motion algorithms on which we will concentrate [17], [24]. Since correspondence is a hard problem for widely separated views, we believe that the structure from motion paradigm offers the best approach to 3D photography [16], [37], because it interferes the least with the scene being imaged and the recovery of the sensor motion enables us to integrate depth information from far apart views for greater accuracy while taking advantage of the easier correspondence due to dense video. In addition, the estimation of the motion of the camera is a fundamental component of most image-based rendering algorithms.
There are many approaches to structure from motion (e.g., see [17], [24] for an overview), but essentially all these approaches disregard the fact that the way how images are acquired already determines to a large degree how difficult it is to solve for the structure and motion of the scene. Since systems have to cope with limited resources, their cameras should be designed to optimize subsequent image processing.
The biological world gives a good example of task specific eye design. It has been estimated that eyes have evolved no fewer than 40 times, independently, in diverse parts of the animal kingdom [13]. These eye designs, and therefore the images they capture, are highly adapted to the tasks the animal has to perform. The sophistication of these eyes suggests that we should not just focus our efforts on designing algorithms that optimally process a given visual input, but also optimize the design of the imaging sensor with regard to the task at hand, so the subsequent processing of the visual information is facilitated. This focus on sensor design has already begun, we just mention as an example the influential work on catadioptric cameras [28].
In [32] we presented a framework to relate the design of an imaging sensor to its usefulness for a given task. Such a framework allows us to evaluate and compare different camera designs in a scientific sense by using mathematical considerations.
To design a task specific camera, we need to answer the following two questions:
- (1)
How is the relevant visual information that we need to extract to solve our task encoded in the visual data that a camera can capture?
- (2)
What is the camera design and image representation that optimally facilitates the extraction of the relevant information?
To answer the first question, we first have to think about what we mean by visual information. When we think about vision, we usually think of interpreting the images taken by (two) eyes such as our own—that is, perspective images acquired by camera-type eyes based on the pinhole principle. These images enable an easy interpretation of the visual information by a human observer. Therefore, most work on sensor design has focused on designing cameras that would result in pictures with higher fidelity (e.g., [18]). Image fidelity has a strong impact on the accuracy with which we can make quantitative measurements of the world, but the qualitative nature of the image we capture (e.g., single versus multiple view point images) also has a major impact on the accuracy of measurements which cannot be measured by a display-based fidelity measure. Since nowadays most processing of visual information is done by machines, there is no need to confine oneself to the usual perspective images. Instead we propose to study how the relevant information is encoded in the geometry of the time-varying space of light rays which allows us to determine how well we can perform a task given any set of light ray measurements.
To answer the second question we have to determine how well a given eye can capture the necessary information. We can interpret this as an approximation problem where we need to assess how well the relevant subset of the space of light rays can be reconstructed based on the samples captured by the eye, our knowledge of the transfer function of the optical apparatus, and our choice of function space to represent the image. By modeling eyes as spatio-temporal sampling patterns in the space of light rays we can use well-developed tools from signal processing and approximation theory to evaluate the suitability of a given eye design for the proposed task and determine the optimal design. The answers to these two questions then allow us to define a fitness function for different camera designs with regard to a given task.
In this work, we will extend our study of the structure of the time-varying plenoptic function captured by a rigidly moving imaging sensor to analyze how the ability of a sensor to estimate its own rigid motion is related to its design and what this effect has on triangulation accuracy, and thus on the quality of the shape models that can be captured.
Section snippets
Plenoptic video geometry: how is 3D motion information encoded in the space of light rays?
The space of light rays is determined by the geometry and motion of the objects in space, their surface reflection properties, and the light sources in the scene. The most general representation for the space of light rays is the plenoptic parameterization. At each location in free space, the radiance, that is the light intensity or color observed at x from a given direction at time , is measured by the plenoptic function . Γ denotes here the spectral energy,
Ray incidence constraint
The ray incidence constraint is defined in terms of a scene point P and set of rays li : = (xi, ri). The incidence relation between the scene point and rays li, defined by their origins and directions , can be written aswhere [r]× denotes the skew-symmetric matrix so that [r]×x = r × x. The rays that satisfy this relation form a 3D-line pencil in the space of light rays.
The geometric incidence relations for light rays lead to extensions of the familiar multi-view
Ray identity constraint
In a static world, where the albedo of every scene point is not changing over time, the brightness structure of the space of light rays is time-invariant, thus if a camera moves rigidly and captures two overlapping sets of light rays at two different time instants, then a subset of these rays should match exactly and would allow us to recover the rigid motion from the light ray correspondences. Note that this is a true brightness constancy constraint because we compare each light ray to itself.
Feature computation in the space of light rays
To utilize the constraints described above we need to define the notion of correspondence in mathematical terms. In the case of the ray identity constraint we have to evaluate if two sets of light rays are identical. If the scene is static we can use the difference between the sets of light rays that are aligned according to the current motion estimate as our matching criterion. This criterion is integrated over all rays and we expect that at the correct solution, we will have a minimum. As
Hierarchy of cameras for 3D photography
Another important criteria for the sensitivity of the motion estimation problem is the size of the field of view (FOV) of the camera system. The basic understanding of these difficulties has attracted few investigators over the years [10], [11], [20], [21], [27], [36]. These difficulties are based on the geometry of the problem and they exist in the cases of small and large baselines between the views, that is for the case of continuous motion as well as for the case of discrete displacements
Sensitivity of motion and depth estimation using perturbation analysis
To assess the performance of different camera designs we have to make sure that the algorithms we use to estimate the motion and shape are comparable. In this work we will restrict our analysis to the case of instantaneous motion of the camera. We will compare a number of standard algorithms for ego-motion estimation as described in [43] to solving a linear system based on the plenoptic motion flow equation Eq. (7). This linear system relates the plenoptic motion flow to the rigid motion
Experimental results
To assess the performance of different camera models with regard to motion estimation, we compare a number of standard algorithms for ego-motion estimation as described in [43] against a multi-camera stereo system. We used Jepson and Heeger’s linear subspace algorithm and Kanatani’s normalized minimization of the epipolar constraint. We assume similar error distributions for the optical flow and disparity distributions, both varying from 0 to 0.04 radians (0–2 degrees) in angular error. This
Conclusion
According to ancient Greek mythology Argus, the hundred-eyed guardian of Hera, the goddess of Olympus, alone defeated a whole army of Cyclops, one-eyed giants. Inspired by the mythological power of many eyes we proposed in this paper a mathematical framework for the design of cameras used for 3D photography. Based on this framework we developed hierarchies of cameras for 3D motion and 3D shape estimation. We analyzed the structure of the space of light rays, and found that large field of view
Acknowledgment
The support through the National Science Foundation Award 0086075 is gratefully acknowledged.
References (44)
Algorithm for analysing optical flow based on the least-squares method
Image Vision Comput.
(1986)- P. Baker, R. Pless, C. Fermuller, Y. Aloimonos, A spherical eye from multiple cameras (makes better models of the...
- P. Baker, Y. Aloimonos, Structure from motion of parallel lines, In: Proc. European Conf. Computer Vision, vol. 4,...
- P. Baker, R. Pless, C. Fermuller, Y. Aloimonos, Camera networks for building shape models from video, in: Workshop on...
- G. Baratoff, Y. Aloimonos, Changes in surface convexity and topology caused by distortions of stereoscopic visual...
Active optical range imaging sensors
Mach. Vis. Appl.
(1988)- F. Blais, A review of 20 years of ranges sensor development, in: Videometrics VII Proceedings of SPIE-IST Electronic...
- et al.
Epipolar-plane image analysis: an approach to determining structure from motion
Internat. J. Comput. Vision
(1987) - J. Chai, X. Tong, H. Shum, Plenoptic sampling, in: Proc. of ACM SIGGRAPH, 2000, pp....
- et al.
Stochastic approximation and rate-distortion analysis for robust structure and motion estimation
Internat. J. Comput. Vision
(2003)
Visual navigation: from biological systems to unmanned ground vehicles
Climbing Mount Improbable
The geometry of algorithms with orthogonality constraints
SIAM J. Matrix Anal. Appl.
Observability of 3D motion
Internat. J. Comput. Vision
Regularized bundle-adjustment to model heads from image sequences without calibration data
Internat. J. Comput. Vision
Multiple View Geometry in Computer Vision
Visual Communication
Cited by (8)
Differential Scene Flow from Light Field Gradients
2020, International Journal of Computer Vision3D scene flow from 4D light field gradients
2018, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Plenoptic flow: Closed-form visual odometry for light field cameras
2011, IEEE International Conference on Intelligent Robots and SystemsModeling of a mobile setup by networks for object contouring
2008, Optical EngineeringOctree-path-chain data structure in a knowledge-based ICAS for nose-nasal surgery in otolaryngology
2007, Proceedings - 2nd International Multi-Symposiums on Computer and Computational Sciences, IMSCCS'07