Tracking objects with generic calibrated sensors: An algorithm based on color and 3D shape features,☆☆

https://doi.org/10.1016/j.robot.2010.02.010Get rights and content

Abstract

We present a color and shape based 3D tracking system suited to a large class of vision sensors. The method is applicable, in principle, to any known calibrated projection model. The tracking architecture is based on particle filtering methods where each particle represents the 3D state of the object, rather than its state in the image, therefore overcoming the nonlinearity caused by the projection model. This allows the use of realistic 3D motion models and easy incorporation of self-motion measurements. All nonlinearities are concentrated in the observation model so that each particle projects a few tens of special points onto the image, on (and around) the 3D object’s surface. The likelihood of each state is then evaluated by comparing the color distributions inside and outside the object’s occluding contour. Since only pixel access operations are required, the method does not require the use of image processing routines like edge/feature extraction, color segmentation or 3D reconstruction, which can be sensitive to motion blur and optical distortions typical in applications of omnidirectional sensors to robotics. We show tracking applications considering different objects (balls, boxes), several projection models (catadioptric, dioptric, perspective) and several challenging scenarios (clutter, occlusion, illumination changes, motion and optical blur). We compare our methodology against a state-of-the-art alternative, both in realistic tracking sequences and with ground truth generated data.

Introduction

Omnidirectional and wide-angle vision sensors have been widely used in the last decade for robotics and surveillance systems. These sensors gather information from a large portion of the surrounding space, thus reducing the number of cameras required to cover a certain spatial extent. Their classical applications include mobile robot self-localization and navigation [1], [2], video surveillance [3] and humanoid foveal vision [4]. One drawback is that images suffer strong distortions and perspective effects, demanding non-standard algorithms for target detection and tracking.

In scenarios where the shape of objects can be modeled accurately a priori, 3D model based techniques are among the most successful in tracking and pose estimation applications [5]. However, classical 3D model based tracking methods are strongly dependent on the projection models and, thus, are not easily applicable to omnidirectional images. Often nonlinear optimization methods are employed: a cost function expressing the mismatch between the predicted and observed object points is locally minimized with respect to the object’s pose parameters [5]. This process involves the linearization of the relation between state and measurements, which can be very complex with omnidirectional sensor geometries. These approaches have limited convergence basins requiring either small target motions or very precise prediction models.

In this paper we try to overcome these problems by addressing the pose estimation and tracking problem in a Monte Carlo sampling framework [6], through the use of a Particle Filter (PF).

Particle Filters (PF) have become a popular tool in image processing and computer vision communities as a way to infer time varying properties of a scene from a set of input images. PF compute a sampled representation of the posterior probability distribution over the scene properties of interest. It is capable of coping with nonlinear dynamics and nonlinear observation equations, which helps in easily maintaining uncertainty with multiple distributions.

The principle of the PF is simple. The goal is to compute the posterior probability distribution p(xt|y1:t) over an unknown state xt, conditioned on image observations up to that time instant, i.e., y1:t. Particle filtering works by approximating the posterior density using a discrete set of samples, i.e., the states. Each state corresponds to some hypothesized set of model parameters. Each sample is typically weighted by its likelihood p(yt|xt), the probability that the current state observations were generated by the hypothesized state. Each sample can be considered as a state hypothesis, whose weight depends on the corresponding image data. The method operates by “testing” state hypothesis, thus avoiding the linearization between state and measurements required in gradient optimization techniques. In the context of our problem, this allows the utilization of arbitrarily complex projection models.

More precisely, 3D object localization hypothesis, generated by the particle filter, allow obtaining the 2D appearance of the objects, provided one has the projection models of the imaging sensors. Examples of imaging sensors used in our experiments range from conventional perspective to lens-mirror (catadioptric) and fisheye-lens based omnidirectional cameras. Note that while the perspective projection is already a nonlinear model, in the sense that constant 3D motion increments of an object do not imply constant 2D image increments, it is still a simple model in the sense that the projection of any 3D point is ray-traced as a straight line passing through one single point. This is not true in general for fisheye or catadioptric cameras. Fisheye lenses bend the incoming principal optical rays progressively more while moving towards the periphery of the field of view. Catadioptric cameras use simple lenses, but the mirror implies a reflection according to the its local slope. In all cases one can have strong, anisotropic, geometrical distortions associated to the projection which bend and blur the object’s visual features in a space variant manner, creating difficulties in the search for object’s features. In our approach, adopting a 3D tracking context, we can use the projection model to predict and test the location of features in the images instead of locally searching for them.

Particle filtering methods have been extensively used during the last decade in 2D tracking applications. One of the first algorithms applying particle filters in the 2D context was the Condensation algorithm [7]. In that work targets were modeled with a contour based appearance template. The approach proved not very robust to occlusion and hard to initialize. To address such limitations, more recent works added other types of features such as color histograms [8] and subspace methods [9]. In [10] a hierarchical particle filter is used to track objects (persons) described by color rectangle features and edge orientation histogram features. Since the use of multiple cues increases the computational demands, two optimized algorithmic techniques are employed: integral images are used to speed-up feature computation and an efficient cascade scheme is used to speed-up particle likelihood evaluation. To deal with occlusions [11] fuses color histogram features acquired from multiple cameras. Cameras are pre-calibrated with affine transformations and the state of the filter is composed by the 2D coordinates and bounding box dimensions of the target in one of the cameras. Each particle in the filter generates a bounding box on each camera and the observation model is composed by the concatenation of the color histograms in all bounding boxes. It is shown that multiple cameras’ tracking and data fusion are able to tackle situations when the target is occluded in one of the cameras.

Despite the success of particle filters in 2D tracking applications, not many works have proposed their use in a 3D model based context. In [12] it is proposed a system for estimating the time varying 3D shape of a moving person having 28DOF. To deal with the high dimensionality, hybrid Monte Carlo (HMC) filter is used. However in that work observations were obtained directly from the 2D projection of suitably placed markers in the human body and, therefore is only applicable in restricted cases. In general settings, the true appearance of an object (e.g. color, shape, contours) must be taken into account. For instance, the work in [13] uses a particle filter [7] to implement a full 3D edge tracking system using fast GPU computation for real-time rendering of each state hypothesis image appearance (visible edges) and for applying edge-detectors in incoming video stream. In [14] the computation of edges in the full image is avoided by grouping line segments from a known model into 3D junctions and forming fast inlier/outlier counts on projected junction branches. A local search for edges around the expected values must be performed at each time step. In [15] a particle filter is used to estimate the 3D positions of humans. The environment is explicitly modeled to handle occlusions caused by fixed objects. Multiple fixed cameras and background subtraction techniques are used for the computation of the likelihood.

Most previous works on 3D model based tracking, both the ones based on nonlinear optimization and the ones relying in sampling methods, require the extraction of edge points from the images, either by processing the full image with edge detectors or performing local search for edge points. We stress that the primary disadvantage of edge based tracking algorithms is the lack of robustness to motion and optical blur. These effects are frequent in robotics applications due to the mobility of the devices and the frequent out-of-focus situations.

We formulate the problem differently. Rather than determining explicitly the location of edge points in the image, we compute differences between color histograms on the inside and outside regions of the projected target surfaces. We do not require explicit image rendering, in the sense of creating an image of the expected appearance of the target as in [13]. Instead we just need to compute the 3D to 2D projection of some selected points inside and outside the target’s visible surfaces. This can be easily done with any single projection center sensor, and allows a fast evaluation of each particle’s likelihood. Because explicit edge or contour extraction is avoided, the method is more robust to blur arising either from fast object motions or optical defocus. Altogether, our approach facilitates the application to arbitrary nonlinear image projections and settings with fast target/robot motion.

In our approach we use color features to compute state likelihoods. The reasons for using color features are the following: color features cope well with motion and optical blur, which are frequent in the scenarios we consider; color features do not require local image processing for extracting edges or corners, but simply pixel evaluations, which makes the system suitable for real-time applications; finally, many robotics research problems assume objects with distinctive colors to facilitate the figure-ground segmentation problem, for instance in cognitive robotics [16] or robotic competitions [17]. Notwithstanding, the approach is general and, in other scenarios, additional features could be used.

In [17] we have presented the first application of our method for tracking balls (spheres) and robots (cylinders) in the RoboCup Middle Size League scenario with catadioptric sensors. In [18] we have compared the application of two well known approximations of the projection function for the class of perspective catadioptric mirrors in our tracking framework: the unified projection model and the perspective model. In [19] we extended the approach to consider not only the 3D position but also the orientation of the targets and presented applications with dioptric and perspective cameras with radial distortion to track both spherical and convex polyhedral shapes. In [20] we have extended the observation model to track objects without a initial color model (only the shape model is used) and have extended the motion model to consider the observer’s self-motion. In the present work we synthesize the main results derived in previous work and provide a better characterization of the method’s performance in challenging scenarios, including the comparison with an alternative state-of-art technique and quantitative evaluations with ground truth data.

The paper is organized as follows: Section 2 presents common imaging systems used in robotics and corresponding projection models. Section 3 describes the particle filtering approach and the state representation in our problem. In Section 4 we detail the 3D shape and color based observation model used in the tracking filter. In Section 5 we show several experiments that illustrate the performance of the system in realistic scenarios, including a comparison to an alternative approach with ground truth data. In Section 6 we draw conclusions and present directions for future work.

Section snippets

Imaging systems

We start with a brief introduction to the most common imaging geometries employed in robotics that were used in our experiments. We focus on imaging systems with axial symmetry, both dioptric and catadioptric. Cameras with axial symmetry can be described by a projection function P:ρ=P([rφz]T;ϑ) where the z axis coincides with the optical axis, [rφz]T represents a 3D point in cylindrical coordinates, ρ is the radial coordinate of the imaged point (the angular coordinate coincides with angular

3D tracking with particle filters

We are interested in estimating, at each time step, the 3D pose of the target. Thus, the state vector of the target, denoted as Xt, contains its 3D pose and derivatives up to a desired order. It represents the object evolution along time, which is assumed to be an unobserved Markov process with some initial distribution p(x0) and a transition distribution p(xt|xt1). The observations {yt;tN}, ytRny, are conditionally independent given the process {xt;tN} with distribution p(yt|xt), where ny

Observation model

In this section we describe the observation model, as expressed by p(yt|xt) in (6). We propose a methodology that associates likelihood values to each of the samples in the particle filter using the target’s 3D shape and color models. Recall that each particle represents an hypothesis of the target’s position and pose. The likelihood function will assign high weights to a particle if the image information is coherent with its 3D position and pose hypothesis, and will be low otherwise. For

Experimental results

This section presents an evaluation of the proposed methods. Firstly we present results taken with omnidirectional cameras: the tracking of a ball performed with a catadioptric setup and the tracking of a cuboid in a dioptric setup. Secondly we present results with conventional cameras and perform a comparison between our method and a competing alternative based on 2D tracking followed by 3D reconstruction. Is this set of experiments we show results with real and artificial images, both with

Conclusions

We presented a 3D model based tracking system, based on a particle filter framework. The method requires the knowledge of the 3D shape of the target and the imaging sensor calibration. We stress that the system is particularly suited for omnidirectional vision systems, as it only requires the projection of isolated points arising from likely posture hypotheses for the target. In practice the method can be used with any projection model given the availability of ways to compute the projection of

Acknowledgements

We would like to thank Dr. Luis Montesano, Dr. Alessio Del Bue and Giovanni Saponaro for the fruitful discussions.

M. Taiana received his M.Sc. degree in Computer Engineering from Politecnico di Milano - Italy, in 2007. He is currently a Ph.D. student at the Computer Vision Laboratory (VisLab), which belongs to the Institute for Systems and Robotics (ISR) of Instituto Superior Técnico (IST) - Lisbon. His research interests are in Computer and Robot Vision and Robotics.

References (34)

  • T. Boult et al.

    Omni-directional visual surveillance

    Image and Vision Computing

    (2004)
  • P. Lima, A. Bonarini, C. Machado, F. Marchese, F. Ribeiro, D. Sorrenti, Omni-directional catadioptric vision for soccer...
  • J. Gaspar et al.

    Vision-based navigation and environmental representations with an omnidirectional camera

    IEEE Transactions on Robotics and Automation

    (2000)
  • Yasuo Kuniyoshi et al.

    Active stereo vision system with foveated wide angle lenses

  • V. Lepetit et al.

    Monocular model-based 3d tracking of rigid objects: A survey

    Foundations and Trends in Computer Graphics and Vision

    (2005)
  • A. Doucet et al.
  • M. Isard et al.

    Condensation: conditional density propagation for visual tracking

    International Journal of Computer Vision

    (1998)
  • Y. Wu

    Robust visual tracking by integrating multiple cues based on co-inference learning.

    International Journal of Computer Vision

    (2004)
  • Z. Khan, T. Balch, F. Dellaert, A rao — blackwellized particle filter for eigentracking, in: Proc. of Int. Conf. on...
  • Changjiang Yang, Ramani Duraiswami, Larry Davis, Fast multiple object tracking via a hierarchical particle filter, in:...
  • Ya-Dong Wang, Jian-Kang Wu, Ashraf A. Kassim, Particle filter for visual tracking using multiple cameras, in: Proc. of...
  • K. Choo, D.J. Fleet, People tracking using hybrid monte carlo filtering, in: Proc. Int. Conf. on Computer Vision,...
  • G. Klein, D. Murray, Full-3d edge tracking with a particle filter, in: Proc. of BMVC 2006, Edinburgh, Scotland, 2006,...
  • M. Pupilli, A. Calway, Real-time camera tracking using known 3d models and a particle filter, in: Proc. of ICPR (1),...
  • Tatsuya Osawa et al.

    Human tracking by particle filtering using full 3d model of both target and environment

  • L. Montesano et al.

    Learning object affordances: From sensory motor maps to imitation

    IEEE Transactions on Robotics

    (2008)
  • M. Taiana, J. Gaspar, J. Nascimento, A. Bernardino, P. Lima, 3D tracking by catadioptric vision based on particle...
  • Cited by (25)

    • Self-calibrating smooth pursuit through active efficient coding

      2015, Robotics and Autonomous Systems
      Citation Excerpt :

      Motion perception and tracking have been largely studied in the computer vision and robotics communities. Most approaches for motion perception involve either optic flow computation [16,17], or some form of object representation along with matching or tracking techniques [18] which sometimes require calibrated sensors [19]. Visual servoing approaches [20] can then be used to close the loop between perception and action.

    • 3D to 2D bijection for spherical objects under equidistant fisheye projection

      2014, Computer Vision and Image Understanding
      Citation Excerpt :

      Our work however shows that if the radius is known, then it is possible to perform the detection in the 3D world space using only a single camera image. Some of the authors of this article have presented earlier such a method, using a color histogram mismatch-based algorithm [14]. However, the concept of uniquely detecting the 3D position using only a single camera image was not theoretically proven and the method was computationally heavy, often making it unsuitable for real-time applications.

    • Multi-robot cooperative spherical-object tracking in 3D space based on particle filters

      2013, Robotics and Autonomous Systems
      Citation Excerpt :

      This not only further reduces the bandwidth usage but also prevents the prediction model errors of the PF to be propagated to teammates which happens when sharing of particles (or of a parametrized form of it) is done. Our work builds mainly upon [2,14], carried out in the direction of object tracking and sensor fusion among teammates respectively. In [2], a PF-based tracker is presented with a unique and novel 3-D observation model based on color histogram matching.

    View all citing articles on Scopus

    M. Taiana received his M.Sc. degree in Computer Engineering from Politecnico di Milano - Italy, in 2007. He is currently a Ph.D. student at the Computer Vision Laboratory (VisLab), which belongs to the Institute for Systems and Robotics (ISR) of Instituto Superior Técnico (IST) - Lisbon. His research interests are in Computer and Robot Vision and Robotics.

    J. Santos received his M.Sc. degree in Electrotechnical and Computer Engineering from Instituto Superior Técnico - Portugal, in 2008. He his co-founder at selfTech - Engenharia de Sistemas e Robótica, a spin-off of Institute for Systems and Robotics (ISR), currently focused in deploying autonomous robotic solutions. His research interests are in Mobile Robot Localization and Computer Vision.

    J. Gaspar, received his Ph.D. degree in Electrical and Computer Engineering from Instituto Superior Técnico (IST), Technical University of Lisbon - Portugal, in 2003. He is currently an Auxiliary Professor at the IST, and a Researcher at the Computer Vision Laboratory (VisLab), Institute for Systems and Robotics (ISR). His research interests are in Computer and Robot Vision, Robotics and Control.

    J. Nascimento (M’06) received the EE degree from Instituto Superior de Engenharia de Lisboa, in 1995, and the M.Sc. and Ph.D. degrees from Instituto Superior Técnico (IST), Technical University of Lisbon, in 1998, and 2003, respectively. Presently, he is a postdoctoral researcher with the Institute for Systems and Robotics (ISR) at IST. His research interests include image processing, shape tracking, robust estimation, medical imaging, and video surveillance. Dr. Nascimento has co-authored over 50 publications in international journals and conference proceedings (many of which of the IEEE), has served on program committees of many international conferences, and has been a reviewer for several international journals.

    A. Bernardino received the Ph.D. degree in Electrical and Computer Engineering in 2004 from Instituto Superior Técnico (IST). He is an Assistant Professor at IST and Researcher at the Institute for Systems and Robotics (ISR-Lisboa) in the Computer Vision Laboratory (VisLab). He participates in several national and international research projects in the fields of robotics, cognitive systems, computer vision and surveillance. He published several articles in international journals and conferences, and his main research interests focus on the application of computer vision, cognitive science and control theory to advanced robotic and surveillance systems.

    P. Lima got his Ph.D. (1994) in Electrical Engineering at the Rensselaer Polytechnic Institute, Troy, NY, USA. Currently, he is an Associate Professor at Instituto Superior Técnico, Lisbon Technical University. He is also a member of the Institute for Systems and Robotics, a Portuguese research institute, where he is coordinator of the Intelligent Systems group and a member of the Scientific Board. Pedro Lima is a Trustee of the RoboCup Federation, and was the General Chair of RoboCup2004, held in Lisbon. He currently serves as President Portugal Robotics Society, for the 2009/10 period. He is also a founding member of Portugal Robotics Society (2006), of the IEEE RAS Portugal Chapter (2005), and a senior member of the IEEE. He is the co-author of two books, regularly serves as member of international conferences program committees, and has coordinated national and international (ESA, EU) R&D projects. He has also been very active in the promotion of Science and Technology to the society, through the organization of Robotics events in Portugal, including the Portuguese Robotics Open since 2001. His research interests lie in the areas of discrete event models of robot plans, reinforcement learning and planning under uncertainty, with applications to multi-robot systems.

    Parts of this manuscript were previously presented at the British Machine Vision Conference (BMVC’08) and at the Workshop on Omnidirectional Robot Vision, held in conjunction with the SIMPAR 2008 conference.

    ☆☆

    This work was supported by the European Commission, Projects IST-004370 RobotCub and ICT-231640 HANDLE, and by the Portuguese Government - Fundação para a Ciência e Tecnologia (ISR/IST pluriannual funding) through the PIDDAC program funds, and through project BIO-LOOK, PTDC/EEA-ACR/71032/2006.

    View full text