Relative scale method to locate an object in cluttered environment
Introduction
Locating an object in a cluttered three-dimensional environment is a challenging problem in computer vision. Many applications such as object manipulation, visual inspection, landmark localization for the navigation of mobile robots, and object tracking need to locate objects in cluttered environments. In order to locate an object, first we need to identify it in the scene image and then to determine its orientation and position with respect to a reference coordinate system. This process is also known as object localization.
The commonly accepted solution for such a situation is a local-feature based approach [1], [2], [3], [4], [5], [6], [7], [8]. This is because of its flexibility to localize partially occluded object in cluttered environment. Moreover, the requirement of information to represent the model is significantly reduced. However, in most of the model-based methods [1], [2], [3], [4], [7], [9] features from the reference images are extracted together with their 3D locations with respect to a given reference frame. Model is represented by these three-dimensional features. In order to localize the object, two-dimensional features are extracted from a single image and some iterative methods (e.g. Newton’s method) are used. Localization is performed with respect to the same reference frame. As a result, these methods seem suitable for environment-specific applications. However, all these methods seem contrary to the human vision system which is the most successful vision system so far for locating an object in cluttered environment. In human vision system, we do not need to memorize the depth information of image features. Instead, we only memorize 2D features (e.g. shape) of the object and localize it by stereovision using correspondences of local features. Only a few works [6] can be found in this approach.
In order to locate the object in cluttered environment, we need to consider some important issues such as different kinds of geometric transformations of the object in the image plane, variation of the light intensity, etc. Particularly, the scale change of the object is one of the most concerning issues for the localization systems. In order to cope with the scale change, multi-scale or scale-space methods have been evolved [15], [27], [28], [38]. In these multi-scale methods, an image is analyzed in multiple scale levels in both model representation and recognition phases. As a direct consequence, the model representation requires large memory space. On the other hand, matching process becomes computationally expensive. However, such multi-space methods may not be always necessary for localization of objects in many practical applications.
In the model-based object recognition, efficiency could be improved by minimizing the number of scale levels, preferably to a single scale. As in model-based object recognition some reference images are required to represent the model of an object and recognition is carried out by a scene image (test image), a relation between scale of the object in reference image and scene image could be established. In some applications, the distance of the object from the camera can be measured and a relation between scale and distance could be obtained. For example, in intruder detection system [42] the distance of the intruder could be measured by a proximity sensor. From this distance measurement, the relative scale of the object in the image plane (with respect to the object in reference image) could be established through calibration or analytic approaches. Similarly, for visual inspection of an object on a conveyor belt, when an object reaches to a specific position of the belt the relative scale could be estimated from the available distance. For an object-following application, the relative scale could be updated based on the motion of the object. Again, exploration of an object can be carried out at a constant relative scale (Section 5). In these ways, relative scale of an object of interest in the scene image can be predicted, estimate, or pre-assigned for many applications.
In this paper, we have proposed a relative scale method to locate a 3D object in cluttered environment. We assign an arbitrary reference scale σR to the given reference object. Model of the object is represented by the local features extracted in the reference scale. As a result, the model representation method needs relatively smaller amount of memory space. Localization of an object is performed in the relative scale, which is estimated or assigned a priori. As a result, the process becomes efficient.
Geometric transformations between a point in a reference image and the corresponding point on scene image are adequately approximated by planar projective transformation. For a rigid object having free-form surfaces, a small surface patch (except patches which include edge and corner regions) can be considered as a planar surface. When the camera is relatively far from the viewed object, the planar projective transformation for such a surface patch can be further approximated by an affine transformation [24], [29], [30]. If we assume a small viewpoint change, this projective/affine deformation may be negligible. Hence, a point p′ = (x′, y′)T on the object in the scene image IT is related to a point p = (x, y)T on the reference image IR by the following transformations:where c > 0 is an arbitrary contrast factor, s > 0 is an arbitrary scaling factor, 0° ⩽ θ < 360° is an arbitrary rotation and (a, b) are arbitrary translations.
The relative scale σI of the object in the scene image IT with respect to the scale of the object in reference image IR, is a linear function of the scaling factor s, i.e.where reference scale σR is constant during the recognition process. Throughout this work, we have used the value of reference scale σR = 2.0. This value is arbitrarily selected and other values are also possible to use.
Theoretically, the relative scale of the object could vary from 0 to ∞. In fact, a very large-scale change of the object may suppress the visual information significantly, making the identification and localization of the object difficult. Therefore, the valid range of scale change for matching process is finite and depends on different parameters such as the focal length of the camera, image resolution, etc.
A modular architecture of the relative scale localization method is illustrated in Fig. 1. The localization process consists of two different phases: off-line model representation and on-line identification and localization. However, the first step of both of the phases is the detection of suitable local features on the object of interest. Features are extracted by detecting some interest points and then computing an invariant descriptor for each of them. The method of feature extraction is discussed in Section 2.
In the off-line model representation phase, local features are extracted at the reference scale from reference images captured from significantly different viewpoints with uniform backgrounds. A PCA-based technique is used for efficient access of model features. The hashing technique is described in Section 3.
Section 4 describes the on-line localization phase. First, a pair of stereo images is captured by calibrated cameras. Then, the object is identified in both of the images by extracting features and matching them with reference features, clustering the matched features using generalized Hough transformation, and verifying clusters using spatial relations between the features. The position of the object is estimated by a 3D reconstruction method using the corresponding features of the object in both of the stereo images. Some experimental results are also demonstrated. Section 5 describes an application of the relative scale method.
Section snippets
Relative scale local features
In this section, a brief overview of relative scale method to extract local features of an object [10], [11], [12] is given. Local features are extracted by detecting interest points and then computing an invariant descriptor for each of them from a small image patch around the interest point. In this method, these are done in estimated relative scale of the object.
First, we detect interest points of the object. The classic work on the interest point detection is the Harris corner detector [13]
Hashing of local features
To represent the model of a 3D object we need several reference images (views) from different viewpoints. The number of required reference images depends on the object itself. Kovacic et al. [20] proposed a method for planning sequences of views required to represent the model. For simplicity, in this work, we consider views from constant interval of pan-tilt angle (j). As the feature detection and description are invariant to the rotation of the object in the image plane, we do not need to
Object localization
The object to be located is usually situated in a cluttered environment with an arbitrary position and orientation. A pair of images of the environment is captured simultaneously by two calibrated stereo cameras. The baseline distance of the cameras is small with respect to the distance of the object. It is obvious that the object present in the images can be geometrically and photometrically transformed with respect to the same object in corresponding reference image. There could also be small
Relative scale exploration
The proposed relative scale localization method may be applied for an exploration task where a robot needs to search a cluttered environment for an object of interest. For example, a robot may be employed to find a fire extinguish situated somewhere in the hallway environment. Obviously, there are two problems related here: navigating through the hallway and locating the object of interest. Several techniques of hallway navigation system [32], [33], [34], [35] under vision control have been
Discussion
In this paper, we have described a novel method to locate 3D objects in a relative scale. We have also introduced a PCA-based hashing technique and knowledge-based stereo-correspondence method. In fact, we have described all the essential components for a relative scale localization procedure. The method is able to handle different kinds of situations such as cluttering, partial occlusions, change of background, and geometric and photometric transformations of the object. We have implemented
References (43)
- et al.
Feature-based object recognition and localization in 3D-space using a single video image
Computer Vision and Image Understanding
(1999) Optimal pose estimation in two and three dimension
Computer Vision and Image Understanding
(1999)- et al.
Detection of arbitrary planar shapes with 3D pose
Image and Vision Computing
(2001) - et al.
An indexing scheme for efficient data-driven verification of 3D pose hypotheses
Image and Vision Computing
(2002) - et al.
Planning sequences of views for 3-D object recognition and pose determination
Pattern Recognition
(1998) - et al.
Shape-adapted smoothing in estimation of 3-D shape cues from affine deformations of local 2-D brightness structure
Image and Vision Computing
(1997) - et al.
Moment invariants for recognition under changing viewpoint and illumination
Computer Vision and Image Understanding
(2004) - et al.
Fast vision-guided mobile robot navigation using model-based reasoning and prediction of uncertainties, Computer Vision
Graphics and Image Processing-Image Understanding
(1992) - et al.
Introductory Techniques for 3-D Computer Vision
(1998) - A.K.C. Wong, L. Rong, X. Liang, Robotic vision: 3D object recognition and pose determinationl, in: Proceedings of...