Elsevier

Image and Vision Computing

Volume 26, Issue 2, 1 February 2008, Pages 259-274
Image and Vision Computing

Relative scale method to locate an object in cluttered environment

https://doi.org/10.1016/j.imavis.2007.06.001Get rights and content

Abstract

This paper proposes an efficient method to locate a three-dimensional object in cluttered environment. Model of the object is represented in a reference scale by the local features extracted from several reference images. A PCA-based hashing technique is introduced for accessing the database of reference features efficiently. Localization is performed in an estimated relative scale. Firstly, a pair of stereo images is captured simultaneously by calibrated cameras. Then the object is identified in both images by extracting features and matching them with reference features, clustering the matched features with generalized Hough transformation, and verifying clusters with spatial relations between the features. After the identification process, knowledge-based correspondences of features belonging to the object present in the stereo images are used for the estimation of the 3D position. The localization method is robust to different kinds of geometric and photometric transformations in addition to cluttering, partial occlusions and background changes. As both the model representation and localization are single-scale processes, the method is efficient in memory usage and computing time. The proposed relative scale method has been implemented and experiments have been carried out on a set of objects. The method results very good accuracy and takes only a few seconds for object localization by our primary implementation. An application of the relative scale method for exploration of an object in cluttered environment is demonstrated. The proposed method could be useful for many other practical applications.

Introduction

Locating an object in a cluttered three-dimensional environment is a challenging problem in computer vision. Many applications such as object manipulation, visual inspection, landmark localization for the navigation of mobile robots, and object tracking need to locate objects in cluttered environments. In order to locate an object, first we need to identify it in the scene image and then to determine its orientation and position with respect to a reference coordinate system. This process is also known as object localization.

The commonly accepted solution for such a situation is a local-feature based approach [1], [2], [3], [4], [5], [6], [7], [8]. This is because of its flexibility to localize partially occluded object in cluttered environment. Moreover, the requirement of information to represent the model is significantly reduced. However, in most of the model-based methods [1], [2], [3], [4], [7], [9] features from the reference images are extracted together with their 3D locations with respect to a given reference frame. Model is represented by these three-dimensional features. In order to localize the object, two-dimensional features are extracted from a single image and some iterative methods (e.g. Newton’s method) are used. Localization is performed with respect to the same reference frame. As a result, these methods seem suitable for environment-specific applications. However, all these methods seem contrary to the human vision system which is the most successful vision system so far for locating an object in cluttered environment. In human vision system, we do not need to memorize the depth information of image features. Instead, we only memorize 2D features (e.g. shape) of the object and localize it by stereovision using correspondences of local features. Only a few works [6] can be found in this approach.

In order to locate the object in cluttered environment, we need to consider some important issues such as different kinds of geometric transformations of the object in the image plane, variation of the light intensity, etc. Particularly, the scale change of the object is one of the most concerning issues for the localization systems. In order to cope with the scale change, multi-scale or scale-space methods have been evolved [15], [27], [28], [38]. In these multi-scale methods, an image is analyzed in multiple scale levels in both model representation and recognition phases. As a direct consequence, the model representation requires large memory space. On the other hand, matching process becomes computationally expensive. However, such multi-space methods may not be always necessary for localization of objects in many practical applications.

In the model-based object recognition, efficiency could be improved by minimizing the number of scale levels, preferably to a single scale. As in model-based object recognition some reference images are required to represent the model of an object and recognition is carried out by a scene image (test image), a relation between scale of the object in reference image and scene image could be established. In some applications, the distance of the object from the camera can be measured and a relation between scale and distance could be obtained. For example, in intruder detection system [42] the distance of the intruder could be measured by a proximity sensor. From this distance measurement, the relative scale of the object in the image plane (with respect to the object in reference image) could be established through calibration or analytic approaches. Similarly, for visual inspection of an object on a conveyor belt, when an object reaches to a specific position of the belt the relative scale could be estimated from the available distance. For an object-following application, the relative scale could be updated based on the motion of the object. Again, exploration of an object can be carried out at a constant relative scale (Section 5). In these ways, relative scale of an object of interest in the scene image can be predicted, estimate, or pre-assigned for many applications.

In this paper, we have proposed a relative scale method to locate a 3D object in cluttered environment. We assign an arbitrary reference scale σR to the given reference object. Model of the object is represented by the local features extracted in the reference scale. As a result, the model representation method needs relatively smaller amount of memory space. Localization of an object is performed in the relative scale, which is estimated or assigned a priori. As a result, the process becomes efficient.

Geometric transformations between a point in a reference image and the corresponding point on scene image are adequately approximated by planar projective transformation. For a rigid object having free-form surfaces, a small surface patch (except patches which include edge and corner regions) can be considered as a planar surface. When the camera is relatively far from the viewed object, the planar projective transformation for such a surface patch can be further approximated by an affine transformation [24], [29], [30]. If we assume a small viewpoint change, this projective/affine deformation may be negligible. Hence, a point p = (x′, y′)T on the object in the scene image IT is related to a point p = (x, y)T on the reference image IR by the following transformations:IT(p)cIR(p),xy=s00scosθ-sinθsinθcosθxy+ab,where c > 0 is an arbitrary contrast factor, s > 0 is an arbitrary scaling factor, 0°  θ < 360° is an arbitrary rotation and (a, b) are arbitrary translations.

The relative scale σI of the object in the scene image IT with respect to the scale of the object in reference image IR, is a linear function of the scaling factor s, i.e.σI=sσR,where reference scale σR is constant during the recognition process. Throughout this work, we have used the value of reference scale σR = 2.0. This value is arbitrarily selected and other values are also possible to use.

Theoretically, the relative scale of the object could vary from 0 to ∞. In fact, a very large-scale change of the object may suppress the visual information significantly, making the identification and localization of the object difficult. Therefore, the valid range of scale change for matching process is finite and depends on different parameters such as the focal length of the camera, image resolution, etc.

A modular architecture of the relative scale localization method is illustrated in Fig. 1. The localization process consists of two different phases: off-line model representation and on-line identification and localization. However, the first step of both of the phases is the detection of suitable local features on the object of interest. Features are extracted by detecting some interest points and then computing an invariant descriptor for each of them. The method of feature extraction is discussed in Section 2.

In the off-line model representation phase, local features are extracted at the reference scale from reference images captured from significantly different viewpoints with uniform backgrounds. A PCA-based technique is used for efficient access of model features. The hashing technique is described in Section 3.

Section 4 describes the on-line localization phase. First, a pair of stereo images is captured by calibrated cameras. Then, the object is identified in both of the images by extracting features and matching them with reference features, clustering the matched features using generalized Hough transformation, and verifying clusters using spatial relations between the features. The position of the object is estimated by a 3D reconstruction method using the corresponding features of the object in both of the stereo images. Some experimental results are also demonstrated. Section 5 describes an application of the relative scale method.

Section snippets

Relative scale local features

In this section, a brief overview of relative scale method to extract local features of an object [10], [11], [12] is given. Local features are extracted by detecting interest points and then computing an invariant descriptor for each of them from a small image patch around the interest point. In this method, these are done in estimated relative scale of the object.

First, we detect interest points of the object. The classic work on the interest point detection is the Harris corner detector [13]

Hashing of local features

To represent the model of a 3D object we need several reference images (views) from different viewpoints. The number of required reference images depends on the object itself. Kovacic et al. [20] proposed a method for planning sequences of views required to represent the model. For simplicity, in this work, we consider views from constant interval of pan-tilt angle (j). As the feature detection and description are invariant to the rotation of the object in the image plane, we do not need to

Object localization

The object to be located is usually situated in a cluttered environment with an arbitrary position and orientation. A pair of images of the environment is captured simultaneously by two calibrated stereo cameras. The baseline distance of the cameras is small with respect to the distance of the object. It is obvious that the object present in the images can be geometrically and photometrically transformed with respect to the same object in corresponding reference image. There could also be small

Relative scale exploration

The proposed relative scale localization method may be applied for an exploration task where a robot needs to search a cluttered environment for an object of interest. For example, a robot may be employed to find a fire extinguish situated somewhere in the hallway environment. Obviously, there are two problems related here: navigating through the hallway and locating the object of interest. Several techniques of hallway navigation system [32], [33], [34], [35] under vision control have been

Discussion

In this paper, we have described a novel method to locate 3D objects in a relative scale. We have also introduced a PCA-based hashing technique and knowledge-based stereo-correspondence method. In fact, we have described all the essential components for a relative scale localization procedure. The method is able to handle different kinds of situations such as cluttering, partial occlusions, change of background, and geometric and photometric transformations of the object. We have implemented

References (43)

  • J.L. Chen et al.

    Determining pose of 3D objects with curved surfaces

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1996)
  • P.L. Rosin

    Robust pose estimation

    IEEE Transactions on Systems, Man, and Cybernetics – Part B

    (1999)
  • D.D. Sheu

    A generalized method for 3D object location from single 2D images

    Pattern Recognition

    (1992)
  • M.S. Islam, A. Sluzek, L. Zhu, Representing and matching the local shape of an object, in: Proceedings of Mirage 2005...
  • M.S. Islam et al.

    Detecting and matching interest points in relative scale

    Machine Graphics & Vision

    (2005)
  • M.S. Islam, L. Zhu, Matching interest points of an object, in: Proceedings of IEEE International Conference on Image...
  • C. Harris, M. Stephens, A combined corner and edge detector, in: Proceedings of 4th Alvey Vision Conference,...
  • C. Schmid et al.

    Evaluation of interest point detectors

    International Journal of Computer Vision

    (2000)
  • K. Mikolajczyk et al.

    Scale & affine invariant interest point detectors

    International Journal of Computer Vision

    (2004)
  • S. Maitra

    Moment invariants

    Proceedings of IEEE

    (1979)
  • A. Abo-Zaid, O. Hinton, E. Horne, About moment normalization and complex moment descriptors, in: Proceedings of 4th...
  • Cited by (0)

    View full text