Appearance-based active object recognition

https://doi.org/10.1016/S0262-8856(99)00075-XGet rights and content

Abstract

We present an efficient method within an active vision framework for recognizing objects which are ambiguous from certain viewpoints. The system is allowed to reposition the camera to capture additional views and, therefore, to improve the classification result obtained from a single view. The approach uses an appearance-based object representation, namely the parametric eigenspace, and augments it by probability distributions. This enables us to cope with possible variations in the input images due to errors in the pre-processing chain or changing imaging conditions. Furthermore, the use of probability distributions gives us a gauge to perform view planning. Multiple observations lead to a significant increase in recognition rate. Action planning is shown to be of great use in reducing the number of images necessary to achieve a certain recognition performance when compared to a random strategy.

Introduction

Most computer vision systems found in the literature perform object recognition on the basis of the information gathered from a single image. Typically, a set of features is extracted and matched against object models stored in a database. Much research in computer vision has gone in the direction of finding features that are capable of discriminating objects [1]. However, this approach faces problems once the features available from a single view are simply not sufficient to determine the identity of the observed object. Such a case happens, for example, if there are objects in the database which look very similar from certain views or share a similar internal representation (ambiguous objects or object-data); a difficulty that is compounded when we have large object databases.

A solution to this problem is to utilize the information contained in multiple sensor observations. Active recognition provides the framework for collecting evidence until we obtain a sufficient level of confidence in one object hypothesis. The merits of this framework have already been recognized in various applications, ranging from land-use classification in remote-sensing [16] to object recognition [2], [3], [4], [15], [18], [19].

Active recognition accumulates evidence collected from a multitude of sensor observations. The system has to provide tentative object hypotheses for each single view.1 Combining observations over a sequence of active steps moves the burden of object recognition slightly away from the process used to recognize a single view to the processes responsible for integrating the classification results of multiple views and for planning the next action.

In active recognition we have a few major modules whose efficiency is decisive for the overall performance (see also Fig. 1):

  • The object recognition system (classifier) itself.

  • The fusion task, combining hypotheses obtained at each active step.

  • The planning and termination procedures.

Each of these modules can be realized in a variety of different ways. This article establishes a specific coherent algorithm for the implementation of each of the necessary steps in active recognition. The system uses a modified version of Murase and Nayar's [12] appearance-based object recognition system to provide object classifications for a single view and augments it by active recognition components. Murase and Nayar's method was chosen because it does not only result in object classifications but also gives reasonable pose estimations (a prerequisite for active object recognition). It should be emphasized, however, that the presented work is not limited to the eigenspace recognition approach. The algorithm can also be applied to other view-based object recognition techniques that rely on unary, numerical feature spaces to represent objects.

Section snippets

Related research

Previous work in planning sensing strategies may be divided into off-line and on-line approaches [21]. Murase and Nayar, for example, have presented an approach for off-line illumination planning in object recognition by searching for regions in eigenspace where object-manifolds are best separated [11]. A conceptually similar but methodically more sophisticated strategy for on-line active view-planning will be presented below.

The other area of research is concerned with choosing on-line a

Object recognition in parametric eigenspace

Appearance-based approaches to object recognition, and especially the eigenspace method, have experienced a renewed interest in the computer vision community due to their ability to handle combined effects of shape, pose, reflection properties and illumination [12], [13], [22]. Furthermore, appearance-based object representations can be obtained through an automatic learning procedure and do not require the explicit specification of object models. Efficient algorithms are available to extend

Probability distributions in eigenspace

Before going on to discuss active fusion in the context of eigenspace object recognition we extend Murase and Nayar's concept of manifolds by introducing probability densities in eigenspace.4 Let us assume that we have constructed an eigenspace of all considered objects. We denote by p(g|oij) the likelihood of ending up at point g in the eigenspace when projecting an image

Active object recognition

Active steps in object recognition will lead to striking improvements if the object database contains objects that share similar views. The key process to disambiguate such objects is a planned movement of the camera to a new viewpoint from which the objects appear distinct. We will tackle this problem now within the framework of eigenspace-based object recognition. Nevertheless, the following discussion on active object recognition is highly independent of the employed feature space. In order

The complexity of the algorithm

In the following we denote by no the number of objects, by nϕ the number of possible discrete manifold parameters (total number of viewpoints) and by nf the number of degrees of freedom of the setup. Since nϕ depends exponentially on the number of degrees of freedom we introduce nv, the mean number of views per degree of freedom, such that nϕ=nvnf. Finally, let us denote by na the average number of possible actions. If all movements are allowed we will usually have na=nϕ.

Before starting the

Continuous values for the pose estimation

The fundamental likelihoods p(g|oij) are based upon learned sample images for a set of discrete pose parameters ϕj, with j=1…nϕ. It is possible to deal with intermediate poses if also images from views ϕj±Δϕj are used for the construction of p(g|oij). However, the system in its current form is limited to final pose estimations P(ϕj|oi,I1,…,In) at the accuracy of the trained subdivisions.

For some applications one is not only interested in recognizing the object but also in estimating its

Experiments

An active vision system has been built that allows for a variety of different movements (see Fig. 5). In the experiments to be described below, the system changes the vertical position of the camera, tilt, and the orientation of the turntable.

Conclusions

We have presented an active object recognition system for single-object scenes. Depending on the uncertainty in the current object classification the recognition module acquires new sensor measurements in a planned manner until the confidence in a certain hypothesis obtains a pre-defined level or another termination criterion is reached. The well-known object recognition approach using eigenspace representations was augmented by probability distributions in order to capture possible variations

References (23)

  • S.A. Hutchinson et al.

    Multisensor strategies using Dempster–Shafer belief accumulation

  • Cited by (89)

    • Deep active object recognition by joint label and action prediction

      2017, Computer Vision and Image Understanding
    • CAD-based 3D objects recognition in monocular images for mobile augmented reality

      2015, Computers and Graphics (Pergamon)
      Citation Excerpt :

      Approaches based on machine learning are very popular. The input image is compared with sampled 2D views of the object to determine the camera pose [3–5]. In offline stage, a triangulated Gaussian sphere is used to generate a huge number of 2D views of the 3D object.

    • Unified Optimization for Multiple Active Object Recognition Tasks with Feature Decision Tree

      2021, Journal of Intelligent and Robotic Systems: Theory and Applications
    View all citing articles on Scopus

    We gratefully acknowledge support by the Austrian ‘Fonds zur Förderung der wissenschaftlichen Forschung’ under grant S7003, and the Austrian Ministry of Science (BMWV Gz. 601.574/2-IV/B/9/96).

    View full text