Elsevier

Image and Vision Computing

Volume 17, Issue 11, September 1999, Pages 845-858
Image and Vision Computing

3D object recognition from static 2D views using multiple coarse data channels

https://doi.org/10.1016/S0262-8856(98)00159-0Get rights and content

Abstract

A 3D object recognition system is described that employs novel multiresolution representation and coarse encoding of feature information. Modifications are bought to classic feature extraction methods by proposing the use of wavelet transform maxima for directing the actions of feature extraction modules. The reasons behind the use of a multi-channel architecture are described, together with the description of the feature extraction and coarse modules. The targeted field of application being automatic categorisation of natural objects, the proposed system is designed to run on ordinary hardware platforms and to process an input in a short timeframe. The system has been evaluated on a variety of 2D views of a set of 5 synthetic objects designed to present various degrees of similarity, as being rated by a panel of human subjects. Parallels between these ratings and the system’s behaviour are drawn. Additionally a small set of photomicrographs of fish larvae has been used to assess the system’s performance when presented with very similar, non-rigid shapes. For comparison, the parameters extracted from each image were fed into two categorisers, discriminant analysis and multilayer feedforward neural network with backpropagation of error. Experimental evidence is presented which demonstrates the efficacy of the methods. The satisfactory categorisation performances of the system are reported, and conclusions are drawn about the system’s behaviour.

Introduction

Computational studies of vision in the past decades have highlighted the complexity of processes involved in performing visual tasks and the inherent difficulty of building computational systems that perform 3D object recognition and scene comprehension. A series of computational studies and theories reviewed by Hildreth and Ullman [1] describe vision as a chain of processes that, based on the retinal image, yield increasingly complex representations of the visible world. Among the theories on visual information representation in biological systems (the work of Marr [2], Biederman [3], Ullman [4], [5] and Edelman and Weinshall [6]), at the present the ones discussing viewer-centred representation seen to have more experimental evidence in their support (Tarr and Pinker [7], Edelman and Bulthoff [8]). The experimental results reviewed by Edelman [9] contradict also the theories centred around the idea of representation with reconstruction [2]. Hence there was a gradual shift in vision research from theories postulating the necessity of using extremely detailed, often complete representations of the world (the work of Freuder, Tenenbaum and Barrow cited in [2]), towards more relaxed frameworks based on viewpoint-dependent encoding of views and storage of multiple views (as suggested by Ullman and Basri [5]).

Arriving at this fundamental problem of representing the visible world, an important task in the design of a recognition system is the selection of descriptors that would constitute the building blocks of the internal representation of the analysed objects. Still, the virtual impossibility of obtaining a universally valid set of features and categorisation criteria based on these has been pointed out by studies in the field of taxonomy. Sokal [10] highlighted the existence of individual differences in taxonomic judgement, since a group of human classifiers can arrive at correct categorisation of objects based on quite different sets of features considered to be salient.

In analysing images for object recognition purposes, researchers usually have tended to focus on object and image properties that are also salient to human enquiry. Thus texture descriptions [11], edge positions [12] and statistical descriptions of pixel densities [13] have all been used to segment images into their component parts. Categorisation follows, providing object recognition. Most of these methods rely on extracting very precise measurements of, for example, symmetries or shape description (as illustrated by methods proposed by Brady [14], Khotanzad and Liou [15]). None have proved reliable analysis tools for understanding natural images or images with noise and clutter obscuring the objects of interest. As an alternative, a method was developed by Ellis et al. [16] that draws on the concept of Ullman’s multiple visual routines [17]. The principle of operation is the registration at low resolution of multiple parameters that describe the object scene in an image. If many of these `coarse channels’ are analysed in concert a solution to the particular analysis may be found – one which may not be apparent when using high resolution data. This is similar in concept to finding a global minimum in a multidimensional descriptor space so often described in artificial neural network research (a good example being Rumelhart and McClelland’s work [18]). The coarse channel principle has been applied successfully to the automatic categorisation of 23 species of field collected marine plankton, in a system developed by Culverhouse et al. [19]. It is also applied here to the task of three dimensional object recognition.

This approach of non-exact feature description and low-resolution encoding of features also constitutes the central concept of other recently developed systems that do not necessarily employ multiple data channels. Bradsky and Grossberg [20] describe a system that uses in its preprocessing stage an array of Gaussian receptive fields in order to decrease the dimensionality of the data. The system developed by Mel [21] employs a large array of filters placed on the input image, the outputs being coarse coded as histograms for achieving viewpoint-invariance. In a conceptually related way, Schiele and Crowley [22], [23] have used multidimensional receptive field histograms characterising 2D views of objects in classification and in determination of favourable viewpoints for recognition. Edelman’s Chorus scheme [24] utilises a receptive field array that provides low-dimensional description of the input data. The main difference between these approaches and the authors’ system is that the attention of the system is directed by a module employing Mallat’s [25] multiresolution analysis (MRA) towards areas of the image that contain potentially relevant features for categorisation. Therefore it does not analyse the entire surface of the input image (e.g. by placing a large array of receptive fields on the image).

The proposed system constitutes an engineering solution, since the processing algorithms were designed to run on largely available hardware platforms and to perform analysis in a sufficiently short timeframe that makes it usable in laboratory conditions.

Section snippets

Overview of the system

The recognition system has three components: (i) a multi-resolution feature extractor that uses wavelet filter banks, (ii) a coarse channel feature analyser and (iii) an object categoriser. Features are defined in this context as areas of high contrast or high curvature, the extraction of these being directed by low-resolution information, following work on visual inspection through eye tracking by Niemann et al. [26] and Rao et al. [27]. The spatial organisation of these features is analysed

The preprocessing and coarse coding methods

In this section the mathematics and algorithms behind the feature extraction and coarse coding methods are described, with emphasis on the novel way of representing and encoding the scale-space topology of wavelet transform’s local maxima.

Classification experiments

In order to evaluate the MRA/coarse-coded data channel image analyser, three test data sets were used. The results were fed into two categorisers for training and testing. An 8–object and a 5–object data set comprised computer–generated 2D views of 3D objects. The Aberdeen data set held multiple 2D views of natural images of fish larvae. Images in the Aberdeen set were typically of much poorer image quality than the first two sets of images. The 8–object data set was used to evaluate the θ

Conclusions

A there dimensional object recogniser was presented that operates on coarse coded data obtained from multi-resolution analysis of 2D views of 3D objects. The system has been tested on a variety of synthetic objects, some of which present self occlusion during rotation. The implemented feature extraction and coarse coding techniques led to good results in classifying views of similar synthetic 3D objects, in conditions of wide variations of viewpoint. Also, in the case of a difficult set of

Acknowledgements

The authors are grateful to Paul Rankine from the Marine Laboratory, Agriculture and Fisheries Department, The Scottish Office, Aberdeen for providing the set of photomicrographs of fish larvae.

References (47)

  • Marr, D., Vision – A computational investigation into the human representation and processing of visual information,...
  • I. Biederman

    Recognition by components: a theory of human image understanding

    Psychol. Review

    (1987)
  • S. Ullman et al.

    Recognition by linear combinations of models

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1991)
  • S. Edelman et al.

    A Self-organising multiple-view representation of 3D objects

    Biological Cybernetics

    (1991)
  • R.R. Sokal

    Classification: purposes, principles, progress, prospects

    Science

    (1974)
  • J. Canny

    A Computational approach to edge detection

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1986)
  • J.D. Helterbrand et al.

    A statistical approach to identifying closed object boundaries in images

    Advances in applied probability

    (1994)
  • Brady, M., Representing shape. In: Parallel architectures and computer vision workshop, Somerville college, Oxford,...
  • A. Khotanzad et al.

    Recognition and pose estimation of unoccluded three-dimensional objects from a two-dimensional perspective view by banks of neural networks

    IEEE Transactions on Neural Networks

    (1996)
  • Ellis, R., Simpson, R., Culverhouse, P.F., Parisini, T., Williams, R., Reguera, B., Moore, B., Lower, D., Expert visual...
  • Rumelhart, D.E., McClelland, J.L., Parallel distributed processing: Explorations in the microstructure of cognition...
  • P.F. Culverhouse et al.

    Automatic categorisation of 23 species of Dinoflagellate by artificial neural network

    Mar. Ecol. Prog. Ser.

    (1996)
  • G. Bradsky et al.

    Fast-learning VIEWNET architectures for recognizing three-dimensional objects from multiple two-dimensional views

    Neural Networks

    (1995)
  • Cited by (16)

    • A diffusion wavelet approach for 3-D model matching

      2009, CAD Computer Aided Design
      Citation Excerpt :

      In this paper, we present methods using 3D shapes based on mesh models which are widely used in computer graphics and CAD applications. There is significant amount of similar works in the area of computer vision (e.g., [1–4]), which generally infer the information about a 3D-shape from one or more frames of 2D-images. This is different from the proposal presented in this paper, as we deal directly with 3D-objects represented as polygonal meshes by which suitable descriptors are extracted.

    • Image Analysis and Computer Vision: 1999

      2000, Computer Vision and Image Understanding
    • Contour based split and merge segmentation and pre-classification of zooplankton in very large images

      2014, VISAPP 2014 - Proceedings of the 9th International Conference on Computer Vision Theory and Applications
    • Automated image processing in marine biology

      2013, Imaging Marine Life: Macrophotography and Microscopy Approaches for Marine Biology
    View all citing articles on Scopus
    View full text