Cite – A trainable image annotation system

https://doi.org/10.1016/S0167-8655(97)00119-0Get rights and content

Abstract

This paper explores techniques for building adaptive and trainable domain-specific image annotation systems. Learning and relaxation-based techniques are used to bind domain knowledge with data-driven sensing where both utilise hierarchical lattice-type data structures.

Introduction

Image Interpretation/Annotation usually refers to the process of labelling images and, in particular, image regions or features, with symbolic descriptions depicting some domain of reference or knowledge base. Two recent and representative systems are Schema and Sigma. In Schema (Draper et al., 1989) both knowledge and computation are partitioned at a course-grained semantic level. The knowledge base contains a set of schemas (frames, objects), each of which is designed to recognise one particular class of object. The Sigma system (Matsuyama and Hwang, 1990) is an image understanding system designed for aerial image analysis. It contains three modules for low level vision, model selection and geometric reasoning, as well as a query module through which the user interacts. These modules are interconnected such that top-down and bottom-up vision processes are closely integrated. Interpretation within Sigma is in the form of a part-of hierarchy with image features at the leaf nodes and spatial relations describing the higher level structure. Sigma's knowledge base is object-oriented in the sense that instances of an object are created (instantiated) dynamically from base object classes. However, as with Schema, the knowledge base for Sigma is hand-coded by an expert.

Our approach and, in this case, one of our systems, Cite differs from Sigma and Schema in a number of ways. It extends the conventional knowledge base to a fully hierarchical one with no depth limitation. Multiple hypotheses are generated for each scene element, and these are resolved using a hierarchical extension to relaxation labelling. Traditional feed-forward segmentation is augmented with knowledge driven resegmentation which closes the control loop on low level vision processes. Cite also extends the typical learn-phase then run-phase operational environment with incremental learning algorithms which build and improve the knowledge base after each scene has been fully analysed as shown in Fig. 1 where the numbers beside each function block represent the approximate order of operation for each operator. An initial segmentation (1) of the image causes the unary (part) feature processor to compute features for each of the low level regions (2). These features are then matched with the knowledge base (3) to provide initial labelling hypotheses which are represented in the indexing structure called the scene interpretation. Clique resolving and hierarchical binary matching then occur on these initial hypotheses (5) using binary (relational) features calculated from the hierarchical segmentation (4). The higher level scene hypotheses are added into the scene interpretation structure, and hierarchical relaxation labelling begins to resolve the multiple ambiguous labels for each object (6). As the labels begin to resolve, nodes are individually resegmented (7) using parameters stored in the knowledge base. These resegmentations replace the initial segmentations in the visual interpretation structure, resulting in a repeat of the unary and binary feature extraction and matching (stages (2) through (6)). This cycle continues a number of times until the interpretation becomes stable. If Cite's final interpretation is incorrect, the user may choose to incrementally learn the correct object labelling (8) by selecting the incorrectly labelled nodes and the desired knowledge base node. The updated knowledge base is then available as the next scene is viewed. The relaxation labelling, knowledge driven resegmentation and hierarchical clique resolving and matching provide very tight closed-loop feedback within Cite. The hierarchical knowledge base provides a rich scene description which can include contextual, taxonomic and deep decomposition information. Finally, the use of incremental supervised learning provides Cite with the ability to increase the descriptive power and accuracy of its analyses, as well as to add new world knowledge as it becomes available, rather than requiring the full set of scene objects to be present during an initial learning phase.

Section snippets

Knowledge representation

Cite represents world knowledge as a (directed) lattice in which each node represents an object or visual concept which is either a part-of, a view-of or a type-of its parent or parents. Within each node information is stored about the optimal segmentation, feature extraction and matching algorithms that are used to recognise this object in an image. Cite represents taxonomies using the type-of node.

Interpretation modules

Cite contains hierarchically segmented image data, again, in the form of a lattice called the

System performance and results

Here, we illustrate the performance of Cite on a street scene although we have examined four different scene types of offices, streets, outdoor and airports. In total, 253 scene elements were detected from 38 images of the four scenarios. The knowledge bases for the scenarios were constructed from a collection of 85 images. The knowledge bases themselves contain a total of 155 scene elements at different levels of abstraction. Of the 253 detected scene elements, an average classification

Conclusions

In this paper we have presented a new system for domain specific image annotation/interpretation which integrates bottom-up and top-down processing to combine object recognition and scene understanding capabilities within the same framework. The more novel aspects of this paper are in the use of hierarchical structures to describe world knowledge in addition to scene and visual decompositions, knowledge driven resegmentation, incremental supervised learning methods, and hierarchical relaxation

References (5)

There are more references available in the full text version of this article.

Cited by (1)

  • Image analysis and computer vision: 1998

    1999, Computer Vision and Image Understanding
View full text