Elsevier

Neurocomputing

Volumes 58–60, June 2004, Pages 801-806
Neurocomputing

A unifying framework for natural image statistics: spatiotemporal activity bubbles

https://doi.org/10.1016/j.neucom.2004.01.130Get rights and content

Abstract

Recently, different models of the statistical structure of natural images (and sequences) have been proposed. Maximizing sparseness, or alternatively temporal coherence of linear filter outputs leads to the emergence of simple cell properties. Taking account of the basic dependencies of linear filter outputs enables modelling of complex cell and topographic properties as well. Here, we propose a unifying framework for all these statistical properties, based on the concept of spatiotemporal activity bubbles.

Introduction

Natural images are not white noise; they have some robust regularities. Previous research has built statistical models of natural images, and utilized them either for modelling the receptive fields of neurons in the visual cortex, or for developing new image processing methods. The following three properties seem to be the most important found so far: sparseness, temporal coherence, and topographic dependencies. This paper proposes a new framework for modelling the statistical structure of natural image sequences, combining these three properties. It leads to models where activation of the simple cells takes the form of “bubbles”, which are regions of activity that are localized both in time and in space (space meaning the cortical surface). First, we will review some of the existing literature and known properties of natural images.

Outputs of linear filters that mimic simple cell receptive fields maximize sparseness [7]. Sparseness means that the random variable takes very small (absolute) values or very large values more often than a gaussian random variable would; to compensate, it takes values in between relatively more rarely. Thus the random variable is activated, i.e. significantly nonzero, only rarely. The probability density of the absolute value of a sparse random variable is often modelled as an exponential density, which has a higher peak at zero than a gaussian density.

Sparseness has nothing to do with the variance (scale) of the random variable. To measure the sparseness of a random variable si with zero mean, let us first normalize its scale so that the variance E{si2} equals some given constant. Then sparseness can be measured as the expectation E{G(si2)} of a suitable nonlinear function of the square. Typically, G is chosen to be convex, i.e. its second derivative is positive, e.g. G(si2)=(si2)2. Convexity implies that this expectation is large when si2 typically takes values that are either very close to 0 or very large, i.e. when si is sparse.

An alternative to sparseness is given by temporal coherence [2], [9], [10]. When the input consists of natural image sequences, i.e. video data, the simple-cell receptive fields optimize this criterion as well. Temporal coherence as defined in [2] is a nonlinear form of correlation, defined, for example, as the temporal correlation of the squared outputs. This means that the general activity level (variance) changes smoothly in time, although the actual cell outputs cannot be predicted.

It must be noted that ordinary linear correlation is not able to produce well-defined filters. Receptive fields maximizing linear correlation are more similar to Fourier components and lack the localization properties of simple-cell RFs [2].

Consider a number of representational components si,i=1,…n, such as outputs of simple cells. Now, we consider their statistical dependencies, assuming that the joint distribution of the si is dictated by the natural image input. Again, we must consider nonlinear correlations like in the case of temporal coherence, since linear correlations are typically constrained to zero. In image data, the principal dependency between two simple-cell outputs seems to be captured by the correlation of their energies si2, that is, the general activity levels or variances [8], [3], [4].

The dependencies of simple-cell outputs can be used to define a topographic organization. Let us assume that the si are arranged on a two-dimensional grid or lattice as is typical in topographic models. We have proposed a model [3], [4] in which the energies are strongly positively correlated for neighboring cells. This means simultaneous activation of neighboring cells; such simultaneous activation is implicit in much of the work in cortical topography.

The statistical properties discussed above are usually utilized in the framework of a generative model. Denote by I(x,y,t) the observed data whose components are pixel gray-scale values (point luminances) in an image patch at time point t. The models that we consider here express a monochrome image patch as a linear superposition of some features or basis vectors ai:I(x,y,t)=i=1nai(x,y)si(t).The si(t) are stochastic coefficients, different from patch to patch. In a cortical interpretation, the si model the responses of (signed) simple cells, and the ai are closely related to their classical receptive fields [7]. For simplicity, we consider only spatial receptive fields in this paper. Estimation of the model consists of determining the values of both si and ai for all i, given a sufficient number of observed patches It.

In the most basic models, the si are assumed to be statistically independent, i.e. the value of sj cannot be used to predict si for ij. Then we can use either sparseness or temporal coherence to estimate the receptive fields [6]. If sparseness is used [7], the temporal structure of the data is ignored; indeed, the data does not need to have any temporal structure in the first place. The resulting model is called independent component analysis (ICA) [6], and it can be considered a nongaussian version of factor analysis. Temporal coherence leads to quite similar receptive fields [2]. When using topography, the si are not assumed to be independent anymore; instead, they have topographic dependencies as defined above. This leads to the topographic ICA model [3], [4] which combines the properties of sparse components and topographic dependencies in a single model.

Section snippets

Temporal bubbles

As discussed above, both maximization of the sparseness of linear filter outputs and the maximization of their temporal coherence lead to receptive fields that have the principal properties of simple cells. How is it possible that two quite different criteria give quite similar receptive fields? What is the connection between the two criteria?

To answer these questions, we propose a model of the linear filter outputs that combines the two properties. The model explains why both criteria give

Discussion

Why would the visual system bother about such a sophisticated model of natural image statistics? First and foremost, bubble coding provides a suitable internal model of the structure of natural stimuli. If we consider visual processing in a Bayesian framework, it is paramount to obtain statistical models of the input that are as accurate as possible. Second, estimating the bubble process may be more interesting for higher areas than the activations of the single cells. In fact, temporal

References (10)

There are more references available in the full text version of this article.
View full text