Keywords

1 Introduction

In recent years, the kind of artificial neural networks (ANNs) known as deep learning [7, 20] has revolutionized the field of computer vision, with unprecedented results [12, 23, 24]. One of the application domains that has definitely benefited from the rise of deep learning is that of autonomous vehicles [1, 21]. Despite the great progress reached, autonomous driving is still an unsolved problem, a major challenge for image processing is to achieve an integration with motor commands enough reliable for an acceptable level of safety.

Contrary to common belief, humans are very reliable at driving: in the US there is about one fatality per 100,000,000 miles. Such considerations lead to reflect on why the human brain is so efficient in solving the driving task, and if it is possible to take inspiration from the mechanisms whereby the brain learns to perform such a complex task. This is the aim of the European project Dreams4Cars, where we are developing an artificial driving agent inspired by the neurocognition of human driving, for further details refer to [2]. The work here presented is a component of the Dreams4Cars project, addressing the visual information collected by a camera on a vehicle.

Artificial neural networks are not a faithful model of how the brain works just because their basic computational entities are named “neurons”, as often supposed. However, in deep convolutional neural networks [12], there is some resemblance between the alternating convolutional and pooling layers and the composition of simple and complex brain cells found in the visual cortex [8]. Still, CNNs adhere to a neat division between the visual process and other cognitive tasks, which is clearly a critical departure from behaviors of living agents, including driving. Our effort is in leveraging on the current most established neurocognitive theories on how the brain develops the ability to drive, in order to derive the neural network architecture here presented.

Fig. 1.
figure 1

Schematic representation of the CDZ framework by Meyer and Damasio. Neuron ensembles in early sensorimotor cortices of different modalities send converging forward projections (red arrows) to higher-order association cortices, which, in turn, project back divergently (black arrows) to the early cortical sites, via several intermediate steps. (Color figure online)

2 Simulation, Imagery, and Their Artificial Counterpart

The ability to drive is just one of the many highly specialized human sensorimotor behaviors. What is remarkable in humans (and in part other in other mammals) is the attitude of learning new motor skills without any innate scheme, a capability that involves sophisticated computational mechanisms [5, 27]. In principle, ANN models are among the most appropriate artificial tools for replicating this ability, being grounded on a strong empiricist paradigm of cognition [13]. However, for turning this general principle into workable models, many details need to be unfolded.

2.1 Simulation Theory and Convergence-Divergence Zones

A first step can be taken by adopting the proposal of Jeannerod and Hesslow, the so-called simulation theory of cognition, dictating that thinking is essentially a simulated interaction with the environment [6, 9]. In the view of Hesslow, simulation is a general principle of cognition, explicated in at least three different components: perception, actions and anticipation. Perception can be simulated by internal activation of sensory cortex in a way that resembles its normal activation during perception of external stimuli. Simulation of actions can be performed when activating motor structures, as during a normal behavior, but suppressing its actual execution. The most simple case of simulation is mental imagery, especially in visual modality. This is the case, for example, when a person tries to picture an object or a situation. During this phenomenon, the primary visual cortex (V1) is activated with a simplified representation of the object of interest, but the visual stimulus is not actually perceived [15].

A second step is to identify how, at neural level, simulation can take place. A prominent proposal in this direction has been formulated in terms of convergence-divergence zones (CDZs) [14]. The primary purpose of “convergence” is to record, by means of synaptic plasticity, which patterns of features – coded as knowledge fragments in the early cortices – occur in relation with a specific concept. Such records are built through experience, by interacting with objects. A requirement for convergence zones is the ability to reciprocate feedforward projections with feedback projections in a one-to-many fashion – the “divergence” path. The convergent flow is dominant during perceptual recognition, while the divergent flow dominates imagery. Convergent-divergent connectivity patterns can be identified for specific sensory modalities, but also in higher order association cortices, as shown in the hierarchical structure in Fig. 1.

2.2 The Predictive Theory

The reason why cognition, according to Hesslow or Jeannerod, is mainly explicated as simulation, is because through simulation the brain can achieve the most precious information of an organism: a prediction of the state of affairs in the environment in the future. The need of predicting, and how it molds the entire cognition, have become the core of a different, but related, theory which has gained large attention in the last decade, made popular under the term “predictive brain”, or “free-energy principle for the brain”. The leading figure of this theory is Karl Friston [3, 4], who argues that the behavior of the brain, and of an organism as a whole, can be conceived as minimization of free-energy. This concept originated in thermodynamics, as a measure of the amount of work that can be extracted from a system. What is borrowed by Friston is not the thermodynamic meaning of the free-energy, but its mathematical form, deriving from the framework of variational Bayesian methods in statistical physics [26]. This basic framework is adapted by Friston for abstract entities of cognitive value, for example, this is his free-energy formulation in the case of perception [4, p. 427]:

$$\begin{aligned} F_P= \varDelta _{\mathrm {KL}}\Big (\check{p}(\varvec{c}|\varvec{z})\Vert p(\varvec{c}|\varvec{x},\varvec{a})\Big ) - \log {p(\varvec{x}|\varvec{a})} \end{aligned}$$
(1)

where \(\varvec{x}\) is the sensorial input of the organism, \(\varvec{c}\) is the collection of the environmental causes producing \(\varvec{x}\), \(\varvec{a}\) are actions that act on the environment to change sensory samples, and \(\varvec{z}\) are inner representations of the brain. The quantity \(\check{p}(\varvec{c}|\varvec{z})\) is the encoding in the brain of the estimate of causes of sensorial stimuli. The difference between this encoding and the distribution \(p(\varvec{c}|\varvec{x},\varvec{a})\) in the environment is computed by the Kullback–Leibler divergence \(\varDelta _{\mathrm {KL}}\) [26]. The minimization of \(F_P\) in Eq. (1) optimizes \(\varvec{z}\).

2.3 Autoencoder-Based CDZs and Free-Energy Models

The CDZ hypothesis has found in the years support of a large body of neurocognitive and neurophysiological evidence, however, it is a purely descriptive model. In our opinion, a computational idea that bears significant similarities with the CDZ scheme is the autoencoder. Autoencoder architectures have been the cornerstone of the evolution from shallow to deep neural architectures [7, 25], and later exploited for capturing compact information from visual inputs [11]. In this kind of models, the task to be solved by the network is to simulate as output the same picture fed as input. The advantage is that while learning to reconstruct the input image, the model develops a very compact internal representation of the visual scene. Models able to learn such representation are closely connected with the cognitive activity of mental imagery.

A remarkable improvement over the original autoencoders is the concept of variational autoencoder [10], where the internal representation is implemented in probabilistic terms, adopting the variational Bayesian framework [26]. The encoder part is held to provide an approximated distribution \(\check{p}_\varPhi (\varvec{z}|\varvec{x})\) of the unknown \(\varvec{x}\), depending on the set of parameters \(\varPhi \) of the encoder. The decoder part has its own set of parameters \(\varTheta \), and from a fixed internal representation \(\varvec{z_0}\) produces an output \(\varvec{y}=d_{\varTheta }(\varvec{z_0})\). The typical loss function for a variational autoencoder with parameters \(\varPhi \) and \(\varTheta \) can be written as:

$$\begin{aligned} \mathcal {L}\left( \varPhi ,\varTheta ,\varvec{x}\right) = \varDelta _{\mathrm {KL}}\big (\check{p}(\varvec{z}|\varvec{x})\Vert p(\varvec{z})\big ) - \log {p_{\varTheta }(\varvec{x}|\varvec{z})} \end{aligned}$$
(2)

where in the right hand side of the equation the first term is the Kullback–Leibler divergence between the approximate distribution of \(\varvec{z}\) produced by the encoder and the prior distribution \(p(\varvec{z})\), while the second term is the element-wise likelihood of the decoder to generate as output the same input data \(\varvec{x}\). It can be easily seen how Eq. (2) has exactly the same form of Friston’s “free-energy”, shown in Eq. (1), therefore variational autoencoders captures both the CDZ scheme and the idea of predicting by minimization of the free-energy.

Fig. 2.
figure 2

The architecture of our model. The variational autoencoder has an encoder compressing an RGB image to a compact high-feature representation. Then 3 different decoders map the latent space back to separated output spaces: the decoder on top of the figure outputs into the same visual space of the input; the other two decoders project into conceptual space, producing binary images containing, respectively, car entities and lane marking entities.

3 Implementation

Here we present the implementation of our model of artificial visual imagery, derived from the neurocognitive concepts just described. We implement the model as an artificial neural network with encoder-decoder architecture, choosing Keras with Tensorflow backend as deep learning framework.

We describe our network as a semi-supervised variational autoencoder with multiple decoding branches. As Fig. 2 shows, the network is composed of a single encoder, which takes as input an RGB image and compresses the information up to a latent space of 128 neurons. Since the images fed to the network have dimension of \(256\times 128\times 3\), the compression performed by the network is almost of 4 orders of magnitude, a significant achievement compared to similar approaches [19] which limits the compression of the encoder to only 1 order of magnitude. The architecture of the encoder is defined by a stack of 4 convolutions followed by 2 dense layers.

The rest of the network is divided into three separated decoders. The input of each decoder is a tensor of 128, and all decoders have a symmetric architecture with respect to the encoder, with 2 dense layers and 4 stacked deconvolutions. What differs is the output space of each branch.

Similarly to the hierarchical arrangement of CDZs in the brain, autoencoder-based models can be placed at a level depending on the distance covered by the processing path, from the lowest primary cortical areas to the output of the simulation. The first decoder, the one on top of Fig. 2, can be considered as the lowest level the processes that start from the raw image data and converge up to simple visual features. It is trained to reconstruct the same RGB image fed as input, therefore this “visual-space branch” makes up a standard variational autoencoder, which can be trained in a total unsupervised manner.

At an intermediate level, the convergent processing path leads to representations that are no more in terms of visual features, rather in terms of “concepts”, where the local perceptual features are pruned, and neural activations code the nature of entities present in the environment that produced the stimuli [16]. In our model we considered two concepts only, that of cars and lane markings, those essential for the higher level, where the divergent path is in the format of action representations. This higher level is under development [17], and is not the focus of this paper.

Therefore, the output of the two “conceptual-space branches” of the network is a binary image in which white pixels belong to the concept at case (other cars or lane markings), while black pixels represent all the rest of the scene. This is not the case of a standard variational autoencoder, where the model output is trained as the reconstruction of the input. In our case, instead, the conceptual-space decoders are still trained together with the encoder usign RGB images, because this should correspond to the sensorial input information. That is the reason why a semi-supervised training is needed here, we give the network both the input RGB image and the corresponding target binary images for each concept.

The loss functions for the three branches are all derived from the basic Eq. (2). For the two “conceptual-space branches” a variation is introduced for accounting the imbalance of pixels that do not belong to either concepts – with respect to pixels that do belong to. We weighted the second component in Eq. (2), the cross entropy \(\log {p_{\varTheta }(\varvec{x}|\varvec{z})}\), by following [22], assigning the following coefficient to the true value class:

$$\begin{aligned} P=\left( \frac{1}{NM}\sum ^N_i\sum ^M_jy_{i,j}\right) ^\frac{1}{k} \end{aligned}$$
(3)

where N is the number of pixels in an image, M is the number of images in the training dataset, and P is the ratio of true value pixels over all the pixels in the dataset. The parameter k is used to smooth the effect of weighting by the probability of ground truth, a value evaluated empirically as valid is 4.

Fig. 3.
figure 3

Samples from the SYNTHIA dataset. All images show the same frame of a driving sequence, but under different environmental conditions. Starting from the top left we have: fall, winter, spring; summer, dawn, sunset; night, winter night, fog; soft rain, rain, night rain.

Fig. 4.
figure 4

Results of our model for two driving sequence of the SYNTHIA dataset: city centre and freeway driving, each with 9 different environmental conditions. In the table, odd columns show the input frames, even columns show the outputs of our neural network. In the output images, the background is the result of the visual-space decoder, the output of the car conceptual-space decoder is highlighted in cyan, in yellow the output of the lane markings conceptual-space decoder. (Color figure online)

4 Results

In our experiments for training and testing the presented model, we adopted the SYNTHIA dataset [18], a large collection of synthetic images representing various urban scenarios. The dataset is realized using the game engine Unity, and it is composed of \(\sim \)100k frames of driving sequences recorded from a simulated camera on the windshield of the ego car. We found this dataset to be well suited for our experiment because, despite being generated in 3D computer graphics, it offers a wide variety of illumination and weather conditions, resulting occasionally in very adverse driving conditions. Each driving sequence is replicated on a set of different environment conditions which includes seasons, weather and time of the day. Figure 3 gives an example of the variety of data coming from the same frame of a driving sequence. Moreover the urban environment is very diverse as well, ranging from driving on freeways, through tunnels, congestion, “NewYork-like” city and “European” town – as they describe. Overall, this dataset appears to be a nice challenge for our variational autoencoder.

Figure 4 shows the results of our artificial CDZ model for a set of driving sequences. The images produced by the model are processed to better show at the same time the results on conceptual space and visual space. The colored overlays highlight the concepts computed by the network, the cyan regions are the output of the car divergent path, and the yellow overlays are the output of the lane markers divergent path. These results nicely show how the projection of the sensorial input (original frames) into conceptual representation is very effective in identifying and preserving the sensible features of cars and lane markings, despite the large variations in lighting and environmental conditions.

Table 1 display the IoU (Intersection over Unit) scores obtained by the network over the SYNTHIA dataset. The table shows how the task of recognizing the “car concept” generally ends up in better scores, with respect to the “lane marking concept”. An explanation of why the latter task is more difficult can be the very low ratio of pixel belonging to the class of lane markings, over the entire image size. However, the performance of the model are satisfying, exhibiting the best accuracy in the driving sequences on highways, and in the sunniest lighting conditions (spring and summer sequences).

To demonstrate the generative capabilities of our model, we verified the result of interpolating two latent space representations. The images on the left and right of Fig. 5 are the two input images, while in the middle there are the images generated from the interpolation of the compact latent spaces of the inputs. Even in the case of very different input images, the interpolation generates novel and plausible scenarios, proving the robustness of the learned latent representation.

Lastly, we would like to stress again that the purpose of our network is not mere segmentation of visual input. The segmentation task is to be considered as a support task, used to enforce the network to learn a more robust latent space representation, which now is explicitly taking into consideration two of the concepts that are fundamental to the driving tasks.

Table 1. IoU scores over the SYNTHIA dataset, grouped into the 5 different driving sequences of the dataset (table on top) and into 9 different environmental and lighting conditions (bottom). The results are given for the two “concepts” of cars and lane markings, and their joint mean.
Fig. 5.
figure 5

Results of interpolation between latent space representations. Images on the extreme left and right are the input, the others are obtained interpolating the two latent spaces of the input images.

5 Conclusions

We presented an artificial neural network inspired by the the neuroscientific foundation of mental imagery, the main form of simulation grounding sensorimotor learning. Specifically, we addressed the two theories of convergence-divergence zones proposed by Meyer and Damasio, and the concept of free-energy minimization purported by Friston. We identified in the variational autoencoder the artificial mechanism closest to these two neuroscientific concepts. In the domain of autonomous driving, we implemented the network as a CDZ, at a level of immediate perception, and at a level of intermediate concepts, of cars and lane markers. The proposed model has been evaluated on the SYNTHIA dataset, proving reliable results over a wide range of driving conditions and illumination. This model is a component inside the Dreams4Cars project, immediately below a higher level model, still based on autoencoder as CDZ, computing motor commands from the conceptual representation of the environment presented in this work.