Abstract
In the last decades, the research in autonomous vehicles has greatly improved thanks to the success of artificial neural models. Yet, self-driving cars are far from reaching human performances. It is our opinion that would be wise to reflect on why the human brain is so effective in learning tasks as complex as the one of driving, and to try to take inspiration for designing new artificial driving agents. For this aim, we consider two relevant and related neurocognitive theories: the Convergence-divergence Zones (CDZs) mechanism of mental simulation, and the predicting brain theory. Then, we propose an implementation of a semi-supervised variational autoencoder for visual perception, with an architecture that best approximates those two neurocognitive theories.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In recent years, the kind of artificial neural networks (ANNs) known as deep learning [7, 20] has revolutionized the field of computer vision, with unprecedented results [12, 23, 24]. One of the application domains that has definitely benefited from the rise of deep learning is that of autonomous vehicles [1, 21]. Despite the great progress reached, autonomous driving is still an unsolved problem, a major challenge for image processing is to achieve an integration with motor commands enough reliable for an acceptable level of safety.
Contrary to common belief, humans are very reliable at driving: in the US there is about one fatality per 100,000,000 miles. Such considerations lead to reflect on why the human brain is so efficient in solving the driving task, and if it is possible to take inspiration from the mechanisms whereby the brain learns to perform such a complex task. This is the aim of the European project Dreams4Cars, where we are developing an artificial driving agent inspired by the neurocognition of human driving, for further details refer to [2]. The work here presented is a component of the Dreams4Cars project, addressing the visual information collected by a camera on a vehicle.
Artificial neural networks are not a faithful model of how the brain works just because their basic computational entities are named “neurons”, as often supposed. However, in deep convolutional neural networks [12], there is some resemblance between the alternating convolutional and pooling layers and the composition of simple and complex brain cells found in the visual cortex [8]. Still, CNNs adhere to a neat division between the visual process and other cognitive tasks, which is clearly a critical departure from behaviors of living agents, including driving. Our effort is in leveraging on the current most established neurocognitive theories on how the brain develops the ability to drive, in order to derive the neural network architecture here presented.
2 Simulation, Imagery, and Their Artificial Counterpart
The ability to drive is just one of the many highly specialized human sensorimotor behaviors. What is remarkable in humans (and in part other in other mammals) is the attitude of learning new motor skills without any innate scheme, a capability that involves sophisticated computational mechanisms [5, 27]. In principle, ANN models are among the most appropriate artificial tools for replicating this ability, being grounded on a strong empiricist paradigm of cognition [13]. However, for turning this general principle into workable models, many details need to be unfolded.
2.1 Simulation Theory and Convergence-Divergence Zones
A first step can be taken by adopting the proposal of Jeannerod and Hesslow, the so-called simulation theory of cognition, dictating that thinking is essentially a simulated interaction with the environment [6, 9]. In the view of Hesslow, simulation is a general principle of cognition, explicated in at least three different components: perception, actions and anticipation. Perception can be simulated by internal activation of sensory cortex in a way that resembles its normal activation during perception of external stimuli. Simulation of actions can be performed when activating motor structures, as during a normal behavior, but suppressing its actual execution. The most simple case of simulation is mental imagery, especially in visual modality. This is the case, for example, when a person tries to picture an object or a situation. During this phenomenon, the primary visual cortex (V1) is activated with a simplified representation of the object of interest, but the visual stimulus is not actually perceived [15].
A second step is to identify how, at neural level, simulation can take place. A prominent proposal in this direction has been formulated in terms of convergence-divergence zones (CDZs) [14]. The primary purpose of “convergence” is to record, by means of synaptic plasticity, which patterns of features – coded as knowledge fragments in the early cortices – occur in relation with a specific concept. Such records are built through experience, by interacting with objects. A requirement for convergence zones is the ability to reciprocate feedforward projections with feedback projections in a one-to-many fashion – the “divergence” path. The convergent flow is dominant during perceptual recognition, while the divergent flow dominates imagery. Convergent-divergent connectivity patterns can be identified for specific sensory modalities, but also in higher order association cortices, as shown in the hierarchical structure in Fig. 1.
2.2 The Predictive Theory
The reason why cognition, according to Hesslow or Jeannerod, is mainly explicated as simulation, is because through simulation the brain can achieve the most precious information of an organism: a prediction of the state of affairs in the environment in the future. The need of predicting, and how it molds the entire cognition, have become the core of a different, but related, theory which has gained large attention in the last decade, made popular under the term “predictive brain”, or “free-energy principle for the brain”. The leading figure of this theory is Karl Friston [3, 4], who argues that the behavior of the brain, and of an organism as a whole, can be conceived as minimization of free-energy. This concept originated in thermodynamics, as a measure of the amount of work that can be extracted from a system. What is borrowed by Friston is not the thermodynamic meaning of the free-energy, but its mathematical form, deriving from the framework of variational Bayesian methods in statistical physics [26]. This basic framework is adapted by Friston for abstract entities of cognitive value, for example, this is his free-energy formulation in the case of perception [4, p. 427]:
where \(\varvec{x}\) is the sensorial input of the organism, \(\varvec{c}\) is the collection of the environmental causes producing \(\varvec{x}\), \(\varvec{a}\) are actions that act on the environment to change sensory samples, and \(\varvec{z}\) are inner representations of the brain. The quantity \(\check{p}(\varvec{c}|\varvec{z})\) is the encoding in the brain of the estimate of causes of sensorial stimuli. The difference between this encoding and the distribution \(p(\varvec{c}|\varvec{x},\varvec{a})\) in the environment is computed by the Kullback–Leibler divergence \(\varDelta _{\mathrm {KL}}\) [26]. The minimization of \(F_P\) in Eq. (1) optimizes \(\varvec{z}\).
2.3 Autoencoder-Based CDZs and Free-Energy Models
The CDZ hypothesis has found in the years support of a large body of neurocognitive and neurophysiological evidence, however, it is a purely descriptive model. In our opinion, a computational idea that bears significant similarities with the CDZ scheme is the autoencoder. Autoencoder architectures have been the cornerstone of the evolution from shallow to deep neural architectures [7, 25], and later exploited for capturing compact information from visual inputs [11]. In this kind of models, the task to be solved by the network is to simulate as output the same picture fed as input. The advantage is that while learning to reconstruct the input image, the model develops a very compact internal representation of the visual scene. Models able to learn such representation are closely connected with the cognitive activity of mental imagery.
A remarkable improvement over the original autoencoders is the concept of variational autoencoder [10], where the internal representation is implemented in probabilistic terms, adopting the variational Bayesian framework [26]. The encoder part is held to provide an approximated distribution \(\check{p}_\varPhi (\varvec{z}|\varvec{x})\) of the unknown \(\varvec{x}\), depending on the set of parameters \(\varPhi \) of the encoder. The decoder part has its own set of parameters \(\varTheta \), and from a fixed internal representation \(\varvec{z_0}\) produces an output \(\varvec{y}=d_{\varTheta }(\varvec{z_0})\). The typical loss function for a variational autoencoder with parameters \(\varPhi \) and \(\varTheta \) can be written as:
where in the right hand side of the equation the first term is the Kullback–Leibler divergence between the approximate distribution of \(\varvec{z}\) produced by the encoder and the prior distribution \(p(\varvec{z})\), while the second term is the element-wise likelihood of the decoder to generate as output the same input data \(\varvec{x}\). It can be easily seen how Eq. (2) has exactly the same form of Friston’s “free-energy”, shown in Eq. (1), therefore variational autoencoders captures both the CDZ scheme and the idea of predicting by minimization of the free-energy.
3 Implementation
Here we present the implementation of our model of artificial visual imagery, derived from the neurocognitive concepts just described. We implement the model as an artificial neural network with encoder-decoder architecture, choosing Keras with Tensorflow backend as deep learning framework.
We describe our network as a semi-supervised variational autoencoder with multiple decoding branches. As Fig. 2 shows, the network is composed of a single encoder, which takes as input an RGB image and compresses the information up to a latent space of 128 neurons. Since the images fed to the network have dimension of \(256\times 128\times 3\), the compression performed by the network is almost of 4 orders of magnitude, a significant achievement compared to similar approaches [19] which limits the compression of the encoder to only 1 order of magnitude. The architecture of the encoder is defined by a stack of 4 convolutions followed by 2 dense layers.
The rest of the network is divided into three separated decoders. The input of each decoder is a tensor of 128, and all decoders have a symmetric architecture with respect to the encoder, with 2 dense layers and 4 stacked deconvolutions. What differs is the output space of each branch.
Similarly to the hierarchical arrangement of CDZs in the brain, autoencoder-based models can be placed at a level depending on the distance covered by the processing path, from the lowest primary cortical areas to the output of the simulation. The first decoder, the one on top of Fig. 2, can be considered as the lowest level the processes that start from the raw image data and converge up to simple visual features. It is trained to reconstruct the same RGB image fed as input, therefore this “visual-space branch” makes up a standard variational autoencoder, which can be trained in a total unsupervised manner.
At an intermediate level, the convergent processing path leads to representations that are no more in terms of visual features, rather in terms of “concepts”, where the local perceptual features are pruned, and neural activations code the nature of entities present in the environment that produced the stimuli [16]. In our model we considered two concepts only, that of cars and lane markings, those essential for the higher level, where the divergent path is in the format of action representations. This higher level is under development [17], and is not the focus of this paper.
Therefore, the output of the two “conceptual-space branches” of the network is a binary image in which white pixels belong to the concept at case (other cars or lane markings), while black pixels represent all the rest of the scene. This is not the case of a standard variational autoencoder, where the model output is trained as the reconstruction of the input. In our case, instead, the conceptual-space decoders are still trained together with the encoder usign RGB images, because this should correspond to the sensorial input information. That is the reason why a semi-supervised training is needed here, we give the network both the input RGB image and the corresponding target binary images for each concept.
The loss functions for the three branches are all derived from the basic Eq. (2). For the two “conceptual-space branches” a variation is introduced for accounting the imbalance of pixels that do not belong to either concepts – with respect to pixels that do belong to. We weighted the second component in Eq. (2), the cross entropy \(\log {p_{\varTheta }(\varvec{x}|\varvec{z})}\), by following [22], assigning the following coefficient to the true value class:
where N is the number of pixels in an image, M is the number of images in the training dataset, and P is the ratio of true value pixels over all the pixels in the dataset. The parameter k is used to smooth the effect of weighting by the probability of ground truth, a value evaluated empirically as valid is 4.
4 Results
In our experiments for training and testing the presented model, we adopted the SYNTHIA dataset [18], a large collection of synthetic images representing various urban scenarios. The dataset is realized using the game engine Unity, and it is composed of \(\sim \)100k frames of driving sequences recorded from a simulated camera on the windshield of the ego car. We found this dataset to be well suited for our experiment because, despite being generated in 3D computer graphics, it offers a wide variety of illumination and weather conditions, resulting occasionally in very adverse driving conditions. Each driving sequence is replicated on a set of different environment conditions which includes seasons, weather and time of the day. Figure 3 gives an example of the variety of data coming from the same frame of a driving sequence. Moreover the urban environment is very diverse as well, ranging from driving on freeways, through tunnels, congestion, “NewYork-like” city and “European” town – as they describe. Overall, this dataset appears to be a nice challenge for our variational autoencoder.
Figure 4 shows the results of our artificial CDZ model for a set of driving sequences. The images produced by the model are processed to better show at the same time the results on conceptual space and visual space. The colored overlays highlight the concepts computed by the network, the cyan regions are the output of the car divergent path, and the yellow overlays are the output of the lane markers divergent path. These results nicely show how the projection of the sensorial input (original frames) into conceptual representation is very effective in identifying and preserving the sensible features of cars and lane markings, despite the large variations in lighting and environmental conditions.
Table 1 display the IoU (Intersection over Unit) scores obtained by the network over the SYNTHIA dataset. The table shows how the task of recognizing the “car concept” generally ends up in better scores, with respect to the “lane marking concept”. An explanation of why the latter task is more difficult can be the very low ratio of pixel belonging to the class of lane markings, over the entire image size. However, the performance of the model are satisfying, exhibiting the best accuracy in the driving sequences on highways, and in the sunniest lighting conditions (spring and summer sequences).
To demonstrate the generative capabilities of our model, we verified the result of interpolating two latent space representations. The images on the left and right of Fig. 5 are the two input images, while in the middle there are the images generated from the interpolation of the compact latent spaces of the inputs. Even in the case of very different input images, the interpolation generates novel and plausible scenarios, proving the robustness of the learned latent representation.
Lastly, we would like to stress again that the purpose of our network is not mere segmentation of visual input. The segmentation task is to be considered as a support task, used to enforce the network to learn a more robust latent space representation, which now is explicitly taking into consideration two of the concepts that are fundamental to the driving tasks.
5 Conclusions
We presented an artificial neural network inspired by the the neuroscientific foundation of mental imagery, the main form of simulation grounding sensorimotor learning. Specifically, we addressed the two theories of convergence-divergence zones proposed by Meyer and Damasio, and the concept of free-energy minimization purported by Friston. We identified in the variational autoencoder the artificial mechanism closest to these two neuroscientific concepts. In the domain of autonomous driving, we implemented the network as a CDZ, at a level of immediate perception, and at a level of intermediate concepts, of cars and lane markers. The proposed model has been evaluated on the SYNTHIA dataset, proving reliable results over a wide range of driving conditions and illumination. This model is a component inside the Dreams4Cars project, immediately below a higher level model, still based on autoencoder as CDZ, computing motor commands from the conceptual representation of the environment presented in this work.
References
Bojarski, M., et al.: Explaining how a deep neural network trained with end-to-end learning steers a car. CoRR abs/1704.07911 (2017)
Da Lio, M., Plebe, A., Bortoluzzi, D., Rosati Papini, G.P., Donà, R.: A system for human-like driving learning. In: Proceedings of the 25th Intelligent Transport Systems World Congress (2018)
Friston, K., Fitzgerald, T., Rigoli, F., Schwartenbeck, P., Pezzulo, G.: Active inference: a process theory. Neural Comput. 29, 1–49 (2017)
Friston, K., Stephan, K.E.: Free-energy and the brain. Synthese 159, 417–458 (2007)
Grillner, S., Wallén, P.: Innate versus learned movements–a false dichotomy. Prog. Brain Res. 143, 1–12 (2004)
Hesslow, G.: The current status of the simulation theory of cognition. Brain 1428, 71–79 (2012)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 28, 504–507 (2006)
Hubel, D., Wiesel, T.: Receptive fields and functional architecture of monkey striate cortex. J. Physiol. 195, 215–243 (1968)
Jeannerod, M.: Neural simulation of action: a unifying mechanism for motor cognition. NeuroImage 14, S103–S109 (2001)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of International Conference on Learning Representations (2014)
Krizhevsky, A., Hinton, G.E.: Using very deep autoencoders for content-based image retrieval. In: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 489–494 (2011)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1090–1098 (2012)
Mareschal, D., Johnson, M.H., Sirois, S., Spratling, M.S., Thomas, M.S.C., Westermann, G. (eds.): Neuroconstructivism: How the Brain Constructs Cognition, vol. I. Oxford University Press, Oxford (2007)
Meyer, K., Damasio, A.: Convergence and divergence in a neural architecture for recognition and memory. Trends Neurosci. 32, 376–382 (2009)
Moulton, S.T., Kosslyn, S.M.: Imagining predictions: mental imagery as mental emulation. Philos. Trans. R. Soc. B 364, 1273–1280 (2009)
Olier, J.S., Barakova, E., Regazzoni, C., Rauterberg, M.: Re-framing the characteristics of concepts and their relation to learning and cognition in artificial agents. Cogn. Syst. Res. 44, 50–68 (2017)
Plebe, A., Da Lio, M., Bortoluzzi, D.: On reliable neural network sensorimotor control in autonomous vehicles. IEEE Trans. Intell. Transp. Syst. early access, 1–12 (2019)
Ros, G., Vazquez, D., Sellart, L., Materzynska, J., Lopez, A.M.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3234–3243 (2016)
Santana, E., Hotz, G.: Learning a driving simulator. CoRR abs/1608.01230 (2016)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Schwarting, W., Alonso-Mora, J., Rus, D.: Planning and decision-making for autonomous vehicles. Annu. Rev. Control Rob. Auton. Syst. 1, 8:1–8:24 (2018)
Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Jorge Cardoso, M.: Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Cardoso, M., et al. (eds.) DLMIA 2017, ML-CDS 2017. LNCS, vol. 10553, pp. 240–248. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67558-9_28
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
VanRullen, R.: Perception science in the age of deep neural networks. Front. Psychol. 8, 142 (2017)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
Šmídl, V., Quinn, A.: The Variational Bayes Method in Signal Processing. Springer, Berlin (2005). https://doi.org/10.1007/3-540-28820-1
Wolpert, D.M., Diedrichsen, J., Flanagan, R.: Principles of sensorimotor learning. Nat. Rev. Neurosci. 12, 739–751 (2011)
Acknowledgements
This work was developed inside the EU Horizon 2020 Dreams4Cars Research and Innovation Action project, supported by the European Commission under Grant 731593.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Plebe, A., Da Lio, M. (2019). Variational Autoencoder Inspired by Brain’s Convergence–Divergence Zones for Autonomous Driving Application. In: Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N. (eds) Image Analysis and Processing – ICIAP 2019. ICIAP 2019. Lecture Notes in Computer Science(), vol 11751. Springer, Cham. https://doi.org/10.1007/978-3-030-30642-7_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-30642-7_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30641-0
Online ISBN: 978-3-030-30642-7
eBook Packages: Computer ScienceComputer Science (R0)