Geometric Interpretation of CNNs’ Last Layer

de la Calle, Alejandro; Tovar, Javier; Almazán, Emilio J.

doi:10.1007/978-3-030-31332-6_12

Alejandro de la Calle¹²,
Javier Tovar¹² &
Emilio J. Almazán¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11867))

Included in the following conference series:

Iberian Conference on Pattern Recognition and Image Analysis

1707 Accesses

Abstract

Training Convolutional Neural Networks (CNNs) remains a non-trivial task that in many cases relies on the skills and experience of the person conducting the training. Choosing hyper-parameters, knowing when the training should be interrupted, or even when to stop trying training strategies are some difficult decisions that have to be made. These decisions are difficult partly because we still know very little about the internal behaviour of CNNs, especially during training. In this work we conduct a methodical experimentation on MNIST public database of handwritten digits to better understand the evolution of the last layer from a geometric perspective: namely the classification vectors and the image embedding vectors. The visual inspection of these vectors during training have revealed misalignment issues, which otherwise would have not being obvious to detect. We show that by constraining the norms of the classifiers during training these issues are mitigated as well as the time to converge is reduced by 40%. Within this context we present the problem of the variability across equal set-up trainings due to the random component of the initialisation method. We propose a novel approach that guides the initialisation of the parameters in the classification layer. This method reduces 12% the variability across repetitions and leads to accuracies 18% higher on average.

You have full access to this open access chapter, Download conference paper PDF

Data-driven weight initialization strategy for convolutional neural networks

Article 14 November 2024

Convolutional Neural Networks with Hebbian-Based Rules in Online Transfer Learning

Hebbian Learning Meets Deep Convolutional Neural Networks

Keywords

1 Introduction

To this day nobody doubts about the potential of Deep Learning for addressing Artificial Intelligence (AI) challenges. Moreover, the Computer Vision field has experienced a revolution where Deep Learning models have substantially outperformed the state of the art, not only in image classification and detection but also in other domains such as image processing, 3D modelling or Natural Language Processing. Despite its success, an important fraction of the community has strongly criticised the inability to provide a clear explanation of how CNNs work inside. Important research has been conducted for the visualisation of the filters and the activations [2, 10, 14]. These works provide tools that enable better diagnosis for addressing issues and identifying failure modes. However, we are still far from a good understanding of neural nets, especially during training time.

Apart from high level metrics such as accuracy or cross entropy loss, we do know very little about how the dynamics of filters and classifiers of the model evolve during the course of training. For instance, it would be helpful to know information about the distribution of the image features in the latent space, or even if the parameters have been initialised to locations that can ensure good convergence. By better understanding the behaviour of the model insights and how it evolves during training we should expect better training strategies that improve the accuracy.

Latent Features and Loss Strategies. In order to extend the understanding of CNNs is common to divide the CNN into two blocks: namely the block in charge of extracting features of interest from the input image, also known as embedding or encoding, and the classifier block that receives the embedding and predicts the correct class. A well trained network is expected to generate similar embeddings for images that belong to the same class and dissimilar embeddings for different classes. Generating good representative ans discriminative features is chief to ensure good accuracies on image classification and image retrieval tasks. Training a CNN for image classification with the standard cross entropy loss does not ensure good separability of classes in the embedding space [6, 7, 12]. A common approach is to train directly the embedding space using pairs of images [11] or triplets [9]. These losses tend to obtain more discriminative embeddings than the standard cross entropy loss. However, since the number of possible pairs/triplets explode with the size of the dataset, these methods require a non-trivial process of data mining to generate the pairs or triplets of interest for training. A popular approach to avoid data mining is the center loss [13] yet it requires extra computation to re-calculate the class centres and intra-class distances at every iteration. Alternatively, some researchers propose variations of the cross entropy loss that aim at reducing the intra-class distance and increasing the inter-class distance. The work of Liu et al. [6] adds a margin in the angle of the embedding vector with respect to the correct classifier, in a similar way that Hinge loss enforces maximum margins between embeddings and classifier’s boundaries. Close to that approach, Wan and coworkers [12] write the softmax loss as a cosine loss by renormalizing the $\ell ^2$-norms of feature vectors and weights, and again introducing a margin to maximise the separability between classes. It is interesting to point out that what is common to all of these proposals and our work is the geometric formalism used to describe the latent space. Furthermore, Ranjan et al. [7] noted that the $\ell ^2$-norm of the feature vectors is a good indicator of the representativeness of the image to its class. They proposed the Crystal loss, which computes the cross entropy loss over features where all have the same norm. Another interesting approach was proposed by Wan et al. [12] where the embedding space is modelled as a mixture of Gaussian distributions. The loss function aims at increasing the probability of each instance to its distribution.

Parameters Initialisation. A significant amount of work has been conducted on the initialisation of parameters and how they can help on mitigating the exploding or vanishing gradients, as well as to avoid slow convergence [3, 5]. In the work of Ayinde et al. [1] a study was conducted on how the initialisation methods affected the amount of redundant filters learned. These well known techniques have a random component that makes each training start from a different configuration and likely to lead to different training states. Thus, it is worth studying deeper the variability across training repetitions and how this variability can be reduced.

Using the previous research as the seed of our study we investigated how the backbone of the network and the final classification layer evolve during training. From a geometrical point of view, we treat the classifier’s weights as vectors that live in the embedded space. This perspective allows us to focus on the geometrical evolution of both vectors representing the classifier weights and image embeddings. We conduct a series of ablation studies to better understand the interplay between these vectors. Moreover, we explore the variability across initialisations in unbalanced datasets with a long tail shape. Finally, we propose a novel initialisation of the classifiers vectors based on the train set distribution. Hence, this method reduces the variability in 12% across initialisations in a long-tail version of MNIST. This paper is arranged as follows: Sect. 2 introduces the geometrical approach of this study. Then in Sect. 3 we identify issues associated with standard training techniques and present a method that mitigates these issues through a guided initialization of the last layer vectors. Lastly, Sect. 4 presents the conclusions and further work.

2 Background

Convolutional Neural Networks (CNNs) can be divided in two blocks, as shown in Fig. 1a, namely the feature extraction block and the classification block:

The feature extraction block receives an input image and applies a series of convolutions and pooling operations with the goal of identifying discriminative features. The output of this block is a one-dimensional vector regarded as the image encoding or embedding of the image. If it is well trained we should expect encodings from the same class to be close together. Figure 1b depicts image encodings from a subset of the train set of the MNIST dataset. Note that we use 2D vectors for visualisation purposes. Each point in the plot represents a different image and their colour corresponds to their respective ground truth label. These vectors are the input for the classification block.
The classification block is a classifier with the softmax function. Although the classification block can in general be composed of several dense layers, we will refer to classification block as the last layer of the CNN throughout the paper. This layer calculates the class probability for the input image. It has a linear classifier that performs the linear transformation given by
$$\begin{aligned} z_c = \sum _{j=1}^N W_{c,j} \cdot x_j + b_c, \end{aligned}$$
(1)
where ${\varvec{W}} \in \mathbb {R}^{C \times N}$ is the classifier weight matrix with C being the number of classes and N the size of the image encoding, ${\varvec{b}} \in \mathbb {R}^C$ is the bias term, ${\varvec{x}} \in \mathbb {R}^{N}$ is the image embedding i.e. the outcome of the feature extraction block, and ${\varvec{z}} \in \mathbb {R}^{C}$ is the prediction class vector. We note that the bias term ${\varvec{b}}$ is removed throughout for simplicity. The block also uses the softmax function, which is a non-linear transformation that produces a probability distribution across all classes. This function $f({\varvec{x}})$ is defined as
$$\begin{aligned} \left[ f({\varvec{z}})\right] _c = \frac{\exp {\left( z_c\right) }}{\sum \nolimits _{c=1}^C \exp {\left( z_c\right) }}, \end{aligned}$$
(2)
where $\left[ f({\varvec{z}})\right] _c$ is the probability of the $c^{\text {th}}$ class. Note that the performance of the classifier block is highly dependent upon the quality of the features. The classifier will benefit from a well separated class-wise features.

During training the network tries to optimise a loss function through back-propagation and gradient descent. One of the most common objective functions is the cross-entropy loss, which measures the difference between the predicted distribution $f({\varvec{x}})$ and the target distribution $p({\varvec{x}})$ i.e. the one constructed from the ground truth. For a given instance the cross-entropy is expressed as follows:

$$\begin{aligned} \mathcal {L}=- \sum _{c=1}^C \left[ p({\varvec{x}}^{(i)})\right] _c \log {\left[ f({\varvec{x}}^{(i)})\right] _c}, \end{aligned}$$

(3)

where C is the total number of classes.

2.1 Geometric Interpretation

We can express Eq. (3) using the geometric notation of the dot product as follows

$$\begin{aligned} \mathcal {L}_i = - \sum _{c=1}^C \left[ p({\varvec{x}}^{(i)})\right] _c \log \left[ \frac{\exp \left( \Vert {\varvec{W}}_{c}\Vert \cdot \Vert {\varvec{x}}^{(i)}\Vert \cos \theta _{c}^{(i)} \right) }{\sum \nolimits _{c=1}^C \exp \left( \Vert {\varvec{W}}_{c}\Vert \cdot \Vert {\varvec{x}}^{(i)}\Vert \cos \theta _{c}^{(i)} \right) } \right] _c, \end{aligned}$$

(4)

where $\theta _{c}^{(i)}$ is the angle between the image encoding i with respect to the classifier vector c. Considering Eq. (4), we can see that there are two pathways to reduce the loss according to the two blocks of the network: updating the parameters of the backbone, i.e. the feature extraction block, or updating the parameters of the classifier.

Backbone update: This entails updating parameters of the convolutional filters of the network, eventually leading to different encoding vectors. From the geometric perspective, in order to reduce the loss the network can reduce the angle $\theta _c^{(i)}$ for c by moving the encoding closer to its classifier.
Classifier update: Updating the classifier block entails updating the classifier’s vectors. The training process can yield an increase of $|W_{c,j}|$ for the correct class or/and change the direction of this vector, so the angle $\theta _c^{(i)}$ of the correct class is reduced. Likewise, it can also reduce the norm of the rest of the classifiers or/and increase their angles with the encoding by changing their directions away from it.

3 Experiments and Results

In the following experiments we study the interplay between the image encodings and the classifier layer during training. In particular, we explore these dynamics in balanced and in unbalanced datasets. In our experiments we use as backbone the ResNet 101 architecture [4] with an embedding of length 2, a batch size of 512, an initial learning rate of $5 \times 10^{-4}$ that gets divided by 5 at epochs $15^{\text {th}}$ and $80^{\text {th}}$. We use a weight decay of $5\times 10^{-3}$, the ADAM optimiser and Xavier for the initialisation of parameters.

3.1 Balanced Dataset

Using the previous configuration we conduct a standard training and visualise the evolution of the classifiers’ vectors and image embeddings as the training progresses. Figure 2a shows the state at the end of a training with an accuracy of 0.98. It is apparent that image embeddings from the same class group together creating clusters that fall in their corresponding class region. Moreover, the classifiers vectors span the angular domain, similarly to the hands of a clock. If we take a closer look we observe that the classifiers vectors do not transverse their clusters. This means that, although the accuracy is high, the dot product in the numerator of Eq. (4) is far from being optimally maximised.

With the goal of reducing the angle between the classifiers vectors and their corresponding embeddings, we constrain the norm of the classifiers during training to a fixed value of 1. Hence, only the angle can be reduced to improve the loss. The resulting embedded space is depicted in Fig. 2b. Now the classifiers vectors traverse their corresponding clusters. Furthermore, when we constrain the classifiers’ norm the training convergence is achieved faster. We can see in Fig. 3, the standard training reaches maximum accuracy at 10,000 iterations, while the constrained case achieves the same accuracy in 4,000 iterations, a reduction in time of 40%.

Another interesting observation arises from the calculation of the classification areas from each class region and the intra-class distance. These areas give us a way to measure the relative coverness of each class within the embedded space. Such areas are simply computed by integrating the surface regions of each class in a circumference of fixed radius. The surface for each class is the one in which the value of the softmax function is the highest among all classes. Also, the intra-class distance is defined as the mean distance of each instance with respect to the centroid of it’s correspondent cluster. As depicted in Fig. 4, the area seem to be inversely correlated with the intra-class distance. It is left for investigation to determine whether we can manipulate the shape of encoding distributions by imposing restrictions on the classifier’s norms.

Figures 5a and b correspond to the vector’s configuration at the beginning and at the end of the training respectively. If we compare them, we see little variation in their directions. This confirms the importance of the classifiers initialisation. Additionally, it suggests that the influence might be more accentuated in unbalanced datasets, where high represented classes in the train set might overcome adjacent classifiers with lower presence.

3.2 Unbalanced Dataset

Unbalanced datasets, are of great interest due to its presence in real-world problems. A particular case of unbalancing is the long-tail dataset. We have modified the MNIST dataset in a way that the instances are geometrically distributed across classes following the relation $y^{LT}_c = y_c * g_c$ where $y_c$ is the number of instances of class c in the balanced dataset, and the down-sampling factor $g_c$ is given by

$$\begin{aligned} g_c = p \left( 1 - p\right) ^c \quad c=1,\ldots ,C. \end{aligned}$$

(5)

In this study we have set $p=\frac{1}{2}$. The resulting distribution is shown in Fig. 6b. We must note that in this experiments we have unbalanced the train set whilst the test set remains balanced (Fig. 6a, as in Sect. 3.1).

Figures 7a and c show the resulting embedded space from a training with our unbalanced dataset. The single most striking observation is that some classes (i.e. 1, 5, 7 and 9) finish the training without classification area. Interestingly, these classes correspond to classes with low representation in the train set. We also observe that classes with higher presence in train tend to overcome the adjacent classes with less presence, up to a point where the minority classes are left without area.

The previous observation along with the little variation of the classification vectors during training evidence the importance of the initialisation, especially in long-tail datasets. The evolution of the accuracy for three different experiments that have been randomly initialised reveals high variability among repetitions, where each training leads to different accuracies, from 95% to 64% in train and from 50% to 33% in test. A difference of 31% and 17% in train and test respectively for trainings with the same set of hyperparameters and number of epochs.

To mitigate this effect, we propose a novel approach that consists on a guided initialisation where areas of classes with similar number of instances are located next to each other. This approach reduces the competition between high and low-represented classes, and therefore the areas of the former will not push out of the embedded space the areas of the latter.

In Figs. 7b and d the results of the guided initialisation are shown. We observe how the absent areas of the previous experiment are now present in the embedded space. In addition, variability among trainings has been reduced to 4% and the final test accuracies are 18% higher in average.

4 Conclusions and Future Work

In this work we have studied the dynamics of CNNs during training time from a geometric perspective. Specifically, we have explored the interplay between classification and image encoding vectors in the final layer space as the training progresses.

A careful examination of this space revealed misalignment issues between classification and embedding vectors. We have conducted experiments showing that by constraining the norms of the classifier vectors not only the misalignment is reduced but also convergence is achieved faster.

Additionally, we have shown how unbalanced datasets are highly sensitive to the randomness of the parameters initialisation, reporting up to 17% accuracy difference in test across repetitions. We proposed a novel approach to initialise the classification layer parameters that reduces this variability to 4%. This method sets the initial direction of the vectors in a way that the competition for the classification area happens between classes with similar number of training instances. Hence, minimising the risk of absent areas for classes with less presence in train. Moreover, this method yielded accuracies 18% higher in average, suggesting that it sets more robust initial states that lead with more frequency to good local minima.

Finally, we have observed an inverse correlation between the classification area of each category and the shape of its cluster. As a future work we propose to investigate further this relation and the impact of the cluster shape on the performance of the network. In addition, we are planning to extend this study to more complex datasets, such as Imagenet [8], in order to test the robustness of our proposal.

References

Ayinde, B.O., Inanc, T., Zurada, J.M.: On correlation of features extracted by deep neural networks. arXiv e-prints arXiv:1901.10900, January 2019
Binder, A., Bach, S., Montavon, G., Müller, K.-R., Samek, W.: Layer-wise relevance propagation for deep neural network architectures. In: Kim, K., Joukov, N. (eds.) Information Science and Applications (ICISA) 2016. LNEE, vol. 376, pp. 913–922. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0557-2_87
Chapter Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y.W., Titterington, M. (eds.) Proceedings of the Thirteenth ICAIS. Proceedings of Machine Learning Research, vol. 9, pp. 249–256. PMLR, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. http://proceedings.mlr.press/v9/glorot10a.html
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv e-prints. arXiv:1512.03385 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet Classification. arXiv e-prints arXiv:1502.01852, February 2015
Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-Margin softmax loss for convolutional neural networks. arXiv e-prints arXiv:1612.02295, December 2016
Ranjan, R., et al.: Crystal loss and quality pooling for unconstrained face verification and recognition. arXiv e-prints arXiv:1804.01159, April 2018
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. In: Advances in Neural Information Processing Systems, pp. 1988–1996 (2014)
Google Scholar
Wan, W., Zhong, Y., Li, T., Chen, J.: Rethinking feature distribution for loss functions in image classification. arXiv e-prints arXiv:1803.02988, March 2018
Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Chapter Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar

Download references

Acknowledgements

The authors would like to thank Dr. J. Javier Yebes for careful revisions and comments.

Author information

Authors and Affiliations

Image Recognition Team, Nielsen, c/Salvador de Madariaga 1, 28027, Madrid, Spain
Alejandro de la Calle, Javier Tovar & Emilio J. Almazán

Authors

Alejandro de la Calle
View author publications
You can also search for this author in PubMed Google Scholar
Javier Tovar
View author publications
You can also search for this author in PubMed Google Scholar
Emilio J. Almazán
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alejandro de la Calle .

Editor information

Editors and Affiliations

Universidad Autónoma de Madrid, Madrid, Spain
Aythami Morales
Universidad Autónoma de Madrid, Madrid, Spain
Julian Fierrez
Universitat Jaume I, Castellón de la Plana, Spain
José Salvador Sánchez
University of Coimbra, Coimbra, Portugal
Bernardete Ribeiro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de la Calle, A., Tovar, J., Almazán, E.J. (2019). Geometric Interpretation of CNNs’ Last Layer. In: Morales, A., Fierrez, J., Sánchez, J., Ribeiro, B. (eds) Pattern Recognition and Image Analysis. IbPRIA 2019. Lecture Notes in Computer Science(), vol 11867. Springer, Cham. https://doi.org/10.1007/978-3-030-31332-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-31332-6_12
Published: 22 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31331-9
Online ISBN: 978-3-030-31332-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)