Keywords

1 Introduction

To this day nobody doubts about the potential of Deep Learning for addressing Artificial Intelligence (AI) challenges. Moreover, the Computer Vision field has experienced a revolution where Deep Learning models have substantially outperformed the state of the art, not only in image classification and detection but also in other domains such as image processing, 3D modelling or Natural Language Processing. Despite its success, an important fraction of the community has strongly criticised the inability to provide a clear explanation of how CNNs work inside. Important research has been conducted for the visualisation of the filters and the activations [2, 10, 14]. These works provide tools that enable better diagnosis for addressing issues and identifying failure modes. However, we are still far from a good understanding of neural nets, especially during training time.

Apart from high level metrics such as accuracy or cross entropy loss, we do know very little about how the dynamics of filters and classifiers of the model evolve during the course of training. For instance, it would be helpful to know information about the distribution of the image features in the latent space, or even if the parameters have been initialised to locations that can ensure good convergence. By better understanding the behaviour of the model insights and how it evolves during training we should expect better training strategies that improve the accuracy.

Latent Features and Loss Strategies. In order to extend the understanding of CNNs is common to divide the CNN into two blocks: namely the block in charge of extracting features of interest from the input image, also known as embedding or encoding, and the classifier block that receives the embedding and predicts the correct class. A well trained network is expected to generate similar embeddings for images that belong to the same class and dissimilar embeddings for different classes. Generating good representative ans discriminative features is chief to ensure good accuracies on image classification and image retrieval tasks. Training a CNN for image classification with the standard cross entropy loss does not ensure good separability of classes in the embedding space [6, 7, 12]. A common approach is to train directly the embedding space using pairs of images [11] or triplets [9]. These losses tend to obtain more discriminative embeddings than the standard cross entropy loss. However, since the number of possible pairs/triplets explode with the size of the dataset, these methods require a non-trivial process of data mining to generate the pairs or triplets of interest for training. A popular approach to avoid data mining is the center loss [13] yet it requires extra computation to re-calculate the class centres and intra-class distances at every iteration. Alternatively, some researchers propose variations of the cross entropy loss that aim at reducing the intra-class distance and increasing the inter-class distance. The work of Liu et al. [6] adds a margin in the angle of the embedding vector with respect to the correct classifier, in a similar way that Hinge loss enforces maximum margins between embeddings and classifier’s boundaries. Close to that approach, Wan and coworkers [12] write the softmax loss as a cosine loss by renormalizing the \(\ell ^2\)-norms of feature vectors and weights, and again introducing a margin to maximise the separability between classes. It is interesting to point out that what is common to all of these proposals and our work is the geometric formalism used to describe the latent space. Furthermore, Ranjan et al. [7] noted that the \(\ell ^2\)-norm of the feature vectors is a good indicator of the representativeness of the image to its class. They proposed the Crystal loss, which computes the cross entropy loss over features where all have the same norm. Another interesting approach was proposed by Wan et al. [12] where the embedding space is modelled as a mixture of Gaussian distributions. The loss function aims at increasing the probability of each instance to its distribution.

Parameters Initialisation. A significant amount of work has been conducted on the initialisation of parameters and how they can help on mitigating the exploding or vanishing gradients, as well as to avoid slow convergence [3, 5]. In the work of Ayinde et al. [1] a study was conducted on how the initialisation methods affected the amount of redundant filters learned. These well known techniques have a random component that makes each training start from a different configuration and likely to lead to different training states. Thus, it is worth studying deeper the variability across training repetitions and how this variability can be reduced.

Using the previous research as the seed of our study we investigated how the backbone of the network and the final classification layer evolve during training. From a geometrical point of view, we treat the classifier’s weights as vectors that live in the embedded space. This perspective allows us to focus on the geometrical evolution of both vectors representing the classifier weights and image embeddings. We conduct a series of ablation studies to better understand the interplay between these vectors. Moreover, we explore the variability across initialisations in unbalanced datasets with a long tail shape. Finally, we propose a novel initialisation of the classifiers vectors based on the train set distribution. Hence, this method reduces the variability in 12% across initialisations in a long-tail version of MNIST. This paper is arranged as follows: Sect. 2 introduces the geometrical approach of this study. Then in Sect. 3 we identify issues associated with standard training techniques and present a method that mitigates these issues through a guided initialization of the last layer vectors. Lastly, Sect. 4 presents the conclusions and further work.

Fig. 1.
figure 1

(a) Schematic diagram of a convolutional neural network. The network is mainly composed by two parts: the backbone block and the classifier block. (b) Example of an embedded space for a subset of the MNIST dataset. (Color figure online)

2 Background

Convolutional Neural Networks (CNNs) can be divided in two blocks, as shown in Fig. 1a, namely the feature extraction block and the classification block:

  • The feature extraction block receives an input image and applies a series of convolutions and pooling operations with the goal of identifying discriminative features. The output of this block is a one-dimensional vector regarded as the image encoding or embedding of the image. If it is well trained we should expect encodings from the same class to be close together. Figure 1b depicts image encodings from a subset of the train set of the MNIST dataset. Note that we use 2D vectors for visualisation purposes. Each point in the plot represents a different image and their colour corresponds to their respective ground truth label. These vectors are the input for the classification block.

  • The classification block is a classifier with the softmax function. Although the classification block can in general be composed of several dense layers, we will refer to classification block as the last layer of the CNN throughout the paper. This layer calculates the class probability for the input image. It has a linear classifier that performs the linear transformation given by

    $$\begin{aligned} z_c = \sum _{j=1}^N W_{c,j} \cdot x_j + b_c, \end{aligned}$$
    (1)

    where \({\varvec{W}} \in \mathbb {R}^{C \times N}\) is the classifier weight matrix with C being the number of classes and N the size of the image encoding, \({\varvec{b}} \in \mathbb {R}^C\) is the bias term, \({\varvec{x}} \in \mathbb {R}^{N}\) is the image embedding i.e. the outcome of the feature extraction block, and \({\varvec{z}} \in \mathbb {R}^{C}\) is the prediction class vector. We note that the bias term \({\varvec{b}}\) is removed throughout for simplicity. The block also uses the softmax function, which is a non-linear transformation that produces a probability distribution across all classes. This function \(f({\varvec{x}})\) is defined as

    $$\begin{aligned} \left[ f({\varvec{z}})\right] _c = \frac{\exp {\left( z_c\right) }}{\sum \nolimits _{c=1}^C \exp {\left( z_c\right) }}, \end{aligned}$$
    (2)

    where \(\left[ f({\varvec{z}})\right] _c\) is the probability of the \(c^{\text {th}}\) class. Note that the performance of the classifier block is highly dependent upon the quality of the features. The classifier will benefit from a well separated class-wise features.

During training the network tries to optimise a loss function through back-propagation and gradient descent. One of the most common objective functions is the cross-entropy loss, which measures the difference between the predicted distribution \(f({\varvec{x}})\) and the target distribution \(p({\varvec{x}})\) i.e. the one constructed from the ground truth. For a given instance the cross-entropy is expressed as follows:

$$\begin{aligned} \mathcal {L}=- \sum _{c=1}^C \left[ p({\varvec{x}}^{(i)})\right] _c \log {\left[ f({\varvec{x}}^{(i)})\right] _c}, \end{aligned}$$
(3)

where C is the total number of classes.

2.1 Geometric Interpretation

We can express Eq. (3) using the geometric notation of the dot product as follows

$$\begin{aligned} \mathcal {L}_i = - \sum _{c=1}^C \left[ p({\varvec{x}}^{(i)})\right] _c \log \left[ \frac{\exp \left( \Vert {\varvec{W}}_{c}\Vert \cdot \Vert {\varvec{x}}^{(i)}\Vert \cos \theta _{c}^{(i)} \right) }{\sum \nolimits _{c=1}^C \exp \left( \Vert {\varvec{W}}_{c}\Vert \cdot \Vert {\varvec{x}}^{(i)}\Vert \cos \theta _{c}^{(i)} \right) } \right] _c, \end{aligned}$$
(4)

where \(\theta _{c}^{(i)}\) is the angle between the image encoding i with respect to the classifier vector c. Considering Eq. (4), we can see that there are two pathways to reduce the loss according to the two blocks of the network: updating the parameters of the backbone, i.e. the feature extraction block, or updating the parameters of the classifier.

  • Backbone update: This entails updating parameters of the convolutional filters of the network, eventually leading to different encoding vectors. From the geometric perspective, in order to reduce the loss the network can reduce the angle \(\theta _c^{(i)}\) for c by moving the encoding closer to its classifier.

  • Classifier update: Updating the classifier block entails updating the classifier’s vectors. The training process can yield an increase of \(|W_{c,j}|\) for the correct class or/and change the direction of this vector, so the angle \(\theta _c^{(i)}\) of the correct class is reduced. Likewise, it can also reduce the norm of the rest of the classifiers or/and increase their angles with the encoding by changing their directions away from it.

3 Experiments and Results

In the following experiments we study the interplay between the image encodings and the classifier layer during training. In particular, we explore these dynamics in balanced and in unbalanced datasets. In our experiments we use as backbone the ResNet 101 architecture [4] with an embedding of length 2, a batch size of 512, an initial learning rate of \(5 \times 10^{-4}\) that gets divided by 5 at epochs \(15^{\text {th}}\) and \(80^{\text {th}}\). We use a weight decay of \(5\times 10^{-3}\), the ADAM optimiser and Xavier for the initialisation of parameters.

3.1 Balanced Dataset

Using the previous configuration we conduct a standard training and visualise the evolution of the classifiers’ vectors and image embeddings as the training progresses. Figure 2a shows the state at the end of a training with an accuracy of 0.98. It is apparent that image embeddings from the same class group together creating clusters that fall in their corresponding class region. Moreover, the classifiers vectors span the angular domain, similarly to the hands of a clock. If we take a closer look we observe that the classifiers vectors do not transverse their clusters. This means that, although the accuracy is high, the dot product in the numerator of Eq. (4) is far from being optimally maximised.

Fig. 2.
figure 2

Embedded space representation for a CNN trained using MNIST. The dots and arrows represent the image encodings and the classifier vectors respectively. They are coloured with their ground truth class. The different coloured regions represent the locations in which the network predicts a particular class. In (a) it is shown the embedded space for a training with no restrictions imposed, whereas in (b) it is shown the resulting space when the classifiers vectors norm are constrained. (Color figure online)

Fig. 3.
figure 3

Accuracy over iterations for trainings using restricted and unrestricted norms. The training when the norm is fixed converges 40% faster than the unfixed norm case.

Fig. 4.
figure 4

Area vs. Intra-class distance, where each dot represents a class.

With the goal of reducing the angle between the classifiers vectors and their corresponding embeddings, we constrain the norm of the classifiers during training to a fixed value of 1. Hence, only the angle can be reduced to improve the loss. The resulting embedded space is depicted in Fig. 2b. Now the classifiers vectors traverse their corresponding clusters. Furthermore, when we constrain the classifiers’ norm the training convergence is achieved faster. We can see in Fig. 3, the standard training reaches maximum accuracy at 10,000 iterations, while the constrained case achieves the same accuracy in 4,000 iterations, a reduction in time of 40%.

Another interesting observation arises from the calculation of the classification areas from each class region and the intra-class distance. These areas give us a way to measure the relative coverness of each class within the embedded space. Such areas are simply computed by integrating the surface regions of each class in a circumference of fixed radius. The surface for each class is the one in which the value of the softmax function is the highest among all classes. Also, the intra-class distance is defined as the mean distance of each instance with respect to the centroid of it’s correspondent cluster. As depicted in Fig. 4, the area seem to be inversely correlated with the intra-class distance. It is left for investigation to determine whether we can manipulate the shape of encoding distributions by imposing restrictions on the classifier’s norms.

Fig. 5.
figure 5

Classifier’s vectors representation during training using a balanced MNIST. We show in (a) the configuration of vectors at the beginning of the training and (b) at the end. No restriction has been imposed to the classifiers vectors during this training.

Figures 5a and b correspond to the vector’s configuration at the beginning and at the end of the training respectively. If we compare them, we see little variation in their directions. This confirms the importance of the classifiers initialisation. Additionally, it suggests that the influence might be more accentuated in unbalanced datasets, where high represented classes in the train set might overcome adjacent classifiers with lower presence.

Fig. 6.
figure 6

Number of instances per class in the dataset, in case of (a) a balanced dataset, and (b) and unbalanced dataset.

Fig. 7.
figure 7

(a) and (c) Embedded space and classifier areas for each class respectively, at the end of a training without restrictions for the unbalanced dataset case. (b) and (d) are the same as (a) and (c) but using guided initialisation.

3.2 Unbalanced Dataset

Unbalanced datasets, are of great interest due to its presence in real-world problems. A particular case of unbalancing is the long-tail dataset. We have modified the MNIST dataset in a way that the instances are geometrically distributed across classes following the relation \(y^{LT}_c = y_c * g_c\) where \(y_c\) is the number of instances of class c in the balanced dataset, and the down-sampling factor \(g_c\) is given by

$$\begin{aligned} g_c = p \left( 1 - p\right) ^c \quad c=1,\ldots ,C. \end{aligned}$$
(5)

In this study we have set \(p=\frac{1}{2}\). The resulting distribution is shown in Fig. 6b. We must note that in this experiments we have unbalanced the train set whilst the test set remains balanced (Fig. 6a, as in Sect. 3.1).

Figures 7a and c show the resulting embedded space from a training with our unbalanced dataset. The single most striking observation is that some classes (i.e. 1, 5, 7 and 9) finish the training without classification area. Interestingly, these classes correspond to classes with low representation in the train set. We also observe that classes with higher presence in train tend to overcome the adjacent classes with less presence, up to a point where the minority classes are left without area.

The previous observation along with the little variation of the classification vectors during training evidence the importance of the initialisation, especially in long-tail datasets. The evolution of the accuracy for three different experiments that have been randomly initialised reveals high variability among repetitions, where each training leads to different accuracies, from 95% to 64% in train and from 50% to 33% in test. A difference of 31% and 17% in train and test respectively for trainings with the same set of hyperparameters and number of epochs.

To mitigate this effect, we propose a novel approach that consists on a guided initialisation where areas of classes with similar number of instances are located next to each other. This approach reduces the competition between high and low-represented classes, and therefore the areas of the former will not push out of the embedded space the areas of the latter.

In Figs. 7b and d the results of the guided initialisation are shown. We observe how the absent areas of the previous experiment are now present in the embedded space. In addition, variability among trainings has been reduced to 4% and the final test accuracies are 18% higher in average.

4 Conclusions and Future Work

In this work we have studied the dynamics of CNNs during training time from a geometric perspective. Specifically, we have explored the interplay between classification and image encoding vectors in the final layer space as the training progresses.

A careful examination of this space revealed misalignment issues between classification and embedding vectors. We have conducted experiments showing that by constraining the norms of the classifier vectors not only the misalignment is reduced but also convergence is achieved faster.

Additionally, we have shown how unbalanced datasets are highly sensitive to the randomness of the parameters initialisation, reporting up to 17% accuracy difference in test across repetitions. We proposed a novel approach to initialise the classification layer parameters that reduces this variability to 4%. This method sets the initial direction of the vectors in a way that the competition for the classification area happens between classes with similar number of training instances. Hence, minimising the risk of absent areas for classes with less presence in train. Moreover, this method yielded accuracies 18% higher in average, suggesting that it sets more robust initial states that lead with more frequency to good local minima.

Finally, we have observed an inverse correlation between the classification area of each category and the shape of its cluster. As a future work we propose to investigate further this relation and the impact of the cluster shape on the performance of the network. In addition, we are planning to extend this study to more complex datasets, such as Imagenet [8], in order to test the robustness of our proposal.