Keywords

1 Introduction

Deep Neural Networks (DNNs) have demonstrated their successes in many computer vision and natural language processing tasks [1,2,3,4,5], but the theoretical reasons that contribute to the successes of DNNs haven’t been fully unveiled. Recently, information theory has shown its preponderance for DNNs understanding. Specifically, Tishby and Zaslavsky [6] note that layered neural networks can be represented as a Markov chain and analyze the neural network via the information bottleneck. Schwartz-Ziv and Tishby [7] calculate the mutual information I(XT), I(TY) for each hidden layer, where X is the input data, Y is the label and T is the hidden layer output, respectively. Then they demonstrate the effectiveness of the visualization of neural networks. These works inspire us to leverage mutual information to evaluate the capability of DNNs.

Fig. 1.
figure 1

This figure is adapted from [7]. The mutual information path is calculated based on a fully connected neural network. X is a 12-dimensional binary input and Y has 2 classes. Each hidden layer first reaches the green point (transition point), then converges at the yellow point. The leftmost path corresponds to the last hidden layer and the rightmost path corresponds to the first hidden layer. (best viewed in color) (Color figure online)

Figure 1 depicts the evolution of the mutual information along with the training epochs in the information plane [7]. As can be seen, the green point, which is referred to as the transition point, in each mutual information path separates the learning process into two distinct phases: the ‘fitting phase’, which takes a few hundred epochs, and the layers’ information on the label, namely I(TY), increases; the subsequent ‘compression phase’, which takes most of the training time and the layers’ information on the input, i.e. I(XT), decreases (this means the layers remove irrelevant information until convergence).

The evolution of I(XT) and I(TY) explains how DNNs work. However, the models used in [6, 7] are some simple fully connected neural networks. In real applications, Convolutional Neural Networks (CNNs) are commonly used in computer vision. Pushing these works [6, 7] forward, in this paper, we design an information plane based framework to study the capability of some classical CNN structures for image classification, including AlexNet [2], VGG [8]. The contributions of our work can be summarized as follows:

  • Our work unveils that I(XT) also contributes to the training accuracy and the correlation grows stronger as the network gets deeper. We perform experiments to validate this claim.

  • An evaluation framework based on the information plane is proposed. The framework is more ‘informative’ than the loss curve and would facilitate a better understanding of DNNs.

  • We show that mutual information can be used to infer the DNN’s capability of recognizing objects of each class in the image classification task.

2 Related Work

The most related topic is the information bottleneck (IB) principle [9]. IB provides a technique for extracting information in some input random variable that is relevant for predicting some different output random variable. [10] extends the original IB method to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. [11] introduces an alternative formulation called the deterministic IB (DIB), which replaces mutual information with entropy and better captures the notion which features are relevant. [12] theoretically analyzes the IB method and its relation to learning algorithms and minimal sufficient statistics. [13] shows that K-means and deterministic annealing algorithms for geometric clustering can be derived from a more general IB approach.

Recently, we have seen some applications of IB in deep learning. [14] presents a variational approximation to the IB method. This variational approach can parameterize the IB model using a neural network and leverage the reparameterization trick for efficient training. [15] proposes a method that allows IB to be used in more general domains, such as discrete or continuous inputs and outputs, nonlinear encoding and decoding maps. [16] proposes a Parametric IB (PIB) framework to jointly optimize the compression and relevance of all layers in stochastic neural networks for better exploiting the networks’ representation capabilities. [17] introduces the Information Dropout method, which generalizes the dropout method in deep learning, rooted in information theoretic principles that automatically adapts to the data, and can better exploit architectures with limited capacity.

[6, 7], which are most relevant to our work, visualize the mutual information of hidden layers and the input/output of a neural network in the information plane to understand the optimization process and the internal organization of DNNs. While in this paper, different from these works which study DNNs with fully connected layers, we propose to study the behavior of more commonly used CNNs in image classification.

3 Mutual Information and Deep Neural Networks

In this section, we first revisit the definition of mutual information and its properties relevant to DNNs analysis, then we interpret the representation learning in DNNs with mutual information and show how to calculate mutual information in DNNs.

3.1 Mutual Information

Given two random variables X and Y with a joint probability mass function p(xy) and marginal probability mass functions p(x) and p(y), the mutual information between two variables, I(XY), is defined as:

$$\begin{aligned} I(X;Y) = \sum _{x, y }p(x,y)\log \frac{p(x,y)}{p(x)p(y)}. \end{aligned}$$
(1)

The entropy of X, H(X), can be defined using the mutual information:

$$\begin{aligned} H(X) = I(X;X) = -\sum _{x}p(x)\log p(x). \end{aligned}$$
(2)

In general, the mutual information of two random variables is a measurement of the mutual dependence between the two variables. More specifically, it quantifies the amount of information obtained about one random variable, through the other one.

There are two properties (3) (4) of mutual information which are useful for analyzing DNNs:

  • function transformation:

    $$\begin{aligned} I(X;Y) = I(\psi (X);\phi (Y)) \end{aligned}$$
    (3)

    for any invertible functions \(\psi \) and \(\phi \).

  • Markov chain. Suppose \(X\rightarrow Y\rightarrow Z\) forms a Markov chain, then we have the data processing inequality:

    $$\begin{aligned} I(X;Y) \ge I(X;Z). \end{aligned}$$
    (4)

3.2 Optimal Representation of Learning Process

In representation learning, we want our model to learn an efficient representation of the original data X without losing prediction capability of the label Y, which means we want to learn a minimal sufficient statistics of X with respect to Y. A minimal sufficient statistics T(X) is the solution to the following optimization problem:

$$\begin{aligned} T(X) = \mathop {\arg \min }_{S(X):I(S(X);Y)=I(X;Y)} I(S(X);X) \end{aligned}$$
(5)

So, from the minimal sufficient statistics perspective, the goal of DNNs is to make I(XS(X)) as small as possible, which means the representation is efficient; while I(S(X); Y) should be the same value of I(XY) which means the information on Y is not lost. In practice, the explicit minimal sufficient statistics only exist for very special distributions. The actual learning process is a tradeoff between I(XS(X)) and I(S(X); Y), and it leads to the IB method [9]. IB can be seen as a special case of Rate Distortion theory and provides a framework to find approximate minimal sufficient statistics. The efficient representation is a tradeoff between the compression of X and the prediction ability of Y.

Let x be an input point, and t be the corresponding model’s output, or the compressed representation of x. This representation is defined by the probabilistic mapping p(t|x). The information bottleneck tradeoff is formulated by the following optimization problem:

$$\begin{aligned} \min \limits _{p(t|x),Y\rightarrow X\rightarrow T}\{ I(X;T) - \beta I(T;Y) \}. \end{aligned}$$
(6)

The Lagrange multiplier \(\beta \) determines the level of relevant information captured by the representation T. So given a joint distribution p(xy) and the parameter \(\beta \), minimizing (6) yields the optimal I(XT) and I(TY) (see (31) in [9]).

3.3 Calculating Mutual Information in DNNs

From Sect. 3.2, we know I(XT) and I(TY) are essential to evaluate the representation learning algorithms, including DNNs, but the calculation in DNNs is a difficult problem.

[7] uses the hyperbolic tangent function as the hidden layer’s activation function, and bins the neuron’s output activation into 30 equal intervals between −1 and 1. Then they use these discretized values for each t, to directly calculate the joint distributions p(xt) and p(ty) over the equally likely patterns of the input data for every hidden layer. But when the number of neurons in the hidden layer is large (it happens when we visualize CNN layers), I(XT) and I(TY) barely change. The reason is that the sample space of T is huge even if we decrease the number of intervals, and the output of a particular input data x falls into one interval of t with high probability. Thus p(x|t) and p(y|t) are approximately deterministic, \(I(X;T)\approx H(X)\) and \(I(T;Y)\approx H(Y)\) from (1) (2). So this issue makes it hard to analyze universal neural networks. Luckily our goal is to evaluate different network structures, so we just need to visualize the last hidden layer since it directly reveals the relationship among the model output T, input X and label Y. Since the number of neurons of the last hidden layer in the DNNs for image classification task is precisely the number of classes of input data, our method is only subject to the number of classes.

Fig. 2.
figure 2

This figure shows how we obtain T from the network for calculating I(XT) and I(TY). \(Y\rightarrow X\rightarrow T\) forms a Markov chain. The output of the last layer (blue circles) is the softmax probability. (Color figure online)

Suppose there are C classes, the outputs of the last hidden layer are scores of different classes which are unbounded. We use the normalized exponential function to squash a C dimensional real vector z of arbitrary real values to a C dimensional vector \(\sigma (z)\) of real values in the range [0,1] that add up to 1. The function is given by

$$\begin{aligned} \sigma (z)_{j} = \frac{e^{z_{j}}}{\sum _{c=1}^{C}e^{z_{c}}} \quad \text {for}~j=1,\dots ,C, \end{aligned}$$
(7)

which is exactly what the softmax function does in the neural network. We bin the neuron’s output \(\sigma (z)\) into 10 equal intervals between 0 and 1 and get our final model output T. Then we can calculate I(XT) and I(TY) for any network architecture. An advantage of this calculation is that the sample space of T is a bit smaller since we enforce the C dimensional vector \(\sigma (z)\) add up to 1. This process is illustrated in Fig. 2.

4 Experiments

This section goes as follows: in Sect. 4.1, we analyze the relationship among the model accuracy, I(XT) and I(TY); in Sect. 4.2, we propose a framework that can be used to evaluate DNNs; in Sect. 4.3, we show the evaluation framework is more informative than the loss curve when evaluating DNNs and how to use this framework to guide us on choosing networks efficiently; in Sect. 4.4, we show how to apply mutual information to infer the capability of a model for objects of each class in image classification tasks.

4.1 Relationship Among Classification Accuracy, I(XT) and I(TY) in DNNs

In addition to developing the theory of deep learning, it is also important to empirically validate it. In the original IB theory [12], X, Y and T represent the training input, training label and model output, respectively; and [12] states that I(TY) explains the training accuracy, I(XT) serves as a regularization term that controls the generalization. Here we find that in DNNs, low I(XT) also contributes to the training accuracy. In particular, when \(I(T_{1};Y)\) and \(I(T_{2};Y)\) are equal, the model with smaller I(XT) has a larger probability to achieve higher training accuracy.

To validate the hypothesis that low I(XT) also contributes to the training accuracy, we train neural networks on the CIFAR-10 dataset to sample values of I(XT), I(TY) and the training accuracy. During the training process, the sampling is performed at every fixed iteration steps. For the i-th sample, we use \(I(X;T_{i})\), \(I(T_{i};Y)\) and \(Acc_{i}\) to denote the mutual information values and the training accuracy, respectively. A direct way to examine the rightness of our hypothesis is to find pairs (ij) which satisfy \(I(T_{i};Y) = I(T_{j};Y)\), then check the relationship of I(XT) and the training accuracy.

Since I(TY) is a real number, it’s hard to find a pair of samples who have the same value of I(TY). Instead, we examine the hypothesis by checking inversions. An inversion is a pair of samples (ij) which satisfy \(I(T_{i};Y)<I(T_{j};Y)\) and \(Acc_{i}>Acc_{j}\). Among all these inversion pairs, we calculate the percentage of pairs that satisfy \(I(X;T_{i})<I(X;T_{j})\). This percentage is a proper indicator of the rightness of our hypothesis since if the percentage is near 0.5, then I(XT) almost has no relation to the training accuracy. Otherwise, if the percentage is high, then low I(XT) also contributes to the training accuracy. In our experiments, we set different training conditions to train neural networks. The percentages are listed in Table 1.

Table 1. This table records the percentages with 600 samples for DNNs with different network structures and training methods on the training set. The percentage converges when we include 600 samples. CNN-9 is a deep convolutional neural network with 9 convolutional layers. Linear network is a feedforward network whose activation function is the identity function. SGD is short for Stochastic Gradient Descent, and BGD for Batch Gradient Descent. For computational limitation, we include 10000 training samples when performing BGD. Also BGD and SGD use the same training set.

The results in Table 1 show that I(XT) also contributes to training accuracy since the percentages are over 0.5. Different network structures may end up with different percentages. Also SGD has higher percentage than BGD. We want to emphasize that the percentages may have a little deviation from the ground truth since the mutual information in DNNs was calculated approximately by binning. This is crucial especially when mutual information values do not vary too much. We believe the accurate mutual information will make our hypothesis more convincing. Table 1 can be further interpreted as follows:

First, notice that I(TY) is not a monotonic function of the training accuracy. For example, suppose we have C classes in the dataset, and \(\mathcal C_{i}\) denotes the i-th class. Consider two cases: In the first case, \(T=\sigma (Y)\) where \(\sigma \) is an identity mapping which means T always predicts the true class. In the second case \(T = \varphi (Y)\) where \(\varphi \) is a shift mapping which means if the true class is \(\mathcal C_{i}\), the prediction of T is \(\mathcal C_{i+1}\). In both two cases, since \(\sigma \) and \(\varphi \) are invertible functions, from (3), we have \(I(T;Y)= I(\sigma (Y);Y) = I(\varphi (Y);Y) = H(Y)\). But in case 1, the training accuracy is 1, whereas in case 2 it is 0.

Second, unlike linear networks, the loss function of CNNs is highly non-convex. By using SGD or BGD to train neural networks, the training loss respect to all the training data does not decrease all the time during the training process which indicates the network sometimes is learning in the wrong direction. Since SGD only uses a mini-batch of samples for each iteration, the loss curve becomes more unstable. Only in the linear network (the loss function is convex) trained by BGD, with a proper learning rate, the training loss always decreases during the training process, which means the model always makes T closer to the true label Y (the model is stablest in this case). So I(TY) can fully explain the training accuracy and I(XT) may not contribute to training accuracy very much.

Third, [18] defines that a learning algorithm is stable if its output does not depend too much on any individual training example. So when \(I(T_{1};Y)\) and \(I(T_{2};Y)\) are equal, the model with low I(XT) has large stability, which may lead to a high training accuracy.

We also find that when trained by SGD, the percentages increase as more convolutional layers are considered, which can be seen from the columns in Table 2. This interesting phenomenon may reveal some inherent properties of CNNs which we would further explore in our future work.

Table 2. This table records the percentages with 600 samples for DNNs with different network structures on the training set. CNN-i is a deep convolutional neural network with i convolutional layers.

We also validate our hypothesis on the validation data where X and Y now represent the validation input and validation label, respectively. The percentages in Table 3 also show that low I(XT) contributes to validation accuracy. This result will be useful in the next subsection for evaluating DNNs.

Table 3. The percentages with numbers of samples on the validation set. The network is VGG-16 trained by SGD.

4.2 Evaluating DNNs in the Information Plane

Evaluating the capability of DNNs during the training process is important because it would help us understand the training phase better. Section 3.2 shows that an optimal representation (a minimal sufficient statistics of X with respect to Y) is a tradeoff between I(XT) and I(TY). We validate the hypothesis in Sect. 4.1 that, in DNNs trained by SGD, not only I(TY) but also I(XT) is a measurement of validation accuracy where X and Y represent the validation input and validation label, respectively. So we use \(\frac{\varDelta I(T;Y)}{\varDelta I(X;T)}\) (the slope of the curve) to represent the model’s learning capability at each moment in the information plane.

Figure 1 shows two learning phases of the training process. The model begins to generalize in the second compression phase, and the first fitting phase takes very little time compared to the compression phase. So we just use \(\frac{\varDelta I(T;Y)}{\varDelta I(X;T)}\) in the second compression phase to evaluate the model’s capability of generalization. We expect that a good model has small (negative) \(\frac{\varDelta I(T;Y)}{\varDelta I(X;T)}\) at the second phase. While for the first fitting phase, I(TY) and I(XT) grow simultaneously (in order to fit the label, the model needs to remember X at first). So we use I(TY) instead of \(\frac{\varDelta I(T;Y)}{\varDelta I(X;T)}\) to represent the model’s capability of fitting the label. Based on the discussion above, we propose our evaluation framework in Fig. 3.

Fig. 3.
figure 3

Evaluation framework based on I(XT) and I(TY). The height of transition point (I(TY)) represents the model’s capability of fitting the label. The slope after transition point (\(\frac{\varDelta I(T;Y)}{\varDelta I(X;T)}\)) represents the model’s capability of generalization.

We are interested in how different neural networks behave under the framework we propose in Fig. 3. So we run different network structures on MNIST and CIFAR-10 dataset (see Fig. 4). Notice that in this and the subsequent experiments, X and Y represent the validation input and validation label respectively. Mutual information curves are smoothed for better visualization since smoothing doesn’t change the trend of the curve. Also, DNNs are just trained once until convergence without data augmentation or retraining since we want to compare networks in an equal way. We also record the mutual information, training epochs, model validation accuracy at the transition point and convergence point in Table 4. Figure 4 and Table 4 show some interesting phenomenons.

  • Convolutional neural networks (CNNs) may have lower capabilities of fitting the label than fully connected networks (FCs) in the first fitting phase by comparing I(TY) at the transition point (The reason may attribute to the large number of parameters of FCs), but CNNs have stronger capabilities of generalization (smaller \(\frac{\varDelta I(T;Y)}{\varDelta I(X;T)}\)) in the compression phase which lead to higher final validation accuracies.

  • Some models may not have second compression phase. For MNIST, all models have exactly two learning phases, but for CIFAR-10, the models with fewer layers don’t show second compression phase (see CNN-2, CNN-4, and FC-3 for CIFAR-10 in Fig. 4). It reveals that when the dataset is harder to classify, neural networks with fewer layers can not generalize well.

  • For CIFAR-10, I(XT) and I(TY) of FC-6 and FC-9 both drop down in the second phase indicating that increasing layers in FCs may lead to overfitting.

Fig. 4.
figure 4

The figures depict mutual information paths with training epochs in the information plane. The left and right figures represent MNIST and CIFAR-10, respectively. Both datasets are trained by fully connected neural networks and convolutional neural networks. FC-i denotes a fully connected neural network which has i layers including the input and output layers. CNN-i denotes a convolutional neural network which has i convolutional layers.

Table 4. The table records I(TY), I(XT), training epochs and validation accuracy of every network at the transition point and convergence point. For FC-3, CNN-2 and CNN-4 on CIFAR-10, the values on the transition point and convergence point are the same since they don’t show the compression phase.

This evaluation framework allows us to visualize any CNN or FC in the information plane. In the next subsection, we will show this framework is more informative than the loss curve when evaluating neural networks.

4.3 Informativeness and Guidance of Information Plane

Usually, for a particular problem, the network structure is determined based on the exhausting search of different DNNs on the validation set which is time-consuming. Next, we will show our evaluation framework is more informative than the loss curve and would facilitate the model selection of DNNs.

Specifically, by comparing the number of training epochs at the transition point and convergence point, we can find that most of the training time is spent on the compression phase, as shown in Table 4. So we can visualize the information plane during training the network, and stop training once the model has crossed the transition point for several epochs. The height of the transition point (I(TY)) represents the model’s capability of fitting the label. The slope (\(\frac{\varDelta I(T;Y)}{\varDelta I(X;T)}\)) after transition point represents the model’s capability of generalization. These two indicators will give us a general prediction about the model’s quality. Figure 5 shows the mutual information paths of different network structures on the CIFAR-10 dataset. Table 5 records the model validation accuracy and ‘percentages’ defined in Sect. 4.1.

Fig. 5.
figure 5

(a) Mutual information path of each model with SGD optimization on the training set of CIFAR-10. (b) Mutual information path of each model on the validation set. (c) Training loss of each model with training iterations.

Table 5. The percentages of each network are from Table 2.

From Fig. 5(c), we can see the loss of each model continues to decrease with training iterations. While in the information plane, each model behaves differently. In Fig. 5(a) and (b), the models with few layers do not have clear second stage in the mutual information paths. Actually, we can visualize the information path of each model on the validation set to help us evaluate or select model efficiently. From Fig. 5(b), compared with CNN-9, the slope of information path of CNN-16 in the second stage is smaller (negative), which represents better generalization capability. The validation accuracy of each model in Table 5 is consistent with our analysis. Thus, the information plane is more ‘informative’ than loss curve when evaluating the DNN model. Since the first stage only takes little time compared to the second stage, we can choose a better model quickly given different model architectures by visualizing the information plane on the validation set.

It is worth noticing that our prediction may not always be true, since the mutual information path may have a larger slope change in the future. So it’s a trade-off between training time and confidence of our prediction. The longer time we train the network, the more confident prediction about the model we can make. But it is still an efficient way to guide us on choosing neural network structure for a given task.

Figure 5(a), (b) and Table 5 also show that when CNNs have fewer layers, the information plane does not clearly show the second phase, and the percentages are low. Whereas for CNN-9 and CNN-16, the information plane clearly show the second phase and the percentages are high. This experiment shows that I(XT) contributes to training accuracy mostly at the second stage of information paths. One possible reason is that the model begins to ‘compress’ the information of the training set and learns to generalize (extract common features from each mini-batch) at the second stage. From the percentages, this process happens even when I(TY)’s remain the same. The correlation between accuracy and I(XT) grows stronger when the number of layers of DNN increases, since DNN with more layers has better generalization capability. We can view I(XT) and I(TY) as: I(TY) determines how much the knowledge T has about the label Y, and I(XT) determines how easy this knowledge can be learned by the network.

4.4 Evaluating DNN’s Capability of Recognizing Objects from Different Classes

Furthermore, we also evaluate the model’s capability of recognizing objects from each class for the image classification task. The information plane provides a method in an informative way. Suppose there are C classes in the dataset, \(\mathcal C_{i}\) denotes the i-th class. To test the model’s capability of recognizing \(\mathcal C_{i}\) from the data, we can label other classes in the validation data as one class, thus label Y changes from \(\mathbf {R}^{C}\) to \(\mathbf {R}^{2}\). When calculating the mutual information, we make label Y balanced so that H(Y) is equal to 1. Then I(XT) and I(TY) can be calculated directly given a neural network. Note that the structure of the neural network does not change. The output T is still \(\mathbf {R}^{C}\). We only alter the way how to test the data. Repeating this process for C times and the model’s capability of recognizing each class can be visualized in the information plane. This method is similar to one-vs-all classifications [19]. It measures the model’s capability of recognizing the true class from all the data.

Fig. 6.
figure 6

Models’ capabilities of recognizing objects from each class on the CIFAR-10 dataset. Models are well trained AlexNet and VGG-16. For each class, we show its I(XT), I(TY) and validation accuracy. The validation accuracy of each class is the percentage of how many samples are correctly predicted out of all samples belonging to that class. Note that since I(TY) is bounded by H(Y) which is 1, the accuracy is also bounded by 1. To facilitate the visualization, we divide I(XT) by its upper-bound H(X) so that I(XT), I(TY) and the validation accuracy have the same magnitude.

For better visualization, we select the first 3 classes (airplane, automobile, bird) on CIFAR-10. Figure 7 shows how network recognizes objects from each class during the training stage in the information plane. Figure 6 compares different networks’ recognizing capabilities for each class at the end of the training.

As shown in Fig. 7, Automobile has almost the same I(TY) as airplane at the transition point, but automobile has smaller slop after that point. So we conclude that VGG-16 model has higher classification accuracy on automobile than airplane. For airplane and bird, model has almost equally generalization capabilities, but the capability of fitting the label of airplane is better than that of bird. So we conclude model has better classification accuracy on airplane than bird. The final classification accuracies for these three classes are 0.921, 0.961 and 0.825 which is consistent with our analysis.

Fig. 7.
figure 7

Mutual information paths of different classes on CIFAR-10 dataset during the training phase for VGG-16.

Figure 6 shows that VGG-16 has stronger recognizing capability than AlexNet on each class. For each model, we can still use I(XT) and I(TY) to compare each class. Like in AlexNet, after comparing I(XT) and I(TY) of automobile and bird, we can conclude model has more recognizing capability on automobile rather than bird since automobile has a higher I(TY) and lower I(XT).

Of course, ‘model accuracy’ can still be used to evaluate the model’s recognizing capability for each class. But I(XT) and I(TY) provide more information about the model’s property in an informative way. Moreover, in some problems where the distribution of sample is unbalanced, we can use the information plane to test how many samples we need to train a neural network with balanced classification capability for each class.

5 Discussion

In this paper, we apply mutual information to evaluate the capability of DNNs for image classification tasks. We explore the relationship among model accuracy, I(XT) and I(TY) in DNNs through extensive experiments. The results show that I(XT) also contributes to accuracy. We propose a general framework that can be used to evaluate DNNs in the information plane. This framework is more informative than the loss curve and can guide us on choosing network structures. We also apply mutual information to validate the network’s recognizing capability for each class in the image classification tasks.

The datasets we use in the paper are MNIST and CIFAR. The difficulty of validating IB on large dataset like Imagenet is that Imagenet has 1000 classes. The sample space of T is huge and we can not calculate I(XT) and I(TY) accurately by binning. Estimating accurate mutual information in high dimension space is still an open problem. Some future works can be done to develop more efficient ways to calculate mutual information and further explore the relationship between accuracy and I(XT) for understanding neural networks better.