Existing methods for filter pruning mostly rely on specific data-driven paradigms but lack the interpretability. Besides, these approaches usually assign layer-wise compression ratios automatically only under given FLOPs by neural architecture search algorithms or just manually, which are short of efficiency. In this paper, we propose a novel interpretable task-inspired adaptive filter pruning method for neural networks to solve the above problems. First, we treat filters as semantic detectors and develop the task-inspired importance criteria by evaluating correlations between input tasks and feature maps, and observing the information flow through filters between adjacent layers. Second, we refer to the human neurobiological mechanism for the better interpretability, where the retained first layer filters act as individual information receivers. Third, inspired by the phenomenon that each filter has a deterministic impact on FLOPs and network parameters, we provide an efficient adaptive compression ratio allocation strategy based on differentiable pruning approximation under multiple budget constraints, as well as considering the performance objective. The proposed method is validated with extensive experiments on the state-of-the-art neural networks, which significantly outperforms all the existing filter pruning methods and achieves the best trade-off between neural network compression and task performance. With ResNet-50 on ImageNet, our approach reduces 75.49% parameters and 70.90% FLOPs, only suffering from 2.31% performance degradation.
Data Availability Statement
The datasets generated during and/or analysed during the current study are available in CIFAR-10 at https://www.cs.toronto.edu/~kriz/cifar.html and ImageNet (ILSVRC2012) at https://www.image-net.org/challenges/LSVRC/index.php.
This work was supported by Natural Science Foundation of China (62271013, 62031013), Shenzhen Fundamental Research Program (GXWD20201231165807007-20200806163656003), and Shenzhen Science and Technology Plan Basic Research Project (JCYJ20230807120808017).
Appendix A: Proof of the Phenomenon Mentioned in Sect. 3.4
In order to prove the phenomenon that filters in the same layer have the same and deterministic influence on total parameters and FLOPs of the network, we assume that weight parameters in the t-th layer can be described as a four-dimensional (4D) matrix \(\textbf{W}_{t}\in \mathbb {R}^{n_{t} \times n_{t-1} \times k_{t} \times k_{t}}\). \(k_t\) represents the kernel size, \(n_{t-1}\) and \(n_{t}\) are the number of input and output channels, respectively.
As for the total parameters, just as vividly shown in Fig. 3, if a filter in the t-th layer is pruned, we have:
where \(n_{t-1} \cdot k_{t} \cdot k_{t}\) means parameters of the pruned filter in the t-th layer and \(n_{t+1} \cdot k_{t+1} \cdot k_{t+1}\) means the reduced parameters due to that all filters in the \((t+1)\)-th layer are pruned by one dimension. Obviously, \(\Delta P_{total}\) is a deterministic value.
Then, according to (Molchanov et al., 2017), FLOPs in the t-th layer can be described as:
where \(H_{t}\) and \(W_{t}\) are height and width of the output feature maps.
If a filter in the t-th layer is pruned, the number of input channels in the \((t+1)\)-th layer will be reduced by one. Moreover, all filters in the \((t+1)\)-th layer will also be reduced by one dimension. Therefore, the change in FLOPs of the t-th and the \((t+1)\)-th layer are as follows:
For the t-th layer:
For the \((t+1)\)-th layer:
Hence, the total FLOPs will be changed by:
where the value of \(\Delta F_{total}\) is also deterministic. Therefore, the total FLOPs will be changed by a deterministic value when a filter in the t-th layer is pruned.
In summary, since \(\Delta P_{total}\) and \(\Delta F_{total}\) are deterministic, filters in the same layer have the same and deterministic influence on total parameters and FLOPs of the network.
Appendix B: Proof of the Rationality of Eq. (13)
\({\varvec{\Phi }}_t^i \in (0,1)\) in Eq. (11) represents the probability that there are i remaining filters in the t-th layer.
We know:
Therefore, we have:
Since \(\textbf{R}_t \in \mathbb {R}^{n_t}\) and at least one filter must be reserved in each layer (\(\Vert \textbf{R}_t\Vert _0\ge 1\)), we have:
Therefore, the value ranges of and \(\Vert \textbf{R}_t\Vert _0\) are similar (
is a relaxation of \(\Vert \textbf{R}_t\Vert _0\)). Thus, we make such a reasonable definition in Eq. (13) in the full paper as:
Appendix C: Proof of the Differentiability of the Loss Function in Eq. (16)
Since the trainable parameters are \(\textbf{W}_t\) and \({\varvec{\Theta }}_t\) in each layer, we need to prove the differentiability of each term of the loss function \(\mathcal {L}\) about the involved parameters \(\textbf{W}_t\) and \({\varvec{\Theta }}_t\). Firstly, \({\mathcal {\widetilde{L}}_{FLOPs}}\) and \(\mathcal {\widetilde{L}}_{Params}\) are obviously differentiable for \({\varvec{\Theta }}_t\). Meanwhile, \(\mathcal {L}_{task}\) is naturally differentiable for \(\textbf{W}_t\). But for \(\mathcal {L}_{task}\) term, it should not only find optimal weight \(\textbf{W}_t\), but also update the \({\varvec{\Theta }}_t\). Therefore, our key problem is how to guarantee the differentiability of \(\mathcal {L}_{task}\) with respect to \({\varvec{\Theta }}_t\).
If \(\mathcal {L}_{task}\) is cross-entropy loss for classification model of C classes, we have:
where \(y_c\) is the ground truth and \(\hat{y_c}\) is the prediction.
We design a mixture of all the possible pruning masks weighted by \({\varvec{\Phi }}_t\) in Eq. (18) as:
When \({\varvec{\Gamma }}_t\) is calculated, \(determine\left( i,{\varvec{\Gamma }}_t\right) \in \mathbb {R}^{n_t}\) will be certain. For example, if \(max\left( {\varvec{\Gamma }}_t\right) ={\varvec{\Gamma }}_t^2\), \(determine\left( 1,{\varvec{\Gamma }}_t\right) =(0,1,0,\cdots ,0)\) (Only the second element is 1, and the other elements are 0). Then \({\varvec{\mathcal {R}}}_t \in \mathbb {R}^{n_t}\) is only related to \({\varvec{\Theta }}_t\) since all the \(determine\left( i,{\varvec{\Gamma }}_t\right) , 1\le i\le n_t\) are certain. We consider \({\varvec{\mathcal {R}}}_t\) as the approximation of \(\textbf{R}_t\).
We make the original feature maps \(\textbf{O}_t\) (\(\textbf{O}_t^i\) is the feature map generated by the i-th filter) in the t-th layer be re-weighted by \({\varvec{\mathcal {R}}}_t\) as shown in Eq. (19), where \({\varvec{\mathcal {R}}}_t\) will execute a dot product with each feature map in \(\textbf{O}_t\), and \({\varvec{\mathcal {O}}}_t\) is the differentiable approximation of the pruned feature maps of the t-th layer.
We know \(\hat{y_c}\) is originally generated by feature maps \(\textbf{O}_t\) layer by layer. We now make \(\hat{y_c}\) be generated by \({\varvec{\mathcal {O}}}_t\) to store the differentiability about \({\varvec{\Theta }}_t\) layer by layer. In this way, the differentiability of \(\mathcal {L}_{task}\) with respect to \({\varvec{\Theta }}_t\) can be guaranteed. In summary, this compeletes the proof of the differentiability of each term in our loss function about the involved parameters.
