LightweightNet: Toward fast and lightweight convolutional neural networks via architecture distillation
Introduction
Deep Convolutional Neural Networks (CNNs) have improved the performance dramatically in a wide range of pattern recognition tasks by learning better discriminative representations to replace traditional handcrafted features [1], [2], [3], [4]. To solve diverse recognition tasks, a variety of network architectures are designed carefully. For instance, LeCun et al. [5], [6] proposed the famous LeNet-5 architecture which contains multiple convolutional layers and fully-connected layers for 10-class handwritten digits recognition (MNIST dataset). Krizhevsky et al. [1] designed the well-known deep CNN AlexNet successfully for the 1000-class image recognition task on ImageNet dataset [7], where the performance is far better than the previous methods. Later, a set of deeper yet complicated CNNs such as ZF-Net [8], VGG-Net [9], GoogLeNet [10], Inception-v4 [11], ResNet [2], and DenseNet [12] are proposed to promote the accuracy. Especially, the ResNet and DenseNet have reached the astounding depth of > 150 layers and achieved the amazing performance of about 97% top-5 accuracy in the large-scale ImageNet recognition challenge.
A specific challenging recognition task that have benefited from deep neural networks is offline handwritten Chinese character recognition (HCCR) [13], [14], [15], [16]. It has following properties: large number of character categories (3755 classes or more), the diversity of writing styles [17], confusion between similar characters, etc. The recent use of CNNs has gradually dominated state-of-the-art performance on this task [18], [19]. However, typical network architectures of natural image classification (e.g., AlexNet [1], VGG-Net [9], ResNet [2]) are not the optimal choice for offline HCCR in terms of performance, storage and computational cost. Thus, a set of specialized HCCR deep networks have been proposed [19], [20], [21] to replace traditional methods [22], [23], [24], [25], [26]. Recently, Zhang et al. [27] report the accuracy of 96.95% using a single network, while Xiao et al. [28] achieve the accuracies of 97.30% and 97.59% using HCCR-CNN9Layer and HCCR-CNN12Layer, respectively.
Although CNN based methods outperform traditional models by a large margin, the high computational cost and large memory overhead are accompanying with them. For example, for recognizing a 224×224 image, the AlexNet [1] needs 725 million FLoating-point OPerations (FLOPs) with 232MB storage, the VGG-16 [9] requires 15.3 billion FLOPs with 527MB storage, and the newly proposed ResNet-18 [2] still involves 1.8 billion FLOPs with 44.6MB storage. For offline HCCR, the HCCR-CNN12Layer [28] also requires 1.2 billion FLOPs with 48.7MB storage to classify a character image of size 96×96. Thus, it is very difficult to deploy these complicated CNNs on resource-limited devices such as Smartphone and Embedded chips. To address this problem, many deep network acceleration and compression schemes have been designed, and they can be categorized into two groups as follows.
One framework mainly adopts weight compression techniques (e.g., low-rank decomposition [29], [30], [31], [32], weight pruning [33], [34], [35], weight quantization [36], [37], [38]) to compress and accelerate the pre-trained AlexNet [1], ZF-Net [8] and VGG-Net [9]. For offline HCCR, Xiao et al. [28] integrate four separated steps to compress and accelerate the pre-trained models. However, these methods have some drawbacks: (1) The training procedure is complicated and time-consuming due to multiple separated training processes; (2) The performance is usually degenerated to some extent by direct compression, and the post fine-tuning can rarely recover the original accuracy due to the highly non-convex objective of deep CNNs.
The other framework exploits single compact unit (e.g., Fire module [39], Inception module [10], Residual block [2]) to stack respective network (e.g., SqueezeNet [39], GoogLeNet [10], ResNet [2]). This can obtain competitive accuracy with the constrained storage. However, this framework also suffers from obvious disadvantages: (1) The computational cost of these networks is still considerable; (2) The single compact unit lacks ubiquity and thus cannot be directly used to compress conventional networks (e.g., LeNet-5 [6], AlexNet [1], VGG-Net [9]) via layer-wise replacement. Many practical tasks have excellent baseline network architectures specialized designed. Thus, it is desirable to design a universal strategy to accelerate and compress these baseline network architectures. To overcome the above problems, we propose a new general framework to simultaneously implement acceleration and compression tasks for deep convolutional neural networks. For an arbitrary baseline network architecture, the corresponding lightweight network can be constructed via architecture distillation framework (called LightweightNet) as shown in Fig. 1. The framework consists of three major operations: network parameter compression, network structure acceleration, and non-tensor layer improvement. On one hand, the lightweight network is able to effectively compress a baseline network without dependence on a pre-trained model. On the other hand, the new network preserves the baseline network layout, where each component has been designed carefully for a particular task.
For conventional networks, the majority of network parameters are occupied in the fully-connected (fc) layers, while the computational cost is dominated by the convolutional (conv) layers. To remove the massive parameters of fc layers, we propose a novel strategy of learning low-dimensional features for substantial memory saving. To accelerate the time-consuming conv layers, we design multiple effective acceleration blocks for significant time saving, which consist of CReLU block [40], basic speedup module (BSM), BSM with CReLU, and accuracy compact block. Based on the observation that the deeper layers who generate stronger discriminative features are more sensitive to the performance, we propose an accuracy-sensitive distillation rule for a better trade-off between speed and accuracy.
To demonstrate the effectiveness of the proposed framework, we perform extensive experiments on MNIST, CIFAR-10 and ImageNet datasets. Without losing the performance, we achieved a acceleration rate and compression rate by constructing corresponding lightweight networks for multiple baseline network architectures such as LeNet-5, VGG-9Layer, and AlexNet. Furthermore, the newly proposed network architecture ResNet-18 can also be accelerated and compressed by 2.65×.
For offline HCCR, the corresponding lightweight networks effectively accelerate the inference process and reduce the model sizes by with negligible accuracy loss. Furthermore, we integrate the pooling merge 1 and the separable convolutions [41] into the lightweight network to obtain a actual compression rate of 8× and a theoretical speedup of 9.7× with the slight accuracy loss. For the better guidance of practical applications, we compare the actual running time of different methods. The lightweight HCCR-CNN9Layer of pooling merge can achieve 7× actual speedup and only requires 3.1 ms to test a character image on a single-threaded CPU. After integrating the separable convolutions, the lightweight network can achieve a faster speed 2.8 ms per character image with only 5.2 MB storage. To the best of our knowledge, this is the first report that CNN based methods surpass traditional approaches in speed.
The reminder of this paper is organized as follows. Section 2 reviews related works about deep network acceleration and compression. Section 3 analyzes the computational cost and parameter storage of conventional network architectures. Section 4 describes the details of the architecture distillation framework. Section 5 introduces within-channel and across-channel separable convolutions. Section 6 presents experimental results with discussions. Concluding remarks are drawn in Section 7.
Section snippets
Related works
Most CNN architectures are heavily over-parameterized and computationally complex [42]. Thus, a key problem is how to remove these redundant parameters and computations without incurring accuracy loss. To address the problem, many approaches have been proposed recently and we briefly review them in the following.
Analysis of the conventional network architecture
To obtain a better guidance on how to effectively remove redundant computations and parameters of conventional network architectures (regular conv layers + fc layers) such as LeNet-5, AlexNet (in Fig. 2), VGG-Net, and HCCR-CNN9Layer, we hope to extract some shared characteristics of these network architectures. Thus, we investigate the number of parameters and computational cost for each layer of baseline network. Here, we take the HCCR-CNN9Layer in Fig. 3 left as an example. Each weight
The architecture distillation framework
To alleviate the shortages and limitations of existing acceleration and compression approaches, we propose a new framework of deep CNN architecture distillation, namely LightweightNet, to distill a given network architecture for the network acceleration and compression.
The proposed framework (type C in Table 2) is a new compression/acceleration manner that is able to effectively reduce parameters and computations from the perspective of network architecture. This manner is obviously different
Within-channel and across-channel separation
Although the constructed lightweight network has attained satisfying accuracy with a large acceleration, some practical applications may need further lower storage and fewer computations, even if there is some accuracy loss. Through careful analysis, accuracy compact blocks still occupy a large number of parameters and computations of the lightweight network. In the following, we show how to further compress the block with sacrificing a little accuracy.
To achieve this goal, we integrate the
Experiments and analysis
We conduct a series of experiments for different recognition tasks to evaluate the effectiveness of the constructed lightweight networks. It is demonstrated that the proposed architecture distillation framework can successfully accelerate and compress various network architectures on MNIST, CIFAR-10, ImageNet and offline HCCR datasets. The training and test processes of all experiments are performed with the high-efficiency Caffe [62] framework. In the following, we report the results on
Conclusion
In this paper, we propose a new framework of architecture distillation, namely LightweightNet, to accelerate and compress state-of-the-art deep CNNs. Instead of existing multi-stage training protocol, we directly train the fast and efficient lightweight network which is constructed by distilling a given network architecture rather than compressing a pre-trained model. The LightweightNet is based on a comprehensive analysis of the network architecture and composed of network parameter
Acknowledgments
This work has been supported by the National Natural Science Foundation of China (NSFC) grants 61721004 and 61633021. We thank Guangliang Cheng, Jie Yang and Guo-Sen Xie for helpful discussions.
Ting-Bing Xu received the BS degree in automation from China University of Petroleum, Qingdao, China, in 2014. He is currently pursuing the PhD degree at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include deep learning, machine learning, pattern recognition, and handwriting recognition.
References (81)
Forty years of research in character and document recognition: an industrial perspective
Pattern Recognit.
(2008)- et al.
Online and offline handwritten Chinese character recognition: benchmarking on new databases
Pattern Recognit.
(2013) - et al.
Handwritten digit recognition: investigation of normalization and feature extraction techniques
Pattern Recognit.
(2004) - et al.
Pseudo two-dimensional shape normalization methods for handwritten Chinese character recognition
Pattern Recognit.
(2005) - et al.
Regularized margin-based conditional log-likelihood loss for prototype learning
Pattern Recognit.
(2010) - et al.
Building compact MQDF classifier for large character set recognition by subspace distribution sharing
Pattern Recognit.
(2008) - et al.
Online and offline handwritten Chinese character recognition: a comprehensive study and new benchmark
Pattern Recognit.
(2017) - et al.
Building fast and compact convolutional neural networks for offline handwritten Chinese character recognition
Pattern Recognit.
(2017) - et al.
Pruning filters for efficient ConvNets
International Conference on Learning Representations (ICLR)
(2017) - et al.
ImageNet classification with deep convolutional neural networks
Advances in Neural Information Processing Systems (NIPS)
(2012)
Deep residual learning for image recognition
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Faster R-CNN: towards real-time object detection with region proposal networks
IEEE Trans. Pattern Anal. Mach. Intell.
Fully convolutional networks for semantic segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
Handwritten digit recognition with a back-propagation network
Advances in Neural Information Processing Systems (NIPS)
Gradient-based learning applied to document recognition
Proceedings of the IEEE
Imagenet large scale visual recognition challenge
Int. J. Comput. Vision
Visualizing and understanding convolutional networks
European Conference on Computer Vision (ECCV)
Very deep convolutional networks for large-scale image recognition
International Conference on Learning Representations (ICLR)
Going deeper with convolutions
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Inception-v4, Inception-ResNet and the impact of residual connections on learning
AAAI Conference on Artificial Intelligence
Densely connected convolutional networks
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Modified quadratic discriminant functions and the application to Chinese character recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Chinese character recognition: history, status and prospects
Front. Comput. Sci. China
Writer adaptation with style transfer mapping
IEEE Trans. Pattern Anal. Mach. Intell.
ICDAR 2011 Chinese handwriting recognition competition
International Conference on Document Analysis and Recognition (ICDAR)
ICDAR 2013 Chinese handwriting recognition competition
International Conference on Document Analysis and Recognition (ICDAR)
High performance offline handwritten Chinese character recognition using GoogLeNet and directional feature maps
International Conference on Document Analysis and Recognition (ICDAR)
Normalization-cooperated gradient feature extraction for handwritten character recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Exploiting linear structure within convolutional networks for efficient evaluation
Advances in Neural Information Processing Systems (NIPS)
Speeding up convolutional neural networks with low rank expansions
British Machine Vision Conference (BMVC)
Efficient and accurate approximations of nonlinear convolutional networks
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Accelerating very deep convolutional networks for classification and detection
IEEE Trans. Pattern Anal. Mach.Intell.
Learning both weights and connections for efficient neural network
Advances in Neural Information Processing Systems (NIPS)
Deep compression: compressing deep neural network with pruning, trained quantization and Huffman coding
International Conference on Learning Representations (ICLR)
Dynamic network surgery for efficient DNNs
Advances in Neural Information Processing Systems (NIPS)
Compressing neural networks with the hashing trick
International Conference on Machine Learning (ICML)
Quantized convolutional neural networks for mobile devices
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Cited by (51)
Local multi-scale feature aggregation network for real-time image dehazing
2023, Pattern RecognitionThumbDet: One thumbnail image is enough for object detection
2023, Pattern RecognitionDcTr: Noise-robust point cloud completion by dual-channel transformer with cross-attention
2023, Pattern RecognitionEnvironmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices
2023, Pattern RecognitionCitation Excerpt :This allows us to focus the structured compression on the important weights. There are different techniques by which structured compression can be achieved, such as Channel Pruning [32], Weight Sharing [30,76], Huffman Coding [77], Knowledge Distillation [34,78] and Quantization [30,79]. Though many works in the literature use these techniques in computer vision, they have hardly ever been used for audio tasks.
Multi-instance semantic similarity transferring for knowledge distillation
2022, Knowledge-Based SystemsCitation Excerpt :To tackle this problem, a large body of works has been proposed to accelerate or compress these deep neural networks in recent years. Generally, these solutions fall into the following perspectives: network pruning [9–11], network decomposition [12–14], network quantization and knowledge distillation [15,16]. Among these methods, the seminal work of knowledge distillation has attracted a lot of attention due to its ability of exploiting dark knowledge from the pre-trained large network.
Dimension-aware attention for efficient mobile networks
2022, Pattern Recognition
Ting-Bing Xu received the BS degree in automation from China University of Petroleum, Qingdao, China, in 2014. He is currently pursuing the PhD degree at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include deep learning, machine learning, pattern recognition, and handwriting recognition.
Peipei Yang received the BS degree in automation from Zhejiang University, Hangzhou, China, in 2007, the MS degree in control science and engineering from the Huazhong University of Science and Technology, Wuhan, China, in 2009, and the PhD degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2013. He is an associate professor with the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences. His current research interests include machine learning, pattern recognition, and computer vision.
Xu-Yao Zhang received the BS degree in computational mathematics from Wuhan University, Wuhan, China, in 2008 and the PhD degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2013. He is currently an associate professor in the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. He was a visiting researcher at CENPARMI of Concordia University, in 2012. From March 2015 to March 2016, he was a visiting scholar in the Montreal Institute for Learning Algorithms (MILA), University of Montreal, Canada. His research interests include machine learning, pattern recognition, handwriting recognition, and deep learning.
Cheng-Lin Liu received the BS degree in electronic engineering from Wuhan University, Wuhan, China, the ME degree in electronic engineering from Beijing Polytechnic University, Beijing, China, the PhD degree in pattern recognition and intelligent control from the Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. From 2005, he has been a professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences, Beijing, China, and is now the Director of the laboratory. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has published over 200 technical papers at prestigious international journals and conferences. He is an Associate Editor-in-Chief of Pattern Recognition, an Associate Editor of Image and Vision Computing, International Journal on Document Analysis and Recognition, and Cognitive Computation. He is a fellow of the IAPR and the IEEE.