Elsevier

Pattern Recognition

Volume 88, April 2019, Pages 272-284
Pattern Recognition

LightweightNet: Toward fast and lightweight convolutional neural networks via architecture distillation

https://doi.org/10.1016/j.patcog.2018.10.029Get rights and content

Highlights

  • We present a new framework of deep convolutional neural network architecture distillation, namely LightweightNet, for acceleration and compression.

  • We exploit the prior knowledge of pre-defined network architecture to guide the efficient design of acceleration/compression strategies, while not using pre-trained model.

  • The proposed framework consists of network parameter compression, network structure acceleration, and non-tensor layer improvement.

  • The proposed framework demonstrates a higher acceleration/compression rate than previous methods in experiments, including a large category handwritten Chinese character recognition task with state-of-the-art performance.

Abstract

In recent years, deep neural networks have achieved remarkable successes in many pattern recognition tasks. However, the high computational cost and large memory overhead hinder them from applications on resource-limited devices. To address this problem, many deep network acceleration and compression methods have been proposed. One group of methods adopt decomposition and pruning techniques to accelerate and compress a pre-trained model. Another group designs single compact unit to stack their own networks. These methods are subject to complicated training processes, or lack of generality and extensibility. In this paper, we propose a general framework of architecture distillation, namely LightweightNet, to accelerate and compress convolutional neural networks. Rather than compressing a pre-trained model, we directly construct the lightweight network based on a baseline network architecture. The LightweightNet, designed based on a comprehensive analysis of the network architecture, consists of network parameter compression, network structure acceleration, and non-tensor layer improvement. Specifically, we propose the strategy of low-dimensional features of fully-connected layers for substantial memory saving, and design multiple efficient compact blocks to distill convolutional layers of baseline network with accuracy-sensitive distillation rule for notable time saving. Finally, it can effectively reduce the computational cost and the model size by >4× with negligible accuracy loss. Benchmarks on MNIST, CIFAR-10, ImageNet and HCCR (handwritten Chinese character recognition) datasets demonstrate the advantages of the proposed framework in terms of speed, performance, storage and training process. In HCCR, our method even outperforms traditional handcrafted features-based classifiers in terms of speed and storage while maintaining state-of-the-art recognition performance.

Introduction

Deep Convolutional Neural Networks (CNNs) have improved the performance dramatically in a wide range of pattern recognition tasks by learning better discriminative representations to replace traditional handcrafted features [1], [2], [3], [4]. To solve diverse recognition tasks, a variety of network architectures are designed carefully. For instance, LeCun et al. [5], [6] proposed the famous LeNet-5 architecture which contains multiple convolutional layers and fully-connected layers for 10-class handwritten digits recognition (MNIST dataset). Krizhevsky et al. [1] designed the well-known deep CNN AlexNet successfully for the 1000-class image recognition task on ImageNet dataset [7], where the performance is far better than the previous methods. Later, a set of deeper yet complicated CNNs such as ZF-Net [8], VGG-Net [9], GoogLeNet [10], Inception-v4 [11], ResNet [2], and DenseNet [12] are proposed to promote the accuracy. Especially, the ResNet and DenseNet have reached the astounding depth of  > 150 layers and achieved the amazing performance of about 97% top-5 accuracy in the large-scale ImageNet recognition challenge.

A specific challenging recognition task that have benefited from deep neural networks is offline handwritten Chinese character recognition (HCCR) [13], [14], [15], [16]. It has following properties: large number of character categories (3755 classes or more), the diversity of writing styles [17], confusion between similar characters, etc. The recent use of CNNs has gradually dominated state-of-the-art performance on this task [18], [19]. However, typical network architectures of natural image classification (e.g., AlexNet [1], VGG-Net [9], ResNet [2]) are not the optimal choice for offline HCCR in terms of performance, storage and computational cost. Thus, a set of specialized HCCR deep networks have been proposed [19], [20], [21] to replace traditional methods [22], [23], [24], [25], [26]. Recently, Zhang et al. [27] report the accuracy of 96.95% using a single network, while Xiao et al. [28] achieve the accuracies of 97.30% and 97.59% using HCCR-CNN9Layer and HCCR-CNN12Layer, respectively.

Although CNN based methods outperform traditional models by a large margin, the high computational cost and large memory overhead are accompanying with them. For example, for recognizing a 224×224 image, the AlexNet [1] needs 725 million FLoating-point OPerations (FLOPs) with 232MB storage, the VGG-16 [9] requires 15.3 billion FLOPs with 527MB storage, and the newly proposed ResNet-18 [2] still involves 1.8 billion FLOPs with 44.6MB storage. For offline HCCR, the HCCR-CNN12Layer [28] also requires 1.2 billion FLOPs with 48.7MB storage to classify a character image of size 96×96. Thus, it is very difficult to deploy these complicated CNNs on resource-limited devices such as Smartphone and Embedded chips. To address this problem, many deep network acceleration and compression schemes have been designed, and they can be categorized into two groups as follows.

One framework mainly adopts weight compression techniques (e.g., low-rank decomposition [29], [30], [31], [32], weight pruning [33], [34], [35], weight quantization [36], [37], [38]) to compress and accelerate the pre-trained AlexNet [1], ZF-Net [8] and VGG-Net [9]. For offline HCCR, Xiao et al. [28] integrate four separated steps to compress and accelerate the pre-trained models. However, these methods have some drawbacks: (1) The training procedure is complicated and time-consuming due to multiple separated training processes; (2) The performance is usually degenerated to some extent by direct compression, and the post fine-tuning can rarely recover the original accuracy due to the highly non-convex objective of deep CNNs.

The other framework exploits single compact unit (e.g., Fire module [39], Inception module [10], Residual block [2]) to stack respective network (e.g., SqueezeNet [39], GoogLeNet [10], ResNet [2]). This can obtain competitive accuracy with the constrained storage. However, this framework also suffers from obvious disadvantages: (1) The computational cost of these networks is still considerable; (2) The single compact unit lacks ubiquity and thus cannot be directly used to compress conventional networks (e.g., LeNet-5 [6], AlexNet [1], VGG-Net [9]) via layer-wise replacement. Many practical tasks have excellent baseline network architectures specialized designed. Thus, it is desirable to design a universal strategy to accelerate and compress these baseline network architectures. To overcome the above problems, we propose a new general framework to simultaneously implement acceleration and compression tasks for deep convolutional neural networks. For an arbitrary baseline network architecture, the corresponding lightweight network can be constructed via architecture distillation framework (called LightweightNet) as shown in Fig. 1. The framework consists of three major operations: network parameter compression, network structure acceleration, and non-tensor layer improvement. On one hand, the lightweight network is able to effectively compress a baseline network without dependence on a pre-trained model. On the other hand, the new network preserves the baseline network layout, where each component has been designed carefully for a particular task.

For conventional networks, the majority of network parameters are occupied in the fully-connected (fc) layers, while the computational cost is dominated by the convolutional (conv) layers. To remove the massive parameters of fc layers, we propose a novel strategy of learning low-dimensional features for substantial memory saving. To accelerate the time-consuming conv layers, we design multiple effective acceleration blocks for significant time saving, which consist of CReLU block [40], basic speedup module (BSM), BSM with CReLU, and accuracy compact block. Based on the observation that the deeper layers who generate stronger discriminative features are more sensitive to the performance, we propose an accuracy-sensitive distillation rule for a better trade-off between speed and accuracy.

To demonstrate the effectiveness of the proposed framework, we perform extensive experiments on MNIST, CIFAR-10 and ImageNet datasets. Without losing the performance, we achieved a 4× acceleration rate and compression rate by constructing corresponding lightweight networks for multiple baseline network architectures such as LeNet-5, VGG-9Layer, and AlexNet. Furthermore, the newly proposed network architecture ResNet-18 can also be accelerated and compressed by 2.65×.

For offline HCCR, the corresponding lightweight networks effectively accelerate the inference process and reduce the model sizes by >4× with negligible accuracy loss. Furthermore, we integrate the pooling merge 1 and the separable convolutions [41] into the lightweight network to obtain a actual compression rate of 8× and a theoretical speedup of 9.7× with the slight accuracy loss. For the better guidance of practical applications, we compare the actual running time of different methods. The lightweight HCCR-CNN9Layer of pooling merge can achieve 7× actual speedup and only requires 3.1 ms to test a character image on a single-threaded CPU. After integrating the separable convolutions, the lightweight network can achieve a faster speed 2.8 ms per character image with only 5.2 MB storage. To the best of our knowledge, this is the first report that CNN based methods surpass traditional approaches in speed.

The reminder of this paper is organized as follows. Section 2 reviews related works about deep network acceleration and compression. Section 3 analyzes the computational cost and parameter storage of conventional network architectures. Section 4 describes the details of the architecture distillation framework. Section 5 introduces within-channel and across-channel separable convolutions. Section 6 presents experimental results with discussions. Concluding remarks are drawn in Section 7.

Section snippets

Related works

Most CNN architectures are heavily over-parameterized and computationally complex [42]. Thus, a key problem is how to remove these redundant parameters and computations without incurring accuracy loss. To address the problem, many approaches have been proposed recently and we briefly review them in the following.

Analysis of the conventional network architecture

To obtain a better guidance on how to effectively remove redundant computations and parameters of conventional network architectures (regular conv layers + fc layers) such as LeNet-5, AlexNet (in Fig. 2), VGG-Net, and HCCR-CNN9Layer, we hope to extract some shared characteristics of these network architectures. Thus, we investigate the number of parameters and computational cost for each layer of baseline network. Here, we take the HCCR-CNN9Layer in Fig. 3 left as an example. Each weight

The architecture distillation framework

To alleviate the shortages and limitations of existing acceleration and compression approaches, we propose a new framework of deep CNN architecture distillation, namely LightweightNet, to distill a given network architecture for the network acceleration and compression.

The proposed framework (type C in Table 2) is a new compression/acceleration manner that is able to effectively reduce parameters and computations from the perspective of network architecture. This manner is obviously different

Within-channel and across-channel separation

Although the constructed lightweight network has attained satisfying accuracy with a large acceleration, some practical applications may need further lower storage and fewer computations, even if there is some accuracy loss. Through careful analysis, accuracy compact blocks still occupy a large number of parameters and computations of the lightweight network. In the following, we show how to further compress the block with sacrificing a little accuracy.

To achieve this goal, we integrate the

Experiments and analysis

We conduct a series of experiments for different recognition tasks to evaluate the effectiveness of the constructed lightweight networks. It is demonstrated that the proposed architecture distillation framework can successfully accelerate and compress various network architectures on MNIST, CIFAR-10, ImageNet and offline HCCR datasets. The training and test processes of all experiments are performed with the high-efficiency Caffe [62] framework. In the following, we report the results on

Conclusion

In this paper, we propose a new framework of architecture distillation, namely LightweightNet, to accelerate and compress state-of-the-art deep CNNs. Instead of existing multi-stage training protocol, we directly train the fast and efficient lightweight network which is constructed by distilling a given network architecture rather than compressing a pre-trained model. The LightweightNet is based on a comprehensive analysis of the network architecture and composed of network parameter

Acknowledgments

This work has been supported by the National Natural Science Foundation of China (NSFC) grants 61721004 and 61633021. We thank Guangliang Cheng, Jie Yang and Guo-Sen Xie for helpful discussions.

Ting-Bing Xu received the BS degree in automation from China University of Petroleum, Qingdao, China, in 2014. He is currently pursuing the PhD degree at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include deep learning, machine learning, pattern recognition, and handwriting recognition.

References (81)

  • K. He et al.

    Deep residual learning for image recognition

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • S. Ren et al.

    Faster R-CNN: towards real-time object detection with region proposal networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • E. Shelhamer et al.

    Fully convolutional networks for semantic segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • Y. LeCun et al.

    Handwritten digit recognition with a back-propagation network

    Advances in Neural Information Processing Systems (NIPS)

    (1990)
  • Y. LeCun et al.

    Gradient-based learning applied to document recognition

    Proceedings of the IEEE

    (1998)
  • O. Russakovsky et al.

    Imagenet large scale visual recognition challenge

    Int. J. Comput. Vision

    (2015)
  • M.D. Zeiler et al.

    Visualizing and understanding convolutional networks

    European Conference on Computer Vision (ECCV)

    (2014)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    International Conference on Learning Representations (ICLR)

    (2015)
  • C. Szegedy et al.

    Going deeper with convolutions

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2015)
  • C. Szegedy et al.

    Inception-v4, Inception-ResNet and the impact of residual connections on learning

    AAAI Conference on Artificial Intelligence

    (2017)
  • G. Huang et al.

    Densely connected convolutional networks

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • F. Kimura et al.

    Modified quadratic discriminant functions and the application to Chinese character recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1987)
  • R.-W. Dai et al.

    Chinese character recognition: history, status and prospects

    Front. Comput. Sci. China

    (2007)
  • X.-Y. Zhang et al.

    Writer adaptation with style transfer mapping

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • C.-L. Liu et al.

    ICDAR 2011 Chinese handwriting recognition competition

    International Conference on Document Analysis and Recognition (ICDAR)

    (2011)
  • F. Yin et al.

    ICDAR 2013 Chinese handwriting recognition competition

    International Conference on Document Analysis and Recognition (ICDAR)

    (2013)
  • D.C. Ciresan, J. Schmidhuber, Multi-column deep neural networks for offline handwritten Chinese character...
  • Z. Zhong et al.

    High performance offline handwritten Chinese character recognition using GoogLeNet and directional feature maps

    International Conference on Document Analysis and Recognition (ICDAR)

    (2015)
  • C.-L. Liu

    Normalization-cooperated gradient feature extraction for handwritten character recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2007)
  • E.L. Denton et al.

    Exploiting linear structure within convolutional networks for efficient evaluation

    Advances in Neural Information Processing Systems (NIPS)

    (2014)
  • M. Jaderberg et al.

    Speeding up convolutional neural networks with low rank expansions

    British Machine Vision Conference (BMVC)

    (2014)
  • X. Zhang et al.

    Efficient and accurate approximations of nonlinear convolutional networks

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2015)
  • X. Zhang et al.

    Accelerating very deep convolutional networks for classification and detection

    IEEE Trans. Pattern Anal. Mach.Intell.

    (2016)
  • S. Han et al.

    Learning both weights and connections for efficient neural network

    Advances in Neural Information Processing Systems (NIPS)

    (2015)
  • S. Han et al.

    Deep compression: compressing deep neural network with pruning, trained quantization and Huffman coding

    International Conference on Learning Representations (ICLR)

    (2016)
  • Y. Guo et al.

    Dynamic network surgery for efficient DNNs

    Advances in Neural Information Processing Systems (NIPS)

    (2016)
  • Y. Gong, L. Liu, M. Yang, L.D. Bourdev, Compressing deep convolutional networks using vector quantization, in:...
  • W. Chen et al.

    Compressing neural networks with the hashing trick

    International Conference on Machine Learning (ICML)

    (2015)
  • J. Wu et al.

    Quantized convolutional neural networks for mobile devices

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • F.N. Iandola, M.W. Moskewicz, K. Ashraf, S. Han, W.J. Dally, K. Keutzer, SqueezeNet: AlexNet-level accuracy with 50x...
  • Cited by (51)

    • Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices

      2023, Pattern Recognition
      Citation Excerpt :

      This allows us to focus the structured compression on the important weights. There are different techniques by which structured compression can be achieved, such as Channel Pruning [32], Weight Sharing [30,76], Huffman Coding [77], Knowledge Distillation [34,78] and Quantization [30,79]. Though many works in the literature use these techniques in computer vision, they have hardly ever been used for audio tasks.

    • Multi-instance semantic similarity transferring for knowledge distillation

      2022, Knowledge-Based Systems
      Citation Excerpt :

      To tackle this problem, a large body of works has been proposed to accelerate or compress these deep neural networks in recent years. Generally, these solutions fall into the following perspectives: network pruning [9–11], network decomposition [12–14], network quantization and knowledge distillation [15,16]. Among these methods, the seminal work of knowledge distillation has attracted a lot of attention due to its ability of exploiting dark knowledge from the pre-trained large network.

    View all citing articles on Scopus

    Ting-Bing Xu received the BS degree in automation from China University of Petroleum, Qingdao, China, in 2014. He is currently pursuing the PhD degree at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include deep learning, machine learning, pattern recognition, and handwriting recognition.

    Peipei Yang received the BS degree in automation from Zhejiang University, Hangzhou, China, in 2007, the MS degree in control science and engineering from the Huazhong University of Science and Technology, Wuhan, China, in 2009, and the PhD degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2013. He is an associate professor with the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences. His current research interests include machine learning, pattern recognition, and computer vision.

    Xu-Yao Zhang received the BS degree in computational mathematics from Wuhan University, Wuhan, China, in 2008 and the PhD degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2013. He is currently an associate professor in the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. He was a visiting researcher at CENPARMI of Concordia University, in 2012. From March 2015 to March 2016, he was a visiting scholar in the Montreal Institute for Learning Algorithms (MILA), University of Montreal, Canada. His research interests include machine learning, pattern recognition, handwriting recognition, and deep learning.

    Cheng-Lin Liu received the BS degree in electronic engineering from Wuhan University, Wuhan, China, the ME degree in electronic engineering from Beijing Polytechnic University, Beijing, China, the PhD degree in pattern recognition and intelligent control from the Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. From 2005, he has been a professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences, Beijing, China, and is now the Director of the laboratory. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has published over 200 technical papers at prestigious international journals and conferences. He is an Associate Editor-in-Chief of Pattern Recognition, an Associate Editor of Image and Vision Computing, International Journal on Document Analysis and Recognition, and Cognitive Computation. He is a fellow of the IAPR and the IEEE.

    View full text