T1000: Mitigating the memory footprint of convolution neural networks with decomposition and re-fusion

doi:10.1016/j.future.2018.02.024

Future Generation Computer Systems

Volume 84, July 2018, Pages 1-10

https://doi.org/10.1016/j.future.2018.02.024 Get rights and content

Highlights

•
We identify the memory problem when applying CP-decomposition to CNNs.
•
We propose a decomposition and re-fusion approach to mitigate the memory problem.
•
We demonstrate the effectiveness of our approach on two state-of-the-art CNNs.
•
Our experiments with AlexNet and VGG-19 show great memory reduction and speedup.

Abstract

In recent years, convolution neural networks have significantly advanced the frontier of computer vision and other intelligent applications due to its promising accuracy. However, the improved accuracy comes with the formidable computation complexity with deeper convolution layers, which prevents its adoption on resource constrained system such as embedded and mobile. Although research efforts have been devoted to reduce the computation complexity of convolution neural networks through tensor decomposition, the volume of intermediate data generated by the tensor decomposition grows dramatically, which consumes more memory resource and has not been addressed by existing work.

In this work, we propose T1000 to re-fuse the convolutions across tensors after applying the canonical polyadic decomposition to conventional convolution layers so that we can receive the benefit of reduced computation complexity, in the meanwhile mitigate the memory occupancy of the intermediate data. We demonstrate the effectiveness of our approach by applying canonical polyadic decomposition and re-fusion to the convolution layers of two well-known convolution neural networks, AlexNet and VGG-19 implemented with Caffe. Compared to the default canonical polyadic decomposition, our approach reduces the memory occupancy of the intermediate data by 84.6% and 77.4% for AlexNet and VGG-19 respectively. In addition, our approach improves the performance of AlexNet and VGG-19 by 1.77 $\times$ and 1.4 $\times$ respectively.

Introduction

In the recent years, the advancement in image classification [[1], [2]] and object detection [[3], [4]] achieved by convolution neural networks (CNNs) have demonstrated that deep learning is an effective approach to develop intelligent computer vision applications, such as self-driving car, personal assistant and artificial intelligent robot. However, as the depth of neural network grows, the computation demand of CNNs is becoming a major obstacle preventing its pervasive adoption. Even though the training tasks can be done on high-end server with powerful accelerators (e.g., GPU and FPGA), there are rising interests from both industry and academia to deploy inference tasks in resource constrained fields such as embedded system and mobile device [[5], [6], [7]]. Due to the limited computation and memory capacity, it is critical to mitigate the resource consumption of CNNs for its successful adoption in resource constrained fields.

To address the above challenges from different perspectives, there has been growing amount of research works such as developing smaller networks with negligible precision loss [[8], [9], [10], [11]], advancing the mathematical computation method [[12], [13]] and modifying existing networks to adapt to the hardware architecture [[5], [14], [15], [16]] (e.g., transforming the convolution layers to reduce computation complexity). Among these studies, the idea of convolution dimensionality reduction (e.g., tensor decomposition [17]) is considered to be an effective way to mitigate the computation complexity. However, existing work [[18], [5], [16]] fails to consider the intermediate data generated after the transformation, which consumes significant amount of memory resource.

Among the convolution dimensionality reduction approach [[18], [17], [5], [16]], canonical polyadic decomposition (CP-decomposition) [5] has been widely used by researchers to optimize the convolution operations. The CP-decomposition actually involves two steps such as kernel decomposition and convolution composition. The decomposition means breaking the high-dimensional convolution kernel tensors into a series of low rank ones. Whereas the composition means replacing the original convolution layer with a composition of four convolution layers with decomposed kernel tensors. The fundamental idea is similar to service composition in the field of cloud computing [[19], [20]]. By approximating a multidimensional convolution to the sum of several low-rank tensors, CP-decomposition can effectively cut down the number of convolution operations with negligible precision loss (detailed discussion in Section 2.2). However, when applying CP-decomposition to convolution layers, it generates large amount of intermedia data by low-rank tensors, exacerbating the problem of memory consumption. To further illustrate, we apply CP-decomposition to the most time consuming convolution layers (e.g., conv2) of AlexNet [21] and VGG-19 [22], and measure memory footprint of each convolution layer after CP-decomposition. Note that the precision loss of both networks is less that 1% after applying the CP-decomposition. The experimental details are shown in Section 4.1.

As shown in Fig. 1, the left two bars represent the results of AlexNet, while the right two bars represent the results of VGG-19. The memory footprint of AlexNet and VGG-19 is shown on the left $y$ -axis and right $y$ -axis respectively. The bar labeled Traditional-Conv shows the results of the original convolution layers, while the bar with CP-Conv shows the results after applying CP-decomposition. Comparing to the original convolution layer, CP-decomposition generates large amount of intermediate data while reducing the computational complexity. For instance, the size of intermedia data for AlexNet and VGG-19 increases by more than 26 $\times$ and 7 $\times$ respectively. The reason for the dramatic increase of intermedia data size is that CP-decomposition utilizes three cascaded tensors (much smaller than the original convolution kernels) to reduce the number of convolution operations. The intermedia data generated by previous tensor is passed to the next tensor, which requires additional memory space to store. Moreover, we observe that the volume of the intermediate data is closely related to the batchsize chosen by the network. Therefore, it is critical to mitigate the memory footprint of intermedia data from CP-decomposition for its successful adoption in resource constrained fields.

The idea of CNN layer re-fusion is explored by [14]. However, there are several challenges to be addressed in order to re-fuse the convolutions across tensors from CP-decomposition. (1) Since the original convolution is decomposed across several tensors, it remains unclear which convolutions should be re-fused and how the decision affects the memory occupancy as well as accuracy. (2) Convolution re-fusion itself consumes memory to store temporary data, therefore it is important to optimize the memory usage during re-fusion. (3) The intermedia data generated from one tensor is the input for the subsequent tensor. The efficiency of the convolution operations across tensors determine the performance of the layer after applying re-fusion.

To address the above challenges, we propose a decomposition and re-fusion approach T1000. It leverages the advantage of CP-decomposition to reduce the computation complexity of convolution operations, and then use re-fusion to mitigate the volume of intermediate data introduced by CP-decomposition. We evaluate the effectiveness of T1000 from several aspects by applying CP-decomposition and re-fusion to two state-of-the-art CNNs (AlexNet and VGG-19). The experiments results demonstrate that T1000 can effectively reduce the size of intermediate data while improving the performance of the original CNN models. Specifically, this paper makes the following contributions:

$•$
We identify the memory problem due to the large amount of intermedia data introduced by CP-decomposition when it is applied to CNNs with comprehensive analysis.
$•$
To overcome the memory problem, we propose a decomposition and re-fusion approach (T1000) to mitigate the volume of intermediate data through combining separate convolution operations across tensors into an integrated computation process.
$•$
We demonstrate the effectiveness of T1000 on two state-of-the-art CNNs (AlexNet and VGG-19) that significantly reduces the size of intermedia data as well as improves the performance of convolution layers.

The rest of the paper is organized as follows. Section 2 describes the background of CNNs and CP-decomposition that motivates our study. Section 3 proposes our decomposition and re-fusion approach. Section 4 details the evaluation. Section 5 discusses the related work. Section 6 concludes our work.

Section snippets

Convolution neural networks

Neural networks have been demonstrated success in many fields such as speech signal prediction [23] and medical data classification [24]. Especially the Convolution Neural Network (CNN) [21] have received dramatic research attention recently with its extraordinary performance on ImageNet [25]. CNN is a multiple-layer neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex. In general, CNNs is composed by series of

Decomposition and re-fusion

In this section, we first describe the design overview of our approach to re-fuse the convolutions across tensors after CP-decomposition. Then we illustrate the implementation details of convolution re-fusion. In addition, we give a quantitative analysis on the reduction of the memory occupancy as well as the computation complexity from our approach.

Experimental setup

We evaluate the effectiveness of our approach with AlexNet and VGG-19 implemented on Caffe [26]. In our experiments, we use the original Caffe implementation as the baseline and compare with the CP-decomposition and our approach. We choose the metric of top-1 error to measure the accuracy of the CNN models. Our experiments are conducted on a SMP server with six Intel Xeon E5-2620 cores, each core can run at the maximum frequency of 2.4 GHz. The memory capacity is 48 GB. The operating system is

Related work

Recently, improving the performance of CNNs has attracted tremendous research attention. Most of the research work on improving the performance of CNNs focus on compressing the network and advancing the computation method.

In compressing the network, Jimmy Ba et al. [9] debate whether deeper neural networks deliver better performance. Their experiments have shown that compressed neural networks can maintain the same expressiveness as deeper ones. Adriana Romero et al. [10] propose an approach to

Conclusion and future work

In this paper, we propose a decomposition and re-fusion approach to mitigate the memory occupancy of the intermediate data generated by default CP-decomposition. By re-fusing the convolutions across tensors, our approach reduces the size of the intermediate data by 84.6% and 77.4% for AlexNet and VGG-19 respectively. In the meanwhile, we optimize the cache locality of the convolution operations after re-fusion, which in turn improves the performance of AlexNet and VGG-19 by $1.77 \times$ and $1.4 \times$

Acknowledgments

We would like to thank anonymous reviewers for their valuable feedbacks. This work is supported by National Key Research and Development Program of China (Grant No. 2016YFB1000304) and National Natural Science Foundation of China (Grant No. 61502019).

Changxi Liu is a Master student in School of Computer Science and Engineering, Beihang University. He is currently working on identifying performance opportunities for both scientific and AI applications. His research interests include HPC, performance optimization and deep learning.

References (41)

BakerT. et al.
An energy-aware service composition algorithm for multiple cloud-based iot applications
J. Netw. Comput. Appl.
(2017)
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference...
C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on...
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: integrated recognition, localization and...
LiJ. et al.
Attentive contexts for object detection
IEEE Trans. Multimedia
(2016)
V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, V. Lempitsky, Speeding-up convolutional neural networks using...
J. Wu, C. Leng, Y. Wang, Q. Hu, J. Cheng, Quantized convolutional neural networks for mobile devices, in: Proceedings...
Z. Zhengj, J. Weng, Mobile device based outdoor navigation with on-line learning neural network: A comparison with...
M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: ImageNet classification using binary convolutional neural...
BaJ. et al.
Do deep nets really need to be deep?

A. Romero, N. Ballas, S.E. Kahou, A. Chassang, C. Gatta, Y. Bengio, Fitnets: hints for thin deep nets, 2014, arXiv...

F.N. Iandola, M.W. Moskewicz, K. Ashraf, S. Han, W.J. Dally, K. Keutzer, SqueezeNet: AlexNet-level accuracy with 50x...

M. Mathieu, M. Henaff, Y. LeCun, Fast training of convolutional networks through ffts, 2013, arXiv preprint...

Y. Cheng, F.X. Yu, R.S. Feris, S. Kumar, A. Choudhary, S.-F. Chang, An exploration of parameter redundancy in deep...

AlwaniM. et al.

Fused-layer CNN accelerators

J. Xue, J. Li, Y. Gong, Restructuring of deep neural network acoustic models with singular value decomposition, in:...

DentonE.L. et al.

Exploiting linear structure within convolutional networks for efficient evaluation

R. Rigamonti, A. Sironi, V. Lepetit, P. Fua, Learning separable filters, in: Proceedings of the IEEE Conference on...

M. Jaderberg, A. Vedaldi, A. Zisserman, Speeding up convolutional neural networks with low rank expansions, 2014, arXiv...

BakerT. et al.

Understanding elasticity of cloud services compositions

Cited by (0)

Hailong Yang is an assistant professor in School of Computer Science and Engineering, Beihang University. He received the Ph.D. degree in the School of Computer Science and Engineering, Beihang University in 2014. He has been involved in several scientific projects such as performance analysis for big data systems and performance optimization for large scale applications. His research interests include parallel and distributed computing, HPC, performance optimization and energy efficiency. He is also a member of IEEE and China Computer Federation (CCF).

Rui Wang is an assistant professor of School of Computer Science and Engineering, Beihang University. He received his BS and MS degree in computer science from Xi’an Jiaotong University in 2000 and 2003, respectively; and his Ph.D. in computer science from Beihang University in 2009. His research interests include computer architecture and computer networks. He is a member of IEEE and China Computer Federation(CCF).

Zhongzhi Luan is an Associate Professor of School of Computer Science and Engineering, and Assistant Director of the Sino-German Joint Software Institute (JSI) at Beihang University, China. In 2003, he completed Ph.D. in Department of Computer Science of Xi’an Jiaotong University. He has been involved into more than 15 scientific projects mostly as project leader or the backbone of the researchers. He is now in charge of the international data placement testbed project which is funded by international cooperation program of National Science Foundation of China. His research interests include distributed computing, parallel computing, grid computing, HPC and new generation of network technology.

Depei Qian is a professor at the Department of Computer Science and Engineering, Beihang University, China. He received his master degree from University of North Texas in 1984. He is currently serving as the chief scientist of China National High Technology Program (863 Program) on high productivity computer and service environment. He is also a fellow of China Computer Federation (CCF). His research interests include innovative technologies in distributed computing, high performance computing and computer architecture.

View full text