Abstract
Recent advances in deep learning have made it possible to implement artificial intelligence in mobile devices. Many studies have put a lot of effort into developing lightweight deep learning models optimized for mobile devices. To overcome the performance limitations of manually designed deep learning models, an automated search algorithm, called neural architecture search (NAS), has been proposed. However, studies on the effect of hardware architecture of the mobile device on the performance of NAS have been less explored. In this article, we show the importance of optimizing a hardware architecture, namely, NPU dataflow, when searching for a more accurate yet fast deep learning model. To do so, we first implement an optimization framework, named FlowOptimizer, for generating a best possible NPU dataflow for a given deep learning operator. Then, we utilize this framework during the latency-aware NAS to find the model with the highest accuracy satisfying the latency constraint. As a result, we show that the searched model with FlowOptimizer outperforms the performance by 87.1% and 92.3% on average compared to the searched model with NVDLA and Eyeriss, respectively, with better accuracy on a proxy dataset. We also show that the searched model can be transferred to a larger model to classify a more complex image dataset, i.e., ImageNet, achieving 0.2%/5.4% higher Top-1/Top-5 accuracy compared to MobileNetV2-1.0 with 3.6\( \times \) lower latency.
- [1] . 2019. Once for all: Train one network and specialize it for efficient deployment. CoRR abs/1908.09791.
arxiv:1908.09791 . Retrieved on May 25, 2021 from http://arxiv.org/abs/1908.09791.Google Scholar - [2] . 2020. Marvel: A data-centric compiler for DNN operators on spatial accelerators. CoRR abs/2002.07752.
arxiv:2002.07752. Retrieved on May 25, 2021 from https://arxiv.org/abs/2002.07752.Google Scholar - [3] . 2014. DaDianNao: A machine-learning supercomputer. In IEEE/ACM International Symposium on Microarchitecture (MICRO’14). 609–622. Google ScholarCross Ref
- [4] . 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (
Jan. 2017), 127–138. Google ScholarCross Ref - [5] . 2019. Fair DARTS: Eliminating unfair advantages in differentiable architecture search. arXiv:abs/1911.12126. Retrieved on May 25, 2021 from http://arxiv.org/abs/1911.12126.Google Scholar
- [6] . 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NIPS’12).Google Scholar
- [7] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Retrieved from
arxiv:1810.04805. http://arxiv.org/abs/1810.04805.Google Scholar - [8] . 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM/IEEE International Symposium on Computer Architecture (ISCA’15). 92–104. Google ScholarCross Ref
- [9] . 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). 807–820. Google ScholarCross Ref
- [10] . 2019. Single path one-shot neural architecture search with uniform sampling. CoRR abs/1904.00420.
arxiv:1904.00420 . Retrieved from http://arxiv.org/abs/1904.00420.Google Scholar - [11] . 2020. The architectural implications of Facebook’s DNN-based personalized recommendation. In IEEE International Symposium on High Performance Computer Architecture (HPCA’20). 488–501. Google ScholarCross Ref
- [12] . 2015. Deep residual learning for image recognition. CoRR abs/1512.03385.
arxiv:1512.03385 . Retrieved on May 25, 2021 from http://arxiv.org/abs/1512.03385.Google Scholar - [13] . 2019. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 1314–1324.Google ScholarCross Ref
- [14] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861.
arxiv:1704.04861 . Retrieved on May 25, 2021 from http://arxiv.org/abs/1704.04861.Google Scholar - [15] . 2019. sharpDARTS: Faster and more accurate differentiable architecture search. arXiv:abs/1903.09900. Retrieved on May 25, 2021 from http://arxiv.org/abs/1903.09900.Google Scholar
- [16] . 2010. 2000 fps real-time vision system with high-frame-rate video recording. In IEEE International Conference on Robotics and Automation (ICRA’10). 1536–1541. Google ScholarCross Ref
- [17] . 2020. Hardware-Aware transformable architecture search with efficient search space. In IEEE International Conference on Multimedia and Expo (ICME’20). 1–6. Google ScholarCross Ref
- [18] . 2017. In-datacenter performance analysis of a tensor processing unit. In ACM/IEEE International Symposium on Computer Architecture (ISCA’17). 1–12. Google ScholarCross Ref
- [19] . 2020. GAMMA: Automating the HW mapping of DNN models on accelerators via genetic algorithm. In International Conference on Computer-Aided Design (ICCAD’20). 1–9.Google ScholarDigital Library
- [20] . 2010. CIFAR-10 (Canadian Institute for Advanced Research). Retrieved March 3, 2021 from https://www.cs.toronto.edu/kriz/cifar.html.Google Scholar
- [21] . 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12).Google Scholar
- [22] . 2020. MAESTRO: A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings. IEEE Micro 40, 3 (
April 2020), 20–29. Google ScholarCross Ref - [23] . 2015. Deep learning. Nature 521 (
May 2015), 426–444. Google ScholarCross Ref - [24] . 2020. S3NAS: Fast NPU-aware neural architecture search methodology. CoRR abs/2009.02009.
arxiv:2009.02009 . Retrieved on May 25, 2021 from https://arxiv.org/abs/2009.02009.Google Scholar - [25] . 2019. UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE Journal of Solid-State Circuits 54, 1 (
Jan. 2019), 173–185. Google ScholarCross Ref - [26] . 2020. EDD: Efficient differentiable DNN architecture and implementation co-search for embedded AI solutions. In Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC’20). Article
130 , 6 pages.Google ScholarCross Ref - [27] . 2018. Progressive neural architecture search. In European Conference on Computer Vision (ECCV’18). 19–35.Google ScholarCross Ref
- [28] . 2018. DARTS: Differentiable architecture search. CoRR abs/1806.09055.
arxiv:1806.09055 . Retrieved on May 25, 2021 from http://arxiv.org/abs/1806.09055.Google Scholar - [29] . 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. CoRR abs/1807.11164.
arxiv:1807.11164 . Retrieved on May 25, 2021 from https://arxiv.org/abs/1807.11164.Google Scholar - [30] NVIDIA. 2017. Volta GPU Architecture. https://www.nvidia.com/en-us/data-center/volta-gpu-archi.Google Scholar
- [31] . 2019. NVDLA Index of Documentation. (2019). Retrieved on May 25, 2021 from http://nvdla.org/contents.html.Google Scholar
- [32] . 2020. NVIDIA Ampere Architecture. (2020). Retrieved on May 25, 2021 from https://www.nvidia.com/en-us/data-center/ampere-architecture/.Google Scholar
- [33] . 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 304–315. Google ScholarCross Ref
- [34] . 2018. Efficient neural architecture search via parameter sharing. CoRR abs/1802.03268.
arxiv:1802.03268 . Retrieved on May 25, 2021 from http://arxiv.org/abs/1802.03268.Google Scholar - [35] . 2018. Improving language understanding by generative pre-training. OpenAI. https://openai.com/blog/language-unsupervised/.Google Scholar
- [36] . 2019. Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (
July 2019), 4780–4789. Google ScholarCross Ref - [37] . 2016. YOLO9000: Better, faster, stronger. CoRR abs/1612.08242.
arxiv:1612.08242. http://arxiv.org/abs/1612.08242.Google Scholar - [38] . 1999. Preventing premature convergence to local optima in genetic algorithms via random offspring generation. In Multiple Approaches to Intelligent Systems. Springer, 127–136.Google Scholar
- [39] . 2015. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI’15). 234–241.Google Scholar
- [40] . 2019. BitBlade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation. In ACM/IEEE Design Automation Conference (DAC’19). 1–6.Google ScholarDigital Library
- [41] . 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs/1801.04381.
arxiv:1801.04381 . Retrieved on May 25, 2021 from http://arxiv.org/abs/1801.04381.Google Scholar - [42] . 2018. Bit Fusion: Bit-Level dynamically composable architecture for accelerating deep neural network. In ACM/IEEE International Symposium on Computer Architecture (ISCA’18). 764–775. Google ScholarCross Ref
- [43] . 2015. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.
arxiv:1409.1556 . http://arxiv.org/abs/1409.1556.Google Scholar - [44] . 2014. Sequence to sequence learning with neural networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’14). 3104–3112.Google Scholar
- [45] . 2018. MnasNet: Platform-aware neural architecture search for mobile. CoRR abs/1807.11626.
arxiv:1807.11626 . Retrieved on May 25, 2021 from http://arxiv.org/abs/1807.11626.Google Scholar - [46] . 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. CoRR abs/1905.11946.
arxiv:1905.11946 . Retrieved on May 25, 2021 from http://arxiv.org/abs/1905.11946.Google Scholar - [47] . 2019. FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10734–10742.Google ScholarCross Ref
- [48] . 2020. Co-Exploration of neural architectures and heterogeneous ASIC accelerator designs targeting multiple tasks. In Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC’20). Article
163 , 6 pages.Google ScholarCross Ref - [49] . 2020. Interstellar: Using Halide’s scheduling language to analyze DNN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). 369–383. Google ScholarCross Ref
- [50] . 2017. Scaling deep learning on GPU and Knights Landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Article
9 , 12 pages. Google ScholarCross Ref - [51] . 2016. Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122.
arxiv:1511.07122 . Retrieved on May 25, 2021 from https://arxiv.org/abs/1511.07122v3.Google Scholar - [52] . 2020. Fast hardware-aware neural architecture search. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’20). 2959–2967. Google ScholarCross Ref
- [53] . 2018. Graph neural networks: A review of methods and applications. CoRR abs/1812.08434.
arxiv:1812.08434 . Retrieved on May 25, 2021 from http://arxiv.org/abs/1812.08434.Google Scholar - [54] . 2016. Neural architecture search with reinforcement learning. CoRR abs/1611.01578.
arxiv:1611.01578 . Retrieved on May 25, 2021 from http://arxiv.org/abs/1611.01578.Google Scholar - [55] . 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 8697–8710. Google ScholarCross Ref
Index Terms
- Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices
Recommendations
Neural Architecture Search Survey: A Hardware Perspective
We review the problem of automating hardware-aware architectural design process of Deep Neural Networks (DNNs). The field of Convolutional Neural Network (CNN) algorithm design has led to advancements in many fields, such as computer vision, virtual ...
Accelerating Deep Neural Networks on Mobile Multicore NPUs
CGO 2023: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and OptimizationNeural processing units (NPUs) have become indispensable parts of mobile SoCs. Furthermore, integrating multiple NPU cores into a single chip becomes a promising solution for ever-increasing computing power demands in mobile devices. This paper ...
Differentiable neural architecture learning for efficient neural networks
Highlights- We build a new standalone control module based on the scaled sigmoid function to enrich the neural network module family to enable the neural architecture ...
AbstractEfficient neural networks has received ever-increasing attention with the evolution of convolutional neural networks (CNNs), especially involving their deployment on embedded and mobile platforms. One of the biggest problems to ...
Comments