research-article

Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices

Authors:

Jaeha KungAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 27, Issue 5

Article No.: 48, Pages 1 - 24

https://doi.org/10.1145/3513085

Published: 06 June 2022 Publication History

Abstract

Recent advances in deep learning have made it possible to implement artificial intelligence in mobile devices. Many studies have put a lot of effort into developing lightweight deep learning models optimized for mobile devices. To overcome the performance limitations of manually designed deep learning models, an automated search algorithm, called neural architecture search (NAS), has been proposed. However, studies on the effect of hardware architecture of the mobile device on the performance of NAS have been less explored. In this article, we show the importance of optimizing a hardware architecture, namely, NPU dataflow, when searching for a more accurate yet fast deep learning model. To do so, we first implement an optimization framework, named FlowOptimizer, for generating a best possible NPU dataflow for a given deep learning operator. Then, we utilize this framework during the latency-aware NAS to find the model with the highest accuracy satisfying the latency constraint. As a result, we show that the searched model with FlowOptimizer outperforms the performance by 87.1% and 92.3% on average compared to the searched model with NVDLA and Eyeriss, respectively, with better accuracy on a proxy dataset. We also show that the searched model can be transferred to a larger model to classify a more complex image dataset, i.e., ImageNet, achieving 0.2%/5.4% higher Top-1/Top-5 accuracy compared to MobileNetV2-1.0 with 3.6\(\times\) lower latency.

References

[1]

Han Cai, Chuang Gan, and Song Han. 2019. Once for all: Train one network and specialize it for efficient deployment. CoRR abs/1908.09791. arxiv:1908.09791. Retrieved on May 25, 2021 from http://arxiv.org/abs/1908.09791.

[2]

Prasanth Chatarasi, Hyoukjun Kwon, Natesh Raina, Saurabh Malik, Vaisakh Haridas, Angshuman Parashar, Michael Pellauer, Tushar Krishna, and Vivek Sarkar. 2020. Marvel: A data-centric compiler for DNN operators on spatial accelerators. CoRR abs/2002.07752. arxiv:2002.07752. Retrieved on May 25, 2021 from https://arxiv.org/abs/2002.07752.

[3]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In IEEE/ACM International Symposium on Microarchitecture (MICRO’14). 609–622.

[4]

Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (Jan. 2017), 127–138.

[5]

Xiangxiang Chu, Tianbao Zhou, Bo Zhang, and Jixiang Li. 2019. Fair DARTS: Eliminating unfair advantages in differentiable architecture search. arXiv:abs/1911.12126. Retrieved on May 25, 2021 from http://arxiv.org/abs/1911.12126.

[6]

Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NIPS’12).

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Retrieved from arxiv:1810.04805. http://arxiv.org/abs/1810.04805.

[8]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM/IEEE International Symposium on Computer Architecture (ISCA’15). 92–104.

[9]

Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). 807–820.

[10]

Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. 2019. Single path one-shot neural architecture search with uniform sampling. CoRR abs/1904.00420. arxiv:1904.00420. Retrieved from http://arxiv.org/abs/1904.00420.

[11]

U. Gupta, C. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia, H. S. Lee, A. Malevich, D. Mudigere, M. Smelyanskiy, L. Xiong, and X. Zhang. 2020. The architectural implications of Facebook’s DNN-based personalized recommendation. In IEEE International Symposium on High Performance Computer Architecture (HPCA’20). 488–501.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385. arxiv:1512.03385. Retrieved on May 25, 2021 from http://arxiv.org/abs/1512.03385.

[13]

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. 2019. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 1314–1324.

[14]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. arxiv:1704.04861. Retrieved on May 25, 2021 from http://arxiv.org/abs/1704.04861.

[15]

Andrew Hundt, Varun Jain, and Gregory D. Hager. 2019. sharpDARTS: Faster and more accurate differentiable architecture search. arXiv:abs/1903.09900. Retrieved on May 25, 2021 from http://arxiv.org/abs/1903.09900.

[16]

Idaku Ishii, Tetsuro Tatebe, Qingyi Gu, Yuta Moriue, Takeshi Takaki, and Kenji Tajima. 2010. 2000 fps real-time vision system with high-frame-rate video recording. In IEEE International Conference on Robotics and Automation (ICRA’10). 1536–1541.

[17]

Yuhang Jiang, Xin Wang, and Wenwu Zhu. 2020. Hardware-Aware transformable architecture search with efficient search space. In IEEE International Conference on Multimedia and Expo (ICME’20). 1–6.

[18]

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In ACM/IEEE International Symposium on Computer Architecture (ISCA’17). 1–12.

[19]

Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW mapping of DNN models on accelerators via genetic algorithm. In International Conference on Computer-Aided Design (ICCAD’20). 1–9.

Digital Library

[20]

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2010. CIFAR-10 (Canadian Institute for Advanced Research). Retrieved March 3, 2021 from https://www.cs.toronto.edu/kriz/cifar.html.

[21]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12).

[22]

Hyoukjun Kwon, Prasanth Chatarasi, Micheal Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. 2020. MAESTRO: A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings. IEEE Micro 40, 3 (April 2020), 20–29.

[23]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (May 2015), 426–444.

[24]

Jaeseong Lee, Duseok Kang, and Soonhoi Ha. 2020. S3NAS: Fast NPU-aware neural architecture search methodology. CoRR abs/2009.02009. arxiv:2009.02009. Retrieved on May 25, 2021 from https://arxiv.org/abs/2009.02009.

[25]

Jinmook Lee, Changhyeon Kim, Sanghoon Kang, Dongjoo Shin, Sangyeob Kim, and Hoi-Jun Yoo. 2019. UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE Journal of Solid-State Circuits 54, 1 (Jan. 2019), 173–185.

[26]

Yuhong Li, Cong Hao, Xiaofan Zhang, Xinheng Liu, Yao Chen, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2020. EDD: Efficient differentiable DNN architecture and implementation co-search for embedded AI solutions. In Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC’20). Article 130, 6 pages.

[27]

Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. 2018. Progressive neural architecture search. In European Conference on Computer Vision (ECCV’18). 19–35.

[28]

Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. DARTS: Differentiable architecture search. CoRR abs/1806.09055. arxiv:1806.09055. Retrieved on May 25, 2021 from http://arxiv.org/abs/1806.09055.

[29]

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. CoRR abs/1807.11164. arxiv:1807.11164. Retrieved on May 25, 2021 from https://arxiv.org/abs/1807.11164.

[30]

NVIDIA. 2017. Volta GPU Architecture. https://www.nvidia.com/en-us/data-center/volta-gpu-archi.

[31]

NVIDIA. 2019. NVDLA Index of Documentation. (2019). Retrieved on May 25, 2021 from http://nvdla.org/contents.html.

[32]

NVIDIA. 2020. NVIDIA Ampere Architecture. (2020). Retrieved on May 25, 2021 from https://www.nvidia.com/en-us/data-center/ampere-architecture/.

[33]

Angshuman Parashar, Priyanka Raina, Yakun S. Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 304–315.

[34]

Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient neural architecture search via parameter sharing. CoRR abs/1802.03268. arxiv:1802.03268. Retrieved on May 25, 2021 from http://arxiv.org/abs/1802.03268.

[35]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI. https://openai.com/blog/language-unsupervised/.

[36]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (July 2019), 4780–4789.

[37]

Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, faster, stronger. CoRR abs/1612.08242. arxiv:1612.08242.http://arxiv.org/abs/1612.08242.

[38]

Miguel Rocha and José Neves. 1999. Preventing premature convergence to local optima in genetic algorithms via random offspring generation. In Multiple Approaches to Intelligent Systems. Springer, 127–136.

[39]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI’15). 234–241.

[40]

Sungju Ryu, Hyungjun Kim, Wooseok Yi, and Jae-Joon Kim. 2019. BitBlade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation. In ACM/IEEE Design Automation Conference (DAC’19). 1–6.

Digital Library

[41]

Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs/1801.04381. arxiv:1801.04381. Retrieved on May 25, 2021 from http://arxiv.org/abs/1801.04381.

[42]

Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. 2018. Bit Fusion: Bit-Level dynamically composable architecture for accelerating deep neural network. In ACM/IEEE International Symposium on Computer Architecture (ISCA’18). 764–775.

[43]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. arxiv:1409.1556. http://arxiv.org/abs/1409.1556.

[44]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’14). 3104–3112.

[45]

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. 2018. MnasNet: Platform-aware neural architecture search for mobile. CoRR abs/1807.11626. arxiv:1807.11626. Retrieved on May 25, 2021 from http://arxiv.org/abs/1807.11626.

[46]

Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. CoRR abs/1905.11946. arxiv:1905.11946. Retrieved on May 25, 2021 from http://arxiv.org/abs/1905.11946.

[47]

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. 2019. FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10734–10742.

[48]

Lei Yang, Zheyu Yan, Meng Li, Hyoukjun Kwon, Weiwen Jiang, Liangzhen Lai, Yiyu Shi, Tushar Krishna, and Vikas Chandra. 2020. Co-Exploration of neural architectures and heterogeneous ASIC accelerator designs targeting multiple tasks. In Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC’20). Article 163, 6 pages.

[49]

Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, and Mark Horowitz. 2020. Interstellar: Using Halide’s scheduling language to analyze DNN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). 369–383.

[50]

Yang You, Aydın Buluç, and James Demmel. 2017. Scaling deep learning on GPU and Knights Landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Article 9, 12 pages.

[51]

Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122. arxiv:1511.07122. Retrieved on May 25, 2021 from https://arxiv.org/abs/1511.07122v3.

[52]

Li Lyna Zhang, Yuqing Yang, Yuhang Jiang, Wenwu Zhu, and Yunxin Liu. 2020. Fast hardware-aware neural architecture search. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’20). 2959–2967.

[53]

Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. 2018. Graph neural networks: A review of methods and applications. CoRR abs/1812.08434. arxiv:1812.08434. Retrieved on May 25, 2021 from http://arxiv.org/abs/1812.08434.

[54]

Barret Zoph and Quoc V. Le. 2016. Neural architecture search with reinforcement learning. CoRR abs/1611.01578. arxiv:1611.01578. Retrieved on May 25, 2021 from http://arxiv.org/abs/1611.01578.

[55]

Barret Zoph, Vijay Vasudevan, Jonathan Shlens, and Quoc V. Le. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 8697–8710.

Cited By

Sun XPeng XZhang SGomez JKhwa WSarwar SLi ZCao WWang ZLiu CChang MSalvo BAkarvardar KWong H(2024)Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical FrameworkACM Transactions on Design Automation of Electronic Systems10.1145/3670404Online publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1145/3670404
Hendrickx LSymons AVan Ranst WVerhelst MGoedemé T(2023)Hardware-aware NAS by Genetic Optimisation with a Design Space Exploration Simulator2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW59228.2023.00222(2275-2283)Online publication date: Jun-2023
https://doi.org/10.1109/CVPRW59228.2023.00222

Index Terms

Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices

Recommendations

Accelerating Deep Neural Networks on Mobile Multicore NPUs
CGO '23: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization

Neural processing units (NPUs) have become indispensable parts of mobile SoCs. Furthermore, integrating multiple NPU cores into a single chip becomes a promising solution for ever-increasing computing power demands in mobile devices. This paper addresses ...
Neural Architecture Search Survey: A Hardware Perspective
We review the problem of automating hardware-aware architectural design process of Deep Neural Networks (DNNs). The field of Convolutional Neural Network (CNN) algorithm design has led to advancements in many fields, such as computer vision, virtual ...
Sequential node search for faster neural architecture search
Abstract
Neural Architecture Search (NAS) has progressed significantly by reducing search costs since it was first introduced. The faster NAS algorithms use cell-type architectures. The cells are made up of nodes, and typically, all the node operations ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 27, Issue 5

September 2022

274 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/3540253

Editor:
X. Sharon Hu
University of Notre Dame, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 06 June 2022

Online AM: 24 February 2022

Accepted: 01 January 2022

Revised: 01 January 2022

Received: 01 June 2021

Published in TODAES Volume 27, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Samsung Research Funding Incubation Center of Samsung Electronics
Samsung Research Funding Incubation Center of Samsung Electronics

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
1,172
Total Downloads

Downloads (Last 12 months)401
Downloads (Last 6 weeks)40

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sun XPeng XZhang SGomez JKhwa WSarwar SLi ZCao WWang ZLiu CChang MSalvo BAkarvardar KWong H(2024)Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical FrameworkACM Transactions on Design Automation of Electronic Systems10.1145/3670404Online publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1145/3670404
Hendrickx LSymons AVan Ranst WVerhelst MGoedemé T(2023)Hardware-aware NAS by Genetic Optimisation with a Design Space Exploration Simulator2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW59228.2023.00222(2275-2283)Online publication date: Jun-2023
https://doi.org/10.1109/CVPRW59228.2023.00222

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents