skip to main content
research-article

Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices

Published: 06 June 2022 Publication History

Abstract

Recent advances in deep learning have made it possible to implement artificial intelligence in mobile devices. Many studies have put a lot of effort into developing lightweight deep learning models optimized for mobile devices. To overcome the performance limitations of manually designed deep learning models, an automated search algorithm, called neural architecture search (NAS), has been proposed. However, studies on the effect of hardware architecture of the mobile device on the performance of NAS have been less explored. In this article, we show the importance of optimizing a hardware architecture, namely, NPU dataflow, when searching for a more accurate yet fast deep learning model. To do so, we first implement an optimization framework, named FlowOptimizer, for generating a best possible NPU dataflow for a given deep learning operator. Then, we utilize this framework during the latency-aware NAS to find the model with the highest accuracy satisfying the latency constraint. As a result, we show that the searched model with FlowOptimizer outperforms the performance by 87.1% and 92.3% on average compared to the searched model with NVDLA and Eyeriss, respectively, with better accuracy on a proxy dataset. We also show that the searched model can be transferred to a larger model to classify a more complex image dataset, i.e., ImageNet, achieving 0.2%/5.4% higher Top-1/Top-5 accuracy compared to MobileNetV2-1.0 with 3.6\(\times\) lower latency.

References

[1]
Han Cai, Chuang Gan, and Song Han. 2019. Once for all: Train one network and specialize it for efficient deployment. CoRR abs/1908.09791. arxiv:1908.09791. Retrieved on May 25, 2021 from http://arxiv.org/abs/1908.09791.
[2]
Prasanth Chatarasi, Hyoukjun Kwon, Natesh Raina, Saurabh Malik, Vaisakh Haridas, Angshuman Parashar, Michael Pellauer, Tushar Krishna, and Vivek Sarkar. 2020. Marvel: A data-centric compiler for DNN operators on spatial accelerators. CoRR abs/2002.07752. arxiv:2002.07752. Retrieved on May 25, 2021 from https://arxiv.org/abs/2002.07752.
[3]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In IEEE/ACM International Symposium on Microarchitecture (MICRO’14). 609–622.
[4]
Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (Jan. 2017), 127–138.
[5]
Xiangxiang Chu, Tianbao Zhou, Bo Zhang, and Jixiang Li. 2019. Fair DARTS: Eliminating unfair advantages in differentiable architecture search. arXiv:abs/1911.12126. Retrieved on May 25, 2021 from http://arxiv.org/abs/1911.12126.
[6]
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NIPS’12).
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Retrieved from arxiv:1810.04805. http://arxiv.org/abs/1810.04805.
[8]
Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM/IEEE International Symposium on Computer Architecture (ISCA’15). 92–104.
[9]
Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). 807–820.
[10]
Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. 2019. Single path one-shot neural architecture search with uniform sampling. CoRR abs/1904.00420. arxiv:1904.00420. Retrieved from http://arxiv.org/abs/1904.00420.
[11]
U. Gupta, C. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia, H. S. Lee, A. Malevich, D. Mudigere, M. Smelyanskiy, L. Xiong, and X. Zhang. 2020. The architectural implications of Facebook’s DNN-based personalized recommendation. In IEEE International Symposium on High Performance Computer Architecture (HPCA’20). 488–501.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385. arxiv:1512.03385. Retrieved on May 25, 2021 from http://arxiv.org/abs/1512.03385.
[13]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. 2019. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 1314–1324.
[14]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. arxiv:1704.04861. Retrieved on May 25, 2021 from http://arxiv.org/abs/1704.04861.
[15]
Andrew Hundt, Varun Jain, and Gregory D. Hager. 2019. sharpDARTS: Faster and more accurate differentiable architecture search. arXiv:abs/1903.09900. Retrieved on May 25, 2021 from http://arxiv.org/abs/1903.09900.
[16]
Idaku Ishii, Tetsuro Tatebe, Qingyi Gu, Yuta Moriue, Takeshi Takaki, and Kenji Tajima. 2010. 2000 fps real-time vision system with high-frame-rate video recording. In IEEE International Conference on Robotics and Automation (ICRA’10). 1536–1541.
[17]
Yuhang Jiang, Xin Wang, and Wenwu Zhu. 2020. Hardware-Aware transformable architecture search with efficient search space. In IEEE International Conference on Multimedia and Expo (ICME’20). 1–6.
[18]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In ACM/IEEE International Symposium on Computer Architecture (ISCA’17). 1–12.
[19]
Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW mapping of DNN models on accelerators via genetic algorithm. In International Conference on Computer-Aided Design (ICCAD’20). 1–9.
[20]
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2010. CIFAR-10 (Canadian Institute for Advanced Research). Retrieved March 3, 2021 from https://www.cs.toronto.edu/kriz/cifar.html.
[21]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12).
[22]
Hyoukjun Kwon, Prasanth Chatarasi, Micheal Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. 2020. MAESTRO: A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings. IEEE Micro 40, 3 (April 2020), 20–29.
[23]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (May 2015), 426–444.
[24]
Jaeseong Lee, Duseok Kang, and Soonhoi Ha. 2020. S3NAS: Fast NPU-aware neural architecture search methodology. CoRR abs/2009.02009. arxiv:2009.02009. Retrieved on May 25, 2021 from https://arxiv.org/abs/2009.02009.
[25]
Jinmook Lee, Changhyeon Kim, Sanghoon Kang, Dongjoo Shin, Sangyeob Kim, and Hoi-Jun Yoo. 2019. UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE Journal of Solid-State Circuits 54, 1 (Jan. 2019), 173–185.
[26]
Yuhong Li, Cong Hao, Xiaofan Zhang, Xinheng Liu, Yao Chen, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2020. EDD: Efficient differentiable DNN architecture and implementation co-search for embedded AI solutions. In Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC’20). Article 130, 6 pages.
[27]
Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. 2018. Progressive neural architecture search. In European Conference on Computer Vision (ECCV’18). 19–35.
[28]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. DARTS: Differentiable architecture search. CoRR abs/1806.09055. arxiv:1806.09055. Retrieved on May 25, 2021 from http://arxiv.org/abs/1806.09055.
[29]
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. CoRR abs/1807.11164. arxiv:1807.11164. Retrieved on May 25, 2021 from https://arxiv.org/abs/1807.11164.
[31]
NVIDIA. 2019. NVDLA Index of Documentation. (2019). Retrieved on May 25, 2021 from http://nvdla.org/contents.html.
[32]
NVIDIA. 2020. NVIDIA Ampere Architecture. (2020). Retrieved on May 25, 2021 from https://www.nvidia.com/en-us/data-center/ampere-architecture/.
[33]
Angshuman Parashar, Priyanka Raina, Yakun S. Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 304–315.
[34]
Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient neural architecture search via parameter sharing. CoRR abs/1802.03268. arxiv:1802.03268. Retrieved on May 25, 2021 from http://arxiv.org/abs/1802.03268.
[35]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI. https://openai.com/blog/language-unsupervised/.
[36]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (July 2019), 4780–4789.
[37]
Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, faster, stronger. CoRR abs/1612.08242. arxiv:1612.08242.http://arxiv.org/abs/1612.08242.
[38]
Miguel Rocha and José Neves. 1999. Preventing premature convergence to local optima in genetic algorithms via random offspring generation. In Multiple Approaches to Intelligent Systems. Springer, 127–136.
[39]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI’15). 234–241.
[40]
Sungju Ryu, Hyungjun Kim, Wooseok Yi, and Jae-Joon Kim. 2019. BitBlade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation. In ACM/IEEE Design Automation Conference (DAC’19). 1–6.
[41]
Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs/1801.04381. arxiv:1801.04381. Retrieved on May 25, 2021 from http://arxiv.org/abs/1801.04381.
[42]
Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. 2018. Bit Fusion: Bit-Level dynamically composable architecture for accelerating deep neural network. In ACM/IEEE International Symposium on Computer Architecture (ISCA’18). 764–775.
[43]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. arxiv:1409.1556. http://arxiv.org/abs/1409.1556.
[44]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’14). 3104–3112.
[45]
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. 2018. MnasNet: Platform-aware neural architecture search for mobile. CoRR abs/1807.11626. arxiv:1807.11626. Retrieved on May 25, 2021 from http://arxiv.org/abs/1807.11626.
[46]
Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. CoRR abs/1905.11946. arxiv:1905.11946. Retrieved on May 25, 2021 from http://arxiv.org/abs/1905.11946.
[47]
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. 2019. FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10734–10742.
[48]
Lei Yang, Zheyu Yan, Meng Li, Hyoukjun Kwon, Weiwen Jiang, Liangzhen Lai, Yiyu Shi, Tushar Krishna, and Vikas Chandra. 2020. Co-Exploration of neural architectures and heterogeneous ASIC accelerator designs targeting multiple tasks. In Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC’20). Article 163, 6 pages.
[49]
Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, and Mark Horowitz. 2020. Interstellar: Using Halide’s scheduling language to analyze DNN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). 369–383.
[50]
Yang You, Aydın Buluç, and James Demmel. 2017. Scaling deep learning on GPU and Knights Landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Article 9, 12 pages.
[51]
Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122. arxiv:1511.07122. Retrieved on May 25, 2021 from https://arxiv.org/abs/1511.07122v3.
[52]
Li Lyna Zhang, Yuqing Yang, Yuhang Jiang, Wenwu Zhu, and Yunxin Liu. 2020. Fast hardware-aware neural architecture search. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’20). 2959–2967.
[53]
Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. 2018. Graph neural networks: A review of methods and applications. CoRR abs/1812.08434. arxiv:1812.08434. Retrieved on May 25, 2021 from http://arxiv.org/abs/1812.08434.
[54]
Barret Zoph and Quoc V. Le. 2016. Neural architecture search with reinforcement learning. CoRR abs/1611.01578. arxiv:1611.01578. Retrieved on May 25, 2021 from http://arxiv.org/abs/1611.01578.
[55]
Barret Zoph, Vijay Vasudevan, Jonathan Shlens, and Quoc V. Le. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 8697–8710.

Cited By

View all
  • (2024)Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical FrameworkACM Transactions on Design Automation of Electronic Systems10.1145/3670404Online publication date: 7-Jun-2024
  • (2023)Hardware-aware NAS by Genetic Optimisation with a Design Space Exploration Simulator2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW59228.2023.00222(2275-2283)Online publication date: Jun-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 27, Issue 5
September 2022
274 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3540253
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 06 June 2022
Online AM: 24 February 2022
Accepted: 01 January 2022
Revised: 01 January 2022
Received: 01 June 2021
Published in TODAES Volume 27, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Dataflow optimization
  2. neural networks
  3. neural architecture search
  4. neural processing unit

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Samsung Research Funding Incubation Center of Samsung Electronics
  • Samsung Research Funding Incubation Center of Samsung Electronics

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)401
  • Downloads (Last 6 weeks)40
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical FrameworkACM Transactions on Design Automation of Electronic Systems10.1145/3670404Online publication date: 7-Jun-2024
  • (2023)Hardware-aware NAS by Genetic Optimisation with a Design Space Exploration Simulator2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW59228.2023.00222(2275-2283)Online publication date: Jun-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media