skip to main content
research-article

Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices

Authors Info & Claims
Published:06 June 2022Publication History
Skip Abstract Section

Abstract

Recent advances in deep learning have made it possible to implement artificial intelligence in mobile devices. Many studies have put a lot of effort into developing lightweight deep learning models optimized for mobile devices. To overcome the performance limitations of manually designed deep learning models, an automated search algorithm, called neural architecture search (NAS), has been proposed. However, studies on the effect of hardware architecture of the mobile device on the performance of NAS have been less explored. In this article, we show the importance of optimizing a hardware architecture, namely, NPU dataflow, when searching for a more accurate yet fast deep learning model. To do so, we first implement an optimization framework, named FlowOptimizer, for generating a best possible NPU dataflow for a given deep learning operator. Then, we utilize this framework during the latency-aware NAS to find the model with the highest accuracy satisfying the latency constraint. As a result, we show that the searched model with FlowOptimizer outperforms the performance by 87.1% and 92.3% on average compared to the searched model with NVDLA and Eyeriss, respectively, with better accuracy on a proxy dataset. We also show that the searched model can be transferred to a larger model to classify a more complex image dataset, i.e., ImageNet, achieving 0.2%/5.4% higher Top-1/Top-5 accuracy compared to MobileNetV2-1.0 with 3.6\( \times \) lower latency.

REFERENCES

  1. [1] Cai Han, Gan Chuang, and Han Song. 2019. Once for all: Train one network and specialize it for efficient deployment. CoRR abs/1908.09791. arxiv:1908.09791. Retrieved on May 25, 2021 from http://arxiv.org/abs/1908.09791.Google ScholarGoogle Scholar
  2. [2] Chatarasi Prasanth, Kwon Hyoukjun, Raina Natesh, Malik Saurabh, Haridas Vaisakh, Parashar Angshuman, Pellauer Michael, Krishna Tushar, and Sarkar Vivek. 2020. Marvel: A data-centric compiler for DNN operators on spatial accelerators. CoRR abs/2002.07752. arxiv:2002.07752. Retrieved on May 25, 2021 from https://arxiv.org/abs/2002.07752.Google ScholarGoogle Scholar
  3. [3] Chen Yunji, Luo Tao, Liu Shaoli, Zhang Shijin, He Liqiang, Wang Jia, Li Ling, Chen Tianshi, Xu Zhiwei, Sun Ninghui, and Temam Olivier. 2014. DaDianNao: A machine-learning supercomputer. In IEEE/ACM International Symposium on Microarchitecture (MICRO’14). 609622. Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Yu-Hsin, Krishna Tushar, Emer Joel S., and Sze Vivienne. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (Jan. 2017), 127138. Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chu Xiangxiang, Zhou Tianbao, Zhang Bo, and Li Jixiang. 2019. Fair DARTS: Eliminating unfair advantages in differentiable architecture search. arXiv:abs/1911.12126. Retrieved on May 25, 2021 from http://arxiv.org/abs/1911.12126.Google ScholarGoogle Scholar
  6. [6] Dean Jeffrey, Corrado Greg S., Monga Rajat, Chen Kai, Devin Matthieu, Le Quoc V., Mao Mark Z., Ranzato Marc’Aurelio, Senior Andrew, Tucker Paul, Yang Ke, and Ng Andrew Y.. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NIPS’12).Google ScholarGoogle Scholar
  7. [7] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Retrieved from arxiv:1810.04805. http://arxiv.org/abs/1810.04805.Google ScholarGoogle Scholar
  8. [8] Du Zidong, Fasthuber Robert, Chen Tianshi, Ienne Paolo, Li Ling, Luo Tao, Feng Xiaobing, Chen Yunji, and Temam Olivier. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM/IEEE International Symposium on Computer Architecture (ISCA’15). 92104. Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Gao Mingyu, Yang Xuan, Pu Jing, Horowitz Mark, and Kozyrakis Christos. 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). 807820. Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Guo Zichao, Zhang Xiangyu, Mu Haoyuan, Heng Wen, Liu Zechun, Wei Yichen, and Sun Jian. 2019. Single path one-shot neural architecture search with uniform sampling. CoRR abs/1904.00420. arxiv:1904.00420. Retrieved from http://arxiv.org/abs/1904.00420.Google ScholarGoogle Scholar
  11. [11] Gupta U., Wu C., Wang X., Naumov M., Reagen B., Brooks D., Cottel B., Hazelwood K., Hempstead M., Jia B., Lee H. S., Malevich A., Mudigere D., Smelyanskiy M., Xiong L., and Zhang X.. 2020. The architectural implications of Facebook’s DNN-based personalized recommendation. In IEEE International Symposium on High Performance Computer Architecture (HPCA’20). 488501. Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385. arxiv:1512.03385. Retrieved on May 25, 2021 from http://arxiv.org/abs/1512.03385.Google ScholarGoogle Scholar
  13. [13] Howard Andrew, Sandler Mark, Chu Grace, Chen Liang-Chieh, Chen Bo, Tan Mingxing, Wang Weijun, Zhu Yukun, Pang Ruoming, Vasudevan Vijay, Le Quoc V., and Adam Hartwig. 2019. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 13141324.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. arxiv:1704.04861. Retrieved on May 25, 2021 from http://arxiv.org/abs/1704.04861.Google ScholarGoogle Scholar
  15. [15] Hundt Andrew, Jain Varun, and Hager Gregory D.. 2019. sharpDARTS: Faster and more accurate differentiable architecture search. arXiv:abs/1903.09900. Retrieved on May 25, 2021 from http://arxiv.org/abs/1903.09900.Google ScholarGoogle Scholar
  16. [16] Ishii Idaku, Tatebe Tetsuro, Gu Qingyi, Moriue Yuta, Takaki Takeshi, and Tajima Kenji. 2010. 2000 fps real-time vision system with high-frame-rate video recording. In IEEE International Conference on Robotics and Automation (ICRA’10). 15361541. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Jiang Yuhang, Wang Xin, and Zhu Wenwu. 2020. Hardware-Aware transformable architecture search with efficient search space. In IEEE International Conference on Multimedia and Expo (ICME’20). 16. Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Jouppi N. P., Young C., Patil N., Patterson D., Agrawal G., Bajwa R., Bates S., Bhatia S., Boden N., Borchers A., Boyle R., Cantin P., Chao C., Clark C., Coriell J., Daley M., Dau M., Dean J., Gelb B., Ghaemmaghami T. V., Gottipati R., Gulland W., Hagmann R., Ho C. R., Hogberg D., Hu J., Hundt R., Hurt D., Ibarz J., Jaffey A., Jaworski A., Kaplan A., Khaitan H., Killebrew D., Koch A., Kumar N., Lacy S., Laudon J., Law J., Le D., Leary C., Liu Z., Lucke K., Lundin A., MacKean G., Maggiore A., Mahony M., Miller K., Nagarajan R., Narayanaswami R., Ni R., Nix K., Norrie T., Omernick M., Penukonda N., Phelps A., Ross J., Ross M., Salek A., Samadiani E., Severn C., Sizikov G., Snelham M., Souter J., Steinberg D., Swing A., Tan M., Thorson G., Tian B., Toma H., Tuttle E., Vasudevan V., Walter R., Wang W., Wilcox E., and Yoon D. H.. 2017. In-datacenter performance analysis of a tensor processing unit. In ACM/IEEE International Symposium on Computer Architecture (ISCA’17). 112. Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Kao Sheng-Chun and Krishna Tushar. 2020. GAMMA: Automating the HW mapping of DNN models on accelerators via genetic algorithm. In International Conference on Computer-Aided Design (ICCAD’20). 19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Krizhevsky Alex, Nair Vinod, and Hinton Geoffrey. 2010. CIFAR-10 (Canadian Institute for Advanced Research). Retrieved March 3, 2021 from https://www.cs.toronto.edu/kriz/cifar.html.Google ScholarGoogle Scholar
  21. [21] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12).Google ScholarGoogle Scholar
  22. [22] Kwon Hyoukjun, Chatarasi Prasanth, Pellauer Micheal, Parashar Angshuman, Sarkar Vivek, and Krishna Tushar. 2020. MAESTRO: A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings. IEEE Micro 40, 3 (April 2020), 2029. Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] LeCun Yann, Bengio Yoshua, and Hinton Geoffrey. 2015. Deep learning. Nature 521 (May 2015), 426444. Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Lee Jaeseong, Kang Duseok, and Ha Soonhoi. 2020. S3NAS: Fast NPU-aware neural architecture search methodology. CoRR abs/2009.02009. arxiv:2009.02009. Retrieved on May 25, 2021 from https://arxiv.org/abs/2009.02009.Google ScholarGoogle Scholar
  25. [25] Lee Jinmook, Kim Changhyeon, Kang Sanghoon, Shin Dongjoo, Kim Sangyeob, and Yoo Hoi-Jun. 2019. UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE Journal of Solid-State Circuits 54, 1 (Jan. 2019), 173185. Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Li Yuhong, Hao Cong, Zhang Xiaofan, Liu Xinheng, Chen Yao, Xiong Jinjun, Hwu Wen-mei, and Chen Deming. 2020. EDD: Efficient differentiable DNN architecture and implementation co-search for embedded AI solutions. In Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC’20). Article 130, 6 pages.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Liu Chenxi, Zoph Barret, Neumann Maxim, Shlens Jonathon, Hua Wei, Li Li-Jia, Fei-Fei Li, Yuille Alan, Huang Jonathan, and Murphy Kevin. 2018. Progressive neural architecture search. In European Conference on Computer Vision (ECCV’18). 1935.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Liu Hanxiao, Simonyan Karen, and Yang Yiming. 2018. DARTS: Differentiable architecture search. CoRR abs/1806.09055. arxiv:1806.09055. Retrieved on May 25, 2021 from http://arxiv.org/abs/1806.09055.Google ScholarGoogle Scholar
  29. [29] Ma Ningning, Zhang Xiangyu, Zheng Hai-Tao, and Sun Jian. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. CoRR abs/1807.11164. arxiv:1807.11164. Retrieved on May 25, 2021 from https://arxiv.org/abs/1807.11164.Google ScholarGoogle Scholar
  30. [30] NVIDIA. 2017. Volta GPU Architecture. https://www.nvidia.com/en-us/data-center/volta-gpu-archi.Google ScholarGoogle Scholar
  31. [31] NVIDIA. 2019. NVDLA Index of Documentation. (2019). Retrieved on May 25, 2021 from http://nvdla.org/contents.html.Google ScholarGoogle Scholar
  32. [32] NVIDIA. 2020. NVIDIA Ampere Architecture. (2020). Retrieved on May 25, 2021 from https://www.nvidia.com/en-us/data-center/ampere-architecture/.Google ScholarGoogle Scholar
  33. [33] Parashar Angshuman, Raina Priyanka, Shao Yakun S., Chen Yu-Hsin, Ying Victor A., Mukkara Anurag, Venkatesan Rangharajan, Khailany Brucek, Keckler Stephen W., and Emer Joel. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 304315. Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Pham Hieu, Guan Melody Y., Zoph Barret, Le Quoc V., and Dean Jeff. 2018. Efficient neural architecture search via parameter sharing. CoRR abs/1802.03268. arxiv:1802.03268. Retrieved on May 25, 2021 from http://arxiv.org/abs/1802.03268.Google ScholarGoogle Scholar
  35. [35] Radford Alec, Narasimhan Karthik, Salimans Tim, and Sutskever Ilya. 2018. Improving language understanding by generative pre-training. OpenAI. https://openai.com/blog/language-unsupervised/.Google ScholarGoogle Scholar
  36. [36] Real Esteban, Aggarwal Alok, Huang Yanping, and Le Quoc V.. 2019. Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (July 2019), 47804789. Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Redmon Joseph and Farhadi Ali. 2016. YOLO9000: Better, faster, stronger. CoRR abs/1612.08242. arxiv:1612.08242. http://arxiv.org/abs/1612.08242.Google ScholarGoogle Scholar
  38. [38] Rocha Miguel and Neves José. 1999. Preventing premature convergence to local optima in genetic algorithms via random offspring generation. In Multiple Approaches to Intelligent Systems. Springer, 127136.Google ScholarGoogle Scholar
  39. [39] Ronneberger Olaf, Fischer Philipp, and Brox Thomas. 2015. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI’15). 234241.Google ScholarGoogle Scholar
  40. [40] Ryu Sungju, Kim Hyungjun, Yi Wooseok, and Kim Jae-Joon. 2019. BitBlade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation. In ACM/IEEE Design Automation Conference (DAC’19). 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Sandler Mark, Howard Andrew G., Zhu Menglong, Zhmoginov Andrey, and Chen Liang-Chieh. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs/1801.04381. arxiv:1801.04381. Retrieved on May 25, 2021 from http://arxiv.org/abs/1801.04381.Google ScholarGoogle Scholar
  42. [42] Sharma Hardik, Park Jongse, Suda Naveen, Lai Liangzhen, Chau Benson, Chandra Vikas, and Esmaeilzadeh Hadi. 2018. Bit Fusion: Bit-Level dynamically composable architecture for accelerating deep neural network. In ACM/IEEE International Symposium on Computer Architecture (ISCA’18). 764775. Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Simonyan Karen and Zisserman Andrew. 2015. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. arxiv:1409.1556. http://arxiv.org/abs/1409.1556.Google ScholarGoogle Scholar
  44. [44] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’14). 31043112.Google ScholarGoogle Scholar
  45. [45] Tan Mingxing, Chen Bo, Pang Ruoming, Vasudevan Vijay, and Le Quoc V.. 2018. MnasNet: Platform-aware neural architecture search for mobile. CoRR abs/1807.11626. arxiv:1807.11626. Retrieved on May 25, 2021 from http://arxiv.org/abs/1807.11626.Google ScholarGoogle Scholar
  46. [46] Tan Mingxing and Le Quoc V.. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. CoRR abs/1905.11946. arxiv:1905.11946. Retrieved on May 25, 2021 from http://arxiv.org/abs/1905.11946.Google ScholarGoogle Scholar
  47. [47] Wu Bichen, Dai Xiaoliang, Zhang Peizhao, Wang Yanghan, Sun Fei, Wu Yiming, Tian Yuandong, Vajda Peter, Jia Yangqing, and Keutzer Kurt. 2019. FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1073410742.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Yang Lei, Yan Zheyu, Li Meng, Kwon Hyoukjun, Jiang Weiwen, Lai Liangzhen, Shi Yiyu, Krishna Tushar, and Chandra Vikas. 2020. Co-Exploration of neural architectures and heterogeneous ASIC accelerator designs targeting multiple tasks. In Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC’20). Article 163, 6 pages.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Yang Xuan, Gao Mingyu, Liu Qiaoyi, Setter Jeff, Pu Jing, Nayak Ankita, Bell Steven, Cao Kaidi, Ha Heonjae, Raina Priyanka, Kozyrakis Christos, and Horowitz Mark. 2020. Interstellar: Using Halide’s scheduling language to analyze DNN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). 369383. Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] You Yang, Buluç Aydın, and Demmel James. 2017. Scaling deep learning on GPU and Knights Landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Article 9, 12 pages. Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Yu Fisher and Koltun Vladlen. 2016. Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122. arxiv:1511.07122. Retrieved on May 25, 2021 from https://arxiv.org/abs/1511.07122v3.Google ScholarGoogle Scholar
  52. [52] Zhang Li Lyna, Yang Yuqing, Jiang Yuhang, Zhu Wenwu, and Liu Yunxin. 2020. Fast hardware-aware neural architecture search. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’20). 29592967. Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Zhou Jie, Cui Ganqu, Zhang Zhengyan, Yang Cheng, Liu Zhiyuan, and Sun Maosong. 2018. Graph neural networks: A review of methods and applications. CoRR abs/1812.08434. arxiv:1812.08434. Retrieved on May 25, 2021 from http://arxiv.org/abs/1812.08434.Google ScholarGoogle Scholar
  54. [54] Zoph Barret and Le Quoc V.. 2016. Neural architecture search with reinforcement learning. CoRR abs/1611.01578. arxiv:1611.01578. Retrieved on May 25, 2021 from http://arxiv.org/abs/1611.01578.Google ScholarGoogle Scholar
  55. [55] Zoph Barret, Vasudevan Vijay, Shlens Jonathan, and Le Quoc V.. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 86978710. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Design Automation of Electronic Systems
          ACM Transactions on Design Automation of Electronic Systems  Volume 27, Issue 5
          September 2022
          274 pages
          ISSN:1084-4309
          EISSN:1557-7309
          DOI:10.1145/3540253
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 June 2022
          • Online AM: 24 February 2022
          • Revised: 1 January 2022
          • Accepted: 1 January 2022
          • Received: 1 June 2021
          Published in todaes Volume 27, Issue 5

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)340
          • Downloads (Last 6 weeks)39

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format