ABSTRACT
Deep neural networks (DNNs) are widely used in various applications. The accurate and latency feedback is essential for model design and deployment. In this work, we attempt to alleviate the cost of model latency acquisition from two aspects: latency query and latency prediction. To ease the difficulty of acquiring model latency on multi-platform, our latency query system can automatically convert DNN model into the corresponding executable format, and measure latency on the target hardware. Powered by this, latency queries can be fulfilled with a simple interface calling. For the efficient utilization of previous latency knowledge, we employ a MySQL database to store numerous models and the corresponding latencies. In our system, the efficiency of latency query can be boosted by 1.8 ×. For latency prediction, we first represent neural networks with the unified GNN-based graph embedding. With the help of the evolving database, our model-based latency predictor achieves better performance, which realizes 12.31% accuracy improvement compared with existing methods. Our codes are open-sourced at https://github.com/ModelTC/NNLQP.
- Junjie Bai, Fang Lu, Ke Zhang, 2019. ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx.Google Scholar
- Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2020. Once-for-All: Train One Network and Specialize it for Efficient Deployment. arxiv:1908.09791 [cs.LG]Google Scholar
- Han Cai, Ligeng Zhu, and Song Han. 2018. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332(2018).Google Scholar
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805 [cs.CL]Google Scholar
- TensorRT Documentation. 2021. Optimizing for Tensor Cores. https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#optimize-tensor-cores.Google Scholar
- Łukasz Dudziak, Thomas Chau, Mohamed S Abdelfattah, Royson Lee, Hyeji Kim, and Nicholas D Lane. 2020. Brp-nas: Prediction-based nas using gcns. arXiv preprint arXiv:2007.08668(2020).Google Scholar
- Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. 2019. Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
- Miguel Grinberg. 2018. Flask web development: developing web applications with python. ” O’Reilly Media, Inc.”.Google Scholar
- Aric Hagberg, Pieter Swart, and Daniel S Chult. 2008. Exploring network structure, dynamics, and function using NetworkX. Technical Report. Los Alamos National Lab.(LANL), Los Alamos, NM (United States).Google Scholar
- William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1025–1035.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385 [cs.CV]Google Scholar
- Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision. 1314–1324.Google ScholarCross Ref
- Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360(2016).Google Scholar
- Samuel Kaufman, Phitchaya Mangpo Phothilimthana, and Mike Burrows. 2019. Learned TPU cost model for XLA tensor programs. In Proc. Workshop ML Syst. NeurIPS. 1–6.Google Scholar
- Samuel J Kaufman, Phitchaya Mangpo Phothilimthana, Yanqi Zhou, Charith Mendis, Sudip Roy, Amit Sabne, and Mike Burrows. 2020. A Learned Performance Model for Tensor Processing Units. arXiv preprint arXiv:2008.01040(2020).Google Scholar
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).Google Scholar
- Chaojian Li, Zhongzhi Yu, Yonggan Fu, Yongan Zhang, Yang Zhao, Haoran You, Qixuan Yu, Yue Wang, and Yingyan Lin. 2021. Hw-nas-bench: Hardware-aware neural architecture search benchmark. arXiv preprint arXiv:2103.10584(2021).Google Scholar
- Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. 2021. BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. In International Conference on Learning Representations. https://openreview.net/forum?id=POWv6hDd9XHGoogle Scholar
- Yuhang Li, Mingzhu Shen, Jian Ma, Yan Ren, Mingxin Zhao, Qi Zhang, Ruihao Gong, Fengwei Yu, and Junjie Yan. 2021. MQBench: Towards Reproducible and Deployable Model Quantization Benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). https://openreview.net/forum?id=TUplOmF8DsMGoogle Scholar
- Heng Liao, Jiajin Tu, Jing Xia, and Xiping Zhou. 2019. DaVinci: A Scalable Architecture for Neural Network Computing.. In Hot Chips Symposium. 1–44.Google ScholarCross Ref
- Ji Lin, Wei-Ming Chen, Yujun Lin, Chuang Gan, Song Han, 2020. Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems 33 (2020), 11711–11722.Google Scholar
- Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2018. Focal Loss for Dense Object Detection. arxiv:1708.02002 [cs.CV]Google Scholar
- Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 522–531.Google Scholar
- Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman Jouppi, and David Patterson. 2021. The Design Process for Google’s Training Chips: TPUv2 and TPUv3. IEEE Micro 41, 2 (2021), 56–63. https://doi.org/10.1109/MM.2021.3058217Google ScholarCross Ref
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026–8037.Google Scholar
- Evgeny Ponomarev, Sergey Matveev, and Ivan Oseledets. 2020. LETI: Latency Estimation Tool and Investigation of Neural Networks inference on Mobile GPU. arXiv preprint arXiv:2010.02871(2020).Google Scholar
- Haotong Qin, Ruihao Gong, Xianglong Liu, Xiao Bai, Jingkuan Song, and Nicu Sebe. 2020. Binary neural networks: A survey. Pattern Recognition (2020), 107281. https://doi.org/10.1016/j.patcog.2020.107281Google Scholar
- Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10428–10436.Google ScholarCross Ref
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arxiv:1506.01497 [cs.CV]Google Scholar
- Jaehun Ryu and Hyojin Sung. 2021. MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks. arxiv:2102.04199 [cs.LG]Google Scholar
- Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510–4520.Google ScholarCross Ref
- Mingzhu Shen, Kai Han, Chunjing Xu, and Yunhe Wang. 2019. Searching for accurate binary neural architectures. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–0.Google ScholarCross Ref
- Mingzhu Shen, Feng Liang, Ruihao Gong, Yuhang Li, Chuming Li, Chen Lin, Fengwei Yu, Junjie Yan, and Wanli Ouyang. 2021. Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search. arxiv:2010.04354 [cs.CV]Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.Google ScholarCross Ref
- Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2820–2828.Google ScholarCross Ref
- Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. PMLR, 6105–6114.Google Scholar
- Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu. 2022. QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization. In International Conference on Learning Representations. https://openreview.net/forum?id=ySQH0oDyp7Google Scholar
- Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services. 81–93.Google ScholarDigital Library
- Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E. Gonzalez, Ion Stoica, and Ameer Haj Ali. 2021. TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). https://openreview.net/forum?id=aIfp8kLuvc9Google Scholar
- Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, and Junjie Yan. 2020. Towards Unified INT8 Training for Convolutional Neural Network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Index Terms
- NNLQP: A Multi-Platform Neural Network Latency Query and Prediction System with An Evolving Database
Recommendations
A pattern-based prediction: An empirical approach to predict end-to-end network latency
Understanding latency in network-based applications has received considerable attention to provide consistent and acceptable levels of services. This paper presents an empirical approach, a pattern-based prediction method, to predict end-to-end network ...
Network latency prediction using high accuracy prediction tree
ICUIMC '13: Proceedings of the 7th International Conference on Ubiquitous Information Management and CommunicationNetwork latency is often used as an optimization parameter for network path construction over the Internet for various real-time applications. This paper proposes a high accuracy prediction tree method for latency estimation minimizing the need for ...
Latency-constrained DNN architecture learning for edge systems using zerorized batch normalization
AbstractDeep learning applications have been widely adopted on edge devices, to mitigate the privacy and latency issues of accessing cloud servers. Deciding the number of neurons during the design of a deep neural network to maximize ...
Highlights- Optimize DNN to fulfil the latency constraint and maintain high accuracy.
- A one-...
Comments