ABSTRACT
Executing deep-learning inference on cloud servers enables the usage of high complexity models for mobile devices with limited resources. However, pre-execution time-the time it takes to prepare and transfer data to the cloud-is variable and can take orders of magnitude longer to complete than inference execution itself. This pre-execution time can be reduced by dynamically deciding the order of two essential steps, preprocessing and data transfer, to better take advantage of on-device resources and network conditions. In this work, we present PieSlicer, a system for making dynamic preprocessing decisions to improve cloud inference performance using linear regression models. PieSlicer then leverages these models to select the appropriate preprocessing location. We show that for image classification applications PieSlicer reduces median and 99th percentile pre-execution time by up to 50.2ms and 217.2ms respectively when compared to static preprocessing methods.
- Qualcomm on Tour. https://www.anandtech.com/show/11201/qualcomm-snapdragon-835-performance-preview/5.Google Scholar
- NVIDIA Triton Inference Server. https://developer.nvidia.com/nvidia-triton-inference-server.Google Scholar
- Deep Learning for Siri's Voice. https://machinelearning.apple.com/2017/08/06/siri-voices.html, 2017.Google Scholar
- Pixel 2 - Wikipedia. https://en.wikipedia.org/wiki/Pixel_2, 2019.Google Scholar
- Bianco, S. et al. Benchmark analysis of representative deep neural network architectures. 2018. doi: 10.1109/ACCESS.2018.2877890.Google Scholar
- Chen, T. et al. TVM: An automated end-to-end optimizing compiler for deep learning. In OSDI 18, 2018.Google Scholar
- Chen, T.Y.H. et al. Glimpse: Continuous, real-time object recognition on mobile devices. In SenSys '15, 2015.Google Scholar
- Chun, B.G. et al. Clonecloud: Elastic execution between mobile device and cloud. In EuroSys '11, 2011.Google ScholarDigital Library
- Crankshaw, D. et al. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation, 2017.Google Scholar
- Cuervo, E. et al. Maui: Making smartphones last longer with code offload. In ACM MobiSys 2010. Association for Computing Machinery, Inc., June 2010.Google Scholar
- Dai, X. et al. Recurrent Networks for Guided Multi-Attention Classification. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'20).Google Scholar
- Deng, J. et al. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR'09.Google Scholar
- Goodfellow, I. et al.Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.Google ScholarDigital Library
- Gujarati, A. et al. Swayam: Distributed autoscaling to meet slas of machine learning inference services with resource efficiency. In Middleware '17, 2017.Google ScholarDigital Library
- Gujarati, A. et al. Serving DNNs like clockwork: Performance predictability from the bottom up. In OSDI'20, 2020.Google Scholar
- Guo, T. Cloud-based or on-device: An empirical study of mobile deep inference. In IC2E'18, 2018.Google ScholarCross Ref
- He, K. et al. Deep residual learning for image recognition. CVPR'16.Google Scholar
- Howard, A.G. et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. abs/1704.04861, 2017.Google Scholar
- Hu, J. et al. Banner: An image sensor reconfiguration framework for seamless resolution-based tradeoffs. MobiSys '19.Google Scholar
- Huang, J. et al. Clio: Enabling automatic compilation of deep learning pipelines across iot and cloud. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, 2020.Google ScholarDigital Library
- Ignatov, A. et al. AI benchmark: All about deep learning on smartphones in 2019. CoRR, abs/1910.06663, 2019. URL http://arxiv.org/abs/1910.06663.Google Scholar
- Ishakian, V. et al. Serving deep learning models in a serverless platform. CoRR,abs/1710.08460, 2017. URL http://arxiv.org/abs/1710.08460.Google Scholar
- Jia, Y. et al. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia, 2014.Google ScholarDigital Library
- Jouppi, N.P. et al. In-data center performance analysis of a tensor processing unit. In ISCA'17, pages 1--12, 2017.Google ScholarDigital Library
- Kang, Y. et al. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In ACM SIGARCH Computer Architecture News, 2017.Google ScholarDigital Library
- Kannan, R.S. et al. Grandslam: Guaranteeing slas for jobs in microservices execution frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019.Google ScholarDigital Library
- Kosta, S. et al. Thinkair: Dynamic resource allocation and parallel execution in the cloud for mobile code offloading. In 2012 Proceedings IEEE INFOCOM, 2012.Google ScholarCross Ref
- Krizhevsky, A. et al. Learning multiple layers of features from tiny images. Technical report, 2009.Google Scholar
- LeMay, M. et al. Perseus: Characterizing performance and cost of multi-tenant serving for cnn models. In IC2E'20, pages 66--72. IEEE, 2020.Google Scholar
- Liang, Q. et al. AI on the edge: Rethinking AI-based IoT applications using specialized edge architectures. arXiv preprint arXiv:2003.12488, 2020.Google Scholar
- List, N. et al. Svm-optimization and steepest-descent line search. In Proceedings of the 22nd Annual Conference on Computational Learning Theory, 2009.Google Scholar
- Liu, L. et al. Edge assisted real-time object detection for mobile augmented reality. In MobiCom'19, 2019.Google ScholarDigital Library
- Liu, Z. et al. Deep n-jpeg: a deep neural network favorable jpeg-based image compression framework. InDAC'18, pages 1--6, 2018.Google Scholar
- Ogden, S.S. et al. MODI: Mobile deep inference made efficient by edge computing. In USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18), 2018.Google Scholar
- Ogden, S.S. et al. Mdinference: Balancing inference accuracy and latency for mobile applications. In IC2E 2020, 2020.Google Scholar
- Ogden, S.S. et al. Pieslicer. https://github.com/cake-lab/PieSlicer, 2020.Google Scholar
- Olston, C. et al. Tensor flow-serving: Flexible, high-performance ml serving. In Workshop on ML Systems at NIPS 2017, 2017.Google Scholar
- Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. 2019.Google Scholar
- Ran, X. et al. Deep decision: A mobile deep learning framework for edge video analytics. In IEEE Conference on Computer Communications, 2018.Google Scholar
- Rayner, K. et al. Masking of foveal and parafoveal vision during eye fixations in reading. J. Exp. Psychol. Hum. Percept. Perform., 1981.Google Scholar
- Reddi, V.J. et al. Mlperf inference benchmark. In ISCA'20, pages 446--459.Google Scholar
- Rice, A. et al. Measuring mobile phone energy consumption for 802.11 wireless networking.Pervasive and Mobile Computing, 6(6):593--606, 2010.Google ScholarDigital Library
- Romero, F. et al. Infaas: Managed & model-less inference serving. CoRR, abs/1905.13348, 2019. URL http://arxiv.org/abs/1905.13348.Google Scholar
- Soifer, J. et al. Deep learning inference service at Microsoft. In 2019 USENIX Conference on Operational Machine Learning (OpML 19), 2019.Google Scholar
- Szegedy, C. et al. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.Google Scholar
- Teerapittayanon, S. et al. Distributed deep neural networks over the cloud, the edge and end devices. In ICDCS'17, pages 328--339. IEEE, 2017.Google ScholarCross Ref
- Wallace, G.K. The jpeg still picture compression standard. IEEE transactions on consumer electronics, 1992.Google ScholarDigital Library
- Wu, C.J. et al. Machine learning at facebook: Understanding inference at the edge. In HPCA'19, pages 331--344. IEEE, 2019.Google ScholarCross Ref
- Xie, X. et al. Source compression with bounded DNN perception loss for IoTedge computer vision. In MobiCom'19, 2019.Google ScholarDigital Library
- Xu, M. et al. A first look at deep learning apps on smartphones. In The World Wide Web Conference, WWW '19, 2019.Google ScholarDigital Library
- Zhang, C. et al. Mark: Exploiting cloud services for cost-effective, slo-aware machine learning inference serving. In 2019 USENIX Annual Technical Conference.Google Scholar
- Zoph, B. et al. Learning transferable architectures for scalable image recognition. CoRR, abs/1707.07012, 2017. URL http://arxiv.org/abs/1707.07012.Google Scholar
Index Terms
- PieSlicer: Dynamically Improving Response Time for Cloud-based CNN Inference
Recommendations
Response time for cloud computing providers
iiWAS '10: Proceedings of the 12th International Conference on Information Integration and Web-based Applications & ServicesCloud services are becoming popular in terms of distributed technology because they allow cloud users to rent well-specified resources of computing, network, and storage infrastructure. Users pay for their use of services without needing to spend ...
Addressing response time of cloud-based mobile applications
MobileCloud '13: Proceedings of the first international workshop on Mobile cloud computing & networkingWith more mobile applications being developed to take advantage of the elastic cloud computing resources instead of restricting to native mobile device resources, this paper investigates a timely question: is there any fundamental challenge that needs ...
Near-Real-Time Cloud Auditing for Rapid Response
WAINA '12: Proceedings of the 2012 26th International Conference on Advanced Information Networking and Applications WorkshopsDue to the rapid emergence of Information Technology, cloud computing provides assorted advantages to service providers, developers, organizations, and customers with respect to scalability, flexibility, cost-effectiveness, and availability. However, it ...
Comments