ABSTRACT
Deep Learning is a subfield of machine learning methods based on artificial neural networks. Thanks to the increased data availability and computational power, such as Graphic Process Units (GPU), training deep networks - a time-consuming process - became possible. Cloud computing is an excellent option to acquire the computational power to train these models since it provides elastic products with a pay-per-use model. Amazon Web Services (AWS), for instance, has GPU-based virtual machine instances in its catalog, which differentiates themselves by the GPU type, number of GPUs, and price per hour. The challenge consists in determining which instance is better for a specific deep learning problem. This paper presents the implications, in terms of runtime and cost, of running two different deep learning problems on AWS GPU-based instances, and it proposes a methodology, based on the previous study cases, that analyzes instances for deep learning algorithms by using the information provided by the Keras framework. Our experimental results indicate that, despite having a higher price per hour, the instances that contain the NVIDIA V100 GPUs (p3) are faster and usually less expensive to use than the instances that contain the NVIDIA K80 GPUs (p2) for the problems we analyzed. Also, the results indicate that the performance of both applications did not scale well with the number of GPUs and that increasing the batch size to improve scalability may affect the final model accuracy. Finally, the proposed methodology provides accurate cost and estimated runtime for the tested applications on different AWS instances with a small cost.
- Lucas Araújo, Fabíola Oliveira, Jorge Faccipieri, Tiago Coimbra, Sandra Avila, Martin Tygel, and Edson Borin. 2018. Deteccc ao de estruturas em dados sísmicos com Deep Learning. Boletim SBGf 104 (2018), 18--21.Google Scholar
- Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXiv preprint arXiv:1802.09941 (2018).Google Scholar
- Thamiris Coelho, Lucas Araújo, Tiago Coimbra, Martin Tygel, Sandra Avila, and Edson Borin. 2019. Automatic detection of diffraction-apex using fully convolutional networks. In Int. Congress of the Brazilian Geophysical Society .Google ScholarCross Ref
- Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. 2005. A tutorial on the cross-entropy method. Annals of Operations Research , Vol. 134, 1 (2005), 19--67.Google ScholarCross Ref
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. 248--255.Google ScholarCross Ref
- Li Deng. 2012. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine , Vol. 29, 6 (2012), 141--142.Google ScholarCross Ref
- Otero Enrico. 2017. Training Deep Learning Models on Multi-GPUs. https://labs.beeva.com/training-deep-learning-models-on-multi-gpus-a3cb7ca07e97.Google Scholar
- Otero Enrico. 2018. Accelerating the training of deep neural networks with MxNet on AWS P3 instances. https://www.bbva.com/en/accelerate-training-deep-neural-networks-mxnet-aws-p3-instances/.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
- Alex Kaplunovich and Yelena Yesha. 2017. Cloud big data decision support system for machine learning on AWS: Analytics of analytics. In IEEE International Conference on Big Data. 3508--3516.Google ScholarCross Ref
- Nitish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tang. 2017. On large-batch training for deep learning: Generalization gap and sharp minima. In Int. Conference on Learning Representations .Google Scholar
- Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images . Technical Report. University of Toronto. https://www.cs.toronto.edu/ kriz/cifar.html.Google Scholar
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature , Vol. 521, 7553 (2015), 436.Google Scholar
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE , Vol. 86. 2278--2324.Google ScholarCross Ref
- Naomi Takemoto, Lucas Araújo, Tiago Coimbra, Martin Tygel, Sandra Avila, and Edson Borin. 2019. Enriching synthetic data with real noise using Neural Style Transfer. In Int. Congress of the Brazilian Geophysical Society .Google ScholarCross Ref
- Yuanshun Yao, Zhujun Xiao, Bolun Wang, Bimal Viswanath, Haitao Zheng, and Ben Zhao. 2017. Complexity vs. performance: empirical analysis of machine learning as a service. In Internet Measurement Conference. 384--397.Google ScholarDigital Library
Index Terms
- Exploring the Cost-benefit of AWS EC2 GPU Instances for Deep Learning Applications
Recommendations
Performance and Cost Comparison of Cloud Services for Deep Learning Workload
ICPE '21: Companion of the ACM/SPEC International Conference on Performance EngineeringMany organizations are migrating their on-premise artificial intelligence workloads to the cloud due to the availability of cost-effective and highly scalable infrastructure, software and platform services. To ease the process of migration, many cloud ...
MATE-EC2: a middleware for processing data with AWS
MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputersRecently, there has been growing interest in using Cloud resources for a variety of high performance and data-intensive applications. While there is currently a number of commercial Cloud service providers, Amazon Web Services (AWS) appears to be the ...
WhoWas: A Platform for Measuring Web Deployments on IaaS Clouds
IMC '14: Proceedings of the 2014 Conference on Internet Measurement ConferencePublic infrastructure-as-a-service (IaaS) clouds such as Amazon EC2 and Microsoft Azure host an increasing number of web services. The dynamic, pay-as-you-go nature of modern IaaS systems enable web services to scale up or down with demand, and only pay ...
Comments