Conferences >2016 IEEE International Confe...

Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Many studies have shown that Deep Convolutional Neural Networks (DCNNs) exhibit great accuracies given large training datasets in image recognition tasks. Optimization te...Show More

Metadata

Abstract:

Many studies have shown that Deep Convolutional Neural Networks (DCNNs) exhibit great accuracies given large training datasets in image recognition tasks. Optimization technique known as asynchronous mini-batch Stochastic Gradient Descent (SGD) is widely used for deep learning because it gives fast training speed and good recognition accuracies, while it may increases generalization error if training parameters are in inappropriate ranges. We propose a performance model of a distributed DCNN training system called SPRINT that uses asynchronous GPU processing based on mini-batch SGD. The model considers the probability distribution of mini-batch size and gradient staleness that are the core parameters of asynchronous SGD training. Our performance model takes DCNN architecture and machine specifications as input parameters, and predicts time to sweep entire dataset, mini-batch size and staleness with 5%, 9% and 19% error in average respectively on several supercomputers with up to thousands of GPUs. Experimental results on two different supercomputers show that our model can steadily choose the fastest machine configuration that nearly meets a target mini-batch size.

Published in: 2016 IEEE International Conference on Big Data (Big Data)

Date of Conference: 05-08 December 2016

Date Added to IEEE Xplore: 06 February 2017

ISBN Information:

DOI: 10.1109/BigData.2016.7840590

Conference Location: Washington, DC, USA

Contents

References is not available for this document.

Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers

Abstract:

Metadata

Abstract:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers

Alerts

Abstract:

Metadata

Abstract:

References

IEEE Account

Purchase Details

Profile Information

Need Help?