Keywords

1 Introduction

Deep learning trains deep neural networks with huge volumes of data. The training process is compute-intensive. It can take weeks or months with one modern GPU. Many researchers employ distributed training to accelerate the training process with a server cluster [1].

Model parallelism and data parallelism are two commonly adopted paradigms for distributed training. Model parallelism splits the model into different parts and allocates each part to one GPU [2]. Although model parallelism can speed up the training process with parallel computing, it has two drawbacks which limit its application. The first drawback is scalability, which means it is hard to create a generic model parallelism solution which splits arbitrary model into balanced parts, allocates to adequate GPUs and achieves sublinear scaling ratio. The second drawback is that model parallelism has high communication-to-computation ratio and the communication overhead may counteract the performance gain. Data parallelism is more widely adopted for its simplicity and generality. The training dataset is usually large and easy to split into sub-datasets. Each GPU hosts a replica of the model and trains it with its sub-dataset concurrently.

Various architectures have been proposed for data parallelism, e.g. parameter server (PS) [3], peer-to-peer, ring-based structure [4]. PS architecture has been proved to be effective and are widely adopted [5, 6]. There are two entities defined in the PS architecture: parameter servers and workers. Parameter servers are responsible for collecting gradient updates from workers and calculating new model parameters with received gradients. Workers pull latest parameters from parameter servers, train their model replicas with their sub-datasets, calculate gradients, and push gradients to parameter servers. The gradient update method between parameter servers and workers can be roughly classified into synchronous method and asynchronous method. For synchronous method, all workers push gradients to parameter servers in every training iteration. This method is robust, fast and has been proved to be equivalent to the standard stochastic gradient descent (SGD) in single GPU training. But the synchronous method has two issues. One is traffic burst when all workers push gradients at roughly the same time. The other one is that if workers are not homogeneous, the slowest one will slow down the overall training process. Asynchronous methods have been proposed to overcome these issues [7]. However, asynchronous methods may suffer from slower convergence or divergence issues due to stale gradients [8].

In this paper, we propose a method to delay the gradient updates between parameter servers and workers dynamically. Experimental results show that our method increases the distributed training throughput, reduces the network bandwidth requirement, and achieves almost the same accuracy as the synchronous method.

2 Related Works

Many previous works target at reducing the communication overhead in distributed training. Chen et al. [9] propose a double buffering technique which shows the delayed update works well. Seide et al. [10] and Strom et al. [11] use an 1-bit SGD method which adds delay to gradient updates. Lin et al. [12] propose a gradient threshold algorithm, which throttles small gradient updates and accumulates them locally. These gradient sparsification technologies can reduce the communication volume and they are validated by experiments. But the convergence of these implicit delayed methods are not proved in theory. Agarwal et al. [7] propose explicitly delayed gradient update methods to reduce the communication frequency. For convex optimization problem, it has been theoretically proved that the delayed gradient update can be asymptotically negligible and the convergence rate scales as \(\mathcal {O} \left( 1/\sqrt{nT} \right) \) for n-node cluster after T iterations. However, this cyclic delayed method suffers from the unbalanced worker computing power issues. In the next section, we introduce a dynamic delay based algorithm to overcome these problems and improve the performance.

3 Dynamic Delay Based Cyclic Gradient Update Method

We propose a dynamic delay based cyclic gradient update method, which extends the previous cyclic delayed method. A dynamic delay is applied to the gradient update of each worker. The delay is calculated from the real-time global gradient updating status. This method decouples the cyclic period and actual delays of workers.

The conventional cyclic delayed architecture computes the stochastic gradients in parallel and updates the model parameters in sequence. The worker i computes the gradient \(g_{i} \left( t - \tau \right) = \nabla F \left[ x \left( t - \tau \right) \right] \) from the stale parameters \(x \left( t - \tau \right) \) of \(\tau \) updates before. The central parameter server obtains \(g_{i} \left( t - \tau \right) \) from worker i, computes the updated model \(x \left( t + 1 \right) \) and pushes it back only to worker i. Meanwhile, other workers do their computations on the stale parameters other than the latest \(x \left( t + 1 \right) \). The delay \(\tau \) comes from the sequential updating of the parameters among the workers, where \(\tau = n - 1\) for a n worker cluster in the simplest case. The errors coming from \(\tau \) is a second order effect, which makes the penalty of delay asymptotically negligible [7].

Fig. 1.
figure 1

Runtime illustration with two workers (Color figure online)

The behavior of each worker can be roughly classified into two phases: a communication phase for gradient synchronization, and a computation phase for gradient calculation and accumulation. Although it is possible to overlap part of the backward computation phase with the communication phase for a single worker, the cyclic method focuses more on overlapping the computation phase of a worker with the communication phases of other workers.

In our proposed method, an additional delay is introduced to improve the throughput performance. Each worker maintains an independent local training pool. When a worker is in computation phase, it keeps doing local training and gradient accumulation. Mini-batches of the dataset are fetched continuously to train the local model replica. The following communication phase is dynamically postponed until the worker obtains the token. The delay is adaptive, which helps to maintain a good load balance. A powerful worker does more training (and hence processes more training data examples) in its computation phase and a weak worker does less.

Figure 1 shows the runtime illustration of different gradient update methods. Blue blocks denote communication phases, and green blocks denote training operations. The width denotes the duration of different operations in the runtime, which is variational because of the imbalance of workloads. In the synchronous method (sync), a strong worker has to wait for a weak worker in every communication phase. The delayed synchronous method (dsync) postpones the synchronizations with a fixed amount of local computations. This additional delay alleviates the load imbalance and reduces the communication volume. In the cyclic method, the round-robin communication phases prevents the network traffic burst. However the computation phases are not fully utilized if the workers are heterogeneous. Additional computations of the strong workers are introduced in our dynamic delay cyclic method (dcyclic), where the computation phase is prolonged to overlap the communication phase of other workers. The dynamic delay makes full use of every worker’s computation power while minimizing the network traffic.

figure a

The dynamic delay cyclic method is described in Algorithm 1.1, where the local variables on the workers are decorated with a tilde. N denotes the number of workers, T denotes the maximum global step, and D denotes the maximum amount of accumulations, i.e. the limitation on the delay of the communication phase in every training iteration. x denotes the weights, g denotes the gradients and G denotes the accumulations. The global step t serves as the token for the synchronization communication and is maintained by the PS. The subscript \(-1\) of x is introduced for convenience, which is unnecessary in the implementation. Each worker implements two operations. One is the communication operation (remote push-to/pull-from the PS), the other one is the local computation operation (computing/accumulating of gradients).

The communication operation is based on the cyclic delayed method [7]. All workers cooperate in a round-robin order. The worker obtaining the communication token performs the communication operation, including the pushing of gradients and the pulling of updated weights. Then the global step t increases by one, in which case the token is relayed to the next worker.

The dynamic delay occurs in the computation phase of the worker. Compared to the single gradient computation in the conventional cyclic architecture, our dynamic cyclic method enables additional gradient computations and accumulations before the worker obtains the token. In the meanwhile, the amount of accumulations is adaptive in runtime, which is limited by the predefined largest delay D. When \(D = 1\), this method falls back to the conventional cyclic delay method [7]. When \(D > 1\), local updates and gradient accumulations are activated. In the computation phase when the worker processes new mini-batches, it keeps monitoring the global step t. As soon as it obtains the communication token, the worker aborts the remnant local operations in order to do the communication operation at the earliest.

The dynamic delay cyclic method brings two benefits. One is the optimized throughput (e.g. in examples/second) due to gradient accumulations. By doing as many training as possible in the computation phase, device utilization is improved. As a result, the total processing time for the same quantity of examples is decreased. The other benefit is the convergence conservation. Being able to abort the computation helps to suppress the actual delay and the staleness of gradients, even when the predefined D is large. This helps to achieve the convergence state.

4 Experimental Results

In this section, the dynamic delay cyclic method is evaluated with two large-scale datasets.

4.1 Datasets and Experiment Setup

Two datasets are selected for the evaluations. One is the ILSVRC2012 [13] dataset, which focuses on the image object classification. The training set contains 1.2 million images and the validation set contains 150 thousand images. Both of them are labeled with the presence or absence of 1000 object categories. The ResNet-V2-50 [14] model is adopted for the classification task. The other dataset is the union of the \(10^{9}\) Word Parallel Corpus for training and the updated development set of the News Crawl for validation from the WMT’15 [15]. The training corpus consists of over 22 million sentences, and the validation corpus consists of 3 thousand sentences. Both focus on the recurring translation task on the French–English pair. The Seq2Seq model [16] is adopted for the translation task.

Fig. 2.
figure 2

Experiment setup. Workers are bound to different GPUs inside one node. All workers connect directly to the PS on the other node. The traffic goes over the network in the same manner as a distributed cluster

Two computing nodes are utilized for all experiments. Both nodes are equipped with dual Intel Xeon E5-2600v4 CPUs, 512 GB memory and a Mellanox 40 Gbit/s network adapter. One node has 4 NVIDIA Tesla P100 GPUs, and the other node has no GPU. The distributed computing environment is simulated with these two nodes by making use of the GPU affinity as illustrated in Fig. 2. The worker procedures are bound to different GPUs, in the meanwhile the PS procedure is launched on the other node. PS and workers communicate through the network adapter, in the same way as a real distributed cluster.

4.2 Algorithm and Implementation

We compare our method with the cyclic and the delayed synchronous methods, and take the vanilla synchronous method as the baseline. Workers in the cyclic method update the parameters in a round-robin order [7]. In the vanilla synchronous method, the weights on the PS are updated by gradients received from all workers at around the same time. In the delayed synchronous method, the gradients are accumulated and applied to the local model replicas first. And then the gradient update to the PS works similarly with the vanilla synchronous method.

We implement these four methods with the PS architecture [17], where the server maintains the parameters and the workers do the computations. The data manipulation is automatically managed by TensorFlow [6] from the implicit insertion of nodes to the computation graph.

The ResNet-V2-50 model is trained with the Nesterov accelerated gradient (NAG) method [18, 19] with a batch size of 32, a momentum of 0.9 and a learning rate of 0.005 in 80 epochs. The learning rate is exponentially decayed with a factor of 0.1 every 20 epochs. The learning rate warmup [20] is implemented in the synchronous methods in order to accelerate the convergence. The vanilla SGD is used to train the Seq2Seq model in 1 epoch with a batch size of 64. The learning rate starts at 0.02 and decays every 0.01 epoch with a decay factor of 0.99. The learning rate warmup is not utilized in the training of the Seq2Seq model.

4.3 Results

We first investigate the performance of different methods. The convergence rates of train (dashed) and validation (solid) are plotted in Fig. 3. The columns from left to right show the synchronous (blue), the delayed synchronous (green), the cyclic (red) and the dynamic delay cyclic (cyan) methods. The top-5 error of the ResNet model is on the top, and the perplexity of the Seq2Seq model is at the bottom. The actual amount of gradient accumulations are tuned to be the same during the training of each model.

Fig. 3.
figure 3

(Color online) The convergence after definite epochs. The top row presents the top-5 error of the ResNet model, and the bottom shows the perplexity of the Seq2Seq model. The columns indicate the synchronous (blue), the delayed synchronous (green), the cyclic (red) and the dynamic delay cyclic (cyan) methods from left to right. The dashed lines denotes the training and the solid lines denotes the validation. (Color figure online)

Fig. 4.
figure 4

(Color online) The wall-clock time of different methods. On the left shows the top-5 error of ResNet, and on the right shows the perplexity of Seq2Seq. The vanilla synchronous SGD (blue) is taken as the baseline. The cyclic method (red) finishes after a long time due to the low device utilization. The delayed synchronous SGD (green) obtains slow convergence in the limited number of epochs. The dynamic delay cyclic method (cyan) converges faster in less wall-clock time because of its high throughput. (Color figure online)

In the ResNet model, the cyclic methods achieve the same performance with the synchronous method. The rate of convergence is not impacted by the inherent gradient staleness from the round-robin order. The additional gradient accumulations limit the rate of convergence in the delayed synchronous method. Nevertheless, it takes little effect on the dynamic-cyclic method.

In the Seq2Seq model, the cyclic methods perform better than the synchronous methods, where the perplexity converges quickly to a lower value in the limited number of epochs. The additional accumulations impacts the convergence rate negatively in the delayed synchronous method. The result shows that the dynamic delay method is more robust to the staleness of the gradient information than the delayed synchronous method.

The gradient accumulations improve the throughput performance significantly. In the delayed methods, the synchronizations are postponed by the local operations on the workers. This delay reduces the communication-to-computation ratio and increases the utilization of the computing device, which leads to a higher throughput as illustrated in Fig. 4. Large datasets are trained to the same convergence rate at a faster speed with the dynamic delay cyclic method.

Fig. 5.
figure 5

(Color online) The actual network bandwidth consumption under different update methods. To achieve the same state of convergence, the dynamic delay cyclic method requires less network traffic than the vanilla synchronous and cyclic methods. (Color figure online)

The network traffic is reduced with the dynamic delay cyclic method. The delay reduces the communication frequency and the total communication volume. In the cyclic methods, the PS responds to only one worker at a time. The rolling of the communication token prevents the traffic burst issue in the synchronous methods and reduces the network requirements. In our experiments, the delay and cyclic methods significantly reduce the network traffic as shown in Fig. 5. The dynamic delay cyclic method preserves the convergence and requires less network traffic than the synchronous and the cyclic methods.

5 Conclusions and Discussions

We propose a dynamic delay based cyclic gradient update method, which benefits from the cyclic gradient update architecture and the local gradient accumulations. The network traffic burst is relieved from the round-robin updating order, and the communication volume and frequency is suppressed by the explicit delay of gradient updates. This method keeps the rate of convergence from the restricted duration between synchronizations, and improves the throughput performance by the dynamic extension of the actual delay. The wall-clock time is reduced in the training of large datasets.

The cyclic methods take full use of the gradients computed from every mini-batch of examples. The gradients are not only employed to update local model replicas, but also accumulated to update the global model on the PS. A fixed (perhaps with decay) learning rate is more applicable for these aggressive methods.

The actual delay is bounded to prevent the convergence problem rising from the gradient staleness. The PS cycles the refresh of the local replicas among all workers in the cluster. The duration of the computation phase scales linearly with the number of workers, which comes from the round-robin nature of the cyclic methods. An oversize delay may limit the convergence rate because of the gradient staleness. An optimized delay restriction should be selected to accelerate the training and preserve the convergence simultaneously.