Keywords

1 Introduction

In recent years, Deep Neural Networks (DNN) [19] has been successfully applied in many fields, including image recognition [3], texture classification [7], speech recognition [11] and so on. It uses deeper networks to learn larger training data sets, which achieves a significant boost in performance. But the growth of network parameters and training data leads to a longer training time. Thus parallel training [1] exploits multiple Graphics Processing Unit (GPU) cores to reduce training time. Data parallel training and model parallel training are two main types of parallel training. Because the whole training data set is divided into many mini-batches to repeat training for the same DNN model. Each GPU uses different mini-batches for input to train the same DNN model in data parallel training. Model parallel training disconnects the DNN network model into several layered training processes on different GPUs. Both of ways reduce the workload of training in single GPU so as to improve the training time of DNN.

Fig. 1.
figure 1

An example of distributed DNN training in MapReduce.

Obviously, there is a need to construct GPU resources cloud services to support parallel training. MapReduce is an example to form the GPU resource cloud service. GPUs are scattered over distributed nodes in MapReduce so that parallel training of DNN becomes distributed DNN training. Because computing services and storage services are separated in MapReduce. Data transmission occurs when data storing node is not same as the training running node for distributed DNN training. Figure 1 illustrates the distributed DNN training in GPU resource cloud. We disconnect the DNN network model into four phases to train. Mini-batches are stored in different storing nodes and different training phases also launch on different computing nodes. When training phase 1 of the first model starts, node 1 needs to fetch mini-batch data 1 from node 3. Then successive training phase 2 launches on the node 3 and node 3 needs to obtain the computing result of training phase 1 in the node 1. The updating parameter summarizes all results from all parallel training models on different nodes. In this scenario, the fast data transmission is key to each training phase in every training model. More importantly, any delayed training phase will prolong the time of updating parameters of DNN network model. If the training phase in each model is referred as an application in MapReduce, there are concurrent and successive applications in distributed DNN training. Thus GPUs need to be assigned to applications reasonably, which balances delaying successive applications and reducing the transferring time for all concurrent applications. This is two optimization problems with orthogonal dimensions, and there is a lack of researches on this complex issue for distributed DNN training.

In this paper, we design a distributed DNN scheduler (\(D^2\text {S}\)) to accelerate the training for DNN in GPU resource cloud. The \(D^2\text {S}\) uses a graph to combine two orthogonal optimization problems. Within the graph, different costs are attached to different optimized operations. In this way, the solution is given by using minimum costs flows algorithm. Our contribution contains three aspects as follows. (1) We firstly implement the distributed DNN training in MapReduce. GPUs are assigned to parallel DNN training phases as a cloud service. (2) Data transmission is considered to accelerate the training time for distributed DNN training. \(D^2\text {S}\) maximizes the ratio of data locality to introduce the shortest transmitting time. (3) Synchronizing mechanism ensures actual parallel training in distributed nodes for DNN. The assignment of GPU resources is adjusted dynamically according to training progresses. Finally, we use three types of DNN models to test \(D^2\text {S}\) in the experiments. Compared with original schedulers, \(D^2\text {S}\) guarantees applications to complete in a stable shorter time. In addition, \(D^2\text {S}\) occupies least network traffic by avoiding long-distance data transmission.

The rest of paper is organized as follows. Section 2 describes related works about promoting DNN training and data transmissions in MapReduce. Section 3 gives the model design in details. At last, experimental results and analysis are displayed in Sect. 4 while Sect. 5 concludes the work.

2 Related Work

Many researches are focus on improving parallel training for DNN. A general framework is used to optimize social cost function for parallel training neural networks models [14]. Some works are focus on enabling parallel training on different systems such as IBM Blue Gene/Q [6], Spark [17] and so on. A software package, called ZNN [23], is introduced to support training convolutional networks on multiple Central Processing Unit (CPU) machines. There are also some works to increase the utilization of GPU in cloud computing. Remote GPU virtualization technique enables single GPU to be accessed by different virtual machines [13] in cloud computing. Even GPU spot instances in AWS EC2 makes it possible to renew interrupting training tasks [9]. But the efficient solution to accelerate distributed DNN training is still a fade zone.

The optimization of data transmission is common in MapReduce. Most schedulers target reducing task-level data transmission as the only goal. Delay scheduler [20] makes tasks wait for data-storing nodes to become free computing nodes. Next-k-node scheduling [21] and minimum network resource consumption model [15] select computing nodes with the shortest delaying time to run tasks instead of greedy waiting policy. These schedulers give the best performance to each task while the makespan of applications stays unstable.

Application-aware scheduling policies take the makespan of applications into assigning tasks. All free resources belong to the most urgent applications [12]. Some schedulers pick up nodes for prior applications to achieve their best performances [16]. Further, Software Defined Network (SDN) [22] and Openflow [10] are deployed to construct specific network topology and program routing path for some applications to avoid network congestion. There is no doubt that the application that monopolizes entire cluster resources can receive the shortest makespan. Together with successive applications, not all applications get a good performance.

In order to serve concurrent applications, coarse-grained schedulers like FAIR Scheduler and Capacity Scheduler [4] split resources into small groups to meet different types of applications. Fine-grained schedulers like Quincy [8] match map tasks in different applications to appropriate nodes. Receding horizon control policy [18] optimizes reduce tasks for all applications. Based on deadlines, the scheduler [5] distributes resources among applications by a graph model. All these applications are independent while DNN training applications are successive. In this way, applications need to keep up with each other without a defined deadline for distributed DNN training.

In a word, no scheduling policy is suitable for distributed DNN training. Applications in distributed DNN training are concurrent and successive. Task-level greedy data transmission optimization drags some applications, which is caused by uneven distribution of optimized tasks in applications. Application-aware scheduler is so overkilled that optimized tasks belong to the same application and applications are sorted to be optimized. Also, there are schedulers oriented to multiple applications while most of methods lose the scalability for all types of tasks and the ability for balancing progresses in applications. In response to this situation, \(D^2\text {S}\) is designed to improve data transmissions without affecting the performance of overall applications.

3 Model Design

This section introduces the design of \(D^2\text {S}\) in MapReduce. The scheduling of distributed DNN training is an optimization problem. Then the optimization model is mapped into a graph model. The minimum cost flows algorithm is used to find the optimal assignment.

3.1 Optimization Model

The implement of distributed DNN training needs to be mapped into applications and tasks in MapReduce. Based on the training process, DNN has three types of applications to complete parallel training. First is the distributing application that spills the training set into many mini-batch data sets for each training model. Second is the training application to complete the training of each model or one phase of layered training. Last updating application focus on updating the model based on all training results. Thus applications are concurrent and successive in the distributed DNN training. Within an application, many map tasks collaborate to process the mini-batch data and reduce tasks to summarize the training results from map tasks.

Supposing there is a MapReduce cluster with many nodes and we number each node from 1, 2, ..., C. In addition, applications with labels 1, 2, ..., A are ready to be scheduled in the cluster. The node set is expressed with \(N^C\) and the collection of applications is \(P^A\). Each application with label p contains \(p^m\) map tasks and \(p^r\) reduce tasks in MapReduce. To classify tasks within applications, \(mt^p\) is used for map tasks and \(rt^p\) for reduce tasks with the superscription p to specify the label of its application. Specially, the input data for map tasks is indicated as idm and the output data for map tasks is indicated as odm. Furthermore, the available network bandwidth is band.

\(D^2\text {S}\) is designed to accelerate the training time for parallel training in MapReduce. Data transmission is key to reducing the training time. There are two main factors affecting the transmission of data, data size and network bandwidth. The input data size is known to map tasks while the input of reduce tasks depends on the output results of related map tasks in MapReduce. Given that tasks are assigned to some nodes, the available network bandwidth between data-storing nodes and task-running nodes can be defined and so is the time of transmitting data. Equations (1) and (2) give the measurement of transmitting time for map tasks and reduce tasks separately.

$$\begin{aligned} tt^{mt^p_m}=\frac{idm^{mt^p_m}_d}{band^d_m} \end{aligned}$$
(1)
$$\begin{aligned} tt^{rt^p_r}=\max \limits _{1\le {mt^p}\le {p^m}}\{\frac{odm^{mt^p_m}_m}{band^m_r}\} \end{aligned}$$
(2)

The accompanied subscripts in \(mt^p_m, rt^p_r, idm^{mt^p_m}_d\), and \(odm^{mt^p_m}_m\) describe the number of nodes that launch tasks or store data. The superscripts of \(idm^{mt^p_m}_d\) and \(odm^{mt^p_m}_m\) indicate data belongs to the map task \(mt^p\) which is running on the node with number of m. It is noted that the storing node of output data is same as task-running node for map tasks while the location of input data is random in map tasks. The subscription is the data-storing node and the superscription is the destination node for available bandwidths \(band^d_m\) and \(band^m_r\). Then the data-transmitting time \(tt^{mt^p_m}\) is determined by the ratio of \(idm^{mt^p_m}_d\) to \(band^m_d\) for the map task \(mt^p_m\). Because the input data in reduce tasks comes from its related map tasks, the transmitting time \(tt^{rt^p_r}\) is the longest parallel fetching time from all map tasks.

Obviously, the time of data transmission depends on the available bandwidth between data-storing nodes and task-running nodes. But data-storing nodes are immutable for each task. Thus the minimization of transmitting time aims to select task-running nodes with the best available bandwidth. Equations (3) and (4) illustrate the minimization of estimated transmitting time for tasks.

$$\begin{aligned} tt^{mt^p}=\min \limits _{m\in {N^C}}(tt^{mt^p_m}+nt^m) \end{aligned}$$
(3)
$$\begin{aligned} tt^{rt^p}=\min \limits _{r\in {N^C}}(tt^{rt^p_r}+nt^r) \end{aligned}$$
(4)

The estimation is composed of transmitting time tt and waiting time nt. If the node is free, the waiting time is 0. If this node is busy, the waiting time is the remaining time of the quickest task. Specially, m is used for map tasks and r applies for reduce tasks.

However, tasks are from different applications while applications are successive in parallel training. It is necessary to synchronize concurrent training applications in order that updating applications will not be delayed to launch. \(D^2\text {S}\) takes progresses of applications into consideration when data transmission is optimized for tasks. Based on progresses, the residual running time of applications can be denoted as \(ft^p\) in the Eq. (5).

$$\begin{aligned} ft^p=\left\{ \begin{aligned}&ut^p\times \frac{1-pr^p}{pr}&{0<pr\le {1}}\\&ut^p&{pr=0} \end{aligned} \right. \end{aligned}$$
(5)

The \(ut^p\) refers to the used time and \(pr^p\) is the progress. Especially, the progress of applications is 0 when there is no completed tasks. To solve this flaw, applications with the progress of 0 will be differentiated by \(ut^p\). Then \(D^2\text {S}\) minimizes the sum \(at^p\) of residual running time and transmitting time for each application like the Eq. (6).

$$\begin{aligned} at^p=ft^p+\min \limits _{{m,r}\in {N^C}}{(\sum _{mt^p=1}^{p^m} tt^{mt^p}+\sum _{rt^p=1}^{p^r} tt^{rt^p})} \end{aligned}$$
(6)

The superscription p is the label of application. Equation (7) shows the ultimate goal for all applications, where t is the sum of all applications.

$$\begin{aligned} t=\min {(\sum _{p=1}^{A} at^{p})} \end{aligned}$$
(7)

3.2 Minimum Cost Flows Algorithm

In fact, there are two minimizations in the scheduling model. Tasks are joint objects of two minimum problems. Thus a graph is used to connect two minimizations. Figure 2 is an example of this graph model with two applications and two nodes. Application 1 has two tasks to be scheduled and only one unscheduled task for application 2. If there is a arrow between tasks and nodes, the node has abilities to satisfy requirements of tasks. Each arrow with direction will be accompanied by its capacity. If the Capacity(Source, App1) is 1 and the application 1 will only select up to 1 task to be scheduled. Thus the capacity of arrows is also referred as the number of distributing tasks. In addition, the objects of Source and Sink are used to limit the number of scheduling tasks. The potential is along with each object in the graph like Source, App1 and Task3. They cooperate to define the number of tasks that can be scheduled for objects. Taking Fig. 2 as an example, the potential of Source is 3 with the assuming that the cluster has sufficient computing capacities. But the potential is not positive all the time. The negative potentials of nodes and Sink mean the ability to run tasks on this object. The rest objects with negative potentials indicate that assigned tasks exceed the limit of the object. Finally, all arrows with direction should be assigned to specified costs. According to Eqs. (1) and (2), the time of data transmission is attached to arrows between tasks and nodes. The estimated residual time is attached to links between Source and applications. Arrows of nodes and Sink use the waiting time as costs.

Fig. 2.
figure 2

An example of graph model.

With the initialized graph, the minimum cost flows algorithm is used to assign tasks to nodes. This algorithm adopts the residual network [2] so that the original graph is converted into the Residual Graph. In the new Residual Graph, new potentials \(P^r\), capacities \(Cap^r\) and costs \(Cost^r\) are dynamic parameters to explain limits of unscheduled tasks and abilities of nodes. More details have been shown in Algorithm 1. Every time we look for an object with a positive potential and store it in a set. Based on the updated set, we count the number of unscheduled tasks as Ust. Similarly, the number of distributing tasks with minimum cost is assigned to Res. Repeated operations are shown as lines 4–9. Once the number of unscheduled tasks exceeds the number of distributing tasks, unscheduled tasks can distribute to all other objects along all minimum cost arrows without distinction. This process corresponds to the line 16 in the algorithm. It is also noted that there is an object with a negative potential as the lines 10–14. In that case, one object with available abilities have requests for assigning tasks. Thus we look for a path with the minimum cost to distribute tasks so that the object can obtain distributed tasks. It is worth emphasizing that the process of distributing tasks will lead to a new residual graph with new parameters. Then the above procedure repeats until no unscheduled tasks exists.

figure a

4 Experiments

4.1 Set Up

We set up a MapReduce cluster with seven nodes, and six nodes are working nodes. Each node is configured with 2 GPUs. As for network, there are two racks that are connected by a router. Three of working nodes belong to the same rack and the bandwidth is the same for all nodes in a rack. Referring to the actual network environment, the bandwidth is 10 M/s between racks and 100 M/s for the inner rack. Three types of DNN models are tested in MapReduce, including Image Classification (IC), Food Recognition (FR) and Video Classification (VC). AlexNet network is applied to IC, and the network model contains 5 conventional layers and 2 fully connected layers. There are 50000 images in CIFAR-10 dataset to be used for training. FR uses 20000 photos to train the Inception-V3 network with 22 layers. Then 230 versions of food are recognized. The training set of VC is a quarter of YouTube-8M, and only 4-layer Long Short-Term Memory (LSTM) is trained with 2 million extracted features. What’s more, contentions are considered into experiments among the same models and mixed models.

Data transmission problem is our concern so that the scheduler [16] optimizes tasks for each application (OTEA) is one of baseline schedulers. Then we employ two application-aware schedulers, FIFO and FAIR, to compare with \(D^2\text {S}\). Four metrics are used for evaluating the performance of \(D^2\text {S}\). The first is the makespan of applications and a shorter makespan of applications is better. In addition, the shorter lag of all makespans is to synchronize the progresses of applications. Second is the makespan of tasks and tasks with a shorter makespan reduce the most data transmission. Then the distribution of running tasks is supplementary to give details. Last is the network traffic to show optimized data transmission in a quantitative manner.

Fig. 3.
figure 3

The makespan of IC.

Fig. 4.
figure 4

The makespan of FR.

Fig. 5.
figure 5

The makespan of VC.

Fig. 6.
figure 6

The makespan of mixed DNN models.

4.2 Results Analysis

There are the same and mixed DNN models in our experiments. Each model has 15 applications while the number of tasks in applications are not the same. Figures 3, 4, 5 and 6 shows the makespan in different types of models. The makespan of tasks uses the cumulative distribution. It is obvious that \(D^2\text {S}\) makes applications synchronize best. Compared with other schedulers, the makespan of applications fluctuates in the smallest range. The worst ratio of maximum and minimum is 6.2 which is smaller than 19.3 of FIFO, 9.8 of FAIR and 8.1 of OTEA in all DNN models. But the makespan in \(D^2\text {S}\) is not the minimum because others offer applications with a shorter makespan like IC application 4 in Fig. 3, FR application 7 in Fig. 4 and so on. The little delayed time is mainly caused by the overhead that considers all applications at once in \(D^2\text {S}\), while baseline schedulers introduce less scheduling time because of sequential scheduling. FIFO assigns tasks when there is a node to issue a request for tasks. Within a smaller cluster, FAIR applies the same rule to assign tasks as FIFO. Such policies lead to the least scheduling time and the random manner makes it possible to get the optimal assignment. OTEA optimizes data transmission for single application and the time to balance applications is zero. It is worthwhile to increase overhead to improve makespan. The averages of 45 makespans are 1528400 ms in \(D^2\text {S}\), 2737100 ms in OTEA, 3264200 ms in FIFO and 3593200 ms in FAIR. Thus the makespan decreases 44.2%, 53.2% and 57.5% respectively. In short, \(D^2\text {S}\) is the best in improving the makespan of applications.

In most cases, \(D^2\text {S}\) offers map tasks a shorter time. The makespan is less than 1000000 ms for 90% of map tasks in IC and FR applications while other schedulers increase by 3 times at least. VC applications are included in a similar situation. \(D^2\text {S}\) also improves the makespan of reduce tasks and maximum makespan is reduced by 4.5 times compared with other schedulers. But \(D^2\text {S}\) prolongs some map tasks. For example, \(D^2\text {S}\) provides that only 13% tasks complete within 750000 ms while other baseline schedulers keep more than 50% of tasks to finish in 750000 ms for IC applications. Such delay of map tasks is mainly due to the same reason as the delayed time in applications. The overall performance is the best for both map tasks and reduce tasks in \(D^2\text {S}\).

Table 1. Locality in the same DNN models.

In addition to the makespan, Table 1 shows the distribution of tasks in the same DNN model. There are three types of tasks, data-local, rack-local and off-switch. If tasks are running on the node which stores its input data, tasks are data-local tasks and avoid data transmission. Similarly, rack-local tasks refer to task-running nodes and data-storing nodes are located in the same rack. The rest of tasks are off-switch because task-running nodes and data-storing nodes belong to different racks. In this way, the transmitting time of rack-local tasks is shorter than off-switch tasks because the available network bandwidth of inner rack is better. Thus FIFO and FAIR are the worst performers as a result of maximum average of off-switch tasks, which enables half tasks transmit data across racks. \(D^2\text {S}\) is the best with the maximum average 61.8% for data-local tasks. This average is almost two times more than 35.5% of FAIR and 27.3% of OTEA. Even the minimum data-local tasks increase to 33.3% by \(D^2\text {S}\) while OTEA is 8.3% and 0 for another two schedulers. Except for the increase in data-local tasks, the rest of tasks have 97% of rack-local tasks and only 3% off-switch tasks in \(D^2\text {S}\). The percentage of rack-local tasks is ranked in the second place which is lower than 47.8% in OTEA. But 47.8% rack-local tasks only account for 65.7% of all no data-local tasks in OTEA. Table 2 gives percentages of locality for tasks in mixed DNN models. \(D^2\text {S}\) still performs best with the maximum data-local tasks.

Table 2. Locality in the mixed DNN models.
Fig. 7.
figure 7

The network traffic of IC and FR.

Fig. 8.
figure 8

The network traffic of VC and mixed DNN models.

Figures 7 and 8 describe the usage of network traffic by moving data. The average network traffic in each experiment is shown for different schedulers and different applications. It is obvious that \(D^2\text {S}\) uses the minimum network traffic to move data and the variance is also the smallest. The value remains at around 5G bytes for all workload-type applications in \(D^2\text {S}\). At the same time, other schedulers fluctuate between 7G bytes and 15G bytes. There are some exceptions of \(D^2\text {S}\) in VC applications and a large variance occurs. But the maximum of used network traffic is still lower than the minimum of other schedulers. The trend in the right figure of Fig. 8 demonstrates that \(D^2\text {S}\) is also suitable for mixed DNN models. The usage of network traffic is always the minimum in \(D^2\text {S}\).

5 Conclusions

This paper comes up with a new scheduler named \(D^2\text {S}\) to accelerate training time for distributed DNN training. The biggest challenge is caused by distributed storage in distributed cluster for parallel training. Such distribution leads to the requirement for timely data transmission for applications. At the same time, applications need to be synchronized in order to not delay successive applications in parallel training. Thus \(D^2\text {S}\) is designed to optimize the time of data transmission for tasks without neglecting the makespan of applications. A scheduling graph combines two factors to accelerate distributed DNN training. We test \(D^2\text {S}\) with the same and the mixed workload applications in MapReduce. The experimental results demonstrate \(D^2\text {S}\) can offer applications a shorter and more stable makespan. In this way, all applications are synchronized by \(D^2\text {S}\). In addition, less data transmission also helps to accelerate training time. In the future, we will implement more deep network models in the distributed cluster to accelerate training.