Accelerating Training for Distributed Deep Neural Networks in MapReduce

Xu, Jie; Wang, Jingyu; Qi, Qi; Sun, Haifeng; Liao, Jianxin

doi:10.1007/978-3-319-94289-6_12

Jie Xu¹⁶,
Jingyu Wang¹⁶,
Qi Qi¹⁶,
Haifeng Sun¹⁶ &
…
Jianxin Liao¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10966))

Included in the following conference series:

International Conference on Web Services

1128 Accesses

Abstract

Parallel training is prevailing in Deep Neural Networks (DNN) to reduce training time. The training data sets and layered training processes of DNN are assigned to multiple Graphics Processing Units (GPUs) in parallel training. But there are some obstacles to deploy parallel training in GPU cloud services. DNN has a tight-dependent layering structure where the next layer feeds on the output of its former layer. It is unavoidable to transmit big output data between separated layered training processes. Since cloud computing offers separated storage services and computing services, data transmission through network harms the performance in training time. Thus parallel training leads to an inefficient training process in GPU cloud environment. In this paper, we construct a distributed DNN training architecture to implement parallel training for DNN in MapReduce. The architecture assigns GPU cloud resources as a web service. We also address the concern of data transmission by proposing a distributed DNN scheduler to accelerate the training time. The scheduler makes use of minimum cost flows algorithm to assign GPU resources, which considers data locality and synchronization into minimizing training time. Compared with original schedulers, experimental results reveal that distributed DNN scheduler decreases the training time by 50% with least data transmission and synchronizing parallel training.

You have full access to this open access chapter, Download conference paper PDF

Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU Cluster

Article 22 February 2021

Jie Xu, Jingyu Wang, … Di Yang

GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters

Scheduling Deep Learning Training in GPU Cluster Using the Model-Similarity-Based Policy

Keywords

1 Introduction

In recent years, Deep Neural Networks (DNN) [19] has been successfully applied in many fields, including image recognition [3], texture classification [7], speech recognition [11] and so on. It uses deeper networks to learn larger training data sets, which achieves a significant boost in performance. But the growth of network parameters and training data leads to a longer training time. Thus parallel training [1] exploits multiple Graphics Processing Unit (GPU) cores to reduce training time. Data parallel training and model parallel training are two main types of parallel training. Because the whole training data set is divided into many mini-batches to repeat training for the same DNN model. Each GPU uses different mini-batches for input to train the same DNN model in data parallel training. Model parallel training disconnects the DNN network model into several layered training processes on different GPUs. Both of ways reduce the workload of training in single GPU so as to improve the training time of DNN.

Obviously, there is a need to construct GPU resources cloud services to support parallel training. MapReduce is an example to form the GPU resource cloud service. GPUs are scattered over distributed nodes in MapReduce so that parallel training of DNN becomes distributed DNN training. Because computing services and storage services are separated in MapReduce. Data transmission occurs when data storing node is not same as the training running node for distributed DNN training. Figure 1 illustrates the distributed DNN training in GPU resource cloud. We disconnect the DNN network model into four phases to train. Mini-batches are stored in different storing nodes and different training phases also launch on different computing nodes. When training phase 1 of the first model starts, node 1 needs to fetch mini-batch data 1 from node 3. Then successive training phase 2 launches on the node 3 and node 3 needs to obtain the computing result of training phase 1 in the node 1. The updating parameter summarizes all results from all parallel training models on different nodes. In this scenario, the fast data transmission is key to each training phase in every training model. More importantly, any delayed training phase will prolong the time of updating parameters of DNN network model. If the training phase in each model is referred as an application in MapReduce, there are concurrent and successive applications in distributed DNN training. Thus GPUs need to be assigned to applications reasonably, which balances delaying successive applications and reducing the transferring time for all concurrent applications. This is two optimization problems with orthogonal dimensions, and there is a lack of researches on this complex issue for distributed DNN training.

In this paper, we design a distributed DNN scheduler ($D^2\text {S}$) to accelerate the training for DNN in GPU resource cloud. The $D^2\text {S}$ uses a graph to combine two orthogonal optimization problems. Within the graph, different costs are attached to different optimized operations. In this way, the solution is given by using minimum costs flows algorithm. Our contribution contains three aspects as follows. (1) We firstly implement the distributed DNN training in MapReduce. GPUs are assigned to parallel DNN training phases as a cloud service. (2) Data transmission is considered to accelerate the training time for distributed DNN training. $D^2\text {S}$ maximizes the ratio of data locality to introduce the shortest transmitting time. (3) Synchronizing mechanism ensures actual parallel training in distributed nodes for DNN. The assignment of GPU resources is adjusted dynamically according to training progresses. Finally, we use three types of DNN models to test $D^2\text {S}$ in the experiments. Compared with original schedulers, $D^2\text {S}$ guarantees applications to complete in a stable shorter time. In addition, $D^2\text {S}$ occupies least network traffic by avoiding long-distance data transmission.

The rest of paper is organized as follows. Section 2 describes related works about promoting DNN training and data transmissions in MapReduce. Section 3 gives the model design in details. At last, experimental results and analysis are displayed in Sect. 4 while Sect. 5 concludes the work.

2 Related Work

Many researches are focus on improving parallel training for DNN. A general framework is used to optimize social cost function for parallel training neural networks models [14]. Some works are focus on enabling parallel training on different systems such as IBM Blue Gene/Q [6], Spark [17] and so on. A software package, called ZNN [23], is introduced to support training convolutional networks on multiple Central Processing Unit (CPU) machines. There are also some works to increase the utilization of GPU in cloud computing. Remote GPU virtualization technique enables single GPU to be accessed by different virtual machines [13] in cloud computing. Even GPU spot instances in AWS EC2 makes it possible to renew interrupting training tasks [9]. But the efficient solution to accelerate distributed DNN training is still a fade zone.

The optimization of data transmission is common in MapReduce. Most schedulers target reducing task-level data transmission as the only goal. Delay scheduler [20] makes tasks wait for data-storing nodes to become free computing nodes. Next-k-node scheduling [21] and minimum network resource consumption model [15] select computing nodes with the shortest delaying time to run tasks instead of greedy waiting policy. These schedulers give the best performance to each task while the makespan of applications stays unstable.

Application-aware scheduling policies take the makespan of applications into assigning tasks. All free resources belong to the most urgent applications [12]. Some schedulers pick up nodes for prior applications to achieve their best performances [16]. Further, Software Defined Network (SDN) [22] and Openflow [10] are deployed to construct specific network topology and program routing path for some applications to avoid network congestion. There is no doubt that the application that monopolizes entire cluster resources can receive the shortest makespan. Together with successive applications, not all applications get a good performance.

In order to serve concurrent applications, coarse-grained schedulers like FAIR Scheduler and Capacity Scheduler [4] split resources into small groups to meet different types of applications. Fine-grained schedulers like Quincy [8] match map tasks in different applications to appropriate nodes. Receding horizon control policy [18] optimizes reduce tasks for all applications. Based on deadlines, the scheduler [5] distributes resources among applications by a graph model. All these applications are independent while DNN training applications are successive. In this way, applications need to keep up with each other without a defined deadline for distributed DNN training.

In a word, no scheduling policy is suitable for distributed DNN training. Applications in distributed DNN training are concurrent and successive. Task-level greedy data transmission optimization drags some applications, which is caused by uneven distribution of optimized tasks in applications. Application-aware scheduler is so overkilled that optimized tasks belong to the same application and applications are sorted to be optimized. Also, there are schedulers oriented to multiple applications while most of methods lose the scalability for all types of tasks and the ability for balancing progresses in applications. In response to this situation, $D^2\text {S}$ is designed to improve data transmissions without affecting the performance of overall applications.

3 Model Design

This section introduces the design of $D^2\text {S}$ in MapReduce. The scheduling of distributed DNN training is an optimization problem. Then the optimization model is mapped into a graph model. The minimum cost flows algorithm is used to find the optimal assignment.

3.1 Optimization Model

The implement of distributed DNN training needs to be mapped into applications and tasks in MapReduce. Based on the training process, DNN has three types of applications to complete parallel training. First is the distributing application that spills the training set into many mini-batch data sets for each training model. Second is the training application to complete the training of each model or one phase of layered training. Last updating application focus on updating the model based on all training results. Thus applications are concurrent and successive in the distributed DNN training. Within an application, many map tasks collaborate to process the mini-batch data and reduce tasks to summarize the training results from map tasks.

Supposing there is a MapReduce cluster with many nodes and we number each node from 1, 2, ..., C. In addition, applications with labels 1, 2, ..., A are ready to be scheduled in the cluster. The node set is expressed with $N^C$ and the collection of applications is $P^A$. Each application with label p contains $p^m$ map tasks and $p^r$ reduce tasks in MapReduce. To classify tasks within applications, $mt^p$ is used for map tasks and $rt^p$ for reduce tasks with the superscription p to specify the label of its application. Specially, the input data for map tasks is indicated as idm and the output data for map tasks is indicated as odm. Furthermore, the available network bandwidth is band.

$D^2\text {S}$ is designed to accelerate the training time for parallel training in MapReduce. Data transmission is key to reducing the training time. There are two main factors affecting the transmission of data, data size and network bandwidth. The input data size is known to map tasks while the input of reduce tasks depends on the output results of related map tasks in MapReduce. Given that tasks are assigned to some nodes, the available network bandwidth between data-storing nodes and task-running nodes can be defined and so is the time of transmitting data. Equations (1) and (2) give the measurement of transmitting time for map tasks and reduce tasks separately.

$$\begin{aligned} tt^{mt^p_m}=\frac{idm^{mt^p_m}_d}{band^d_m} \end{aligned}$$

(1)

$$\begin{aligned} tt^{rt^p_r}=\max \limits _{1\le {mt^p}\le {p^m}}\{\frac{odm^{mt^p_m}_m}{band^m_r}\} \end{aligned}$$

(2)

The accompanied subscripts in $mt^p_m, rt^p_r, idm^{mt^p_m}_d$, and $odm^{mt^p_m}_m$ describe the number of nodes that launch tasks or store data. The superscripts of $idm^{mt^p_m}_d$ and $odm^{mt^p_m}_m$ indicate data belongs to the map task $mt^p$ which is running on the node with number of m. It is noted that the storing node of output data is same as task-running node for map tasks while the location of input data is random in map tasks. The subscription is the data-storing node and the superscription is the destination node for available bandwidths $band^d_m$ and $band^m_r$. Then the data-transmitting time $tt^{mt^p_m}$ is determined by the ratio of $idm^{mt^p_m}_d$ to $band^m_d$ for the map task $mt^p_m$. Because the input data in reduce tasks comes from its related map tasks, the transmitting time $tt^{rt^p_r}$ is the longest parallel fetching time from all map tasks.

Obviously, the time of data transmission depends on the available bandwidth between data-storing nodes and task-running nodes. But data-storing nodes are immutable for each task. Thus the minimization of transmitting time aims to select task-running nodes with the best available bandwidth. Equations (3) and (4) illustrate the minimization of estimated transmitting time for tasks.

$$\begin{aligned} tt^{mt^p}=\min \limits _{m\in {N^C}}(tt^{mt^p_m}+nt^m) \end{aligned}$$

(3)

$$\begin{aligned} tt^{rt^p}=\min \limits _{r\in {N^C}}(tt^{rt^p_r}+nt^r) \end{aligned}$$

(4)

The estimation is composed of transmitting time tt and waiting time nt. If the node is free, the waiting time is 0. If this node is busy, the waiting time is the remaining time of the quickest task. Specially, m is used for map tasks and r applies for reduce tasks.

However, tasks are from different applications while applications are successive in parallel training. It is necessary to synchronize concurrent training applications in order that updating applications will not be delayed to launch. $D^2\text {S}$ takes progresses of applications into consideration when data transmission is optimized for tasks. Based on progresses, the residual running time of applications can be denoted as $ft^p$ in the Eq. (5).

$$\begin{aligned} ft^p=\left\{ \begin{aligned}&ut^p\times \frac{1-pr^p}{pr}&{0<pr\le {1}}\\&ut^p&{pr=0} \end{aligned} \right. \end{aligned}$$

(5)

The $ut^p$ refers to the used time and $pr^p$ is the progress. Especially, the progress of applications is 0 when there is no completed tasks. To solve this flaw, applications with the progress of 0 will be differentiated by $ut^p$. Then $D^2\text {S}$ minimizes the sum $at^p$ of residual running time and transmitting time for each application like the Eq. (6).

$$\begin{aligned} at^p=ft^p+\min \limits _{{m,r}\in {N^C}}{(\sum _{mt^p=1}^{p^m} tt^{mt^p}+\sum _{rt^p=1}^{p^r} tt^{rt^p})} \end{aligned}$$

(6)

The superscription p is the label of application. Equation (7) shows the ultimate goal for all applications, where t is the sum of all applications.

$$\begin{aligned} t=\min {(\sum _{p=1}^{A} at^{p})} \end{aligned}$$

(7)

3.2 Minimum Cost Flows Algorithm

In fact, there are two minimizations in the scheduling model. Tasks are joint objects of two minimum problems. Thus a graph is used to connect two minimizations. Figure 2 is an example of this graph model with two applications and two nodes. Application 1 has two tasks to be scheduled and only one unscheduled task for application 2. If there is a arrow between tasks and nodes, the node has abilities to satisfy requirements of tasks. Each arrow with direction will be accompanied by its capacity. If the Capacity(Source, App1) is 1 and the application 1 will only select up to 1 task to be scheduled. Thus the capacity of arrows is also referred as the number of distributing tasks. In addition, the objects of Source and Sink are used to limit the number of scheduling tasks. The potential is along with each object in the graph like Source, App1 and Task3. They cooperate to define the number of tasks that can be scheduled for objects. Taking Fig. 2 as an example, the potential of Source is 3 with the assuming that the cluster has sufficient computing capacities. But the potential is not positive all the time. The negative potentials of nodes and Sink mean the ability to run tasks on this object. The rest objects with negative potentials indicate that assigned tasks exceed the limit of the object. Finally, all arrows with direction should be assigned to specified costs. According to Eqs. (1) and (2), the time of data transmission is attached to arrows between tasks and nodes. The estimated residual time is attached to links between Source and applications. Arrows of nodes and Sink use the waiting time as costs.

With the initialized graph, the minimum cost flows algorithm is used to assign tasks to nodes. This algorithm adopts the residual network [2] so that the original graph is converted into the Residual Graph. In the new Residual Graph, new potentials $P^r$, capacities $Cap^r$ and costs $Cost^r$ are dynamic parameters to explain limits of unscheduled tasks and abilities of nodes. More details have been shown in Algorithm 1. Every time we look for an object with a positive potential and store it in a set. Based on the updated set, we count the number of unscheduled tasks as Ust. Similarly, the number of distributing tasks with minimum cost is assigned to Res. Repeated operations are shown as lines 4–9. Once the number of unscheduled tasks exceeds the number of distributing tasks, unscheduled tasks can distribute to all other objects along all minimum cost arrows without distinction. This process corresponds to the line 16 in the algorithm. It is also noted that there is an object with a negative potential as the lines 10–14. In that case, one object with available abilities have requests for assigning tasks. Thus we look for a path with the minimum cost to distribute tasks so that the object can obtain distributed tasks. It is worth emphasizing that the process of distributing tasks will lead to a new residual graph with new parameters. Then the above procedure repeats until no unscheduled tasks exists.

4 Experiments

4.1 Set Up

We set up a MapReduce cluster with seven nodes, and six nodes are working nodes. Each node is configured with 2 GPUs. As for network, there are two racks that are connected by a router. Three of working nodes belong to the same rack and the bandwidth is the same for all nodes in a rack. Referring to the actual network environment, the bandwidth is 10 M/s between racks and 100 M/s for the inner rack. Three types of DNN models are tested in MapReduce, including Image Classification (IC), Food Recognition (FR) and Video Classification (VC). AlexNet network is applied to IC, and the network model contains 5 conventional layers and 2 fully connected layers. There are 50000 images in CIFAR-10 dataset to be used for training. FR uses 20000 photos to train the Inception-V3 network with 22 layers. Then 230 versions of food are recognized. The training set of VC is a quarter of YouTube-8M, and only 4-layer Long Short-Term Memory (LSTM) is trained with 2 million extracted features. What’s more, contentions are considered into experiments among the same models and mixed models.

Data transmission problem is our concern so that the scheduler [16] optimizes tasks for each application (OTEA) is one of baseline schedulers. Then we employ two application-aware schedulers, FIFO and FAIR, to compare with $D^2\text {S}$. Four metrics are used for evaluating the performance of $D^2\text {S}$. The first is the makespan of applications and a shorter makespan of applications is better. In addition, the shorter lag of all makespans is to synchronize the progresses of applications. Second is the makespan of tasks and tasks with a shorter makespan reduce the most data transmission. Then the distribution of running tasks is supplementary to give details. Last is the network traffic to show optimized data transmission in a quantitative manner.

4.2 Results Analysis

There are the same and mixed DNN models in our experiments. Each model has 15 applications while the number of tasks in applications are not the same. Figures 3, 4, 5 and 6 shows the makespan in different types of models. The makespan of tasks uses the cumulative distribution. It is obvious that $D^2\text {S}$ makes applications synchronize best. Compared with other schedulers, the makespan of applications fluctuates in the smallest range. The worst ratio of maximum and minimum is 6.2 which is smaller than 19.3 of FIFO, 9.8 of FAIR and 8.1 of OTEA in all DNN models. But the makespan in $D^2\text {S}$ is not the minimum because others offer applications with a shorter makespan like IC application 4 in Fig. 3, FR application 7 in Fig. 4 and so on. The little delayed time is mainly caused by the overhead that considers all applications at once in $D^2\text {S}$, while baseline schedulers introduce less scheduling time because of sequential scheduling. FIFO assigns tasks when there is a node to issue a request for tasks. Within a smaller cluster, FAIR applies the same rule to assign tasks as FIFO. Such policies lead to the least scheduling time and the random manner makes it possible to get the optimal assignment. OTEA optimizes data transmission for single application and the time to balance applications is zero. It is worthwhile to increase overhead to improve makespan. The averages of 45 makespans are 1528400 ms in $D^2\text {S}$, 2737100 ms in OTEA, 3264200 ms in FIFO and 3593200 ms in FAIR. Thus the makespan decreases 44.2%, 53.2% and 57.5% respectively. In short, $D^2\text {S}$ is the best in improving the makespan of applications.

In most cases, $D^2\text {S}$ offers map tasks a shorter time. The makespan is less than 1000000 ms for 90% of map tasks in IC and FR applications while other schedulers increase by 3 times at least. VC applications are included in a similar situation. $D^2\text {S}$ also improves the makespan of reduce tasks and maximum makespan is reduced by 4.5 times compared with other schedulers. But $D^2\text {S}$ prolongs some map tasks. For example, $D^2\text {S}$ provides that only 13% tasks complete within 750000 ms while other baseline schedulers keep more than 50% of tasks to finish in 750000 ms for IC applications. Such delay of map tasks is mainly due to the same reason as the delayed time in applications. The overall performance is the best for both map tasks and reduce tasks in $D^2\text {S}$.

Table 1. Locality in the same DNN models.

Full size table

In addition to the makespan, Table 1 shows the distribution of tasks in the same DNN model. There are three types of tasks, data-local, rack-local and off-switch. If tasks are running on the node which stores its input data, tasks are data-local tasks and avoid data transmission. Similarly, rack-local tasks refer to task-running nodes and data-storing nodes are located in the same rack. The rest of tasks are off-switch because task-running nodes and data-storing nodes belong to different racks. In this way, the transmitting time of rack-local tasks is shorter than off-switch tasks because the available network bandwidth of inner rack is better. Thus FIFO and FAIR are the worst performers as a result of maximum average of off-switch tasks, which enables half tasks transmit data across racks. $D^2\text {S}$ is the best with the maximum average 61.8% for data-local tasks. This average is almost two times more than 35.5% of FAIR and 27.3% of OTEA. Even the minimum data-local tasks increase to 33.3% by $D^2\text {S}$ while OTEA is 8.3% and 0 for another two schedulers. Except for the increase in data-local tasks, the rest of tasks have 97% of rack-local tasks and only 3% off-switch tasks in $D^2\text {S}$. The percentage of rack-local tasks is ranked in the second place which is lower than 47.8% in OTEA. But 47.8% rack-local tasks only account for 65.7% of all no data-local tasks in OTEA. Table 2 gives percentages of locality for tasks in mixed DNN models. $D^2\text {S}$ still performs best with the maximum data-local tasks.

Table 2. Locality in the mixed DNN models.

Full size table

Figures 7 and 8 describe the usage of network traffic by moving data. The average network traffic in each experiment is shown for different schedulers and different applications. It is obvious that $D^2\text {S}$ uses the minimum network traffic to move data and the variance is also the smallest. The value remains at around 5G bytes for all workload-type applications in $D^2\text {S}$. At the same time, other schedulers fluctuate between 7G bytes and 15G bytes. There are some exceptions of $D^2\text {S}$ in VC applications and a large variance occurs. But the maximum of used network traffic is still lower than the minimum of other schedulers. The trend in the right figure of Fig. 8 demonstrates that $D^2\text {S}$ is also suitable for mixed DNN models. The usage of network traffic is always the minimum in $D^2\text {S}$.

5 Conclusions

This paper comes up with a new scheduler named $D^2\text {S}$ to accelerate training time for distributed DNN training. The biggest challenge is caused by distributed storage in distributed cluster for parallel training. Such distribution leads to the requirement for timely data transmission for applications. At the same time, applications need to be synchronized in order to not delay successive applications in parallel training. Thus $D^2\text {S}$ is designed to optimize the time of data transmission for tasks without neglecting the makespan of applications. A scheduling graph combines two factors to accelerate distributed DNN training. We test $D^2\text {S}$ with the same and the mixed workload applications in MapReduce. The experimental results demonstrate $D^2\text {S}$ can offer applications a shorter and more stable makespan. In this way, all applications are synchronized by $D^2\text {S}$. In addition, less data transmission also helps to accelerate training time. In the future, we will implement more deep network models in the distributed cluster to accelerate training.

References

Abadi, M., Agarwal, A., Barham, P., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. CoRR abs/1603.04467, p. 1 (2016)
Google Scholar
Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows - Theory, Algorithms and Applications, vol. 45. Prentice Hall, Upper Saddle River (1993)
MATH Google Scholar
Bai, J., Chen, Z., Feng, B., Xu, B.: Chinese image character recognition using DNN and machine simulated training samples. In: Wermter, S., Weber, C., Duch, W., Honkela, T., Koprinkova-Hristova, P., Magg, S., Palm, G., Villa, A.E.P. (eds.) ICANN 2014. LNCS, vol. 8681, pp. 209–216. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11179-7_27
Chapter Google Scholar
Chauhan, J., Makaroff, D.J., Grassmann, W.K.: Simulation and performance evaluation of the hadoop capacity scheduler. In: Proceedings of 24th Annual International Conference on Computer Science and Software Engineering, CASCON 2014, Markham, Ontario, Canada, 3–5 November 2014. pp. 163–177 (2014)
Google Scholar
Chen, C., Lin, J., Kuo, S.: Deadline-constrained MapReduce scheduling based on graph modelling. In: 2014 IEEE 7th International Conference on Cloud Computing, Anchorage, AK, USA, 27 June–2 July 2014, pp. 416–423 (2014). https://doi.org/10.1109/cloud.2014.63
Chung, I., Sainath, T.N., Ramabhadran, B., Picheny, M., Gunnels, J.A., Austel, V., Chaudhari, U.V., Kingsbury, B.: Parallel deep neural network training for big data on blue gene/Q. IEEE Trans. Parallel Distrib. Syst. 28(6), 1703–1714 (2017). https://doi.org/10.1109/TPDS.2016.2626289
Article Google Scholar
Heo, H., Jung, J., Yang, I., Yoon, S., Yu, H.: Joint training of expanded end-to-end DNN for text-dependent speaker verification. In: 18th Annual Conference of the International Speech Communication Association, Interspeech 2017, Stockholm, Sweden, 20–24 August 2017, pp. 1532–1536 (2017). https://doi.org/10.21437/interspeech.2017-1050
Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.V.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, Montana, USA, 11–14 October 2009, pp. 261–276 (2009). https://doi.org/10.1145/1629575.1629601
Lee, K., Son, M.: DeepSpotCloud: leveraging cross-region GPU spot instances for deep learning. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), Honolulu, HI, USA, 25–30 June 2017, pp. 98–105 (2017). https://doi.org/10.1109/cloud.2017.21
Li, Z., Shen, Y., Yao, B., Guo, M.: OFScheduler: a dynamic network optimizer for mapreduce in heterogeneous cluster. Int. J. Parallel Program. 43(3), 472–488 (2015). https://doi.org/10.1007/s10766-013-0281-6
Article Google Scholar
Novoa, J., Fredes, J., Poblete, V., Yoma, N.B.: Uncertainty weighting and propagation in DNN-HMM-based speech recognition. Comput. Speech Lang. 47, 30–46 (2018). https://doi.org/10.1016/j.csl.2017.06.005
Article Google Scholar
Polo, J., Carrera, D., Becerra, Y., Steinder, M., Whalley, I.: Performance-driven task co-scheduling for MapReduce environments. In: IEEE/IFIP Network Operations and Management Symposium, NOMS 2010, 19–23 April 2010, Osaka, Japan, pp. 373–380 (2010). https://doi.org/10.1109/noms.2010.5488494
Prades, J., Silla, F.: A live demo for showing the benefits of applying the remote GPU virtualization technique to cloud computing. In: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, Madrid, Spain, 14–17 May 2017, pp. 735–738 (2017). https://doi.org/10.1109/ccgrid.2017.86
Scardapane, S., Lorenzo, P.D.: A framework for parallel and distributed training of neural networks. Neural Netw. 91, 42–54 (2017). https://doi.org/10.1016/j.neunet.2017.04.004
Article Google Scholar
Shang, F., Chen, X., Yan, C.: A strategy for scheduling reduce task based on intermediate data locality of the MapReduce. Clust. Comput. 20(4), 2821–2831 (2017). https://doi.org/10.1007/s10586-017-0972-7
Article Google Scholar
Shen, H., Sarker, A., Yu, L., Deng, F.: Probabilistic network-aware task placement for MapReduce scheduling. In: 2016 IEEE International Conference on Cluster Computing, CLUSTER 2016, Taipei, Taiwan, 12–16 September 2016, pp. 241–250 (2016). https://doi.org/10.1109/cluster.2016.48
Shrivastava, D., Chaudhury, S., Jayadeva, D.: A data and model-parallel, distributed and scalable framework for training of deep networks in apache spark. CoRR abs/1708.05840 (2017)
Google Scholar
Tan, J., Meng, S., Meng, X., Zhang, L.: Improving ReduceTask data locality for sequential MapReduce jobs. In: Proceedings of the IEEE INFOCOM 2013, Turin, Italy, 14–19 April 2013, pp. 1627–1635 (2013). https://doi.org/10.1109/infcom.2013.6566959
Yoshioka, T., Karita, S., Nakatani, T.: Far-field speech recognition using CNN-DNN-HMM with convolution in time. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, 19–24 April 2015, pp. 4360–4364 (2015). https://doi.org/10.1109/icassp.2015.7178794
Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, EuroSys 2010, Paris, France, 13–16 April 2010, pp. 265–278 (2010). https://doi.org/10.1145/1755913.1755940
Zhang, X., Zhong, Z., Feng, S., Tu, B., Fan, J.: Improving data locality of MapReduce by scheduling in homogeneous computing environments. In: IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2011, Busan, Korea, 26–28 May 2011, pp. 120–126 (2011). https://doi.org/10.1109/ispa.2011.14
Zhao, S., Medhi, D.: Application-aware network design for hadoop MapReduce optimization using software-defined networking. IEEE Trans. Netw. Serv. Manag. 14(4), 804–816 (2017). https://doi.org/10.1109/TNSM.2017.2728519
Article Google Scholar
Zlateski, A., Lee, K., Seung, H.S.: Scalable training of 3D convolutional networks on multi- and many-cores. J. Parallel Distrib. Comput. 106, 195–204 (2017). https://doi.org/10.1016/j.jpdc.2017.02.006
Article Google Scholar

Download references

Acknowledgements

This work was jointly supported by: (1) National Natural Science Foundation of China (No. 61771068, 61671079, 61471063, 61372120, 61421061); (2) Beijing Municipal Natural Science Foundation (No. 4182041, 4152039); (3) the National Basic Research Program of China (No. 2013CB329102). We also would like to thank the International Conference on Web Services committee for more detailed comments as well.

Author information

Authors and Affiliations

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Jie Xu, Jingyu Wang, Qi Qi, Haifeng Sun & Jianxin Liao

Authors

Jie Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jingyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qi Qi
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Xu .

Editor information

Editors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Hai Jin
Louisiana State University, Baton Rouge, Louisiana, USA
Qingyang Wang
Kingdee International Software Group CO., LTD, Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, J., Wang, J., Qi, Q., Sun, H., Liao, J. (2018). Accelerating Training for Distributed Deep Neural Networks in MapReduce. In: Jin, H., Wang, Q., Zhang, LJ. (eds) Web Services – ICWS 2018. ICWS 2018. Lecture Notes in Computer Science(), vol 10966. Springer, Cham. https://doi.org/10.1007/978-3-319-94289-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-94289-6_12
Published: 19 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94288-9
Online ISBN: 978-3-319-94289-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics