Abstract
As the scale of supercomputers grows, so does the size of the interconnect network. Topology-aware task mapping, which maps parallel application processes onto processors to reduce communication cost, becomes increasingly important. Previous works mainly focus on the task mapping between compute nodes (i.e., inter-node mapping), while ignoring the mapping within a node (i.e., intra-node mapping). In this paper, we propose a hierarchical task mapping strategy, which performs both inter-node and intra-node mapping. We consider supercomputers with popular fat-tree and torus network topologies, and introduce two mapping algorithms: (1) a generic recursive tree mapping algorithm, which can handle both inter-node mapping and intra-node mapping; (2) a recursive bipartitioning mapping algorithm for torus topology, which efficiently partitions the compute nodes according to their coordinates. Moreover, a hierarchical task mapping library is developed. Experimental results show that the proposed approach significantly improves the communication performance by up to 77 % with low runtime overhead.












Similar content being viewed by others
Notes
We use “task mapping” and “topology mapping” interchangeably.
Note that other graph partitioning algorithms may have different time complexities.
On Blue Gene/P systems, if the number of allocated nodes is less than \(512\), then their network topology is a mesh.
LibTopMap failed to derive mapping solutions when the number of processes is larger than 1024.
References
LibTopoMap: A Generic Topology Mapping Library. http://www.unixer.de/research/mpitopo/libtopomap/
MPI: A message-passing interface standard. version 3.0. http://www.mpi-forum.org/
ALCF Intrepid. https://www.alcf.anl.gov/intrepid
METIS: Graph Partitioning Tool. http://glaros.dtc.umn.edu/gkhome/views/metis
NICS Kraken User Guide. https://www.xsede.org/web/guest/nics-kraken
TACC Stampede User Guide. http://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide
Top 500 Supercomputer Sites. http://www.top500.org/
TOPOMap. http://bluesky.cs.iit.edu/topomap/
Abts D (2011) The Cray XT4 and Seastar 3-D Torus interconnect. Encycl Parallel Comput 4:470–477.
Agarwal T, Sharma A, Laxmikant A, Kale LV (2006) Topology-aware task mapping for reducing communication contention on large parallel machines. In: Proceedings of IEEE international symposium on parallel and distributed processing (IPDPS)
Aleliunas R, Rosenberg AL (1982) On embedding rectangular grids in square grids. IEEE Trans Comput C-31(9):907–913. doi:10.1109/TC.1982.1676109
Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Int J Eurogr Assoc (Computer Graphics Forum) 8(1):3–12
Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. In: Proceedings of the 7th annual international high performance computing conference. The 1993 high performance computing: new horizons supercomputing symposium, pp 349–357.
Arimilli B, Arimilli R, Chung V, Clark S, Denzel W, Drerup B, Hoefler T, Joyner J, Lewis J, Li J, Ni N, Rajamony R (2010) The PERCS high-performance interconnect. In: Proceedings of the 18th IEEE symposium on high performance interconnects, pp 75–82
Berman F, Snyder L (1987) On mapping parallel algorithms into parallel architectures. J Parallel Distrib Comput 4(5):439–458
Bhatele A (2010) Automating topology aware mapping for supercomputers. PhD thesis, University of Illinois at Urbana-Champaign, Urbana.
Bhatele A, Kale LV (2008) Application-specific topology-aware mapping for three dimensional topologies. In: Proceedings of IEEE international symposium on parallel and distributed processing (IPDPS), pp 1–8
Bokhari SH (1981) On the mapping problem. IEEE Trans Comput 30(3):207–214
Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R (2010) hwloc: a generic framework for managing hardware affinities in hpc applications. In: Proceedings of the 18th euromicro international conference on parallel, distributed and network-based processing (PDP), pp 180–186. doi:10.1109/PDP.2010.67
Chockalingam T, Arunkumar S (1992) A randomized heuristics for the mapping problem: the genetic approach. Parallel Comput 18(10):1157–1165
Chung IH, Lee CR, Zhou J, Chung YC (2011) Hierarchical mapping for HPC applications. In: Proceedings of IEEE international symposium on parallel and distributed processing workshops and Phd forum (IPDPSW), pp 1815–1823
Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of the 24th National Conference, pp 157–172.
Davis TA, Hu Y (2011) The university of florida sparse matrix collection. ACM Trans Math Softw 38(1):1–25
Deveci M, Rajamanickam S, Leung VJ, Pedretti K, Olivier SL, Bunde DP, Çatalyürek UV, Devine K (2014) Exploiting geometric partitioning in task mapping for parallel computers. In: Proceedings of the 2014 IEEE 28th international parallel and distributed processing symposium, pp 27–36
Ercal F, Ramanujam J, Sadayappan P (1988) Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proceedings of the third conference on hypercube concurrent computers and applications: architecture, software, computer systems, and general issues, vol 1, no C3P, pp 210–221
Hoefler T, Snir M (2011) Generic topology mapping strategies for large-scale parallel architectures. In: Proceedings of the international conference on supercomputing (ICS), pp 75–84
Jeannot E, Mercier G (2010) Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: Proceedings of the 16th International euro-par conference on parallel processing: part II, pp 199–210
Karypis G, Kumar V (1998) Multilevel k-way partitioning scheme for irregular graphs. J Parallel Distrib Comput 48(1):96–129
Kravtsov AV, Klypin AA, Khokhlov AM (1997) Adaptive refinement tree: a new high-resolution N-body code for cosmological simulations. Astrophys J Suppl Ser 111:73–94
Lee C, Bic L (1989) On the mapping problem using simulated annealing. In: Proceedings of international phoenix conference on computers and communications, pp 40–44. doi:10.1109/PCCC.1989.37357
Leiserson CE (1985) Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans Comput 34(10):892–901
Mercier G, Jeannot E (2011) Improving MPI applications performance on multicore clusters with rank reordering. In: Proceedings of the 18th European MPI users’ group conference on recent advances in the message passing interface, pp 39–49
Pellegrini F (1994) Static mapping by dual recursive bipartitioning of process architecture graphs. In: Proceedings of the scalable high-performance computing conference, pp 486–493
pellegrini F, Roman J (1996) Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs. High-performance computing and networking. Lecture notes in computer science, vol 1067, pp 493–498
Plewa T, Linde T, Weirs VG (2005) Adaptive mesh refinement: theory and applications. Springer, Berlin
Rashti MJ, Green J, Balaji P, Afsahi A, Gropp W (2011) Multi-core and network aware MPI topology functions. In: Proceedings of the 18th European MPI users’ group conference on recent advances in the message passing interface, pp 50–60
Salman A, Ahmad I, Al-Madani S (2002) Particle swarm optimization for task assignment problem. Microprocess Microsyst 26(8):363–371
Subramoni H, Potluri S, Kandalla K, Barth B, Vienne J, Keasler J, Tomko K, Schulz K, Moody A, Panda D (2012) Design of a scalable infiniband topology service to enable network-topology-aware placement of processes. In: Proceedings of international conference on high performance computing, networking, storage and analysis, pp 1–12. doi:10.1109/SC.2012.47
Tang W, Lan Z, Desai N, Buettner D, Yu Y (2011) Reducing fragmentation on torus-connected supercomputers. In: Proceedings of IEEE international symposium on parallel and distributed processing (IPDPS), pp 828–839
Tr\(\ddot{a}\)ff JL (2002) Implementing the MPI process topology mechanism. In: Proceedings of ACM/IEEE conference on supercomputing, pp 28:1–28:14
Wu J, Gonzalez RE, Lan Z, Gnedin NY, Kravtsov AV, Rudd DH, Yu Y (2011) Performance emulation of cell-based AMR cosmology simulations. In: Proceedings of IEEE International conference on cluster computing (CLUSTER), pp 8–16
Wu J, Lan Z, Xiong X, Gnedin NY, Kravtsov AV (2012) Hierarchical task mapping of cell-based AMR cosmology simulations. In: Proceedings of international conference on high performance computing, networking, storage and analysis, SC ’12, pp 75:1–75:10
Yu H, Chung IH, Moreira J (2006) Topology mapping for Blue Gene/L supercomputer. In: Proceedings of ACM/IEEE conference on supercomputing, p 52
Yu Y, Rudd DH, Lan Z, Gnedin NY, Kravtsov AV, Wu J (2012) Improving parallel IO performance of cell-based AMR cosmology applications. In: Proceedings of IEEE international symposium on parallel and distributed processing (IPDPS), pp 933–944
Zou H, Yu Y, Tang W, Chen HWM (2014) Flexanalytics: a flexible data analytics framework for big data applications with I/O performance improvement. Big Data Res 1(0): 4–13 (Special Issue on Scalable Computing for Big Data)
Acknowledgments
The work at Illinois Institute of Technology is supported in part by US National Science Foundation grant CNS-1320125. This work is also supported in part by the National Natural Science Foundation of China grant 61402083. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu This material is based upon work supported by the National Science Foundation under Grant numbers 0711134, 0933959, 1041709, and 1041710 and the University of Tennessee through the use of the Kraken computing resource at the National Institute for Computational Sciences (http://www.nics.tennessee.edu). This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1053575.
Author information
Authors and Affiliations
Corresponding author
Additional information
The majority of this work was done when Jingjin and Xuanxing were Ph.D. students at Illinois Institute of Technology.
Rights and permissions
About this article
Cite this article
Wu, J., Xiong, X. & Lan, Z. Hierarchical task mapping for parallel applications on supercomputers. J Supercomput 71, 1776–1802 (2015). https://doi.org/10.1007/s11227-014-1324-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1324-5