Hierarchical task mapping for parallel applications on supercomputers

Wu, Jingjin; Xiong, Xuanxing; Lan, Zhiling

doi:10.1007/s11227-014-1324-5

Hierarchical task mapping for parallel applications on supercomputers

Published: 15 November 2014

Volume 71, pages 1776–1802, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jingjin Wu¹,
Xuanxing Xiong² &
Zhiling Lan³

493 Accesses
16 Citations
1 Altmetric
Explore all metrics

Abstract

As the scale of supercomputers grows, so does the size of the interconnect network. Topology-aware task mapping, which maps parallel application processes onto processors to reduce communication cost, becomes increasingly important. Previous works mainly focus on the task mapping between compute nodes (i.e., inter-node mapping), while ignoring the mapping within a node (i.e., intra-node mapping). In this paper, we propose a hierarchical task mapping strategy, which performs both inter-node and intra-node mapping. We consider supercomputers with popular fat-tree and torus network topologies, and introduce two mapping algorithms: (1) a generic recursive tree mapping algorithm, which can handle both inter-node mapping and intra-node mapping; (2) a recursive bipartitioning mapping algorithm for torus topology, which efficiently partitions the compute nodes according to their coordinates. Moreover, a hierarchical task mapping library is developed. Experimental results show that the proposed approach significantly improves the communication performance by up to 77 % with low runtime overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topology mapping of irregular parallel applications on torus-connected supercomputers

Article 26 October 2016

TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A Supercomputer

Algorithms for tree-shaped task partition and allocation on heterogeneous multiprocessors

Article 22 March 2023

Notes

We use “task mapping” and “topology mapping” interchangeably.
Note that other graph partitioning algorithms may have different time complexities.
On Blue Gene/P systems, if the number of allocated nodes is less than $512$, then their network topology is a mesh.
LibTopMap failed to derive mapping solutions when the number of processes is larger than 1024.

References

LibTopoMap: A Generic Topology Mapping Library. http://www.unixer.de/research/mpitopo/libtopomap/
MPI: A message-passing interface standard. version 3.0. http://www.mpi-forum.org/
ALCF Intrepid. https://www.alcf.anl.gov/intrepid
METIS: Graph Partitioning Tool. http://glaros.dtc.umn.edu/gkhome/views/metis
NICS Kraken User Guide. https://www.xsede.org/web/guest/nics-kraken
TACC Stampede User Guide. http://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide
Top 500 Supercomputer Sites. http://www.top500.org/
TOPOMap. http://bluesky.cs.iit.edu/topomap/
Abts D (2011) The Cray XT4 and Seastar 3-D Torus interconnect. Encycl Parallel Comput 4:470–477.
Agarwal T, Sharma A, Laxmikant A, Kale LV (2006) Topology-aware task mapping for reducing communication contention on large parallel machines. In: Proceedings of IEEE international symposium on parallel and distributed processing (IPDPS)
Aleliunas R, Rosenberg AL (1982) On embedding rectangular grids in square grids. IEEE Trans Comput C-31(9):907–913. doi:10.1109/TC.1982.1676109
Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Int J Eurogr Assoc (Computer Graphics Forum) 8(1):3–12
Article Google Scholar
Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. In: Proceedings of the 7th annual international high performance computing conference. The 1993 high performance computing: new horizons supercomputing symposium, pp 349–357.
Arimilli B, Arimilli R, Chung V, Clark S, Denzel W, Drerup B, Hoefler T, Joyner J, Lewis J, Li J, Ni N, Rajamony R (2010) The PERCS high-performance interconnect. In: Proceedings of the 18th IEEE symposium on high performance interconnects, pp 75–82
Berman F, Snyder L (1987) On mapping parallel algorithms into parallel architectures. J Parallel Distrib Comput 4(5):439–458
Article Google Scholar
Bhatele A (2010) Automating topology aware mapping for supercomputers. PhD thesis, University of Illinois at Urbana-Champaign, Urbana.
Bhatele A, Kale LV (2008) Application-specific topology-aware mapping for three dimensional topologies. In: Proceedings of IEEE international symposium on parallel and distributed processing (IPDPS), pp 1–8
Bokhari SH (1981) On the mapping problem. IEEE Trans Comput 30(3):207–214
Article MathSciNet Google Scholar
Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R (2010) hwloc: a generic framework for managing hardware affinities in hpc applications. In: Proceedings of the 18th euromicro international conference on parallel, distributed and network-based processing (PDP), pp 180–186. doi:10.1109/PDP.2010.67
Chockalingam T, Arunkumar S (1992) A randomized heuristics for the mapping problem: the genetic approach. Parallel Comput 18(10):1157–1165
Article MATH Google Scholar
Chung IH, Lee CR, Zhou J, Chung YC (2011) Hierarchical mapping for HPC applications. In: Proceedings of IEEE international symposium on parallel and distributed processing workshops and Phd forum (IPDPSW), pp 1815–1823
Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of the 24th National Conference, pp 157–172.
Davis TA, Hu Y (2011) The university of florida sparse matrix collection. ACM Trans Math Softw 38(1):1–25
MathSciNet Google Scholar
Deveci M, Rajamanickam S, Leung VJ, Pedretti K, Olivier SL, Bunde DP, Çatalyürek UV, Devine K (2014) Exploiting geometric partitioning in task mapping for parallel computers. In: Proceedings of the 2014 IEEE 28th international parallel and distributed processing symposium, pp 27–36
Ercal F, Ramanujam J, Sadayappan P (1988) Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proceedings of the third conference on hypercube concurrent computers and applications: architecture, software, computer systems, and general issues, vol 1, no C3P, pp 210–221
Hoefler T, Snir M (2011) Generic topology mapping strategies for large-scale parallel architectures. In: Proceedings of the international conference on supercomputing (ICS), pp 75–84
Jeannot E, Mercier G (2010) Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: Proceedings of the 16th International euro-par conference on parallel processing: part II, pp 199–210
Karypis G, Kumar V (1998) Multilevel k-way partitioning scheme for irregular graphs. J Parallel Distrib Comput 48(1):96–129
Article MathSciNet Google Scholar
Kravtsov AV, Klypin AA, Khokhlov AM (1997) Adaptive refinement tree: a new high-resolution N-body code for cosmological simulations. Astrophys J Suppl Ser 111:73–94
Article Google Scholar
Lee C, Bic L (1989) On the mapping problem using simulated annealing. In: Proceedings of international phoenix conference on computers and communications, pp 40–44. doi:10.1109/PCCC.1989.37357
Leiserson CE (1985) Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans Comput 34(10):892–901
Article Google Scholar
Mercier G, Jeannot E (2011) Improving MPI applications performance on multicore clusters with rank reordering. In: Proceedings of the 18th European MPI users’ group conference on recent advances in the message passing interface, pp 39–49
Pellegrini F (1994) Static mapping by dual recursive bipartitioning of process architecture graphs. In: Proceedings of the scalable high-performance computing conference, pp 486–493
pellegrini F, Roman J (1996) Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs. High-performance computing and networking. Lecture notes in computer science, vol 1067, pp 493–498
Plewa T, Linde T, Weirs VG (2005) Adaptive mesh refinement: theory and applications. Springer, Berlin
Book Google Scholar
Rashti MJ, Green J, Balaji P, Afsahi A, Gropp W (2011) Multi-core and network aware MPI topology functions. In: Proceedings of the 18th European MPI users’ group conference on recent advances in the message passing interface, pp 50–60
Salman A, Ahmad I, Al-Madani S (2002) Particle swarm optimization for task assignment problem. Microprocess Microsyst 26(8):363–371
Article Google Scholar
Subramoni H, Potluri S, Kandalla K, Barth B, Vienne J, Keasler J, Tomko K, Schulz K, Moody A, Panda D (2012) Design of a scalable infiniband topology service to enable network-topology-aware placement of processes. In: Proceedings of international conference on high performance computing, networking, storage and analysis, pp 1–12. doi:10.1109/SC.2012.47
Tang W, Lan Z, Desai N, Buettner D, Yu Y (2011) Reducing fragmentation on torus-connected supercomputers. In: Proceedings of IEEE international symposium on parallel and distributed processing (IPDPS), pp 828–839
Tr$\ddot{a}$ff JL (2002) Implementing the MPI process topology mechanism. In: Proceedings of ACM/IEEE conference on supercomputing, pp 28:1–28:14
Wu J, Gonzalez RE, Lan Z, Gnedin NY, Kravtsov AV, Rudd DH, Yu Y (2011) Performance emulation of cell-based AMR cosmology simulations. In: Proceedings of IEEE International conference on cluster computing (CLUSTER), pp 8–16
Wu J, Lan Z, Xiong X, Gnedin NY, Kravtsov AV (2012) Hierarchical task mapping of cell-based AMR cosmology simulations. In: Proceedings of international conference on high performance computing, networking, storage and analysis, SC ’12, pp 75:1–75:10
Yu H, Chung IH, Moreira J (2006) Topology mapping for Blue Gene/L supercomputer. In: Proceedings of ACM/IEEE conference on supercomputing, p 52
Yu Y, Rudd DH, Lan Z, Gnedin NY, Kravtsov AV, Wu J (2012) Improving parallel IO performance of cell-based AMR cosmology applications. In: Proceedings of IEEE international symposium on parallel and distributed processing (IPDPS), pp 933–944
Zou H, Yu Y, Tang W, Chen HWM (2014) Flexanalytics: a flexible data analytics framework for big data applications with I/O performance improvement. Big Data Res 1(0): 4–13 (Special Issue on Scalable Computing for Big Data)

Download references

Acknowledgments

The work at Illinois Institute of Technology is supported in part by US National Science Foundation grant CNS-1320125. This work is also supported in part by the National Natural Science Foundation of China grant 61402083. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu This material is based upon work supported by the National Science Foundation under Grant numbers 0711134, 0933959, 1041709, and 1041710 and the University of Tennessee through the use of the Kraken computing resource at the National Institute for Computational Sciences (http://www.nics.tennessee.edu). This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1053575.

Author information

Authors and Affiliations

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
Jingjin Wu
Design Group, Synopsys, Inc., Mountain View, CA, 94043, USA
Xuanxing Xiong
Department of Computer Science, Illinois Institute of Technology, Chicago, IL, 60616, USA
Zhiling Lan

Authors

Jingjin Wu
View author publications
You can also search for this author inPubMed Google Scholar
Xuanxing Xiong
View author publications
You can also search for this author inPubMed Google Scholar
Zhiling Lan
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jingjin Wu.

Additional information

The majority of this work was done when Jingjin and Xuanxing were Ph.D. students at Illinois Institute of Technology.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, J., Xiong, X. & Lan, Z. Hierarchical task mapping for parallel applications on supercomputers. J Supercomput 71, 1776–1802 (2015). https://doi.org/10.1007/s11227-014-1324-5

Download citation

Published: 15 November 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s11227-014-1324-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical task mapping for parallel applications on supercomputers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Topology mapping of irregular parallel applications on torus-connected supercomputers

TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A Supercomputer

Algorithms for tree-shaped task partition and allocation on heterogeneous multiprocessors

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now