Topology mapping of irregular parallel applications on torus-connected supercomputers

Wu, Jingjin; Xiong, Xuanxing; Berrocal, Eduardo; Wang, Jia; Lan, Zhiling

doi:10.1007/s11227-016-1876-7

Topology mapping of irregular parallel applications on torus-connected supercomputers

Published: 26 October 2016

Volume 73, pages 1691–1714, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jingjin Wu ORCID: orcid.org/0000-0002-9615-4260¹,
Xuanxing Xiong²,
Eduardo Berrocal³,
Jia Wang⁴ &
…
Zhiling Lan³

297 Accesses
Explore all metrics

Abstract

Supercomputers with ever increasing computing power are being built for scientific applications. As the system size scales up, so does the size of interconnect network. As a result, communication in supercomputers becomes increasingly expensive due to the long distance between nodes and network contention. Topology mapping, which maps parallel application processes onto compute nodes by considering network topology and application communication pattern, is an essential technique for communication optimization. In this paper, we study the topology mapping problem for torus-connected supercomputers, and present an analytical topology mapping algorithm for parallel applications with irregular communication patterns. We consider our problem as a discrete optimization problem in the geometric domain of a torus topology, and design an analytical mapping algorithm, which uses numerical solvers to compute the mapping. Experimental results show that our algorithm provides high-quality mappings on 3-dimensional torus, which significantly reduce the communication time by up to 72%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A Supercomputer

Optimal circulant graphs as low-latency network topologies

Article 21 March 2022

Optimal low-latency network topologies for cluster performance enhancement

Article 02 March 2020

Notes

The physical meaning of Eq. (2) is introduced below. The communication graph of the application is modeled as a spring system, where each edge $(i,j)\in E_c$ is represented as a spring with corresponding spring constant being c(i, j). The total energy of the springs is a quadratic function of their lengths. A mapping solution is obtained by minimizing the total energy to find a force equilibrium state.

References

Abdel-Gawad AH, Thottethodi M, Bhatele A (2014) RAHTM: routing algorithm aware hierarchical task mapping. In: Proc. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), p 325–335
Abts D (2011) The Cray XT4 and Seastar 3-D Torus interconnect. Encyclopedia of Parallel Computing, p 470–477
Agarwal T, Sharma A, Laxmikant A, Kale LV (2006) Topology-aware task mapping for reducing communication contention on large parallel machines. In: Proc. IEEE International Symposium on Parallel and Distributed Processing (IPDPS)
Analytical Mapping Tool (2014) http://bluesky.cs.iit.edu/topomap/. Accessed 30 July 2014
Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269
Article MATH Google Scholar
Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. In: Proc. the 7th Annual International High Performance Computing Conference. The 1993 High Performance Computing: New Horizons Supercomputing Symposium, p 349–357
Berman F, Snyder L (1987) On mapping parallel algorithms into parallel architectures. J Parallel Distrib Comput 4(5):439–458
Article Google Scholar
Bhandarkar SM, Arabnia HR (1995) The hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114
Article Google Scholar
Bhatele A (2010) Automating topology aware mapping for supercomputers. Ph.D. thesis, University of Illinois at Urbana-Champaign, Urbana
Bhatele A, Gamblin T, Langer SH, Bremer PT, Draeger EW, Hamann B, Isaacs KE, Landge AG, Levine JA, Pascucci V, Schulz M, Still CH (2012) Mapping applications with collectives over sub-communicators on torus networks. In: Proc. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), p 97:1–97:11
Bokhari SH (1981) On the mapping problem. IEEE Trans Comput 30(3):207–214
Article MathSciNet Google Scholar
Boyd S, Vandenberghe L (2009) Convex optimization. Cambridge University Press, Cambridge
MATH Google Scholar
Butz AR (1971) Alternative algorithm for Hilbert’s space-filling curve. IEEE Trans Comput C–20(4):424–426
Article MATH Google Scholar
Chen Y, Davis TA, Hager WW, Rajamanickam S (2008) Algorithm 887: CHOLMOD, supernodal sparse cholesky factorization and update/downdate. ACM Trans Math Softw 35(3):22:1–22:14
Article MathSciNet Google Scholar
Chockalingam T, Arunkumar S (1992) A randomized heuristics for the mapping problem: the genetic approach. Parallel Comput 18(10):1157–1165
Article MATH Google Scholar
Chung IH, Lee CR, Zhou J, Chung YC (2011) Hierarchical mapping for HPC applications. In: Proc. IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), p 1815–1823
Davis TA, Hu Y (2011) The university of Florida sparse matrix collection. ACM Trans Math Softw 38(1):1–25
MathSciNet Google Scholar
Deveci M, Rajamanickam S, Leung VJ, Pedretti K, Olivier SL, Bunde DP, Çatalyürek UV, Devine K (2014) Exploiting geometric partitioning in task mapping for parallel computers. In: Proc. IEEE International Symposium on Parallel and Distributed Processing (IPDPS), p 27–36
Ercal F, Ramanujam J, Sadayappan P (1988) Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proc. the Third Conference on Hypercube Concurrent Computers and Applications: Architecture, Software, Computer Systems, and General Issues, vol 1, C3P, p 210–221
Golub GH, Loan CFV (1996) Matrix computations, 3rd edn. The Johns Hopkins University Press, Baltimore, London
MATH Google Scholar
Hoefler T, Snir M (2011) Generic topology mapping strategies for large-scale parallel architectures. In: Proc. the International Conference on Supercomputing (ICS), p 75–84
Hu YF, Blake RJ, Emerson DR (1998) An optimal migration algorithm for dynamic load balancing. Concurr Pract Exp 10(6):467–483
Article MATH Google Scholar
IBM References for BG/P (2013) https://www.alcf.anl.gov/user-guides/bgp-references. Accessed 1 May 2013
Jeannot E, Mercier G, Tessier F (2014) Process placement in multicore clusters: algorithmic issues and practical techniques. IEEE Trans Parallel Distrib Syst 25(4):993–1002
Article Google Scholar
Kravtsov AV, Klypin AA, Khokhlov AM (1997) Adaptive refinement tree: a new high-resolution N-body code for cosmological simulations. Astrophys J Suppl Ser 111:73–94
Article Google Scholar
Lee C, Bic L (1989) On the mapping problem using simulated annealing. In: Proc. International Phoenix Conference on Computers and Communications, p 40–44. doi:10.1109/PCCC.1989.37357
LibTopoMap (2010) A generic topology mapping library. http://www.unixer.de/research/mpitopo/libtopomap/. Accessed 8 May 2013
METIS (2013) Graph partitioning tool. http://glaros.dtc.umn.edu/gkhome/views/metis. Accessed 6 May 2013
Pellegrini F (1994) Static mapping by dual recursive bipartitioning of process architecture graphs. In: Proc. the Scalable High-Performance Computing Conference, p 486–493
Plewa T, Linde T, Weirs VG (2005) Adaptive mesh refinement-theory and applications. Springer, Berlin
Book MATH Google Scholar
Salman A, Ahmad I, Al-Madani S (2002) Particle swarm optimization for task assignment problem. Microprocess Microsyst 26(8):363–371
Article Google Scholar
Spielman D, Teng SH (2003) Solving sparse, symmetric, diagonally-dominant linear systems in time o(m1.31). In: Proc. IEEE Symposium on Foundations of Computer Science, p 416–427
Träff JL (2002) Implementing the MPI process topology mechanism. In: Proc. ACM/IEEE Conference on Supercomputing, p 28:1–28:14
Top 500 Supercomputer Sites (2015) http://www.top500.org/. Accessed 30 Nov 2015
The Gemini Network (2010) http://wiki.ci.uchicago.edu/pub/Beagle/SystemSpecs/Gemini _whitepaper.pdf. Accessed 1 May 2013
Viswanathan N, Chu CCN (2004) FastPlace: Efficient analytical placement using cell shifting, iterative local refinement and a hybrid net model. In: Proc. International Symposium on Physical Design, p 26–33
Viswanathan N, Chu CCN (2005) FastPlace: efficient analytical placement using cell shifting, iterative local refinement, and a hybrid net model. IEEE Trans Comput Aided Design 24(5):722–733
Article Google Scholar
Wallace S, Vishwanath V, Coghlan S, Tramm J, Lan Z, Papkay M (2013) Application power profiling on IBM Blue Gene/Q. In: Proc. IEEE International Conference on Cluster Computing (CLUSTER), p 1–8
Wu J, Gonzalez RE, Lan Z, Gnedin NY, Kravtsov AV, Rudd DH, Yu Y (2011) Performance emulation of cell-based AMR cosmology simulations. In: Proc. IEEE International Conference on Cluster Computing (CLUSTER), p 8–16
Wu J, Lan Z, Xiong X, Gnedin NY, Kravtsov AV (2012) Hierarchical task mapping of cell-based AMR cosmology simulations. In: Proc. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), SC ’12, p 75:1–75:10
Wu J, Xiong X, Lan Z (2015) Hierarchical task mapping for parallel applications on supercomputers. J Supercomput 71(5):1776–1802
Article Google Scholar
Yu H, Chung IH, Moreira J (2006) Topology mapping for Blue Gene/L supercomputer. In: Proc. ACM/IEEE Conference on Supercomputing, p 52. doi:10.1109/SC.2006.63
Yu Y, Rudd DH, Lan Z, Gnedin NY, Kravtsov AV, Wu J (2012) Improving parallel IO performance of cell-based AMR cosmology applications. In: Proc. IEEE International Symposium on Parallel and Distributed Processing (IPDPS), p 933–944

Download references

Acknowledgments

This work is supported in part by US National Science Foundation Grants OCI-0904670 and CNS-1320125. This work is also supported in part by the National Natural Science Foundation of China Grant 61402083. The authors thank the Argonne Leadership Computing Facility for the use of their supercomputers.

Author information

Authors and Affiliations

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
Jingjin Wu
Design Group, Synopsys, Inc., Mountain View, CA, 94043, USA
Xuanxing Xiong
Department of Computer Science, Illinois Institute of Technology, Chicago, IL, 60616, USA
Eduardo Berrocal & Zhiling Lan
Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL, 60616, USA
Jia Wang

Authors

Jingjin Wu
View author publications
You can also search for this author inPubMed Google Scholar
Xuanxing Xiong
View author publications
You can also search for this author inPubMed Google Scholar
Eduardo Berrocal
View author publications
You can also search for this author inPubMed Google Scholar
Jia Wang
View author publications
You can also search for this author inPubMed Google Scholar
Zhiling Lan
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jingjin Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, J., Xiong, X., Berrocal, E. et al. Topology mapping of irregular parallel applications on torus-connected supercomputers. J Supercomput 73, 1691–1714 (2017). https://doi.org/10.1007/s11227-016-1876-7

Download citation

Published: 26 October 2016
Issue Date: April 2017
DOI: https://doi.org/10.1007/s11227-016-1876-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topology mapping of irregular parallel applications on torus-connected supercomputers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A Supercomputer

Optimal circulant graphs as low-latency network topologies

Optimal low-latency network topologies for cluster performance enhancement

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now