Skip to main content

TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A Supercomputer

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2018)

Abstract

With the increasing size of high performance computing systems, the expensive communication overhead between processors has become a key factor leading to the performance bottleneck. However, default process-to-processor mapping strategies do not take into account the topology of the interconnection network, and thus the distance spanned by communication messages may be particularly far. In order to enhance the communication locality, we propose a new topology-aware mapping method called TAMM. By generating an accurate description of the communication pattern and network topology, TAMM employs a two-step optimization strategy to obtain an efficient mapping solution for various parallel applications. This strategy first extracts an appropriate subset of all idle computing resources on the underlying system and then constructs an optimized one-to-one mapping with a refined iterative algorithm. Experimental results demonstrate that TAMM can effectively improve the communication performance on the Tianhe-2A supercomputer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bhatele, A., Laxmikant, V.: An evaluative study on the effect of contention on message latencies in large supercomputers. In: 2009 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–8 (2009). https://doi.org/10.1109/IPDPS.2009.5161094

  2. Bhatele, A.: Automating topology aware mapping for supercomputers. Ph.D. thesis, University of Illinois at Urbana-Champaign, Champaign, IL, USA (2010)

    Google Scholar 

  3. Brandfass, B., Alrutz, T., Gerhold, T.: Rank reordering for mpi communication optimization. Comput. Fluids 80, 372–380 (2013). https://doi.org/10.1016/j.compfluid.2012.01.019

    Article  Google Scholar 

  4. Cao, J., Xiao, L., Pang, Z., Wang, K., Xu, J.: The efficient in-band management for interconnect network in Tianhe-2 system. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp. 18–26 (2016). https://doi.org/10.1109/PDP.2016.58

  5. Chen, H., Chen, W., Huang, J., Robert, B., Kuhn, H.: MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In: Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, pp. 353–360. ACM (2006). https://doi.org/10.1145/1183401.1183451

  6. Duff, I.S.: European exascale software initiative: numerical libraries, solvers and algorithms. In: Alexander, M., et al. (eds.) Euro-Par 2011. LNCS, vol. 7155, pp. 295–304. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-29737-3_34

    Chapter  Google Scholar 

  7. Ercal, F., Ramanujam, J., Sadayappan, P.: Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications: Architecture, Software, Computer Systems, and General Issues, C3P, vol. 1, pp. 210–221. ACM (1988). https://doi.org/10.1145/62297.62323

  8. Fujiwara, T., Malakar, P., Reda, K., Vishwanath, V., Papka, M.E., Ma, K.L.: A visual analytics system for optimizing communications in massively parallel applications. In: IEEE Conference on Visual Analytics Science and Technology (2017)

    Google Scholar 

  9. Galvez, J.J., Jain, N., Kale, L.V.: Automatic topology mapping of diverse large-scale parallel applications. In: Proceedings of the International Conference on Supercomputing, ICS 2017, pp. 17:1–17:10. ACM (2017). https://doi.org/10.1145/3079079.3079104

  10. Geist, A., Dosanjh, S.: IESP exascale challenge: co-design of architectures and algorithms. Int. J. High Perform. Comput. Appl. 23(4), 401–402 (2009). https://doi.org/10.1177/1094342009347766

    Article  Google Scholar 

  11. Georgiou, Y., Jeannot, E., Mercier, G., Villiermet, A.: Topology-aware job mapping. Int. J. High Perform. Comput. Appl. 63 (2017). https://doi.org/10.1109/SC.2006.63

  12. Hendrickson, B., Leland, R.: The Chaco user’s guide: version 2.0. Technical report, Sandia National Laboratory (1994)

    Google Scholar 

  13. Hoefler, T., Jeannot, E., Mercier, G.: An overview of topology mapping algorithms and techniques in high-performance computing, Chap. 5, pp. 73–94. Wiley-Blackwell (2014).https://doi.org/10.1002/9781118711897.ch5

  14. Hoefler, T., Snir, M.: Generic topology mapping strategies for large-scale parallel architectures. In: Proceedings of the International Conference on Supercomputing, ICS 2011. pp. 75–84. ACM(2011). https://doi.org/10.1145/1995896.1995909

  15. Jeannot, E., Mercier, G., Tessier, F.: Process placement in multicore clusters:algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25(4), 993–1002 (2014). https://doi.org/10.1109/TPDS.2013.104

    Article  Google Scholar 

  16. Jeannot, E., Mercier, G.: Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010. LNCS, vol. 6272, pp. 199–210. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15291-7_20

    Chapter  Google Scholar 

  17. Karypis, G., Kumar, V.: Metis: a software package for partitioning unstructured graphs. International Cryogenics Monograph, pp. 121–124 (1998)

    Google Scholar 

  18. Li, S., Hoefler, T., Snir, M.: NUMA-aware shared-memory collective communication for MPI. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, pp. 85–96. ACM (2013). https://doi.org/10.1145/2462902.2462903

  19. Liao, X.K., et al.: High performance interconnect network for Tianhe system. J. Comput. Sci. Technol. 30(2), 259–272 (2015). https://doi.org/10.1007/s11390-015-1520-7

    Article  Google Scholar 

  20. Liao, X., Xiao, L., Yang, C., Lu, Y.: Milkyway-2 supercomputer: system and application. Front. Comput. Sci. 8(3), 345–356 (2014). https://doi.org/10.1007/s11704-014-3501-3

    Article  MathSciNet  Google Scholar 

  21. Mercier, G., Clet-Ortega, J.: Towards an efficient process placement policy for MPI applications in multicore environments. In: Ropo, M., Westerholm, J., Dongarra, J. (eds.) EuroPVM/MPI 2009. LNCS, vol. 5759, pp. 104–115. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03770-2_17

    Chapter  Google Scholar 

  22. Mirsadeghi, S.H., Afsahi, A.: PTRAM: a parallel topology-and routing-aware mapping framework for large-scale HPC systems. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 386–396 (2016). https://doi.org/10.1109/IPDPSW.2016.146

  23. Mirsadeghi, S.H., Afsahi, A.: Topology-aware rank reordering for MPI collectives. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1759–1768 (2016). https://doi.org/10.1109/IPDPSW.2016.139

  24. Pang, Z., et al.: The TH express high performance interconnect networks. Front. Comput. Sci. 8(3), 357–366 (2014). https://doi.org/10.1007/s11704-014-3500-9

    Article  MathSciNet  Google Scholar 

  25. Pellegrini, F., Roman, J.: Scotch: a software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In: Liddell, H., Colbrook, A., Hertzberger, B., Sloot, P. (eds.) HPCN-Europe 1996. LNCS, vol. 1067, pp. 493–498. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61142-8_588

    Chapter  Google Scholar 

  26. Rodrigues, E.R., Madruga, F.L., Navaux, P.O.A., Panetta, J.: Multi-core aware process mapping and its impact on communication overhead of parallel applications. In: 2009 IEEE Symposium on Computers and Communications, pp. 811–817 (2009). https://doi.org/10.1109/ISCC.2009.5202271

  27. Schreiber, R.S., et al.: The NAS parallel benchmarks. In: 1991 ACM/IEEE Conference on Supercomputing (Supercomputing 1991) (SC), pp. 158–165 (1991). https://doi.org/10.1145/125826.125925

  28. Sreepathi, S., D’Azevedo, E., Philip, B., Worley, P.: Communication characterization and optimization of applications using topology-aware task mapping on large supercomputers. In: Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering, ICPE 2016, pp. 225–236. ACM (2016). https://doi.org/10.1145/2851553.2851575

  29. Subramoni, H., et al.: Design of network topology aware scheduling services for large infiniband clusters. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–8 (2013). https://doi.org/10.1109/CLUSTER.2013.6702677

  30. Sweep3D: The ASCI Sweep3D Benchmark Code (2014). http://www.llnl.gov/asci-benchmarks/scsi/limited/sweep3d/asci_sweep3d.html (2014)

  31. Tuncer, O., Leung, V.J., Coskun, A.K.: PaCMap: topology mapping of unstructured communication patterns onto non-contiguous allocations. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, pp. 37–46. ACM (2015).https://doi.org/10.1145/2751205.2751225

  32. Walshaw, C., Cross, M.: Jostle: Parallel multilevel graph-partitioning software - an overview. Mesh Partitioning Techniques and Domain Decomposition Techniques (2007)

    Google Scholar 

  33. Wang, T., Qing, P., Wei, D., Qi, F.B.: Optimization of process-to-core mapping based on clustering analysis. Chin. J. Comput. 38, 1044–1055 (2015)

    MathSciNet  Google Scholar 

  34. Wu, J., Xiong, X., Berrocal, E., Wang, J., Lan, Z.: Topology mapping of irregular parallel applications on torus-connected supercomputers. J. Supercomput. 73(4), 1691–1714 (2017). https://doi.org/10.1007/s11227-016-1876-7

    Article  Google Scholar 

  35. Yu, H., Chung, I.H., Moreira, J.: Topology mapping for blue Gene/L supercomputer. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 2006. ACM (2006). https://doi.org/10.1145/1188455.1188576

  36. Zerr, R.J., Baker, R.S.: SNAP: SN (discrete ordinates) application proxy - proxy description. Technical report, Los Alamos National Laboratory (2013)

    Google Scholar 

Download references

Acknowledgment

This research work was supported in part by the National Key Research and Development Program of China (2017YFB0202104), the National Natural Science Foundation of China under Grant No.: 91530324, No.: 91430218, China Postdoctoral Science Foundation (CPSF) Grant No.: 2014M562570, Special Financial Grant from CPSF Grant No.: 2015T81127.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, X., Liu, J., Li, S., Xie, P., Chi, L., Wang, Q. (2018). TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A Supercomputer. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11334. Springer, Cham. https://doi.org/10.1007/978-3-030-05051-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-05051-1_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-05050-4

  • Online ISBN: 978-3-030-05051-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics