Abstract
Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have recently shown that in presence of contention, message latencies can grow substantially large. Hence task mapping strategies should take the topology of the machine into account on large machines. In this paper, we present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance.
Our methodology is facilitated by the idea of object-based decomposition used in Charm++ which separates the processes of decomposition from mapping of computation to processors and allows a more flexible mapping based on communication patterns between objects. Exploiting this and the topology of the allocated job partition, we present mapping strategies for a production code, OpenAtom to improve overall performance and scaling. OpenAtom presents complex communication scenarios of interaction involving multiple groups of objects and makes the mapping task a challenge. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download to read the full chapter text
Chapter PDF
References
Greenberg, R.I., Oh, H.C.: Universal wormhole routing. IEEE Transactions on Parallel and Distributed Systems 08(3), 254–262 (1997)
Ni, L.M., McKinley, P.K.: A survey of wormhole routing techniques in direct networks. Computer 26(2), 62–76 (1993)
Bhatele, A., Kale, L.V.: An Evaluation of the Effect of Interconnect Topologies on Message Latencies in Large Supercomputers. In: Proceedings of Workshop on Large-Scale Parallel Processing (IPDPS 2009) (May 2009)
Kalé, L., Krishnan, S.: CHARM++: A Portable Concurrent Object Oriented System Based on C++. In: Paepcke, A. (ed.) Proceedings of OOPSLA 1993, September 1993, pp. 91–108. ACM Press, New York (1993)
Bhandarkar, M., Kale, L.V., de Sturler, E., Hoeflinger, J.: Object-Based Adaptive Load Balancing for MPI Programs. In: Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.) ICCS-ComputSci 2001. LNCS, vol. 2074, pp. 108–117. Springer, Heidelberg (2001)
Pasquarello, A., Hybertsen, M.S., Car, R.: Interface structure between silicon and its oxide by first-principles molecular dynamics. Nature 396, 58 (1998)
De Santis, L., Carloni, P.: Serine proteases: An ab initio molecular dynamics study. Proteins 37, 611 (1999)
Saitta, A.M., Soper, P.D., Wasserman, E., Klein, M.L.: Influence of a knot on the strength of a polymer strand. Nature 399, 46 (1999)
Rothlisberger, U., Carloni, P., Doclo, K., Parinello, M.: A comparative study of galactose oxidase and active site analogs based on QM/MM Car Parrinello simulations. J. Biol. Inorg. Chem. 5, 236 (2000)
Bokhari, S.H.: On the mapping problem. IEEE Trans. Computers 30(3), 207–214 (1981)
Lee, S.Y., Aggarwal, J.K.: A mapping strategy for parallel processing. IEEE Trans. Computers 36(4), 433–442 (1987)
Ercal, F., Ramanujam, J., Sadayappan, P.: Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proceedings of the 3rd conference on Hypercube concurrent computers and applications, pp. 210–221. ACM Press, New York (1988)
Berman, F., Snyder, L.: On mapping parallel algorithms into parallel architectures. Journal of Parallel and Distributed Computing 4(5), 439–458 (1987)
Bollinger, S.W., Midkiff, S.F.: Processor and link assignment in multicomputers using simulated annealing. In: ICPP (1), pp. 1–7 (1988)
Arunkumar, S., Chockalingam, T.: Randomized heuristics for the mapping problem. International Journal of High Speed Computing (IJHSC) 4(4), 289–300 (1992)
Bhanot, G., Gara, A., Heidelberger, P., Lawless, E., Sexton, J.C., Walkup, R.: Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development 49(2/3), 489–500 (2005)
Gygi, F., Draeger, E.W., Schulz, M., Supinski, B.R.D., Gunnels, J.A., Austel, V., Sexton, J.C., Franchetti, F., Kral, S., Ueberhuber, C., Lorenz, J.: Large-Scale Electronic Structure Calculations of High-Z Metals on the Blue Gene/L Platform. In: Proceedings of the International Conference in Supercomputing. ACM Press, New York (2006)
Bhatelé, A., Kalé, L.V., Kumar, S.: Dynamic Topology Aware Load Balancing Algorithms for Molecular Dynamics Applications. In: 23rd ACM International Conference on Supercomputing (2009)
Smith, B.E., Bode, B.: Performance Effects of Node Mappings on the IBM Blue Gene/L Machine. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 1005–1013. Springer, Heidelberg (2005)
Yu, H., Chung, I.H., Moreira, J.: Topology mapping for Blue Gene/L supercomputer. In: SC 2006: Proceedings of the, ACM/IEEE conference on Supercomputing, p. 116. ACM, New York (2006)
Weisser, D., Nystrom, N., Vizino, C., Brown, S.T., Urbanic, J.: Optimizing Job Placement on the Cray XT3. In: 48th Cray User Group Proceedings (2006)
Bhatelé, A., Kalé, L.V.: Benefits of Topology Aware Mapping for Mesh Interconnects. Parallel Processing Letters (Special issue on Large-Scale Parallel Processing) 18(4), 549–566 (2008)
Bohm, E., Bhatele, A., Kale, L.V., Tuckerman, M.E., Kumar, S., Gunnels, J.A., Martyna, G.J.: Fine Grained Parallelization of the Car-Parrinello ab initio MD Method on Blue Gene/L. IBM Journal of Research and Development: Applications of Massively Parallel Systems 52(1/2), 159–174 (2008)
IBM Blue Gene Team: Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development 52(1/2) (2008)
Tuckerman, M.E.: Ab initio molecular dynamics: Basic concepts, current trends and novel applications. J. Phys. Condensed Matter 14, R1297 (2002)
Dongarra, J., Luszczek, P.: Introduction to the HPC Challenge Benchmark Suite. Technical Report UT-CS-05-544, University of Tennessee, Dept. of Computer Science (2005)
Salapura, V., Ganesan, K., Gara, A., Gschwind, M., Sexton, J., Walkup, R.: Next-Generation Performance Counters: Towards Monitoring Over Thousand Concurrent Events. In: IEEE International Symposium on Performance Analysis of Systems and Software, April 2008, pp. 139–146 (2008)
Catlett, C., et al.: TeraGrid: Analysis of Organization, System Architecture, and Middleware Enabling New Types of Applications. In: Grandinetti, L. (ed.) HPC and Grids in Action. IOS Press, Amsterdam (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bhatelé, A., Bohm, E., Kalé, L.V. (2009). A Case Study of Communication Optimizations on 3D Mesh Interconnects. In: Sips, H., Epema, D., Lin, HX. (eds) Euro-Par 2009 Parallel Processing. Euro-Par 2009. Lecture Notes in Computer Science, vol 5704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03869-3_94
Download citation
DOI: https://doi.org/10.1007/978-3-642-03869-3_94
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03868-6
Online ISBN: 978-3-642-03869-3
eBook Packages: Computer ScienceComputer Science (R0)