ABSTRACT
Efficient mapping of application communication patterns to the network topology is a critical problem for optimizing the performance of communication bound applications on parallel computing systems. The problem has been extensively studied in the past, but they mostly formulate the problem as finding an isomorphic mapping between two static graphs with edges annotated by traffic volume and network bandwidth. But in practice, the network performance is difficult to be accurately estimated, and communication patterns are often changing over time and not easily obtained. Therefore, this work proposes a deep reinforcement learning (DRL) approach to explore better task mappings by utilizing the performance prediction and runtime communication behaviors provided from a simulator to learn an efficient task mapping algorithm. We extensively evaluated our approach using both synthetic and real applications with varied communication patterns on Torus and Dragonfly networks. Compared with several existing approaches from literature and software library, our proposed approach found task mappings that consistently achieved comparable or better application performance. Especially for a real application, the average improvement of our approach on Torus and Dragonfly networks are 11% and 16%, respectively. In comparison, the average improvements of other approaches are all less than 6%.
- Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi Mao, and Mohammad Alizadeh. 2019. Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning. arXiv e-prints (June 2019).Google Scholar
- Kadir Akbudak, Enver Kayaaslan, and Cevdet Aykanat. 2013. Hypergraph Partitioning Based Models and Methods for Exploiting Cache Locality in Sparse Matrix-Vector Multiplication. SIAM Journal on Scientific Computing 35, 3 (2013), C237–C262.Google ScholarCross Ref
- Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. 2017. Neural Combinatorial Optimization with Reinforcement Learning.Google Scholar
- A. Bhatele, N. Jain, K. E. Isaacs, R. Buch, T. Gamblin, S. H. Langer, and L. V. Kale. 2014. Optimizing the performance of parallel applications on a 5D torus via task mapping. In International Conference on High Performance Computing. 1–10.Google ScholarCross Ref
- A. Bhatele and L. V. Kale. 2011. Heuristic-Based Techniques for Mapping Irregular Communication Graphs to Mesh Topologies. In IEEE International Conference on High Performance Computing and Communications. 765–771.Google ScholarDigital Library
- Abhinav Bhatelé, Laxmikant V. Kalé, and Sameer Kumar. 2009. Dynamic Topology Aware Load Balancing Algorithms for Molecular Dynamics Applications. In Proceedings of ACM/IEEE Conference on Supercomputing. 110–116.Google ScholarDigital Library
- A. Bhatelé, G. R. Gupta, L. V. Kalé, and I. Chung. 2010. Automated mapping of regular communication graphs on mesh interconnects. In International Conference on High Performance Computing. 1–10.Google Scholar
- Bokhari. 1981. On the Mapping Problem. IEEE Trans. Comput. C-30, 3 (1981), 207–214.Google ScholarDigital Library
- S. W. Bollinger and S. F. Midkiff. 1991. Heuristic technique for processor and link assignment in multicomputers. IEEE Trans. Comput. 40, 3 (1991), 325–333.Google ScholarDigital Library
- Rajkumar Buyya and Manzur Murshed. 2002. GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing. Concurrency and Computation: Practice and Experience 14 (11 2002).Google Scholar
- Rodrigo N. Calheiros, Rajiv Ranjan, Anton Beloglazov, César A. F. De Rose, and Rajkumar Buyya. 2011. CloudSim: A Toolkit for Modeling and Simulation of Cloud Computing Environments and Evaluation of Resource Provisioning Algorithms. Softw. Pract. Exper. 41, 1 (Jan. 2011), 23–50.Google ScholarDigital Library
- Henri Casanova, Arnaud Giersch, Arnaud Legrand, Martin Quinson, and Frédéric Suter. 2014. Versatile, Scalable, and Accurate Simulation of Distributed Applications and Platforms. J. Parallel and Distrib. Comput. 74, 10 (June 2014), 2899–2917.Google ScholarCross Ref
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages.Google ScholarDigital Library
- A. Degomme, A. Legrand, G. S. Markomanolis, M. Quinson, M. Stillwell, and F. Suter. 2017. Simulating MPI Applications: The SMPI Approach. IEEE Transactions on Parallel and Distributed Systems 28, 8 (2017), 2387–2400.Google ScholarDigital Library
- M. Deveci, K. Kaya, B. Uçar, and Ü. V. Çatalyürek. 2015. Fast and High Quality Topology-Aware Task Mapping. In 2015 IEEE International Parallel and Distributed Processing Symposium. 197–206. https://doi.org/10.1109/IPDPS.2015.93Google ScholarDigital Library
- M. Deveci, S. Rajamanickam, V. J. Leung, K. Pedretti, S. L. Olivier, D. P. Bunde, U. V. Çatalyürek, and K. Devine. 2014. Exploiting Geometric Partitioning in Task Mapping for Parallel Computers. In IEEE IPDPS. 27–36.Google Scholar
- Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. 2017. OpenAI Baselines. https://github.com/openai/baselines.Google Scholar
- Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. 2018. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. arXiv e-prints (Feb. 2018).Google Scholar
- Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray Cascade: A Scalable HPC System Based on a Dragonfly Network. In Proceedings of ACM/IEEE Conference on Supercomputing. 9.Google ScholarDigital Library
- Yuanxiang Gao, Li Chen, and Baochun Li. 2018. Spotlight: Optimizing Device Placement for Training Deep Neural Networks. In Proceedings of International Conference on Machine Learning, Vol. 80. 1676–1684.Google Scholar
- S. Gertphol, Yang Yu, A. Alhusaini, and V. K. Prasanna. 2001. An integer programming approach for static mapping of paths onto heterogeneous real-time systems. In IPDPS. 993–1000.Google Scholar
- Roland Glantz, Henning Meyerhenke, and Alexander Noe. 2014. Algorithms for Mapping Parallel Processes onto Grid and Torus Architectures. (11 2014).Google Scholar
- Torsten Hoefler and Marc Snir. 2011. Generic Topology Mapping Strategies for Large-Scale Parallel Architectures. In Proceedings of ACM/IEEE Conference on Supercomputing. 75–84.Google ScholarDigital Library
- K. Huang, X. Zhang, D. Zheng, M. Yu, X. Jiang, X. Yan, L. B. de Brisolara, and A. A. Jerraya. 2019. A Scalable and Adaptable ILP-Based Approach for Task Mapping on MPSoC Considering Load Balance and Communication Optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 9(2019), 1744–1757.Google ScholarDigital Library
- Hrvoje Jasak. 2009. OpenFOAM: Open source CFD in research and industry. International Journal of Naval Architecture and Ocean Engineering (2009), 89 – 94.Google Scholar
- George Karypis and Vipin Kumar. 1996. Parallel Multilevel K-Way Partitioning Scheme for Irregular Graphs. In Proceedings of ACM/IEEE Conference on Supercomputing. 35–es.Google ScholarDigital Library
- Bob Lantz, Brandon Heller, and Nick McKeown. 2010. A Network in a Laptop: Rapid Prototyping for Software-Defined Networks. Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, 19.Google ScholarDigital Library
- Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource Management with Deep Reinforcement Learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks. 50–56.Google ScholarDigital Library
- Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Anand Babu, Quoc V. Le, James Laudon, Richard Ho, Roger Carpenter, and Jeff Dean. 2020. Chip Placement with Deep Reinforcement Learning. arXiv e-prints (April 2020).Google Scholar
- Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio, Benoit Steiner, Yuefeng Zhou, Naveen Kumar, Rasmus Larsen, and Jeff Dean. 2017. Device Placement Optimization with Reinforcement Learning. https://arxiv.org/abs/1706.04972Google Scholar
- Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. arXiv e-prints (Feb. 2016).Google ScholarDigital Library
- MohammadReza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takac. 2018. Reinforcement Learning for Solving the Vehicle Routing Problem. In Proceedings of International Conference on Neural Information Processing Systems. 9839–9849.Google Scholar
- F. Pellegrini. 1994. Static mapping by dual recursive bipartitioning of process architecture graphs. In Proceedings of IEEE Scalable High Performance Computing Conference. 486–493.Google ScholarCross Ref
- François Pellegrini and Jean Roman. 1996. Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In High-Performance Computing and Networking, Heather Liddell, Adrian Colbrook, Bob Hertzberger, and Peter Sloot (Eds.). Springer, 493–498.Google Scholar
- Steve Plimpton. 1995. Fast Parallel Algorithms for Short-Range Molecular Dynamics. J. Comput. Phys. 117, 1 (1995), 1 – 19.Google ScholarDigital Library
- Peter Sanders and Christian Schulz. 2013. Think Locally, Act Globally: Highly Balanced Graph Partitioning. In Proceedings of International Symposium on Experimental Algorithms, Vol. 7933. Springer, 164–175.Google ScholarCross Ref
- Kirk Schloegel, George Karypis, and Vipin Kumar. 2002. Parallel static and dynamic multi-constraint graph partitioning. Concurrency and Computation: Practice and Experience 14 (03 2002), 219–240.Google Scholar
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. arXiv e-prints (July 2017).Google Scholar
- D. Tetzlaff and S. Glesner. 2010. Intelligent Task Mapping Using Machine Learning. In 2010 International Conference on Computational Intelligence and Software Engineering. 1–4.Google Scholar
- Dirk Tetzlaff and Sabine Glesner. 2012. Making MPI Intelligent. Software Engineering (Workshops) P-199, 75 – 88.Google Scholar
- Brendan Vastenhouw and Rob Bisseling. 2002. A Two-Dimensional Data Distribution Method For Parallel Sparse Matrix-Vector Multiplication. SIAM Rev. 47 (06 2002).Google Scholar
- Bernd Waschneck, André Reichstaller, Lenz Belzner, Thomas Altenmüller, Thomas Bauernhansl, Alexander Knapp, and Andreas Kyek. 2018. Optimization of global production scheduling with deep reinforcement learning. Procedia CIRP 72 (01 2018), 1264–1269.Google Scholar
- Christopher John Cornish Hellaby Watkins. 1989. Learning from Delayed Rewards. Ph.D. Dissertation. King’s College, Cambridge, UK. http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdfGoogle Scholar
- Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 8, 3–4 (May 1992), 229–256.Google ScholarDigital Library
Index Terms
A Deep Reinforcement Learning Method for Solving Task Mapping Problems with Dynamic Traffic on Parallel Systems
Recommendations
Hierarchical task mapping for parallel applications on supercomputers
As the scale of supercomputers grows, so does the size of the interconnect network. Topology-aware task mapping, which maps parallel application processes onto processors to reduce communication cost, becomes increasingly important. Previous works ...
Dynamic Task Mapping with Congestion Speculation for Reconfigurable Network-on-Chip
Network-on-Chip (NoC) has been proposed as a promising communication architecture to replace the dedicated interconnections and shared buses for future embedded system platforms. In such a parallel platform, mapping application tasks to the NoC is a key ...
A Majority-Based Reliability-Aware Task Mapping in High-Performance Homogenous NoC Architectures
Special Issue on Autonomous Battery-Free Sensing and Communication, Special Issue on ESWEEK 2016 and Regular PapersThis article presents a new reliability-aware task mapping approach in a many-core platform at design time for applications with DAG-based task graphs. The main goal is to devise a task mapping which meets a predefined reliability threshold considering ...
Comments