ABSTRACT
With the rapid growth of the machine learning applications, the workloads of future HPC systems are anticipated to be a mix of scientific simulation, big data analytics, and machine learning applications. Simulation is a great research vehicle to understand the performance implications of co-running scientific applications with big data and machine learning workloads on large-scale systems. In this work, we propose a scalable workload manager that provides an automatic framework to facilitate hybrid workload simulation. We investigate various hybrid workloads and navigate various application-system configurations for a deeper understanding of performance implications of a diverse mix of workloads on current and future supercomputers.
- Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXiv preprint arXiv:1802.09941 (2018).Google Scholar
- Christopher D Carothers, David Bauer, and Shawn Pearce. 2002. ROSS: A high-performance, low-memory, modular Time Warp system. J. Parallel and Distrib. Comput. 62, 11 (2002), 1648–1669.Google ScholarCross Ref
- Sudheer Chunduri, Kevin Harms, Scott Parker, Vitali Morozov, Sam Oshin, Naveen Cherukuri, and Kalyan Kumaran. 2017. Run-to-run Variability on Xeon Phi based Cray XC Systems. In SC17: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE.Google ScholarDigital Library
- Greg Faanes, Abdulla Bataineh, Duncan Roweth, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, James Reinhard, 2012. Cray Cascade: A scalable HPC system based on a Dragonfly network. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 103.Google ScholarDigital Library
- Mario Flajslik, Eric Borch, and Mike A Parker. 2018. Megafly: A Topology for Exascale Systems. In International Conference on High Performance Computing. Springer, 289–310.Google Scholar
- Nan Jiang, Daniel U Becker, George Michelogiannakis, James Balfour, Brian Towles, David E Shaw, John Kim, and William J Dally. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 86–96.Google ScholarCross Ref
- Ana Jokanovic, Jose Carlos Sancho, German Rodriguez, Alejandro Lucero, Cyriel Minkenberg, and Jesus Labarta. 2015. Quiet neighborhoods: Key to protect job performance predictability. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 449–459.Google ScholarDigital Library
- John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In ACM SIGARCH Computer Architecture News, Vol. 36. IEEE Computer Society, 77–88.Google Scholar
- Misbah Mubarak, Christopher D Carothers, Robert B Ross, and Philip Carns. 2017. Enabling parallel simulation of large-scale HPC network systems. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 87–100.Google ScholarDigital Library
- Arun F Rodrigues, K Scott Hemmert, Brian W Barrett, Chad Kersey, Ron Oldfield, Marlo Weston, Rolf Risen, Jeanine Cook, Paul Rosenfeld, E CooperBalls, 2011. The structural simulation toolkit. SIGMETRICS Performance Evaluation Review 38, 4 (2011), 37–42.Google ScholarDigital Library
Index Terms
- A Study of Simulating Heterogeneous Workloads on Large-scale Interconnect Network
Recommendations
Performance Analysis of Network I/O Workloads in Virtualized Data Centers
Server consolidation and application consolidation through virtualization are key performance optimizations in cloud-based service delivery industry. In this paper, we argue that it is important for both cloud consumers and cloud providers to understand ...
A Performance Study of Big Data Workloads in Cloud Datacenters with Network Variability
ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance EngineeringPublic cloud computing platforms are a cost-effective solution for individuals and organizations to deploy various types of workloads, ranging from scientific applications, business-critical workloads, e-governance to big data applications. Co-locating ...
Network-aware migration control and scheduling of differentiated virtual machine workloads
CLOUD '09: Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud ComputingServer virtualization enables dynamic workload management for data centers. However, especially live migrations of virtual machines (VM) induce significant overheads on physical hosts and the shared network infrastructure possibly leading to host ...
Comments