ABSTRACT
It is generally accepted that future supercomputing workloads will consist of application compositions made up of coupled simulations as well as in-situ analytics. While these components have commonly been deployed using a space-shared configuration to minimize cross-workload interference, it is likely that not all the workload components will require the full processing capacity of the CPU cores they are running on. For instance, an analytics workload often does not need to run continuously and is not generally considered to have the same priority as simulation codes. In a space-shared configuration, this arrangement would lead to wasted resources due to periodically idle CPUs, which are generally unusable by traditional bulk synchronous parallel (BSP) applications. As a result, many have started to reconsider task based runtimes owing to their ability to dynamically utilize available CPU resources. While the dynamic behavior of task-based runtimes had historically been targeted at application induced load imbalances, the same basic situation arises due to the asymmetric performance resulting from time sharing a CPU with other workloads. Many have assumed that task based runtimes would be able to adapt easily to these new environments without significant modifications. In this paper, we present a preliminary set of experiments that measured how well asynchronous task-based runtimes are able to respond to load imbalances caused by the asymmetric performance of time shared CPUs. Our work focuses on a set of experiments using benchmarks running on both Charm++ and HPX-5 in the presence of a competing workload. The results show that while these runtimes are better suited at handling the scenarios than traditional runtimes, they are not yet capable of effectively addressing anything other than a fairly minimal level of CPU contention.
- 2014. The Qthread Library. http://www.cs.sandia.gov/qthreads/. (2014). Accessed April 5.Google Scholar
- 2017. Charm++ Mini-apps. http://charmplusplus.org/benchmarks/. (2017). Accessed April 5.Google Scholar
- 2017. HPX-5 Applications. https://hpx.crest.iu.edu/applications. (2017). Accessed April 5.Google Scholar
- Bilge Acun and Laxmikant V Kale. 2016. Mitigating Processor Variation through Dynamic Load Balancing. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International. IEEE, 1073--1076.Google ScholarCross Ref
- Bilge Acun, Phil Miller, and Laxmikant V Kale. 2016. Variation among processors under turbo boost in hpc systems. In Proceedings of the 2016 International Conference on Supercomputing. ACM, 6. Google ScholarDigital Library
- Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE Computer Society Press, 66. Google ScholarDigital Library
- Abhinav Bhatele, Kathryn Mohror, Steven H Langer, and Katherine E Isaacs. 2013. There goes the neighborhood: performance degradation due to nearby jobs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 41. Google ScholarDigital Library
- Bradford L Chamberlain, David Callahan, and Hans P Zima. 2007. Parallel programmability and the chapel language. The International Journal of High Performance Computing Applications 21, 3 (2007), 291--312. Google ScholarDigital Library
- Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Robert Sisneros, Orcun Yildiz, Shadi Ibrahim, Tom Peterka, and Leigh Orf. 2016. Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations. ACM Transactions on Parallel Computing (TOPC) 3, 3 (2016), 15. Google ScholarDigital Library
- Matteo Frigo, Charles E Leiserson, and Keith H Randall. 1998. The implementation of the Cilk-5 multithreaded language. In ACM Sigplan Notices, Vol. 33. ACM, 212--223. Google ScholarDigital Library
- Guang R Gao, Thomas Sterling, Rick Stevens, Mark Hereld, and Weirong Zhu. 2007. Parallex: A study of a new parallel computation model. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. IEEE, 1--6.Google ScholarCross Ref
- Yuichi Inadomi, Tapasya Patki, Koji Inoue, Mutsumi Aoyagi, Barry Rountree, Martin Schulz, David Lowenthal, Yasutaka Wada, Keiichiro Fukazawa, Masatsugu Ueda, and others. 2015. Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 78. Google ScholarDigital Library
- Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. Hpx: A task based programming model in a global address space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. ACM, 6. Google ScholarDigital Library
- Laxmikant V Kale and Sanjeev Krishnan. 1993. CHARM++: a portable concurrent object oriented system based on C++. In ACM Sigplan Notices, Vol. 28. ACM, 91--108. Google ScholarDigital Library
- Jonathan Lifflander, Sriram Krishnamoorthy, and Laxmikant V Kale. 2012. Work stealing and persistence-based load balancers for iterative overdecomposed applications. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing. ACM, 137--148. Google ScholarDigital Library
- Aniruddha Marathe, Peter E Bailey, David K Lowenthal, Barry Rountree, Martin Schulz, and Bronis R de Supinski. 2015. A run-time system for power-constrained HPC applications. In International Conference on High Performance Computing. Springer, 394--408.Google ScholarCross Ref
- Oscar H Mondragon, Patrick G Bridges, Scott Levy, Kurt B Ferreira, and Patrick Widener. 2016. Understanding performance interference in next-generation HPC systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 33. Google ScholarDigital Library
- Allan Porterfield, Rob Fowler, Sridutt Bhalachandra, Barry Rountree, Diptorup Deb, and Rob Lewis. 2015. Application runtime variability and power optimization for exascale computers. In Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers. ACM, 3. Google ScholarDigital Library
- Sangmin Seo, Abdelhalim Amer, Pavan Balaji, P Beckman, C Bordage, G Bosilca, A Brooks, A CastellAs, D Genet, T Herault, and others. 2015. Argobots: A lightweight low-level threading/tasking framework. (2015).Google Scholar
- Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan Tallent, Dhabaleswar DK Panda, Darren Kerbyson, and Adolfy Hoisie. 2015. A case for application-oblivious energy-efficient MPI runtime. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 29. Google ScholarDigital Library
- Jeremiah J Wilke. 2015. Dharma: Distributed asynchronous adaptive resilient management of applications. Technical Report. Sandia National Laboratories (SNL-CA), Livermore, CA (United States).Google Scholar
Index Terms
- The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes
Recommendations
D3: discarding dispensable data for efficient live migration of virtual machines
RACS '14: Proceedings of the 2014 Conference on Research in Adaptive and Convergent SystemsVirtualization, one of the most actively adopted technologies today in computer systems, is increasingly widening its range of applications. As multiple virtual machines are concurrently executed on a physical machine, they compete for physical ...
An early performance evaluation of many integrated core architecture based SGI rackable computing system
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisIntel recently introduced the Xeon Phi coprocessor based on the Many Integrated Core architecture featuring 60 cores with a peak performance of 1.0 Tflop/s. NASA has deployed a 128-node SGI Rackable system where each node has two Intel Xeon E2670 8-core ...
Leveraging Core Specialization via OS Scheduling to Improve Performance on Asymmetric Multicore Systems
Asymmetric multicore processors (AMPs) consist of cores with the same ISA (instruction-set architecture), but different microarchitectural features, speed, and power consumption. Because cores with more complex features and higher speed typically use ...
Comments