Abstract
Traditional full-featured operating systems are known to have properties that limit the scalability of distributed memory parallel programs, the most common programming paradigm utilized in high end computing. Furthermore, as processor counts increase with the most capable systems, the necessary activity to manage the system becomes more of a burden. To make a general purpose operating system scale to such levels, new technology is required for parallel resource management and global system management (including fault management). In this paper, we describe the shortcomings of full-featured operating systems and runtime systems and discuss an approach to scale such systems to one hundred thousand processors with both scalable parallel application performance and efficient system management.
- C. Huang, O. Lawlor, and L. V. Kalé, "Adaptive MPI," in Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, (College Station, Texas), pp. 306--322, October 2003.Google Scholar
- J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kalé, "NAMD: Biomolecular simulation on thousands of processors," in Proceedings of SC 2002, (Baltimore, MD), September 2002. Google ScholarDigital Library
- R. K. Brunner and L. V. Kalé, "Handling application-induced load imbalance using parallel objects," in Parallel and Distributed Computing for Symbolic and Irregular Applications, pp. 167--181, World Scientific Publishing, 2000.Google Scholar
- G. Zheng, Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005. Google ScholarDigital Library
- T. Agarwal, A. Sharma, and L. V. Kalé, "Topology-aware task mapping for reducing communication contention on large parallel machines," in Proceedings of IEEE International Parallel and Distributed Processing Symposium 2006, April 2006. Google ScholarDigital Library
- C. Huang, "System support for checkpoint and restart of charm++ and ampi applications," Master's thesis, Dept. of Computer Science, University of Illinois, 2004.Google Scholar
- G. Zheng, L. Shi, and L. V. Kalé, "Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi," in 2004 IEEE International Conference on Cluster Computing, (San Dieago, CA), September 2004. Google ScholarDigital Library
- S. Chakravorty and L. V. Kale, "A fault tolerant protocol for massively parallel machines," in FTPDS Workshop for IPDPS 2004, IEEE Press, 2004.Google Scholar
- P. Apparao and G. Averill, "Firmware-based platform reliability." Intel white paper, October 2004.Google Scholar
- R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam, "Critical event prediction for proactive management in large-scale computer clusters," in Proceedings og the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pp. 426--435, August 2003. Google ScholarDigital Library
- A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, and A. Sivasubramaniam, "Fault-aware job scheduling for BlueGene/L systems," Tech. Rep. RC23077, IBM Research, January (2004).Google Scholar
- T. Jones, J. Fier, and L. Brenner, "Observed impacts of operating systems on the scalability of applications," Tech. Rep. UCRL-MI-202629, Lawrence Livermore National Laboratory, March 2003.Google Scholar
- P. Terry, A. Shan, and P. Huttunen, "Improving application performance on hpc systems with process synchronization," Linux Journal, pp. 68--73, November 2004. Google ScholarDigital Library
- T. Jones, S. Dawson, R. Neely, W. Tuel, L. Brenner, J. Fier, R. Blackmore, P. Caffrey, B. Maskell, P. Tomlinson,, and M. Roberts, "Improving the scalability of parallel jobs by adding parallel awareness to the operating system," in Proceedings of Supercomputing'03, (Phoenix, AZ), November 2003. Google ScholarDigital Library
- A. W. Cook and W. H. Cabot, "Large scale simulations with miranda on Blue Gene/L," Tech. Rep. UCRL-PRES-200327, Lawrence Livermore National Laboratory, 2003.Google Scholar
- J. Moreira et al, "Blue Gene/L programming and operating environment," IBM Journal of Research and Development, vol. 49, no. 2/3, pp. 367--376, 2005. Google ScholarDigital Library
- Y.-C. Chow and W. H. Kohler, "Models for dynamic load balancing in homogeneous multiple processor systems," in IEEE Transactions on Computers, vol. c-36, pp. 667--679, May 1982.Google Scholar
- L. M. Ni and K. Hwang, "Optimal Load Balancing in a Multiple Processor System with Many Job Classes," in IEEE Trans. on Software Eng., vol. SE-11, 1985.Google ScholarDigital Library
- A. Corradi, L. Leonardi, and F. Zambonelli, "Diffusive Load Balancing Policies for Dynamic Applications," in IEEE Concurrency, pp. 7(1):22--31, 1999. Google ScholarDigital Library
- A. Ha'c and X. Jin, "Dynamic Load Balancing in Distributed System Using a Decentralized Algorithm," in Proc. of 7-th Intl. Conf. on Distributed Computing Systems, April 1987.Google Scholar
- A. Sinha and L. Kalé, "A load balancing strategy for prioritized execution of tasks," in International Parallel Processing Symposium, (New Port Beach, CA.), pp. 230--237, April 1993.Google Scholar
- M. H. Willebeek-LeMair and A. P. Reeves, "Strategies for dynamic load balancing on highly parallel computers," in IEEE Transactions on Parallel and Distributed Systems, vol. 4, September 1993. Google ScholarDigital Library
- A. Basermann, J. Clinckemaillie, T. Coupez, J. Fingberg, H. Digonnet, R. Ducloux, J.-M. Gratien, U. Hartmann, G. Lonsdale, B. Maerten, D. Roose, and C. Walshaw, "Dynamic load balancing of finite element applications with the DRAMA Library," in Applied Math. Modeling, vol. 25, pp. 83--98, 2000.Google ScholarCross Ref
- K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Teresco, J. Faik, J. E. Flaherty, and L. G. Gervasio, "New challenges in dynamic load balancing," Appl. Numer. Math., vol. 52, no. 2-3, pp. 133--152, 2005. Google ScholarDigital Library
- P. Colella, D. Graves, T. Ligocki, D. Martin, D. Modiano, D. Serafini, and B. Van Straalen, "Chombo Software Package for AMR Applications Design Document," 2003. http://seesar.lbl.gov/anag/chombo/ChomboDesign-1.4. pdf.Google Scholar
- F. Ercal, J. Ramanujam, and P. Sadayappan, "Task allocation onto a hypercube by recursive mincut bipartitioning," in Proceedings of the third conference on Hypercube concurrent computers and applications, (New York, NY, USA), pp. 210--221, ACM Press, 1988. Google ScholarDigital Library
- R. P. B. Jr. and J. P. Shen, "Interprocessor traffic scheduling algorithm for multiple-processor networks.," IEEE Trans. Computers, vol. 36, no. 4, pp. 396--409, 1987. Google ScholarDigital Library
- Z. Fang, X. Li, and L. M. Ni, "On the communication complexity of generalized 2-d convolution on array processors," IEEE Trans. Comput., vol. 38, no. 2, pp. 184--194, 1989. Google ScholarDigital Library
- G. Stellner, "CoCheck: Checkpointing and process migration for MPI," in Proceedings of the 10th International Parallel Processing Symposium, pp. 526--531, 1996. Google ScholarDigital Library
- A. Agbaria and R. Friedman, "Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations," Cluster Computing, vol. 6, pp. 227--236, July 2003. Google ScholarDigital Library
- Y. Chen, J. S. Plank, and K. Li, "Clip: A checkpointing tool for message-passing parallel programs," in Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pp. 1--11, 1997. Google ScholarDigital Library
- R. Strom and S. Yemini, "Optimistic recovery in distributed systems," ACM Transactions on Computer Systems, vol. 3, no. 3, pp. 204--226, 1985. Google ScholarDigital Library
- G. E. Fagg and J. J. Dongarra, "Building and using a fault-tolerant MPI implementation," International Journal of High Performance Computing Applications, vol. 18, no. 3, pp. 353--361, 2004. Google ScholarDigital Library
- R. Batchu, A. Skjellum, Z. Cui, M. Beddhu, J. P. Neelamegam, Y. Dandass, and M. Apte, "Mpi/fttm: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing," in Proceedings of the 1st International Symposium on Cluster Computing and the Grid, p. 26, IEEE Computer Society, 2001. Google ScholarDigital Library
- S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou, "MPI-FT: Portable fault tolerance scheme for MPI," Parallel Processing Letters, vol. 10, no. 4, pp. 371--382, 2000.Google ScholarCross Ref
- A. Bouteiller, F. Cappello, T. Hérault, G. Krawezik, P. Lemarinier, and F. Magniette, "MPICH-V2: A fault tolerant MPI for volatile nodes based on the pessimistic sender based message logging programming via processor virtualization," in Proceedings of Supercomputing'03, (Phoenix, AZ), November 2003. Google ScholarDigital Library
- E. N. Elnozahy and W. Zwaenepoel, "Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit," IEEE Transactions on Computers, vol. 41, no. 5, pp. 526--531, 1992. Google ScholarDigital Library
- S. Chakravorty, C. L. Mendes, and L. V. Kalé, "Proactive fault tolerance in MPI applications via task migration," 2006. Submitted to publication.Google Scholar
- J. K. Ousterhout, "Scheduling techniques for concurrent systems," in Third International Conference on Distributed Computing Systems, pp. 22--30, May 1982.Google Scholar
- P. G. Sobalvarro, S. Pakin, W. E. Weihl, and A. A. Chien, "Dynamic co-scheduling on workstation clusters," Tech. Rep. 1997-017, Digital Systems Research Center, March 1997.Google Scholar
- F. Petrini, D. J. Kerbyson, and S. Pakin, "The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q," in Proceedings of Supercomputing'03, (Phoenix, AZ), November 2003. Google ScholarDigital Library
- K. London, S. Moore, D. Terpstra, and J. Dongarra, "Support for simultaneous multiple substrate performance monitoring," October 2005. Poster Session at LACSI Symposium 2005.Google Scholar
Index Terms
- HPC-Colony: services and interfaces for very large systems
Recommendations
A computational science IDE for HPC systems: design and applications
Software engineering studies have shown that programmer productivity is improved through the use of computational science integrated development environments (or CSIDE, pronounced "sea side") such as MATLAB. Scientists often desire to use high-...
Benefits of Cross Memory Attach for MPI libraries on HPC Clusters
XSEDE '14: Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery EnvironmentWith the number of cores per node increasing in modern clusters, an efficient implementation of intra-node communications is critical for application performance. MPI libraries generally use shared memory mechanisms for communication inside the node, ...
MPI windows on storage for HPC applications
EuroMPI '17: Proceedings of the 24th European MPI Users' Group MeetingUpcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design ...
Comments