article

HPC-Colony: services and interfaces for very large systems

Authors:
Sayantan Chakravorty

University of Illinois

University of Illinois
View Profile

,
Celso L. Mendes

University of Illinois

University of Illinois
View Profile

,
Laxmikant V. Kalé

University of Illinois

University of Illinois
View Profile

,
Terry Jones

Lawrence Livermore National Lab.

Lawrence Livermore National Lab.
View Profile

,
Andrew Tauferner

IBM

IBM
View Profile

,
Todd Inglett

IBM

IBM
View Profile

,
José Moreira

IBM

IBM
View Profile

Authors Info & Claims

ACM SIGOPS Operating Systems Review Volume 40 Issue 2April 2006pp 43–49https://doi.org/10.1145/1131322.1131334

Published:01 April 2006Publication History

ACM SIGOPS Operating Systems Review

Abstract

Traditional full-featured operating systems are known to have properties that limit the scalability of distributed memory parallel programs, the most common programming paradigm utilized in high end computing. Furthermore, as processor counts increase with the most capable systems, the necessary activity to manage the system becomes more of a burden. To make a general purpose operating system scale to such levels, new technology is required for parallel resource management and global system management (including fault management). In this paper, we describe the shortcomings of full-featured operating systems and runtime systems and discuss an approach to scale such systems to one hundred thousand processors with both scalable parallel application performance and efficient system management.

References

C. Huang, O. Lawlor, and L. V. Kalé, "Adaptive MPI," in Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, (College Station, Texas), pp. 306--322, October 2003.Google Scholar
J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kalé, "NAMD: Biomolecular simulation on thousands of processors," in Proceedings of SC 2002, (Baltimore, MD), September 2002. Google ScholarDigital Library
R. K. Brunner and L. V. Kalé, "Handling application-induced load imbalance using parallel objects," in Parallel and Distributed Computing for Symbolic and Irregular Applications, pp. 167--181, World Scientific Publishing, 2000.Google Scholar
G. Zheng, Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005. Google ScholarDigital Library
T. Agarwal, A. Sharma, and L. V. Kalé, "Topology-aware task mapping for reducing communication contention on large parallel machines," in Proceedings of IEEE International Parallel and Distributed Processing Symposium 2006, April 2006. Google ScholarDigital Library
C. Huang, "System support for checkpoint and restart of charm++ and ampi applications," Master's thesis, Dept. of Computer Science, University of Illinois, 2004.Google Scholar
G. Zheng, L. Shi, and L. V. Kalé, "Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi," in 2004 IEEE International Conference on Cluster Computing, (San Dieago, CA), September 2004. Google ScholarDigital Library
S. Chakravorty and L. V. Kale, "A fault tolerant protocol for massively parallel machines," in FTPDS Workshop for IPDPS 2004, IEEE Press, 2004.Google Scholar
P. Apparao and G. Averill, "Firmware-based platform reliability." Intel white paper, October 2004.Google Scholar
R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam, "Critical event prediction for proactive management in large-scale computer clusters," in Proceedings og the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pp. 426--435, August 2003. Google ScholarDigital Library
A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, and A. Sivasubramaniam, "Fault-aware job scheduling for BlueGene/L systems," Tech. Rep. RC23077, IBM Research, January (2004).Google Scholar
T. Jones, J. Fier, and L. Brenner, "Observed impacts of operating systems on the scalability of applications," Tech. Rep. UCRL-MI-202629, Lawrence Livermore National Laboratory, March 2003.Google Scholar
P. Terry, A. Shan, and P. Huttunen, "Improving application performance on hpc systems with process synchronization," Linux Journal, pp. 68--73, November 2004. Google ScholarDigital Library
T. Jones, S. Dawson, R. Neely, W. Tuel, L. Brenner, J. Fier, R. Blackmore, P. Caffrey, B. Maskell, P. Tomlinson,, and M. Roberts, "Improving the scalability of parallel jobs by adding parallel awareness to the operating system," in Proceedings of Supercomputing'03, (Phoenix, AZ), November 2003. Google ScholarDigital Library
A. W. Cook and W. H. Cabot, "Large scale simulations with miranda on Blue Gene/L," Tech. Rep. UCRL-PRES-200327, Lawrence Livermore National Laboratory, 2003.Google Scholar
J. Moreira et al, "Blue Gene/L programming and operating environment," IBM Journal of Research and Development, vol. 49, no. 2/3, pp. 367--376, 2005. Google ScholarDigital Library
Y.-C. Chow and W. H. Kohler, "Models for dynamic load balancing in homogeneous multiple processor systems," in IEEE Transactions on Computers, vol. c-36, pp. 667--679, May 1982.Google Scholar
L. M. Ni and K. Hwang, "Optimal Load Balancing in a Multiple Processor System with Many Job Classes," in IEEE Trans. on Software Eng., vol. SE-11, 1985.Google ScholarDigital Library
A. Corradi, L. Leonardi, and F. Zambonelli, "Diffusive Load Balancing Policies for Dynamic Applications," in IEEE Concurrency, pp. 7(1):22--31, 1999. Google ScholarDigital Library
A. Ha'c and X. Jin, "Dynamic Load Balancing in Distributed System Using a Decentralized Algorithm," in Proc. of 7-th Intl. Conf. on Distributed Computing Systems, April 1987.Google Scholar
A. Sinha and L. Kalé, "A load balancing strategy for prioritized execution of tasks," in International Parallel Processing Symposium, (New Port Beach, CA.), pp. 230--237, April 1993.Google Scholar
M. H. Willebeek-LeMair and A. P. Reeves, "Strategies for dynamic load balancing on highly parallel computers," in IEEE Transactions on Parallel and Distributed Systems, vol. 4, September 1993. Google ScholarDigital Library
A. Basermann, J. Clinckemaillie, T. Coupez, J. Fingberg, H. Digonnet, R. Ducloux, J.-M. Gratien, U. Hartmann, G. Lonsdale, B. Maerten, D. Roose, and C. Walshaw, "Dynamic load balancing of finite element applications with the DRAMA Library," in Applied Math. Modeling, vol. 25, pp. 83--98, 2000.Google ScholarCross Ref
K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Teresco, J. Faik, J. E. Flaherty, and L. G. Gervasio, "New challenges in dynamic load balancing," Appl. Numer. Math., vol. 52, no. 2-3, pp. 133--152, 2005. Google ScholarDigital Library
P. Colella, D. Graves, T. Ligocki, D. Martin, D. Modiano, D. Serafini, and B. Van Straalen, "Chombo Software Package for AMR Applications Design Document," 2003. http://seesar.lbl.gov/anag/chombo/ChomboDesign-1.4. pdf.Google Scholar
F. Ercal, J. Ramanujam, and P. Sadayappan, "Task allocation onto a hypercube by recursive mincut bipartitioning," in Proceedings of the third conference on Hypercube concurrent computers and applications, (New York, NY, USA), pp. 210--221, ACM Press, 1988. Google ScholarDigital Library
R. P. B. Jr. and J. P. Shen, "Interprocessor traffic scheduling algorithm for multiple-processor networks.," IEEE Trans. Computers, vol. 36, no. 4, pp. 396--409, 1987. Google ScholarDigital Library
Z. Fang, X. Li, and L. M. Ni, "On the communication complexity of generalized 2-d convolution on array processors," IEEE Trans. Comput., vol. 38, no. 2, pp. 184--194, 1989. Google ScholarDigital Library
G. Stellner, "CoCheck: Checkpointing and process migration for MPI," in Proceedings of the 10th International Parallel Processing Symposium, pp. 526--531, 1996. Google ScholarDigital Library
A. Agbaria and R. Friedman, "Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations," Cluster Computing, vol. 6, pp. 227--236, July 2003. Google ScholarDigital Library
Y. Chen, J. S. Plank, and K. Li, "Clip: A checkpointing tool for message-passing parallel programs," in Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pp. 1--11, 1997. Google ScholarDigital Library
R. Strom and S. Yemini, "Optimistic recovery in distributed systems," ACM Transactions on Computer Systems, vol. 3, no. 3, pp. 204--226, 1985. Google ScholarDigital Library
G. E. Fagg and J. J. Dongarra, "Building and using a fault-tolerant MPI implementation," International Journal of High Performance Computing Applications, vol. 18, no. 3, pp. 353--361, 2004. Google ScholarDigital Library
R. Batchu, A. Skjellum, Z. Cui, M. Beddhu, J. P. Neelamegam, Y. Dandass, and M. Apte, "Mpi/fttm: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing," in Proceedings of the 1st International Symposium on Cluster Computing and the Grid, p. 26, IEEE Computer Society, 2001. Google ScholarDigital Library
S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou, "MPI-FT: Portable fault tolerance scheme for MPI," Parallel Processing Letters, vol. 10, no. 4, pp. 371--382, 2000.Google ScholarCross Ref
A. Bouteiller, F. Cappello, T. Hérault, G. Krawezik, P. Lemarinier, and F. Magniette, "MPICH-V2: A fault tolerant MPI for volatile nodes based on the pessimistic sender based message logging programming via processor virtualization," in Proceedings of Supercomputing'03, (Phoenix, AZ), November 2003. Google ScholarDigital Library
E. N. Elnozahy and W. Zwaenepoel, "Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit," IEEE Transactions on Computers, vol. 41, no. 5, pp. 526--531, 1992. Google ScholarDigital Library
S. Chakravorty, C. L. Mendes, and L. V. Kalé, "Proactive fault tolerance in MPI applications via task migration," 2006. Submitted to publication.Google Scholar
J. K. Ousterhout, "Scheduling techniques for concurrent systems," in Third International Conference on Distributed Computing Systems, pp. 22--30, May 1982.Google Scholar
P. G. Sobalvarro, S. Pakin, W. E. Weihl, and A. A. Chien, "Dynamic co-scheduling on workstation clusters," Tech. Rep. 1997-017, Digital Systems Research Center, March 1997.Google Scholar
F. Petrini, D. J. Kerbyson, and S. Pakin, "The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q," in Proceedings of Supercomputing'03, (Phoenix, AZ), November 2003. Google ScholarDigital Library
K. London, S. Moore, D. Terpstra, and J. Dongarra, "Support for simultaneous multiple substrate performance monitoring," October 2005. Poster Session at LACSI Symposium 2005.Google Scholar

Index Terms

HPC-Colony: services and interfaces for very large systems
1. Software and its engineering
  1. Software organization and properties

Recommendations

A computational science IDE for HPC systems: design and applications

Software engineering studies have shown that programmer productivity is improved through the use of computational science integrated development environments (or CSIDE, pronounced "sea side") such as MATLAB. Scientists often desire to use high-...
Read More
Benefits of Cross Memory Attach for MPI libraries on HPC Clusters
XSEDE '14: Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment

With the number of cores per node increasing in modern clusters, an efficient implementation of intra-node communications is critical for application performance. MPI libraries generally use shared memory mechanisms for communication inside the node, ...
Read More
MPI windows on storage for HPC applications
EuroMPI '17: Proceedings of the 24th European MPI Users' Group Meeting

Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGOPS Operating Systems Review Volume 40, Issue 2
April 2006
107 pages
ISSN:0163-5980
DOI:10.1145/1131322
Issue’s Table of Contents

Copyright © 2006 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2006
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 287
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HPC-Colony: services and interfaces for very large systems

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

A computational science IDE for HPC systems: design and applications

Benefits of Cross Memory Attach for MPI libraries on HPC Clusters

MPI windows on storage for HPC applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

HPC-Colony: services and interfaces for very large systems

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

A computational science IDE for HPC systems: design and applications

Benefits of Cross Memory Attach for MPI libraries on HPC Clusters

MPI windows on storage for HPC applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media