skip to main content
article

HPC-Colony: services and interfaces for very large systems

Published:01 April 2006Publication History
Skip Abstract Section

Abstract

Traditional full-featured operating systems are known to have properties that limit the scalability of distributed memory parallel programs, the most common programming paradigm utilized in high end computing. Furthermore, as processor counts increase with the most capable systems, the necessary activity to manage the system becomes more of a burden. To make a general purpose operating system scale to such levels, new technology is required for parallel resource management and global system management (including fault management). In this paper, we describe the shortcomings of full-featured operating systems and runtime systems and discuss an approach to scale such systems to one hundred thousand processors with both scalable parallel application performance and efficient system management.

References

  1. C. Huang, O. Lawlor, and L. V. Kalé, "Adaptive MPI," in Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, (College Station, Texas), pp. 306--322, October 2003.Google ScholarGoogle Scholar
  2. J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kalé, "NAMD: Biomolecular simulation on thousands of processors," in Proceedings of SC 2002, (Baltimore, MD), September 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. K. Brunner and L. V. Kalé, "Handling application-induced load imbalance using parallel objects," in Parallel and Distributed Computing for Symbolic and Irregular Applications, pp. 167--181, World Scientific Publishing, 2000.Google ScholarGoogle Scholar
  4. G. Zheng, Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Agarwal, A. Sharma, and L. V. Kalé, "Topology-aware task mapping for reducing communication contention on large parallel machines," in Proceedings of IEEE International Parallel and Distributed Processing Symposium 2006, April 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Huang, "System support for checkpoint and restart of charm++ and ampi applications," Master's thesis, Dept. of Computer Science, University of Illinois, 2004.Google ScholarGoogle Scholar
  7. G. Zheng, L. Shi, and L. V. Kalé, "Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi," in 2004 IEEE International Conference on Cluster Computing, (San Dieago, CA), September 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chakravorty and L. V. Kale, "A fault tolerant protocol for massively parallel machines," in FTPDS Workshop for IPDPS 2004, IEEE Press, 2004.Google ScholarGoogle Scholar
  9. P. Apparao and G. Averill, "Firmware-based platform reliability." Intel white paper, October 2004.Google ScholarGoogle Scholar
  10. R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam, "Critical event prediction for proactive management in large-scale computer clusters," in Proceedings og the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pp. 426--435, August 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, and A. Sivasubramaniam, "Fault-aware job scheduling for BlueGene/L systems," Tech. Rep. RC23077, IBM Research, January (2004).Google ScholarGoogle Scholar
  12. T. Jones, J. Fier, and L. Brenner, "Observed impacts of operating systems on the scalability of applications," Tech. Rep. UCRL-MI-202629, Lawrence Livermore National Laboratory, March 2003.Google ScholarGoogle Scholar
  13. P. Terry, A. Shan, and P. Huttunen, "Improving application performance on hpc systems with process synchronization," Linux Journal, pp. 68--73, November 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Jones, S. Dawson, R. Neely, W. Tuel, L. Brenner, J. Fier, R. Blackmore, P. Caffrey, B. Maskell, P. Tomlinson,, and M. Roberts, "Improving the scalability of parallel jobs by adding parallel awareness to the operating system," in Proceedings of Supercomputing'03, (Phoenix, AZ), November 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. W. Cook and W. H. Cabot, "Large scale simulations with miranda on Blue Gene/L," Tech. Rep. UCRL-PRES-200327, Lawrence Livermore National Laboratory, 2003.Google ScholarGoogle Scholar
  16. J. Moreira et al, "Blue Gene/L programming and operating environment," IBM Journal of Research and Development, vol. 49, no. 2/3, pp. 367--376, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y.-C. Chow and W. H. Kohler, "Models for dynamic load balancing in homogeneous multiple processor systems," in IEEE Transactions on Computers, vol. c-36, pp. 667--679, May 1982.Google ScholarGoogle Scholar
  18. L. M. Ni and K. Hwang, "Optimal Load Balancing in a Multiple Processor System with Many Job Classes," in IEEE Trans. on Software Eng., vol. SE-11, 1985.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Corradi, L. Leonardi, and F. Zambonelli, "Diffusive Load Balancing Policies for Dynamic Applications," in IEEE Concurrency, pp. 7(1):22--31, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Ha'c and X. Jin, "Dynamic Load Balancing in Distributed System Using a Decentralized Algorithm," in Proc. of 7-th Intl. Conf. on Distributed Computing Systems, April 1987.Google ScholarGoogle Scholar
  21. A. Sinha and L. Kalé, "A load balancing strategy for prioritized execution of tasks," in International Parallel Processing Symposium, (New Port Beach, CA.), pp. 230--237, April 1993.Google ScholarGoogle Scholar
  22. M. H. Willebeek-LeMair and A. P. Reeves, "Strategies for dynamic load balancing on highly parallel computers," in IEEE Transactions on Parallel and Distributed Systems, vol. 4, September 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Basermann, J. Clinckemaillie, T. Coupez, J. Fingberg, H. Digonnet, R. Ducloux, J.-M. Gratien, U. Hartmann, G. Lonsdale, B. Maerten, D. Roose, and C. Walshaw, "Dynamic load balancing of finite element applications with the DRAMA Library," in Applied Math. Modeling, vol. 25, pp. 83--98, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  24. K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Teresco, J. Faik, J. E. Flaherty, and L. G. Gervasio, "New challenges in dynamic load balancing," Appl. Numer. Math., vol. 52, no. 2-3, pp. 133--152, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Colella, D. Graves, T. Ligocki, D. Martin, D. Modiano, D. Serafini, and B. Van Straalen, "Chombo Software Package for AMR Applications Design Document," 2003. http://seesar.lbl.gov/anag/chombo/ChomboDesign-1.4. pdf.Google ScholarGoogle Scholar
  26. F. Ercal, J. Ramanujam, and P. Sadayappan, "Task allocation onto a hypercube by recursive mincut bipartitioning," in Proceedings of the third conference on Hypercube concurrent computers and applications, (New York, NY, USA), pp. 210--221, ACM Press, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. P. B. Jr. and J. P. Shen, "Interprocessor traffic scheduling algorithm for multiple-processor networks.," IEEE Trans. Computers, vol. 36, no. 4, pp. 396--409, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Z. Fang, X. Li, and L. M. Ni, "On the communication complexity of generalized 2-d convolution on array processors," IEEE Trans. Comput., vol. 38, no. 2, pp. 184--194, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Stellner, "CoCheck: Checkpointing and process migration for MPI," in Proceedings of the 10th International Parallel Processing Symposium, pp. 526--531, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Agbaria and R. Friedman, "Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations," Cluster Computing, vol. 6, pp. 227--236, July 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Chen, J. S. Plank, and K. Li, "Clip: A checkpointing tool for message-passing parallel programs," in Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pp. 1--11, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Strom and S. Yemini, "Optimistic recovery in distributed systems," ACM Transactions on Computer Systems, vol. 3, no. 3, pp. 204--226, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. E. Fagg and J. J. Dongarra, "Building and using a fault-tolerant MPI implementation," International Journal of High Performance Computing Applications, vol. 18, no. 3, pp. 353--361, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Batchu, A. Skjellum, Z. Cui, M. Beddhu, J. P. Neelamegam, Y. Dandass, and M. Apte, "Mpi/fttm: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing," in Proceedings of the 1st International Symposium on Cluster Computing and the Grid, p. 26, IEEE Computer Society, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou, "MPI-FT: Portable fault tolerance scheme for MPI," Parallel Processing Letters, vol. 10, no. 4, pp. 371--382, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  36. A. Bouteiller, F. Cappello, T. Hérault, G. Krawezik, P. Lemarinier, and F. Magniette, "MPICH-V2: A fault tolerant MPI for volatile nodes based on the pessimistic sender based message logging programming via processor virtualization," in Proceedings of Supercomputing'03, (Phoenix, AZ), November 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. N. Elnozahy and W. Zwaenepoel, "Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit," IEEE Transactions on Computers, vol. 41, no. 5, pp. 526--531, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Chakravorty, C. L. Mendes, and L. V. Kalé, "Proactive fault tolerance in MPI applications via task migration," 2006. Submitted to publication.Google ScholarGoogle Scholar
  39. J. K. Ousterhout, "Scheduling techniques for concurrent systems," in Third International Conference on Distributed Computing Systems, pp. 22--30, May 1982.Google ScholarGoogle Scholar
  40. P. G. Sobalvarro, S. Pakin, W. E. Weihl, and A. A. Chien, "Dynamic co-scheduling on workstation clusters," Tech. Rep. 1997-017, Digital Systems Research Center, March 1997.Google ScholarGoogle Scholar
  41. F. Petrini, D. J. Kerbyson, and S. Pakin, "The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q," in Proceedings of Supercomputing'03, (Phoenix, AZ), November 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. K. London, S. Moore, D. Terpstra, and J. Dongarra, "Support for simultaneous multiple substrate performance monitoring," October 2005. Poster Session at LACSI Symposium 2005.Google ScholarGoogle Scholar

Index Terms

  1. HPC-Colony: services and interfaces for very large systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGOPS Operating Systems Review
        ACM SIGOPS Operating Systems Review  Volume 40, Issue 2
        April 2006
        107 pages
        ISSN:0163-5980
        DOI:10.1145/1131322
        Issue’s Table of Contents

        Copyright © 2006 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 April 2006

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader