Skip to main content
Log in

How to build a fast and reliable 1024 node cluster with only one disk

The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In the last year LANL has constructed a 1408-node AMD Opteron cluster, a 1024-node Intel P4 Xeon cluster, a 256-node AMD Opteron cluster and two 128-node Intel P4 Xeon clusters. Each of these clusters is controlled by one front-end node, and each cluster needs only one disk in the front-end node for production operations. In this paper we describe the software architecture that boots and manages these clusters. This software architecture represents a clean break from the way that clusters have been set up for the last 14 years. We show the ways that this architecture has been used to greatly improve the operation of the nodes, with particular emphasis on improvements in boot-time performance, scalability, and reliability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

  1. Barak A, La’adan O, Shiloh A (1999) Scalable cluster computing with MOSIX for linux . In: Proceedings of the linux expo 99 Raleigh, NC, pp. 95–100

  2. Bruno G, Papadopoulos PM (2001) NPACI rocks: Tools and Techniques for Easily Deploying Manageable Linux Clusters. In: Proceedings of Cluster 2001, Anaheim, CA

  3. Hendriks EA (2002) BProc : The Beowulf distributed process space. In: 16th annual ACM international conference on supercomputing

  4. Hendriks EA (2004) Fast mapping on myrinet networks. In: Proceedings of 7th International Conference on High Performance Computing and Grid in Asia Pacific region

  5. Intel corporation. Preboot execution environment (PXE) specification. 2002

  6. Minnich R (2004) Give your bootstrap the boot: Using the operating system to boot the operating system. In: Proceedings of cluster 2004, San Diego, CA

  7. Minnich R, Hendricks J, Webster D (2000) The Linux BIOS. In: Proceedings of the Fourth Annual Linux Showcase and Conference, Atlanta, GA

  8. Riesen R, Brightwell R, Fisk LA, Hudson T, Otto J, Maccabe AB (1999) Cplant. In Proceedings of the Second Extreme Linux Workshop

  9. SONE takeshi. Filo readme. Technical report, http://te.to/ts1/filo/

  10. The open cluster group. OSCAR : A packaged cluster software stack for high performance computing. January 2001

  11. Yap K, Gutschke M (2002) Etherboot user manual

Download references

Author information

Authors and Affiliations

Authors

Additional information

Los Alamos National Laboratory is operated by the University of California for the National Nuclear Security Administration of the United States Department of Energy under contract W-7405-ENG-36. LANL LA-UR-03-9081.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hendriks, E.A., Minnich, R.G. How to build a fast and reliable 1024 node cluster with only one disk. J Supercomput 36, 171–181 (2006). https://doi.org/10.1007/s11227-006-7956-3

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-006-7956-3

Keywords

Navigation