Abstract
In the last year LANL has constructed a 1408-node AMD Opteron cluster, a 1024-node Intel P4 Xeon cluster, a 256-node AMD Opteron cluster and two 128-node Intel P4 Xeon clusters. Each of these clusters is controlled by one front-end node, and each cluster needs only one disk in the front-end node for production operations. In this paper we describe the software architecture that boots and manages these clusters. This software architecture represents a clean break from the way that clusters have been set up for the last 14 years. We show the ways that this architecture has been used to greatly improve the operation of the nodes, with particular emphasis on improvements in boot-time performance, scalability, and reliability.
References
Barak A, La’adan O, Shiloh A (1999) Scalable cluster computing with MOSIX for linux . In: Proceedings of the linux expo 99 Raleigh, NC, pp. 95–100
Bruno G, Papadopoulos PM (2001) NPACI rocks: Tools and Techniques for Easily Deploying Manageable Linux Clusters. In: Proceedings of Cluster 2001, Anaheim, CA
Hendriks EA (2002) BProc : The Beowulf distributed process space. In: 16th annual ACM international conference on supercomputing
Hendriks EA (2004) Fast mapping on myrinet networks. In: Proceedings of 7th International Conference on High Performance Computing and Grid in Asia Pacific region
Intel corporation. Preboot execution environment (PXE) specification. 2002
Minnich R (2004) Give your bootstrap the boot: Using the operating system to boot the operating system. In: Proceedings of cluster 2004, San Diego, CA
Minnich R, Hendricks J, Webster D (2000) The Linux BIOS. In: Proceedings of the Fourth Annual Linux Showcase and Conference, Atlanta, GA
Riesen R, Brightwell R, Fisk LA, Hudson T, Otto J, Maccabe AB (1999) Cplant. In Proceedings of the Second Extreme Linux Workshop
SONE takeshi. Filo readme. Technical report, http://te.to/ts1/filo/
The open cluster group. OSCAR : A packaged cluster software stack for high performance computing. January 2001
Yap K, Gutschke M (2002) Etherboot user manual
Author information
Authors and Affiliations
Additional information
Los Alamos National Laboratory is operated by the University of California for the National Nuclear Security Administration of the United States Department of Energy under contract W-7405-ENG-36. LANL LA-UR-03-9081.
Rights and permissions
About this article
Cite this article
Hendriks, E.A., Minnich, R.G. How to build a fast and reliable 1024 node cluster with only one disk. J Supercomput 36, 171–181 (2006). https://doi.org/10.1007/s11227-006-7956-3
Issue Date:
DOI: https://doi.org/10.1007/s11227-006-7956-3