Abstract
Independent of the level of built-in resilience, large distributed computer infrastructures will become unreliable if scaled to an appropriate size. Fault tolerance is an extremely important issue in large GRID systems, especially since neither the nodes nor their interconnects, nor even their data repositories, can be assumed to be reliable due to the distributed nature of the GRID system. Advanced fault tolerance and maintenance techniques are required in order to ensure the operation of large-scale GRID systems. This paper discusses key issues like the reliable distribution of information in a data-driven application while accommodating any failures and the highly reliable distributed storage of data. Particular developments are currently being pursued in order to generate the required resilience for distributed processing and mass storage. Benchmark results of applicable prototypes are discussed. A number of issues remain as open research activities for future generation grids, which are discussed in the outlook.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
The Large Hadron Collider homepage http://lhc-new-homepage.web.cern.ch/lhc-new-homepage
ALICE High-Level Trigger Homepage http://www.ti.uni-hd. de/HLT
ROOT An Object-Oriented Data Analysis Framework Homepage http://root.cern.ch
A. Reinefeld and V. Lindenstruth, How to Build a High-Performance Compute Cluster for the Grid, 2nd International Workshop on Metacomputing Systems and Applications, MSA’2001, Valencia, Spain, September 2001.
Virtual Network Computing, http://www.realvnc.com
The ALICE Collaboration, ALICE Technical Design Report of the Trigger, Data Acquisition, High-Level Trigger, and Control System CERN/LHCC 2003-062, ALICE TDR 10, ISBN 92-9083-217-7, January 2004.
T. M. Steinbeck, A Modular and Fault-Tolerant Data Transport Framework, Ph.D. Thesis, February 2004, http://www.ub.uni-heidelberg.de/archiv/4575
T. M. Steinbeck, V. Lindenstruth, M. Schulz, An Object-Oriented Network-Transparent Data Transportation Framework, IEEE Transaction on Nuclear Science, Proceedings of the IEEE Real-Time Conference, Valencia, 2001.
T. M. Steinbeck, V. Lindenstruth, D. Röhrich et al., A Framework for Building Distributed Data Flow Chains in Clusters, Proceedings of the 6th International Conference on Applied Parallel Computing 2002 (PARA02), Espoo, Finland, 2002, Lecture Notes in Computer Science 2367, Springer Publishing, ISBN 3-540-43786-X, 2002.
T. M. Steinbeck, V. Lindenstruth, H. Tilsner, New experiences with the ALICE High Level Trigger Data Transport Framework, Computing in High Energy Physics 2004 (CHEP04), http://chep2004.web.cern.ch/chep2004/
T. M. Steinbeck, V. Lindenstruth, H. Tilsner, A Control Software for the ALICE High Level Trigger, Computing in High Energy Physics 2004 (CHEP04), http://chep2004.web.cern.ch/chep2004/
T. Smith. Managing Mature White Box Clusters at CERN. In Second Large Scale Cluster Computing Workshop, Fermilab, Batavia, Illinois, USA, 2002.
J. Menon. Grand Challenges facing Storage Systems. In Computing in High Energy and Nuclear Physics Conference 2004 (CHEP 2004), Interlaken, Switzerland, 2004.
P. M. Chen et al. RAID: High-Performance, Reliable Secondary Storage. In ACM Computing Surveys, 26(2):145–185, 1994.
R. J. T. Morris and B. J. Truskowski. The Evolution of Storage Systems. In IBM Systems Journal, 42(2):205–217, 2003.
D. A. Patterson et al. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data, 109–116, Chicago, Illinois, USA, 1988.
I. S. Reed and G. Solomon. Polynomial Codes over Certain Finite Fields. In Journal of the Society for Industrial and Applied Mathematics, 8(2):300–304, 1960.
D. R. Hankerson et al. Coding Theory and Cryptography, The Essentials. Pure and Applied Mathematics, Dekker, 2000.
A. Barak, S. Guday, and R. Wheeler. The MOSIX Distributed Operating System, Load Balancing for UNIX. Lecture Notes in Computer Science, 672, Springer 1993.
J. S. Plank. A Tutorial on Reed-Solomon Coding for Fault-tolerance in RAID-like Systems. In Software-Practice & Experience, 9(27):995–1012, 1997.
J. S. Plank and Y. Ding. Note: Correction to the 1997 Tutorial on Reed-Solomon Coding. In Technical Report CS-03-04, University of Tennessee, Knoxville, Tennessee, USA, 2003.
P. T. Breuer, A. M. Lopez, and Arturo G. Ares. The Enhanced Network Block Device. In Linux Journal, 2000.
T. Wlodek. Developing and Managing a Large Linux Farm — the Brookhaven Experience. In Computing in High-Energy and Nuclear Physics Conference 2004 (CHEP 2004), Interlaken, Switzerland, 2004.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer Science+Business Media, Inc.
About this chapter
Cite this chapter
Lindenstruth, V., Panse, R., Steinbeck, T., Tilsner, H., Wiebalck, A. (2006). Remote Administration and Fault Tolerance in Distributed Computer Infrastructures. In: Getov, V., Laforenza, D., Reinefeld, A. (eds) Future Generation Grids. Springer, Boston, MA . https://doi.org/10.1007/978-0-387-29445-2_4
Download citation
DOI: https://doi.org/10.1007/978-0-387-29445-2_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-27935-0
Online ISBN: 978-0-387-29445-2
eBook Packages: Computer ScienceComputer Science (R0)