Remote Administration and Fault Tolerance in Distributed Computer Infrastructures

Lindenstruth, Volker; Panse, Ralph; Steinbeck, Timm; Tilsner, Heinz; Wiebalck, Arne

doi:10.1007/978-0-387-29445-2_4

Volker Lindenstruth⁴,
Ralph Panse⁴,
Timm Steinbeck⁴,
Heinz Tilsner⁴ &
…
Arne Wiebalck⁴

252 Accesses

Abstract

Independent of the level of built-in resilience, large distributed computer infrastructures will become unreliable if scaled to an appropriate size. Fault tolerance is an extremely important issue in large GRID systems, especially since neither the nodes nor their interconnects, nor even their data repositories, can be assumed to be reliable due to the distributed nature of the GRID system. Advanced fault tolerance and maintenance techniques are required in order to ensure the operation of large-scale GRID systems. This paper discusses key issues like the reliable distribution of information in a data-driven application while accommodating any failures and the highly reliable distributed storage of data. Particular developments are currently being pursued in order to generate the required resilience for distributed processing and mass storage. Benchmark results of applicable prototypes are discussed. A number of issues remain as open research activities for future generation grids, which are discussed in the outlook.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

The Large Hadron Collider homepage http://lhc-new-homepage.web.cern.ch/lhc-new-homepage
Google Scholar
ALICE High-Level Trigger Homepage http://www.ti.uni-hd. de/HLT
Google Scholar
ROOT An Object-Oriented Data Analysis Framework Homepage http://root.cern.ch
Google Scholar
A. Reinefeld and V. Lindenstruth, How to Build a High-Performance Compute Cluster for the Grid, 2nd International Workshop on Metacomputing Systems and Applications, MSA’2001, Valencia, Spain, September 2001.
Google Scholar
Virtual Network Computing, http://www.realvnc.com
Google Scholar
The ALICE Collaboration, ALICE Technical Design Report of the Trigger, Data Acquisition, High-Level Trigger, and Control System CERN/LHCC 2003-062, ALICE TDR 10, ISBN 92-9083-217-7, January 2004.
Google Scholar
T. M. Steinbeck, A Modular and Fault-Tolerant Data Transport Framework, Ph.D. Thesis, February 2004, http://www.ub.uni-heidelberg.de/archiv/4575
Google Scholar
T. M. Steinbeck, V. Lindenstruth, M. Schulz, An Object-Oriented Network-Transparent Data Transportation Framework, IEEE Transaction on Nuclear Science, Proceedings of the IEEE Real-Time Conference, Valencia, 2001.
Google Scholar
T. M. Steinbeck, V. Lindenstruth, D. Röhrich et al., A Framework for Building Distributed Data Flow Chains in Clusters, Proceedings of the 6th International Conference on Applied Parallel Computing 2002 (PARA02), Espoo, Finland, 2002, Lecture Notes in Computer Science 2367, Springer Publishing, ISBN 3-540-43786-X, 2002.
Google Scholar
T. M. Steinbeck, V. Lindenstruth, H. Tilsner, New experiences with the ALICE High Level Trigger Data Transport Framework, Computing in High Energy Physics 2004 (CHEP04), http://chep2004.web.cern.ch/chep2004/
Google Scholar
T. M. Steinbeck, V. Lindenstruth, H. Tilsner, A Control Software for the ALICE High Level Trigger, Computing in High Energy Physics 2004 (CHEP04), http://chep2004.web.cern.ch/chep2004/
Google Scholar
T. Smith. Managing Mature White Box Clusters at CERN. In Second Large Scale Cluster Computing Workshop, Fermilab, Batavia, Illinois, USA, 2002.
Google Scholar
J. Menon. Grand Challenges facing Storage Systems. In Computing in High Energy and Nuclear Physics Conference 2004 (CHEP 2004), Interlaken, Switzerland, 2004.
Google Scholar
P. M. Chen et al. RAID: High-Performance, Reliable Secondary Storage. In ACM Computing Surveys, 26(2):145–185, 1994.
Article Google Scholar
R. J. T. Morris and B. J. Truskowski. The Evolution of Storage Systems. In IBM Systems Journal, 42(2):205–217, 2003.
Article Google Scholar
D. A. Patterson et al. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data, 109–116, Chicago, Illinois, USA, 1988.
Google Scholar
I. S. Reed and G. Solomon. Polynomial Codes over Certain Finite Fields. In Journal of the Society for Industrial and Applied Mathematics, 8(2):300–304, 1960.
Article MATH MathSciNet Google Scholar
D. R. Hankerson et al. Coding Theory and Cryptography, The Essentials. Pure and Applied Mathematics, Dekker, 2000.
Google Scholar
A. Barak, S. Guday, and R. Wheeler. The MOSIX Distributed Operating System, Load Balancing for UNIX. Lecture Notes in Computer Science, 672, Springer 1993.
Google Scholar
J. S. Plank. A Tutorial on Reed-Solomon Coding for Fault-tolerance in RAID-like Systems. In Software-Practice & Experience, 9(27):995–1012, 1997.
Article Google Scholar
J. S. Plank and Y. Ding. Note: Correction to the 1997 Tutorial on Reed-Solomon Coding. In Technical Report CS-03-04, University of Tennessee, Knoxville, Tennessee, USA, 2003.
Google Scholar
P. T. Breuer, A. M. Lopez, and Arturo G. Ares. The Enhanced Network Block Device. In Linux Journal, 2000.
Google Scholar
T. Wlodek. Developing and Managing a Large Linux Farm — the Brookhaven Experience. In Computing in High-Energy and Nuclear Physics Conference 2004 (CHEP 2004), Interlaken, Switzerland, 2004.
Google Scholar

Download references

Author information

Authors and Affiliations

Kirchhoff Institute for Physics, University of Heidelberg, Heidelberg, Germany
Volker Lindenstruth, Ralph Panse, Timm Steinbeck, Heinz Tilsner & Arne Wiebalck

Authors

Volker Lindenstruth
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Panse
View author publications
You can also search for this author in PubMed Google Scholar
Timm Steinbeck
View author publications
You can also search for this author in PubMed Google Scholar
Heinz Tilsner
View author publications
You can also search for this author in PubMed Google Scholar
Arne Wiebalck
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Westminster, London, UK
Vladimir Getov
Information Science and Technologies Institute, Pisa, Italy
Domenico Laforenza
Zuse-Institut Berlin and Humboldt-Universitat zu Berlin, Germany
Alexander Reinefeld

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lindenstruth, V., Panse, R., Steinbeck, T., Tilsner, H., Wiebalck, A. (2006). Remote Administration and Fault Tolerance in Distributed Computer Infrastructures. In: Getov, V., Laforenza, D., Reinefeld, A. (eds) Future Generation Grids. Springer, Boston, MA . https://doi.org/10.1007/978-0-387-29445-2_4

Download citation

DOI: https://doi.org/10.1007/978-0-387-29445-2_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-27935-0
Online ISBN: 978-0-387-29445-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics