Skip to main content

Remote Administration and Fault Tolerance in Distributed Computer Infrastructures

  • Chapter
Future Generation Grids

Abstract

Independent of the level of built-in resilience, large distributed computer infrastructures will become unreliable if scaled to an appropriate size. Fault tolerance is an extremely important issue in large GRID systems, especially since neither the nodes nor their interconnects, nor even their data repositories, can be assumed to be reliable due to the distributed nature of the GRID system. Advanced fault tolerance and maintenance techniques are required in order to ensure the operation of large-scale GRID systems. This paper discusses key issues like the reliable distribution of information in a data-driven application while accommodating any failures and the highly reliable distributed storage of data. Particular developments are currently being pursued in order to generate the required resilience for distributed processing and mass storage. Benchmark results of applicable prototypes are discussed. A number of issues remain as open research activities for future generation grids, which are discussed in the outlook.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. The Large Hadron Collider homepage http://lhc-new-homepage.web.cern.ch/lhc-new-homepage

    Google Scholar 

  2. ALICE High-Level Trigger Homepage http://www.ti.uni-hd. de/HLT

    Google Scholar 

  3. ROOT An Object-Oriented Data Analysis Framework Homepage http://root.cern.ch

    Google Scholar 

  4. A. Reinefeld and V. Lindenstruth, How to Build a High-Performance Compute Cluster for the Grid, 2nd International Workshop on Metacomputing Systems and Applications, MSA’2001, Valencia, Spain, September 2001.

    Google Scholar 

  5. Virtual Network Computing, http://www.realvnc.com

    Google Scholar 

  6. The ALICE Collaboration, ALICE Technical Design Report of the Trigger, Data Acquisition, High-Level Trigger, and Control System CERN/LHCC 2003-062, ALICE TDR 10, ISBN 92-9083-217-7, January 2004.

    Google Scholar 

  7. T. M. Steinbeck, A Modular and Fault-Tolerant Data Transport Framework, Ph.D. Thesis, February 2004, http://www.ub.uni-heidelberg.de/archiv/4575

    Google Scholar 

  8. T. M. Steinbeck, V. Lindenstruth, M. Schulz, An Object-Oriented Network-Transparent Data Transportation Framework, IEEE Transaction on Nuclear Science, Proceedings of the IEEE Real-Time Conference, Valencia, 2001.

    Google Scholar 

  9. T. M. Steinbeck, V. Lindenstruth, D. Röhrich et al., A Framework for Building Distributed Data Flow Chains in Clusters, Proceedings of the 6th International Conference on Applied Parallel Computing 2002 (PARA02), Espoo, Finland, 2002, Lecture Notes in Computer Science 2367, Springer Publishing, ISBN 3-540-43786-X, 2002.

    Google Scholar 

  10. T. M. Steinbeck, V. Lindenstruth, H. Tilsner, New experiences with the ALICE High Level Trigger Data Transport Framework, Computing in High Energy Physics 2004 (CHEP04), http://chep2004.web.cern.ch/chep2004/

    Google Scholar 

  11. T. M. Steinbeck, V. Lindenstruth, H. Tilsner, A Control Software for the ALICE High Level Trigger, Computing in High Energy Physics 2004 (CHEP04), http://chep2004.web.cern.ch/chep2004/

    Google Scholar 

  12. T. Smith. Managing Mature White Box Clusters at CERN. In Second Large Scale Cluster Computing Workshop, Fermilab, Batavia, Illinois, USA, 2002.

    Google Scholar 

  13. J. Menon. Grand Challenges facing Storage Systems. In Computing in High Energy and Nuclear Physics Conference 2004 (CHEP 2004), Interlaken, Switzerland, 2004.

    Google Scholar 

  14. P. M. Chen et al. RAID: High-Performance, Reliable Secondary Storage. In ACM Computing Surveys, 26(2):145–185, 1994.

    Article  Google Scholar 

  15. R. J. T. Morris and B. J. Truskowski. The Evolution of Storage Systems. In IBM Systems Journal, 42(2):205–217, 2003.

    Article  Google Scholar 

  16. D. A. Patterson et al. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data, 109–116, Chicago, Illinois, USA, 1988.

    Google Scholar 

  17. I. S. Reed and G. Solomon. Polynomial Codes over Certain Finite Fields. In Journal of the Society for Industrial and Applied Mathematics, 8(2):300–304, 1960.

    Article  MATH  MathSciNet  Google Scholar 

  18. D. R. Hankerson et al. Coding Theory and Cryptography, The Essentials. Pure and Applied Mathematics, Dekker, 2000.

    Google Scholar 

  19. A. Barak, S. Guday, and R. Wheeler. The MOSIX Distributed Operating System, Load Balancing for UNIX. Lecture Notes in Computer Science, 672, Springer 1993.

    Google Scholar 

  20. J. S. Plank. A Tutorial on Reed-Solomon Coding for Fault-tolerance in RAID-like Systems. In Software-Practice & Experience, 9(27):995–1012, 1997.

    Article  Google Scholar 

  21. J. S. Plank and Y. Ding. Note: Correction to the 1997 Tutorial on Reed-Solomon Coding. In Technical Report CS-03-04, University of Tennessee, Knoxville, Tennessee, USA, 2003.

    Google Scholar 

  22. P. T. Breuer, A. M. Lopez, and Arturo G. Ares. The Enhanced Network Block Device. In Linux Journal, 2000.

    Google Scholar 

  23. T. Wlodek. Developing and Managing a Large Linux Farm — the Brookhaven Experience. In Computing in High-Energy and Nuclear Physics Conference 2004 (CHEP 2004), Interlaken, Switzerland, 2004.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer Science+Business Media, Inc.

About this chapter

Cite this chapter

Lindenstruth, V., Panse, R., Steinbeck, T., Tilsner, H., Wiebalck, A. (2006). Remote Administration and Fault Tolerance in Distributed Computer Infrastructures. In: Getov, V., Laforenza, D., Reinefeld, A. (eds) Future Generation Grids. Springer, Boston, MA . https://doi.org/10.1007/978-0-387-29445-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-29445-2_4

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-27935-0

  • Online ISBN: 978-0-387-29445-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics