Skip to main content
Log in

FPGA based distributed self healing architecture for reusable systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Creating an environment of “no doubt” for computing systems is critical for supporting next generation science, engineering, and commercial applications. With reconfigurable devices such as Field Programmable Gate Arrays (FPGAs), designers are provided with a seductive tool to use as a basis for sophisticated but highly reliable platforms. Reconfigurable computing platforms potentially offer the enhancement of reliability and recovery from catastrophic failures through partial and dynamic reconfigurations; and eliminate the need for redundant hardware resources typically used by existing fault-tolerant systems. We propose a two-level self-healing methodology to offer 100% availability for mission critical systems with comparatively less hardware overhead and performance degradation. Our proposed system first undertakes healing at the node-level. Failing to rectify the system at the node-level, network-level healing is then undertaken. We have designed a system based on Xilinx Virtex-5 FPGAs and Cirronet wireless mesh nodes to demonstrate autonomous wireless healing capability among networked node devices. Our prototype is a proof-of-concept work which demonstrates the feasibility of using FPGAs to provide maximum computational availability in a critical self-healing distributed architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abramovici, M., Stroud, C., Emmert, M.: Using embedded FPGAs for SOC yield improvement. In: Proc. Design Automation Conf., pp. 713–724, 2002

  2. Bonino, D., Bosca, A., Corno, F.: An agent based autonomic semantic platform. In: 1st Int. Conf. on Autonomic Computing, NY, 2004

  3. Brocade. www.brocade.com/san/pdf/datasheets/FabricOS_DS_08.pdf (2008). Accessed 20 May 2008

  4. Chan, H., Arnold, B.: A policy based system to incorporate self-managing behaviors in applications. In: Companion of the 18th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, pp. 94–95, 2003

  5. Cisco: Load Balancing. http://www.cisco.com/univercd/cc/td/doc/solution/sesm/sesm_315/instconf/eldbal.pdf (2008). Accessed 26 August 2008

  6. Dai, Y.S., Xie, M., Poh, K.L.: Markov renewal models for correlated software failures of multiple types. IEEE Trans. Reliab. 54(1), 100–106 (2005)

    Article  Google Scholar 

  7. DATE 2001 Roundtable: adding reconfigurable logic to SOC design. IEEE Des. Test Comput. 18(4), 65–71 (2001)

    Google Scholar 

  8. Dynamite-blasting obstacles to parallel cluster computing. http://www.hoise.com/dynamite/dynamite/AE-DY-01-99-1.html (2008). Accessed 15 September 2008

  9. Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the grid: enabling scalable virtual organizations. Int. J. Supercomput. Appl. 15(3), 200–222 (2001)

    Article  Google Scholar 

  10. Frechette, S., Avresky, D.R.: Method for task migration in grid environments. In: Fourth IEEE Int. Symposium on Network Computing and Applications, pp. 49–58, 2005

  11. Ganek, A.G., Corbi, T.A.: The dawning of the autonomic computing era. IBM Syst. J. 42(1), 5–18 (2003)

    Article  Google Scholar 

  12. Gericota, M.G., Alves, G.R., Ferreira, J.M.: A self-healing real-time system based on run-time self-reconfiguration. In: IEEE International Conference on Emerging Technologies and Factory Automation, Catania, Italy, 2005

  13. Gericota, M.G., Alves, G.R., Ferreira, J.M.: Configurable system design with built-in self-healing. In: XX Conference on Design of Circuits and Integrated Systems, Lisbon, Portugal, 2005

  14. Gericota, M.G., Lemos, L.F., Alves, G.R., Ferreira, J.M.: On-line self-healing of circuits implemented on reconfigurable FPGAs. In: 13th IEEE Int. On-Line Testing Symposium, pp. 217–222, 2007

  15. Habinc, S.: Functional triple modular redundancy (FTMR) design and assessment report. Gaisler Research (2002)

  16. Haytes, A.T., Martinoli, A., Goodman, R.M.: Swarm robotic odor localization. In: Proceedings of the 2001 IEE/RSJ International Conference on Intelligent Robot and Systems, pp. 1073–1078, 2001

  17. Horn, P.: Autonomic computing: IBM’s perspective on the state of information technology. IBM Corp. http://www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf (2001). Accessed 26 August 2008

  18. HP.: HP adaptive enterprise strategy. http://www.hp.com/go/demandmore (2007). Accessed 18 April 2007

  19. Hwang, J., Aravamudham, P.: Middleware services for P2P computing in wireless grid networks. IEEE Internet Comput. 8(4), 40–46 (2004)

    Article  Google Scholar 

  20. Kalim, U., Jameel, H., Sajjad, A., Lee, S.Y.: Mobile-to-grid middleware: an approach for breaching the divide between mobile and grid environments. In: Lecture Notes in Computer Science, vol. 3420, pp. 1–8. Springer, Berlin (2005)

    Google Scholar 

  21. Kephart, J.O., Chess, D.M.: The vision of autonomic computing. Computer 36(1), 41–50 (2003)

    Article  MathSciNet  Google Scholar 

  22. Kewley, D., Bouchard, J.: DARPA information assurance program dynamic defense experiment summary. IEEE Trans. Syst., Man, Cybern., Part A 31(4), 331–336 (2001)

    Article  Google Scholar 

  23. Kewley, D., Bouchard, J.: DARPA information assurance program dynamic defense experiment summary. IEEE Trans. Syst., Man, Cybern., Part A 31(4), 331–336 (2001)

    Article  Google Scholar 

  24. Kube, R., Bonabeau, C.R.: Cooperative transport by ants and robots. Robot. Auton. Syst. 30, 85–101 (2001)

    Article  Google Scholar 

  25. Kumar, V.V., Lach, J.: Fine-grained self-healing hardware for large-scale autonomic systems. In: Proc. of the 14th International Workshop on Database and Expert Systems Applications (DEXA’03), 2003

  26. Microsoft: Microsoft Dynamic Systems Initiative. http://www.microsoft.com/windowsserversystem/dsi/default.mspx (2008). Accessed 12 June 2008

  27. Nguyen, G.T., Hluchy, L., Tran, V.D., Kotocova, M.: DDG task recovery for cluster computing. In: Proc. Int. Conf. on Parallel Processing and Applied Mathematics, pp. 369–378, 2001

  28. Patterson, D., Brown, A., Broadwell, P.: Recovery oriented computing (roc): motivation, definition, techniques, and case studies. Technical Report CSD-02-1175, Univ. of California-Berkeley, pp. 1–25 (2002)

  29. Paulson, L.: Computer system, heal thyself. Computer, 35(8) (2002)

  30. Paulsson, K., Hubner, M., Becker, J.: Strategies to on-line failure recovery in self-adaptive systems based on dynamic and partial reconfiguration. In: Proc. of the 1st NASA/ESA Conference on Adaptive Hardware and Systems, 2006

  31. Paulsson, K., Hubner, M., Jung, M., Becker, J.: Methods for run-time failure recognition and recovery in dynamic and partial reconfigurable systems based on Xilinx Virtex-II Pro FPGAs. In: IEEE Computer Society Annual Symposium on VLSI: Emerging VLSI Technologies and Architectures, pp. 159–166, 2006

  32. Pena, J., Hinchey, M.G., Sterritt, R.: Towards modeling, specifying and deploying policies in autonomous and autonomic systems using an AOSE methodology. In: 3rd IEEE Int. Workshop on Engineering of Autonomic and Autonomous Systems (EASE’06), pp. 37–46, 2006

  33. Prencipe, G.: CORDA: distributed coordination of a set of autonomous mobile robots. In: Proc. 4th European Research Seminar on Advances in Distributed Systems, pp. 185–190, 2001

  34. Sajjad, A., Jameel, H.: MAGI—mobile access to grid infrastructure: bringing the gifts of grid to mobile computing. In: 2nd Int. Conf. on Grid Service Engineering and Management, Erfurt, Germany, 2005

  35. Truszkowski, W., Hinchey, M., Rash, J., Rouff, C.: NASA’s swarm missions: the challenge of building autonomous software. IT Prof. 6(5), 47–52 (2004)

    Article  Google Scholar 

  36. Venishetti, S.K., Akoglu, A., Kalra, R.: Hierarchical built-in self-testing and FPGA based healing methodology for system-on-a-chip. In: Second NASA/ESA Conference on Adaptive Hardware and Systems, 2007

  37. Wilton, S., Saleh, R.: Programmable logic IP cores in SoC design: opportunities and challenges. In: IEEE Custom Integrated Circuit Conference, 2001

  38. Xie, M., Dai, Y.S., Poh, K.L.: Computing Systems Reliability: Models and Analysis. Kluwer Academic, New York (2004)

    Google Scholar 

  39. Xilinx, Inc.: Two flows for partial reconfiguration: module based or difference based. Xilinx Application Note 290, September 9, 2004. Accessed 20 May 2008

  40. Xilinx, Inc.: Embedded system tools reference manual, embedded development kit, EDK 8.2i. UG111 (v6.0) June 23, 2006. Accessed 20 May 2008

  41. Xilinx, Inc.: EDK 9.2 MicroBlaze tutorial in Virtex-4. Tutorial, Oct. 2007. Accessed 20 May 2008

  42. Xilinx, Inc.: EDK design using PlanAhead for partial reconfiguration. Xilinx Application Note. (Preliminary Version) Accessed 20 May 2008

  43. Xilinx, Inc.: OPB HWICAP product specification. DS 280 (v 1.3), March 15, 2004. Accessed 20 May 2008

  44. Xilinx, Inc.: PlanAhead user guide 9.2. July 27, 2007. Accessed 20 May 2008

  45. Zeng, G., Ito, H.: Efficient test data decompression for System-on-a-chip using an embedded FPGA core. In: Proc. Int. Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 503–510, 2003

  46. Zuchowski, P.S., Reynolds, C.B., Grupp, R.J., Davis, S.G., Cremen, B., Troxel, B.: A hybrid ASIC and FPGA architecture. In: Proc. Int. Conf. on Computer Aided Design, pp. 187–194, 2002

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Akoglu.

Additional information

This work was supported in part by NASA JPL through strategic university partnership (SURP 2007) program under Agreement NO: 1315980.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Akoglu, A., Sreeramareddy, A. & Josiah, J.G. FPGA based distributed self healing architecture for reusable systems. Cluster Comput 12, 269–284 (2009). https://doi.org/10.1007/s10586-009-0082-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-009-0082-2

Keywords

Navigation