Abstract
Creating an environment of “no doubt” for computing systems is critical for supporting next generation science, engineering, and commercial applications. With reconfigurable devices such as Field Programmable Gate Arrays (FPGAs), designers are provided with a seductive tool to use as a basis for sophisticated but highly reliable platforms. Reconfigurable computing platforms potentially offer the enhancement of reliability and recovery from catastrophic failures through partial and dynamic reconfigurations; and eliminate the need for redundant hardware resources typically used by existing fault-tolerant systems. We propose a two-level self-healing methodology to offer 100% availability for mission critical systems with comparatively less hardware overhead and performance degradation. Our proposed system first undertakes healing at the node-level. Failing to rectify the system at the node-level, network-level healing is then undertaken. We have designed a system based on Xilinx Virtex-5 FPGAs and Cirronet wireless mesh nodes to demonstrate autonomous wireless healing capability among networked node devices. Our prototype is a proof-of-concept work which demonstrates the feasibility of using FPGAs to provide maximum computational availability in a critical self-healing distributed architecture.
Similar content being viewed by others
References
Abramovici, M., Stroud, C., Emmert, M.: Using embedded FPGAs for SOC yield improvement. In: Proc. Design Automation Conf., pp. 713–724, 2002
Bonino, D., Bosca, A., Corno, F.: An agent based autonomic semantic platform. In: 1st Int. Conf. on Autonomic Computing, NY, 2004
Brocade. www.brocade.com/san/pdf/datasheets/FabricOS_DS_08.pdf (2008). Accessed 20 May 2008
Chan, H., Arnold, B.: A policy based system to incorporate self-managing behaviors in applications. In: Companion of the 18th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, pp. 94–95, 2003
Cisco: Load Balancing. http://www.cisco.com/univercd/cc/td/doc/solution/sesm/sesm_315/instconf/eldbal.pdf (2008). Accessed 26 August 2008
Dai, Y.S., Xie, M., Poh, K.L.: Markov renewal models for correlated software failures of multiple types. IEEE Trans. Reliab. 54(1), 100–106 (2005)
DATE 2001 Roundtable: adding reconfigurable logic to SOC design. IEEE Des. Test Comput. 18(4), 65–71 (2001)
Dynamite-blasting obstacles to parallel cluster computing. http://www.hoise.com/dynamite/dynamite/AE-DY-01-99-1.html (2008). Accessed 15 September 2008
Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the grid: enabling scalable virtual organizations. Int. J. Supercomput. Appl. 15(3), 200–222 (2001)
Frechette, S., Avresky, D.R.: Method for task migration in grid environments. In: Fourth IEEE Int. Symposium on Network Computing and Applications, pp. 49–58, 2005
Ganek, A.G., Corbi, T.A.: The dawning of the autonomic computing era. IBM Syst. J. 42(1), 5–18 (2003)
Gericota, M.G., Alves, G.R., Ferreira, J.M.: A self-healing real-time system based on run-time self-reconfiguration. In: IEEE International Conference on Emerging Technologies and Factory Automation, Catania, Italy, 2005
Gericota, M.G., Alves, G.R., Ferreira, J.M.: Configurable system design with built-in self-healing. In: XX Conference on Design of Circuits and Integrated Systems, Lisbon, Portugal, 2005
Gericota, M.G., Lemos, L.F., Alves, G.R., Ferreira, J.M.: On-line self-healing of circuits implemented on reconfigurable FPGAs. In: 13th IEEE Int. On-Line Testing Symposium, pp. 217–222, 2007
Habinc, S.: Functional triple modular redundancy (FTMR) design and assessment report. Gaisler Research (2002)
Haytes, A.T., Martinoli, A., Goodman, R.M.: Swarm robotic odor localization. In: Proceedings of the 2001 IEE/RSJ International Conference on Intelligent Robot and Systems, pp. 1073–1078, 2001
Horn, P.: Autonomic computing: IBM’s perspective on the state of information technology. IBM Corp. http://www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf (2001). Accessed 26 August 2008
HP.: HP adaptive enterprise strategy. http://www.hp.com/go/demandmore (2007). Accessed 18 April 2007
Hwang, J., Aravamudham, P.: Middleware services for P2P computing in wireless grid networks. IEEE Internet Comput. 8(4), 40–46 (2004)
Kalim, U., Jameel, H., Sajjad, A., Lee, S.Y.: Mobile-to-grid middleware: an approach for breaching the divide between mobile and grid environments. In: Lecture Notes in Computer Science, vol. 3420, pp. 1–8. Springer, Berlin (2005)
Kephart, J.O., Chess, D.M.: The vision of autonomic computing. Computer 36(1), 41–50 (2003)
Kewley, D., Bouchard, J.: DARPA information assurance program dynamic defense experiment summary. IEEE Trans. Syst., Man, Cybern., Part A 31(4), 331–336 (2001)
Kewley, D., Bouchard, J.: DARPA information assurance program dynamic defense experiment summary. IEEE Trans. Syst., Man, Cybern., Part A 31(4), 331–336 (2001)
Kube, R., Bonabeau, C.R.: Cooperative transport by ants and robots. Robot. Auton. Syst. 30, 85–101 (2001)
Kumar, V.V., Lach, J.: Fine-grained self-healing hardware for large-scale autonomic systems. In: Proc. of the 14th International Workshop on Database and Expert Systems Applications (DEXA’03), 2003
Microsoft: Microsoft Dynamic Systems Initiative. http://www.microsoft.com/windowsserversystem/dsi/default.mspx (2008). Accessed 12 June 2008
Nguyen, G.T., Hluchy, L., Tran, V.D., Kotocova, M.: DDG task recovery for cluster computing. In: Proc. Int. Conf. on Parallel Processing and Applied Mathematics, pp. 369–378, 2001
Patterson, D., Brown, A., Broadwell, P.: Recovery oriented computing (roc): motivation, definition, techniques, and case studies. Technical Report CSD-02-1175, Univ. of California-Berkeley, pp. 1–25 (2002)
Paulson, L.: Computer system, heal thyself. Computer, 35(8) (2002)
Paulsson, K., Hubner, M., Becker, J.: Strategies to on-line failure recovery in self-adaptive systems based on dynamic and partial reconfiguration. In: Proc. of the 1st NASA/ESA Conference on Adaptive Hardware and Systems, 2006
Paulsson, K., Hubner, M., Jung, M., Becker, J.: Methods for run-time failure recognition and recovery in dynamic and partial reconfigurable systems based on Xilinx Virtex-II Pro FPGAs. In: IEEE Computer Society Annual Symposium on VLSI: Emerging VLSI Technologies and Architectures, pp. 159–166, 2006
Pena, J., Hinchey, M.G., Sterritt, R.: Towards modeling, specifying and deploying policies in autonomous and autonomic systems using an AOSE methodology. In: 3rd IEEE Int. Workshop on Engineering of Autonomic and Autonomous Systems (EASE’06), pp. 37–46, 2006
Prencipe, G.: CORDA: distributed coordination of a set of autonomous mobile robots. In: Proc. 4th European Research Seminar on Advances in Distributed Systems, pp. 185–190, 2001
Sajjad, A., Jameel, H.: MAGI—mobile access to grid infrastructure: bringing the gifts of grid to mobile computing. In: 2nd Int. Conf. on Grid Service Engineering and Management, Erfurt, Germany, 2005
Truszkowski, W., Hinchey, M., Rash, J., Rouff, C.: NASA’s swarm missions: the challenge of building autonomous software. IT Prof. 6(5), 47–52 (2004)
Venishetti, S.K., Akoglu, A., Kalra, R.: Hierarchical built-in self-testing and FPGA based healing methodology for system-on-a-chip. In: Second NASA/ESA Conference on Adaptive Hardware and Systems, 2007
Wilton, S., Saleh, R.: Programmable logic IP cores in SoC design: opportunities and challenges. In: IEEE Custom Integrated Circuit Conference, 2001
Xie, M., Dai, Y.S., Poh, K.L.: Computing Systems Reliability: Models and Analysis. Kluwer Academic, New York (2004)
Xilinx, Inc.: Two flows for partial reconfiguration: module based or difference based. Xilinx Application Note 290, September 9, 2004. Accessed 20 May 2008
Xilinx, Inc.: Embedded system tools reference manual, embedded development kit, EDK 8.2i. UG111 (v6.0) June 23, 2006. Accessed 20 May 2008
Xilinx, Inc.: EDK 9.2 MicroBlaze tutorial in Virtex-4. Tutorial, Oct. 2007. Accessed 20 May 2008
Xilinx, Inc.: EDK design using PlanAhead for partial reconfiguration. Xilinx Application Note. (Preliminary Version) Accessed 20 May 2008
Xilinx, Inc.: OPB HWICAP product specification. DS 280 (v 1.3), March 15, 2004. Accessed 20 May 2008
Xilinx, Inc.: PlanAhead user guide 9.2. July 27, 2007. Accessed 20 May 2008
Zeng, G., Ito, H.: Efficient test data decompression for System-on-a-chip using an embedded FPGA core. In: Proc. Int. Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 503–510, 2003
Zuchowski, P.S., Reynolds, C.B., Grupp, R.J., Davis, S.G., Cremen, B., Troxel, B.: A hybrid ASIC and FPGA architecture. In: Proc. Int. Conf. on Computer Aided Design, pp. 187–194, 2002
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by NASA JPL through strategic university partnership (SURP 2007) program under Agreement NO: 1315980.
Rights and permissions
About this article
Cite this article
Akoglu, A., Sreeramareddy, A. & Josiah, J.G. FPGA based distributed self healing architecture for reusable systems. Cluster Comput 12, 269–284 (2009). https://doi.org/10.1007/s10586-009-0082-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-009-0082-2