Abstract
In this article, we describe RQNoC, a service-oriented Network-on-Chip (NoC) resilient to permanent faults. We characterize the network resources based on the particular service that they support and, when faulty, bypass them, allowing the respective traffic class to be redirected. We propose two alternatives for service redirection, each having different advantages and disadvantages. The first one, Service Detour, uses longer alternative paths through resources of the same service to bypass faulty network parts, keeping traffic classes isolated. The second approach, Service Merge, uses resources of other services providing shorter paths but allowing traffic classes to interfere with each other. The remaining network resources that are common for all services employ additional mechanisms for tolerating faults. Links tolerate faults using additional spare wires combined with a flit-shifting mechanism, and the router control is protected with Triple-Modular-Redundancy (TMR). The proposed RQNoC network designs are implemented in 65nm technology and evaluated in terms of performance, area, power consumption, and fault tolerance. Service Detour requires 9% more area and consumes 7.3% more power compared to a baseline network, not tolerant to faults. Its packet latency and throughput is close to the fault-free performance at low-fault densities, but fault tolerance and performance drop substantially for 8 or more network faults. Service Merge requires 22% more area and 27% more power than the baseline and has a 9% slower clock. Compared to a fault-free network, a Service Merge RQNoC with up to 32 faults has increased packet latency up to 1.5 to 2.4× and reduced throughput to 70% or 50%. However, it delivers substantially better fault tolerance, having a mean network connectivity above 90% even with 32 network faults versus 41% of a Service Detour network. Combining Serve Merge and Service Detour improves fault tolerance, further sustaining a higher number of network faults and reduced packet latency.
- Muhammad Ali, Michael Welzl, and Sven Hessler. 2007. A fault tolerant mechanism for handling permanent and transient failures in a network on chip. In 4th International Conference on Information Technology, 2007 (ITNG’07). IEEE, 1027--1032. Google ScholarDigital Library
- Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. 2004. QNoC: QoS architecture and design process for network on chip. Journal of Systems Architecture 50, 2, 105--128. Google ScholarDigital Library
- S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6, 10--16. Google ScholarDigital Library
- Cristian Constantinescu. 2003. Trends and challenges in VLSI circuit reliability. IEEE Micro 4, 14--19. Google ScholarDigital Library
- Kypros Constantinides, Stephen Plaza, Jason Blome, Bin Zhang, Valeria Bertacco, Scott Mahlke, Todd Austin, and Michael Orshansky. 2006. Bulletproof: A defect-tolerant CMP switch architecture. In 12th International Symposium on High-Performance Computer Architecture, 2006. IEEE, 5--16.Google ScholarCross Ref
- Andrew DeOrio, David Fick, Valeria Bertacco, Dennis Sylvester, David Blaauw, Jin Hu, and Gregory Chen. 2012. A reliable routing architecture and algorithm for NoCs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 5, 726--739. Google ScholarDigital Library
- Marios Evripidou, Chrysostomos Nicopoulos, Vassos Soteriou, and Jongman Kim. 2012. Virtualizing virtual channels for increased network-on-chip robustness and upgradeability. In 2012 IEEE Computer Society Annual Symposium onVLSI (ISVLSI’12). IEEE, 21--26. Google ScholarDigital Library
- Chaochao Feng, Zhonghai Lu, Axel Jantsch, Jinwen Li, and Minxuan Zhang. 2010. FoN: Fault-on-neighbor aware routing algorithm for networks-on-chip. In 2010 IEEE International SOC Conference (SOCC’10). IEEE, 441--446.Google ScholarCross Ref
- Chaochao Feng, Zhonghai Lu, Axel Jantsch, Minxuan Zhang, and Zuocheng Xing. 2013. Addressing transient and permanent faults in NoC with efficient fault-tolerant deflection router. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21, 6, 1053--1066. Google ScholarDigital Library
- David Fick, Andrew DeOrio, Gregory Chen, Valeria Bertacco, Dennis Sylvester, and David Blaauw. 2009a. A highly resilient routing algorithm for fault-tolerant NoCs. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 21--26. Google ScholarDigital Library
- David Fick, Andrew DeOrio, Jin Hu, Valeria Bertacco, David Blaauw, and Dennis Sylvester. 2009b. Vicis: A reliable network for unreliable silicon. In Proceedings of the 46th Annual Design Automation Conference. ACM, New York, NY, 812--817. Google ScholarDigital Library
- F. Gilabert, María Engracia Gómez, Simone Medardoni, and Davide Bertozzi. 2010. Improved utilization of NoC channel bandwidth by switch replication for cost-effective multi-processor systems-on-chip. In Proceedings of the 2010 4th ACM/IEEE International Symposium on Networks-on-Chip. IEEE Computer Society, 165--172. Google ScholarDigital Library
- Cristian Grecu, Andre Ivanov, Res Saleh, Egor S. Sogomonyan, and Partha Pratim Pande. 2006. On-line fault detection and location for NoC interconnects. In 12th IEEE International On-Line Testing Symposium, 2006 (IOLTS’06). IEEE, 6 pp. Google ScholarDigital Library
- Yinhe Han and Binzhang Fu. 2009. A new fault-tolerant routing based on turn model. In Proceedings of the 3rd Workshop on Diagnostic Services in Network-on-Chips (DSNOC’09). IEEE Computer Society, 102--103.Google Scholar
- Andrew B. Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. 2009. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 423--428. Google ScholarDigital Library
- Mohammad Reza Kakoee, Valeria Bertacco, and Luca Benini. 2011. Relinoc: A reliable network for priority-based on-chip communication. In Design, Automation & Test in Europe Conference & Exhibition (DATE’& Exhibition (DATE’’11). IEEE, 1--6.Google ScholarCross Ref
- Jongman Kim, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan, Mazin S. Yousif, and Chita R. Das. 2006. A gracefully degrading and energy-efficient modular router architecture for on-chip networks. In ACM SIGARCH Computer Architecture News, Vol. 34. IEEE Computer Society, 4--15. Google ScholarDigital Library
- Adán Kohler, Gert Schley, and Martin Radetzki. 2010. Fault tolerant network on chip switching with graceful performance degradation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 29, 6, 883--896. Google ScholarDigital Library
- Michihiro Koibuchi, Hiroki Matsutani, Hideharu Amano, and Timothy Mark Pinkston. 2008. A lightweight fault-tolerant mechanism for network-on-chip. In Proceedings of the 2nd ACM/IEEE International Symposium on Networks-on-Chip. IEEE Computer Society, 13--22. Google ScholarDigital Library
- Teijo Lehtonen, Pasi Liljeberg, and Juha Plosila. 2007. Online reconfigurable self-timed links for fault tolerant NoC. VLSI Design 2007.Google Scholar
- Cheng Liu, Lei Zhang, Yinhe Han, and Xiaowei Li. 2011. A resilient on-chip router design through data path salvaging. In Proceedings of the 16th Asia and South Pacific Design Automation Conference. IEEE Press, 437--442. Google ScholarDigital Library
- Srinivasan Murali, David Atienza, Luca Benini, and Giovanni De Michel. 2006. A multi-path routing strategy with guaranteed in-order packet delivery and fault-tolerance for networks on chip. In Proceedings of the 43rd Annual Design Automation Conference. ACM, New York, NY, 845--848. Google ScholarDigital Library
- John D. Owens, William J. Dally, Ron Ho, D. N. (Jay) Jayasimha, Stephen W. Keckler, and Li-Shiuan Peh. 2007. Research challenges for on-chip interconnection networks. IEEE Micro 27, 5, 96--108. Google ScholarDigital Library
- Li-Shiuan Peh and Natalie Enright Jerger. 2009. On-Chip Networks (1st ed.). Morgan and Claypool Publishers, San Francisco, CA. Google ScholarDigital Library
- Antonis Psathakis, Vassilis Papaefstathiou, Nikolaos Chrysos, Fabien Chaix, Evangelos Vasilakis, Dionisios Pnevmatikatos, and Manolis Katevenis. 2015. A systematic evaluation of emerging mesh-like CMP NoCs. In 2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS). IEEE, 159--170. Google ScholarDigital Library
- Martin Radetzki, Chaochao Feng, Xueqian Zhao, and Axel Jantsch. 2013. Methods for fault tolerance in networks-on-chip. ACM Computing Surveys 46, 1, 8. Google ScholarDigital Library
- Ronald L. Rivest and Charles E. Leiserson. 1990. Introduction to Algorithms. McGraw-Hill, Inc., New York, NY. Google ScholarDigital Library
- Samuel Rodrigo, Jose Flich, Antoni Roca, Simone Medardoni, Davide Bertozzi, J. Camacho, Federico Silla, and Jose Duato. 2010. Addressing manufacturing challenges with cost-efficient fault tolerant routing. In 2010 4th ACM/IEEE International Symposium on Networks-on-Chip (NOCS’10). IEEE, 25--32. Google ScholarDigital Library
- Ioannis Sourdis, Christos Strydis, Antonino Armato, Christos-Savvas Bouganis, Babak Falsafi, Georgi Nedeltchev Gaydadjiev, Sebastian Isaza, Alirad Malek, Riccardo Mariani, D. Pnevmatikatos, and others. 2013. DeSyRe: On-demand system reliability. Microprocessors and Microsystems 37, 8, 981--1001. Google ScholarDigital Library
- Stavros Tzilis and Ioannis Sourdis. 2014. A runtime manager for gracefully degrading SoCs. In International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).Google ScholarCross Ref
- Arseniy Vitkovskiy, Vassos Soteriou, and Chrysostomos Nicopoulos. 2013. Dynamic fault-tolerant routing algorithm for networks-on-chip based on localised detouring paths. IET Computers & Digital Techniques 7, 2, 93--103.Google ScholarCross Ref
- David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. On-chip interconnection architecture of the tile processor. IEEE Micro 5, 15--31. Google ScholarDigital Library
Index Terms
- RQNoC: A Resilient Quality-of-Service Network-on-Chip with Service Redirection
Recommendations
Fault tolerant mechanism to improve yield in NoCs using a reconfigurable router
SBCCI '09: Proceedings of the 22nd Annual Symposium on Integrated Circuits and System Design: Chip on the DunesAs the complexity of designs increase and technologies scale down, devices are subject to new types of malfunctions and failures. Network-on-chip routers are responsible to ensure the proper communication of on-chip cores, and the buffers present in the ...
A complete self-testing and self-configuring NoC infrastructure for cost-effective MPSoCs
Special Section on Wireless Health Systems, On-Chip and Off-Chip Network ArchitecturesNetworks-on-chip need to survive to manufacturing faults in order to sustain yield. An effective testing and configuration strategy however implies two opposite requirements. One one hand, a fast and scalable built-in self-testing and self-diagnosis ...
Multi-Layer Test and Diagnosis for Dependable NoCs
NOCS '15: Proceedings of the 9th International Symposium on Networks-on-ChipNetworks-on-chip are inherently fault tolerant or at least gracefully degradable as both, connectivity and amount of resources, provide some useful redundancy. These properties can only be exploited extensively if test and diagnosis techniques support ...
Comments