ABSTRACT
As silicon continues to scale, transistor reliability is becoming a major concern. At the same time, increasing transistor counts are causing a rapid shift towards large chip multi-processors (CMP) and system-on-chip (SoC) designs, comprising several cores and IPs communicating via a network-on-chip (NoC). As the sole medium of on-chip communication, a NoC should gracefully tolerate many permanent faults.
We propose uDIREC, a unified framework for permanent fault diagnosis and subsequent reconfiguration in NoCs that provides graceful performance degradation with increasing number of faults. Upon in-field transistor failures, uDIREC leverages a fine-resolution diagnosis mechanism to disable faulty components very sparingly. At its core, uDIREC employs a novel routing algorithm to find reliable and deadlock-free routes that utilize the still-functional links in the NoC. uDIREC places no restriction on topology, router architecture and number and location of faults. Experimental results show that uDIREC, implemented in a 64-node NoC, drops 3x fewer nodes and provides 25% higher throughput (beyond 15 faults) when compared to other state-of-the-art fault-tolerance solutions. uDIREC's improvement over prior-art grows with more faults, making it a suitable NoC reliability solution for a wide range of fault rates.
- K. Aisopos, A. DeOrio, L.-S. Peh, and V. Bertacco. ARIADNE: Agnostic reconfiguration in a disconnected network environment. In Proc. PACT, 2011. Google ScholarDigital Library
- K. Aisopos and L.-S. Peh. A systematic methodology to develop resilient cache coherence protocols. In Proc. MICRO, 2011. Google ScholarDigital Library
- M. Al Faruque, T. Ebi, and J. Henkel. Configurable links for runtime adaptive on-chip communication. In Proc. DATE, 2009. Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. PACT, October 2008. Google ScholarDigital Library
- T. Bjerregaard and S. Mahadevan. A survey of research and practices of network-on-chip. ACM Computing Surveys, 38(1), 2006. Google ScholarDigital Library
- P. Bogdan, T. Dumitras, and R. Marculescu. Stochastic communication: A new paradigm for fault-tolerant networks-on-chip. VLSI Design, 2007.Google Scholar
- S. Borkar. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. Micro, IEEE, 25(6), 2005. Google ScholarDigital Library
- F. Chaix, D. Avresky, N.-E. Zergainoh, and M. Nicolaidis. A fault-tolerant deadlock-free adaptive routing for on chip interconnects. In Proc. DATE, 2011.Google ScholarCross Ref
- S. Chalasani and R. Boppana. Communication in multicomputers with nonconvex faults. IEEE Trans. Computers, 46(5), 1997. Google ScholarDigital Library
- G.-M. Chiu. The odd-even turn model for adaptive routing. IEEE Trans. Parallel and Distributed Systems, 11(7), 2000. Google ScholarDigital Library
- K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky. BulletProof: A defect-tolerant CMP switch architecture. In Proc. HPCA, 2006.Google ScholarCross Ref
- E. Cota, F. Kastensmidt, M. Cassel, M. Herve, P. Almeida, P. Meirelles, A. Amory, and M. Lubaszewski. A high-fault-coverage approach for the test of data, control and handshake interconnects in mesh networks-on-chip. IEEE Trans. Computers, 57(9), 2008. Google ScholarDigital Library
- W. Dally and C. Seitz. Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans. Computers, C-36(5), 1987. Google ScholarDigital Library
- W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., 2003. Google ScholarDigital Library
- A. DeOrio, K. Aisopos, V. Bertacco, and L.-S. Peh. DRAIN: Distributed recovery architecture for inaccessible nodes in multi-core chips. In Proc. DAC, 2011. Google ScholarDigital Library
- A. DeOrio, D. Fick, V. Bertacco, D. Sylvester, D. Blaauw, J. Hu, and G. Chen. A reliable routing architecture and algorithm for nocs. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2012.Google ScholarDigital Library
- D. Fick, A. DeOrio, J. Hu, V. Bertacco, D. Blaauw, and D. Sylvester. Vicis: A reliable network for unreliable silicon. In Proc. DAC, 2009. Google ScholarDigital Library
- J. Flich, A. Mejia, P. Lopez, and J. Duato. Region-based routing: An efficient routing mechanism to tackle unreliable hardware in network on chips. In Proc. NoCs, 2007. Google ScholarDigital Library
- Y. Fukushima, M. Fukushi, and S. Horiguchi. Fault-tolerant routing algorithm for network on chip without virtual channels. In Proc. DFT, 2009. Google ScholarDigital Library
- A. Ghofrani, R. Parikh, A. Shamshiri, A. DeOrio, K.-T. Cheng, and V. Bertacco. Comprehensive online defect diagnosis in on-chip networks. In Proc. VTS, 2012.Google ScholarCross Ref
- C. Glass and L. Ni. The turn model for adaptive routing. In Proc. ISCA, 1992. Google ScholarDigital Library
- C. J. Glass and L. M. Ni. Fault-tolerant wormhole routing in meshes without virtual channels. IEEE Trans. Parallel and Distributed Systems, 7, 1996. Google ScholarDigital Library
- M. Gomez, J. Duato, J. Flich, P. Lopez, A. Robles, N. Nordbotten, O. Lysne, and T. Skeie. An efficient fault-tolerant routing methodology for meshes and tori. Comp. Arch. Letters, 3(1), 2004. Google ScholarDigital Library
- S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. The StageNet fabric for constructing resilient multicore systems. In Proc. MICRO, 2008. Google ScholarDigital Library
- A. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proc. DATE, 2009. Google ScholarDigital Library
- J. Kim, C. Nicopoulos, D. Park, V. Narayanan, M. Yousif, and C. Das. A gracefully degrading and energy-efficient modular router architecture for on-chip networks. In Proc. ISCA, 2006. Google ScholarDigital Library
- A. Kohler and M. Radetzki. Fault-tolerant architecture and deflection routing for degradable noc switches. In Proc. NoCs, 2009. Google ScholarDigital Library
- M. Koibuchi, H. Matsutani, H. Amano, and T. M. Pinkston. A lightweight fault-tolerant mechanism for network-on-chip. In Proc. NoCs, 2008. Google ScholarDigital Library
- O. Lysne, J. M. Montañana, J. Flich, J. Duato, T. M. Pinkston, and T. Skeie. An efficient and deadlock-free network reconfiguration protocol. IEEE Trans. Computers, 57(6), 2008. Google ScholarDigital Library
- M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, and D. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Computer Architecture News, 33(4), 2005. Google ScholarDigital Library
- A. Mejia, J. Flich, J. Duato, S.-A. Reinemo, and T. Skeie. Segment-based routing: An efficient fault-tolerant routing algorithm for meshes and tori. In Proc. IPDPS, 2006. Google ScholarDigital Library
- S. Murali, T. Theocharides, N. Vijaykrishnan, M. Irwin, L. Benini, and G. De Micheli. Analysis of error recovery schemes for networks on chips. IEEE Design & Test, 22(5), 2005. Google ScholarDigital Library
- E. B. Nightingale, J. R. Douceur, and V. Orgovan. Cycles, cells and platters: An empirical analysis of hardware failures on a million consumer PCs. In Proc. EUROSYS, 2011. Google ScholarDigital Library
- M. Palesi, S. Kumar, and V. Catania. Leveraging partially faulty links usage for enhancing yield and performance in networks-on-chip. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 29(3), 2010. Google ScholarDigital Library
- R. Parikh and V. Bertacco. Formally enhanced verification at runtime to ensure noc functional correctness. In Proc. MICRO, 2011. Google ScholarDigital Library
- D. Park, C. Nicopoulos, J. Kim, N. Vijaykrishnan, and C. R. Das. Exploring fault-tolerant network-on-chip architectures. In Proc. DSN, 2006. Google ScholarDigital Library
- A. Pellegrini, J. L. Greathouse, and V. Bertacco. Viper: virtual pipelines for enhanced reliability. In Proc. ISCA, 2012. Google ScholarDigital Library
- T. M. Pinkston, R. Pang, and J. Duato. Deadlock-free dynamic reconfiguration schemes for increased network dependability. IEEE Trans. Parallel and Distributed Systems, 14(8), 2003. Google ScholarDigital Library
- A. Prodromou, A. Panteli, C. Nicopoulos, and Y. Sazeides. Nocalert: An on-line and real-time fault detection mechanism for network-on-chip architectures. In Proc. MICRO, 2012. Google ScholarDigital Library
- M. Prvulovic, Z. Zhang, and J. Torrellas. Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proc. ISCA, 2002. Google ScholarDigital Library
- V. Puente, J. A. Gregorio, F. Vallejo, and R. Beivide. Immunet: A cheap and robust fault-tolerant packet routing mechanism. In Proc. ISCA, 2004. Google ScholarDigital Library
- J. Raik, R. Ubar, and V. Govind. Test configurations for diagnosing faulty links in noc switches. In Proc. ETS, 2007. Google ScholarDigital Library
- V. Reddi and D. Brooks. Resilient architectures via collaborative design: Maximizing commodity processor performance in the presence of variations. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 30(10), 2011. Google ScholarDigital Library
- J. Sancho, A. Robles, and J. Duato. An effective methodology to improve the performance of the up*/down* routing algorithm. IEEE Trans. Parallel and Distributed Systems, 15(8), 2004. Google ScholarDigital Library
- J. C. Sancho, A. Robles, and J. Duato. A flexible routing scheme for networks of workstations. In Proc. ISHPC, 2000. Google ScholarDigital Library
- M. Schroeder, A. Birrell, M. Burrows, H. Murray, R. Needham, T. Rodeheffer, E. Satterthwaite, and C. Thacker. Autonet: A high-speed, self-configuring local area network using point-to-point links. IEEE Trans. Selected Areas in Communication, 9(8), 1991. Google ScholarDigital Library
- S. Shamshiri, A. Ghofrani, and K.-T. Cheng. End-to-end error correction and online diagnosis for on-chip networks. In Proc. ITC, 2011.Google ScholarCross Ref
- J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The impact of technology scaling on lifetime reliability. In Proc. DSN, 2004. Google ScholarDigital Library
- S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits, 2008.Google Scholar
- D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. Brown, and A. Agarwal. On-chip interconnection architecture of the tile processor. Micro, IEEE, 27(5), 2007. Google ScholarDigital Library
Index Terms
- uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults
Recommendations
Smart Reconfiguration Approach for Fault-Tolerant NoC Based MPSoCs
SBCCI '15: Proceedings of the 28th Symposium on Integrated Circuits and Systems DesignNewest technologies of integrated circuits fabrication allow billions of transistors arranged in a single chip enabling to implement a complex parallel system, which requires a high scalable and parallel communication architecture, such as a Network-on-...
Improving the yield of NoC-based systems through fault diagnosis and adaptive routing
We propose an effective and low cost method to increase the yield and the lifetime of torus NoCs. The method consists in detecting and diagnosing NoC interconnect faults using BIST structures and activating alternative paths for the faulty links. ...
ARIADNE: Agnostic Reconfiguration in a Disconnected Network Environment
PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation TechniquesExtreme transistor technology scaling is causing increasing concerns in device reliability: the expected lifetime of individual transistors in complex chips is quickly decreasing, and the problem is expected to worsen at future technology nodes. With ...
Comments