skip to main content
10.1145/2540708.2540722acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults

Published:07 December 2013Publication History

ABSTRACT

As silicon continues to scale, transistor reliability is becoming a major concern. At the same time, increasing transistor counts are causing a rapid shift towards large chip multi-processors (CMP) and system-on-chip (SoC) designs, comprising several cores and IPs communicating via a network-on-chip (NoC). As the sole medium of on-chip communication, a NoC should gracefully tolerate many permanent faults.

We propose uDIREC, a unified framework for permanent fault diagnosis and subsequent reconfiguration in NoCs that provides graceful performance degradation with increasing number of faults. Upon in-field transistor failures, uDIREC leverages a fine-resolution diagnosis mechanism to disable faulty components very sparingly. At its core, uDIREC employs a novel routing algorithm to find reliable and deadlock-free routes that utilize the still-functional links in the NoC. uDIREC places no restriction on topology, router architecture and number and location of faults. Experimental results show that uDIREC, implemented in a 64-node NoC, drops 3x fewer nodes and provides 25% higher throughput (beyond 15 faults) when compared to other state-of-the-art fault-tolerance solutions. uDIREC's improvement over prior-art grows with more faults, making it a suitable NoC reliability solution for a wide range of fault rates.

References

  1. K. Aisopos, A. DeOrio, L.-S. Peh, and V. Bertacco. ARIADNE: Agnostic reconfiguration in a disconnected network environment. In Proc. PACT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Aisopos and L.-S. Peh. A systematic methodology to develop resilient cache coherence protocols. In Proc. MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Al Faruque, T. Ebi, and J. Henkel. Configurable links for runtime adaptive on-chip communication. In Proc. DATE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. PACT, October 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Bjerregaard and S. Mahadevan. A survey of research and practices of network-on-chip. ACM Computing Surveys, 38(1), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Bogdan, T. Dumitras, and R. Marculescu. Stochastic communication: A new paradigm for fault-tolerant networks-on-chip. VLSI Design, 2007.Google ScholarGoogle Scholar
  7. S. Borkar. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. Micro, IEEE, 25(6), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Chaix, D. Avresky, N.-E. Zergainoh, and M. Nicolaidis. A fault-tolerant deadlock-free adaptive routing for on chip interconnects. In Proc. DATE, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  9. S. Chalasani and R. Boppana. Communication in multicomputers with nonconvex faults. IEEE Trans. Computers, 46(5), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G.-M. Chiu. The odd-even turn model for adaptive routing. IEEE Trans. Parallel and Distributed Systems, 11(7), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky. BulletProof: A defect-tolerant CMP switch architecture. In Proc. HPCA, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  12. E. Cota, F. Kastensmidt, M. Cassel, M. Herve, P. Almeida, P. Meirelles, A. Amory, and M. Lubaszewski. A high-fault-coverage approach for the test of data, control and handshake interconnects in mesh networks-on-chip. IEEE Trans. Computers, 57(9), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Dally and C. Seitz. Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans. Computers, C-36(5), 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. DeOrio, K. Aisopos, V. Bertacco, and L.-S. Peh. DRAIN: Distributed recovery architecture for inaccessible nodes in multi-core chips. In Proc. DAC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. DeOrio, D. Fick, V. Bertacco, D. Sylvester, D. Blaauw, J. Hu, and G. Chen. A reliable routing architecture and algorithm for nocs. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Fick, A. DeOrio, J. Hu, V. Bertacco, D. Blaauw, and D. Sylvester. Vicis: A reliable network for unreliable silicon. In Proc. DAC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Flich, A. Mejia, P. Lopez, and J. Duato. Region-based routing: An efficient routing mechanism to tackle unreliable hardware in network on chips. In Proc. NoCs, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Fukushima, M. Fukushi, and S. Horiguchi. Fault-tolerant routing algorithm for network on chip without virtual channels. In Proc. DFT, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Ghofrani, R. Parikh, A. Shamshiri, A. DeOrio, K.-T. Cheng, and V. Bertacco. Comprehensive online defect diagnosis in on-chip networks. In Proc. VTS, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  21. C. Glass and L. Ni. The turn model for adaptive routing. In Proc. ISCA, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. J. Glass and L. M. Ni. Fault-tolerant wormhole routing in meshes without virtual channels. IEEE Trans. Parallel and Distributed Systems, 7, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Gomez, J. Duato, J. Flich, P. Lopez, A. Robles, N. Nordbotten, O. Lysne, and T. Skeie. An efficient fault-tolerant routing methodology for meshes and tori. Comp. Arch. Letters, 3(1), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. The StageNet fabric for constructing resilient multicore systems. In Proc. MICRO, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proc. DATE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Kim, C. Nicopoulos, D. Park, V. Narayanan, M. Yousif, and C. Das. A gracefully degrading and energy-efficient modular router architecture for on-chip networks. In Proc. ISCA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Kohler and M. Radetzki. Fault-tolerant architecture and deflection routing for degradable noc switches. In Proc. NoCs, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Koibuchi, H. Matsutani, H. Amano, and T. M. Pinkston. A lightweight fault-tolerant mechanism for network-on-chip. In Proc. NoCs, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. O. Lysne, J. M. Montañana, J. Flich, J. Duato, T. M. Pinkston, and T. Skeie. An efficient and deadlock-free network reconfiguration protocol. IEEE Trans. Computers, 57(6), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, and D. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Computer Architecture News, 33(4), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Mejia, J. Flich, J. Duato, S.-A. Reinemo, and T. Skeie. Segment-based routing: An efficient fault-tolerant routing algorithm for meshes and tori. In Proc. IPDPS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Murali, T. Theocharides, N. Vijaykrishnan, M. Irwin, L. Benini, and G. De Micheli. Analysis of error recovery schemes for networks on chips. IEEE Design & Test, 22(5), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E. B. Nightingale, J. R. Douceur, and V. Orgovan. Cycles, cells and platters: An empirical analysis of hardware failures on a million consumer PCs. In Proc. EUROSYS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Palesi, S. Kumar, and V. Catania. Leveraging partially faulty links usage for enhancing yield and performance in networks-on-chip. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 29(3), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Parikh and V. Bertacco. Formally enhanced verification at runtime to ensure noc functional correctness. In Proc. MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Park, C. Nicopoulos, J. Kim, N. Vijaykrishnan, and C. R. Das. Exploring fault-tolerant network-on-chip architectures. In Proc. DSN, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Pellegrini, J. L. Greathouse, and V. Bertacco. Viper: virtual pipelines for enhanced reliability. In Proc. ISCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. T. M. Pinkston, R. Pang, and J. Duato. Deadlock-free dynamic reconfiguration schemes for increased network dependability. IEEE Trans. Parallel and Distributed Systems, 14(8), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Prodromou, A. Panteli, C. Nicopoulos, and Y. Sazeides. Nocalert: An on-line and real-time fault detection mechanism for network-on-chip architectures. In Proc. MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Prvulovic, Z. Zhang, and J. Torrellas. Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proc. ISCA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. V. Puente, J. A. Gregorio, F. Vallejo, and R. Beivide. Immunet: A cheap and robust fault-tolerant packet routing mechanism. In Proc. ISCA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Raik, R. Ubar, and V. Govind. Test configurations for diagnosing faulty links in noc switches. In Proc. ETS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. V. Reddi and D. Brooks. Resilient architectures via collaborative design: Maximizing commodity processor performance in the presence of variations. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 30(10), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. Sancho, A. Robles, and J. Duato. An effective methodology to improve the performance of the up*/down* routing algorithm. IEEE Trans. Parallel and Distributed Systems, 15(8), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. C. Sancho, A. Robles, and J. Duato. A flexible routing scheme for networks of workstations. In Proc. ISHPC, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. Schroeder, A. Birrell, M. Burrows, H. Murray, R. Needham, T. Rodeheffer, E. Satterthwaite, and C. Thacker. Autonet: A high-speed, self-configuring local area network using point-to-point links. IEEE Trans. Selected Areas in Communication, 9(8), 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. Shamshiri, A. Ghofrani, and K.-T. Cheng. End-to-end error correction and online diagnosis for on-chip networks. In Proc. ITC, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  48. J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The impact of technology scaling on lifetime reliability. In Proc. DSN, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits, 2008.Google ScholarGoogle Scholar
  50. D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. Brown, and A. Agarwal. On-chip interconnection architecture of the tile processor. Micro, IEEE, 27(5), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
            December 2013
            498 pages
            ISBN:9781450326384
            DOI:10.1145/2540708

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 7 December 2013

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            MICRO-46 Paper Acceptance Rate39of239submissions,16%Overall Acceptance Rate484of2,242submissions,22%

            Upcoming Conference

            MICRO '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader