skip to main content
10.1145/2656075.2656100acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

Tackling QoS-induced aging in exascale systems through agile path selection

Published: 12 October 2014 Publication History

Abstract

Network-On-Chips (NoCs) have become the standard communication platform for future massively parallel systems due to their performance, flexibility and scalability advantages. However, reliability issues brought about by scaling in the sub-20nm era threaten to undermine the benefits offered by NoCs. In this paper, we showthat QoS policies exacerbate the reliability profile of an exascale system. To mitigate this imposing challenge, we propose Dynamic Wearout Resilient Routing (DWRR) algorithms in QoS-enabled exascale NoCs. Our proposal includes two novel DWRR algorithms enabled by a critical-pathmonitor and a broadcast-based routing configuration. Using PARSEC benchmarks, our best algorithm improves QoS and long-term sustainability (Mean Time To Failure) of the system by an average of 16% and 25% compared to a state-of-the-art fault tolerant technique, respectively.

References

[1]
J. Wawrzynek, D. Patterson, M. Oskin, S.-L. Lu, C. Kozyrakis, J. Hoe, D. Chiou, and K. Asanovic, "Ramp: Research accelerator for multiple processors," Proc. of MICRO, pp. 46--57, 2007.
[2]
B. Grot, S. W. Keckler, and O. Mutlu, "Preemptive virtual clock: a flexible, efficient, and cost-effective qos scheme for networks-on-chip," in Proc. of MICRO, pp. 268--279, 2009.
[3]
J. Lee, M. C. Ng, and K. Asanovic, "Globally-synchronized frames for guaranteed quality-of-service in on-chip networks," in Proc. of ISCA, pp. 89--100, 2008.
[4]
D. Fick, A. DeOrio, J. Hu, V. Bertacco, D. Blaauw, and D. Sylvester, "Vicis: a reliable network for unreliable silicon," in Proc. of DAC, pp. 812--817, 2009.
[5]
J. Srinivasan, S. Adve, P. Bose, and J. Rivers, "Lifetime reliability: toward an architectural solution," Proc. of MICRO, pp. 70--80, 2005.
[6]
W. Wang, V. Reddy, A. Krishnan, R. Vattikonda, S. Krishnan, and Y. Cao, "Compact modeling and simulation of circuit reliability for 65-nm cmos technology," IEEE Trans. on Device and Materials Reliability, vol. 7, no. 4, pp. 509--517, 2007.
[7]
A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient virtual memory for big memory servers," in Proc. of ISCA, pp. 237--248, 2013.
[8]
P. Lotfi-Kamran, B. Grot, and B. Falsafi, "Noc-out: Microarchitecting a scale-out processor," in Proc. of MICRO, pp. 177--187, 2012.
[9]
D. Fick, N. Liu, Z. Foo, M. Fojtik, J. sun Seo, D. Sylvester, and D. Blaauw, "In situ delay-slack monitor for high-performance processors using an all-digital self-calibrating 5ps resolution time-to-digital converter," in ISSCC, pp. 188--189, 2010.
[10]
S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan, K. Lai, D. Bull, and D. Blaauw, "RazorII: In situ error detection and correction for PVT and SER tolerance," JSSC, vol. 44, pp. 32--48, Jan. 2009.
[11]
Open Source NoC Router RTL. https://nocs.stanford.edu/cgi-bin/trac.cgi/wiki/Resources/Router.
[12]
W. J. Dally and B. Towles, Principles and practices of interconnection networks. Morgan Kaufmann, 2004.
[13]
P. Gratz, B. Grot, and S. W. Keckler, "Regional congestion awareness for load balance in networks-on-chip," in HPCA, pp. 203--214, 2008.
[14]
M. Ramakrishna, P. V. Gratz, and A. Sprintson, "Gca: Global congestion awareness for load balance in networks-on-chip," in NOCS, pp. 1--8, IEEE, 2013.
[15]
W. Zhao and Y. Cao, Predictive Technology Model. http://ptm.asu.edu/.
[16]
K. Bhardwaj, K. Chakraborty, and S. Roy, "Towards graceful aging degradation in nocs through an adaptive routing algorithm," in Proc. of DAC, pp. 382--391, 2012.
[17]
C. Sun, C.-H. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic, "Dsent - a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling," in NOCS, pp. 201--210, 2012.
[18]
P. Kundu, "On-die interconnects for next generation cmps," in Proc. of WOCIN, 2006.
[19]
D. M. Ancajas, K. Chakraborty, and S. Roy, "Hci tolerant noc router micro-architecture," in Proc. of DAC, no. 40, 2013.
[20]
D. Lorenz, M. Barke, and U. Schlichtmann, "Aging analysis at gate and macro cell level," in Proc. of ICCAD, pp. 77--84, 2010.
[21]
M. Kamal, M. P. Qing Xie, A. Afzali-Kusha, and S. Safari, "An efficient reliability simulation flow for evaluating the hot carrier injection effect in cmos vlsi circuits," in ICCD, pp. 352--357, 2012.
[22]
Seyab and S. Hamdioui, "Nbti modeling in the framework of temperature variation," in Proc. of DATE, pp. 283--286, 2010.
[23]
J. Hestness, B. Grot, and S. W. Keckler, "Netrace: dependency-driven trace-based network-on-chip simulation," in Proc. of WNOCA, pp. 31--36, 2010.
[24]
W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusamy, "Compact thermal modeling for temperature-aware design," in Proc. of DAC, pp. 878--883, 2004.
[25]
E. A. H. El Amir, "On uses of mean absolute deviation: decomposition, skewness and correlation coefficients," Metron, vol. 70, no. 2-3, pp. 145--164, 2012.
[26]
S. Borkar, "Thousand core chipsa technology perspective," in Proc. of DAC, pp. 746--749, 2007.
[27]
C.-L. Chou and R. Marculescu, "Farm: Fault-aware resourcemanagement in noc-based multiprocessor platforms," in Proc. of DATE, pp. 673--678, 2011.
[28]
A. Hosseini, T. Ragheb, and Y. Massoud, "A fault-aware dynamic routing algorithm for on-chip networks," in Proc. of ISCAS, pp. 2653--2656, 2008.
[29]
F. Chaix, D. Avresky, N.-E. Zergainoh, and M. Nicolaidis, "A fault-tolerant deadlock-free adaptive routing for on chip interconnects," in Proc. of DATE, pp. 909--912, 2011.
[30]
Y.-C. Lan, M. Chen, W.-D. Chen, S.-J. Chen, and Y.-H. Hu, "Performance-energy tradeoffs in reliable nocs," in Quality of Electronic Design, 2009. ISQED 2009. Quality Electronic Design, pp. 141--146, 2009.
[31]
R. Parikh and V. Bertacco, "Formally enhanced runtime verification to ensure noc functional correctness," in Proc. of MICRO, pp. 410--419, 2011.
[32]
A. Prodromou1, A. Panteli1, C. Nicopoulos1, and Y. Sazeides2, "Nocalert: An on-line and real-time fault detection mechanism for network-on-chip architectures," in Proc. of MICRO, pp. 60--71, 2012.
[33]
Z. Zhang, A. Greiner, and S. Taktak, "A reconfigurable routing algorithm for a fault-tolerant 2d-mesh network-on-chip," in Proc. of DAC, pp. 441--446, 2008.
[34]
W.-C. Tsai, D.-Y. Zheng, S.-J. Chen, and Y. H. Hu, "A fault-tolerant noc scheme using bidirectional channel," in Proc. of DAC, pp. 918--923, 2011.
[35]
T. Moscibroda and O. Mutlu, "A case for bufferless routing in on-chip networks," in Proc. of ISCA, pp. 196--207, 2009.
[36]
N. Abeyratne, R. Das, Q. Li, K. Sewell, B. Giridhar, R. Dreslinski, D. Blaauw, and T. Mudge, "Scaling towards kilo-core processors with asymmetric high radix topologies," in HPCA, pp. 89--101, 2013.
[37]
K. Bhardwaj, K. Chakraborty, and S. Roy, "An milp based aging aware routing algorithm for nocs," in Proc. of DATE, pp. 326--331, 2012.

Cited By

View all
  • (2019)A Non-Minimal Routing Algorithm for Aging Mitigation in 2D-Mesh NoCsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.285514938:7(1373-1377)Online publication date: Jul-2019
  • (2019)Probabilistic Verification for Reliable Network-on-Chip System DesignFormal Methods for Industrial Critical Systems10.1007/978-3-030-27008-7_7(110-126)Online publication date: 25-Jul-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CODES '14: Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis
October 2014
331 pages
ISBN:9781450330510
DOI:10.1145/2656075
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2014

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

ESWEEK'14
ESWEEK'14: TENTH EMBEDDED SYSTEM WEEK
October 12 - 17, 2014
New Delhi, India

Acceptance Rates

Overall Acceptance Rate 280 of 864 submissions, 32%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)A Non-Minimal Routing Algorithm for Aging Mitigation in 2D-Mesh NoCsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.285514938:7(1373-1377)Online publication date: Jul-2019
  • (2019)Probabilistic Verification for Reliable Network-on-Chip System DesignFormal Methods for Industrial Critical Systems10.1007/978-3-030-27008-7_7(110-126)Online publication date: 25-Jul-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media