Architectural Support for Fault Tolerance in a Teradevice Dataflow System

Weis, Sebastian; Garbade, Arne; Fechner, Bernhard; Mendelson, Avi; Giorgi, Roberto; Ungerer, Theo

doi:10.1007/s10766-014-0312-y

Architectural Support for Fault Tolerance in a Teradevice Dataflow System

Published: 29 May 2014

Volume 44, pages 208–232, (2016)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Sebastian Weis¹,
Arne Garbade¹,
Bernhard Fechner¹,
Avi Mendelson²,
Roberto Giorgi³ &
…
Theo Ungerer¹

480 Accesses
20 Citations
Explore all metrics

Abstract

The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine side-effect free execution and reduced synchronization overhead. However, the terascale transistor integration of such future chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means dynamic fault-tolerance mechanisms have to be an essential part of such future system. In this paper, we present a fault tolerant architecture for a coarse-grained dataflow system, leveraging the inherent features of the dataflow execution model. In detail, we provide methods to dynamically detect and manage permanent, intermittent, and transient faults during runtime. Furthermore, we exploit the dataflow execution model for a thread-level recovery scheme. Our results showed that redundant execution of dataflow threads can efficiently make use of underutilized resources in a multi-core, while the overhead in a fully utilized system stays reasonable. Moreover, thread-level recovery suffered from moderate overhead, even in the case of high fault rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards Fault Tolerance and Resilience in the Sequential Codelet Model

Energy Conscious Scheduling for Fault-Tolerant Real-Time Distributed Computing Systems

An Adaptive Low-Overhead Mechanism for Dependable General-Purpose Many-Core Processors

References

International Technology Roadmap for Semiconductors 2011 Edition. Website. http://www.itrs.net
Agarwal, R., Garg, P., Torrellas, J.: Rebound: scalable checkpointing for coherent shared memory. In: International Symposium on Computer Architecture (ISCA), pp. 153–164. IEEE (2011)
AMD Inc.: AMD64 Architecture Programmer’s Manual Volume 2: System Programming (2006)
Arandi, S., Kyriacou, C., George, M., George, M., Masrujeh, N., Trancoso, P., Evripidou, S., Giorgi, R., Zhibin, Y., Collange, S., Scionti, A., Khan, B., Khan, S., Lujan, M., Watson, I., Etsion, Y., Ungerer, T., Fechner, B., Garbade, A., Weis, S.: D6.2-advanced teraflux architecture. Public deliverable, The TERAFLUX Project (FP7/2007-2013 Grant Agreement No. 249013) (2011)
Argollo, E., Falcón, A., Faraboschi, P., Monchiero, M., Ortega, D.: COTSon: infrastructure for full system simulation. ACM SIGOPS Oper. Syst. Rev. 43(1), 52–61 (2009)
Austin, T.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: International Symposium on Microarchitecture (MICRO), pp. 196–207. IEEE (1999)
Bell, S., et al.: TILE64-processor: a 64-core soc with mesh interconnect. In: International Solid-State Circuits Conference (ISSCC). Digest of Technical Papers, pp. 88–89. IEEE (2008)
Bernick, D., Bruckert, B., Vigna, P., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstop advanced architecture. In: International Conference on Dependable Systems and Networks (DSN), pp. 12–21. IEEE (2005)
Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)
Article Google Scholar
Borkar, S.: Thousand core chips: a technology perspective. In: Annual Design Automation Conference (DAC), pp. 746–749. ACM (2007)
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Etsion, Y., Cabarcas, F., Rico, A., Ramirez, A., Badia, R. M., Ayguade, E., Labarta, J., Valero, M.: Task superscalar: an out-of-order task pipeline. In: International Symposium on Microarchitecture (MICRO), pp. 89–100. IEEE (2010)
Gautier, T., Besseron, X., Pigeon, L.: KAAPI: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: International Workshop on Parallel Symbolic Computation (PASCO), pp. 15–23. ACM (2007)
Giorgi, R.: TERAFLUX: exploiting dataflow parallelism in teradevices. In: International Conference on Computing Frontiers (CF), pp. 303–304. ACM (2012)
Giorgi, R., Popovic, Z., Puzovic, N.: DTA-C: a decoupled multi-threaded architecture for CMP systems. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 263–270. IEEE (2007)
Giorgi, R., Popovic, Z., Puzovic, N.: Implementing fine/medium grained TLP support in a many-core architecture. In: Bertels, K., Dimopoulos, N., Silvano, C., Wong, S. (eds.) Embedded Computer Systems: Architectures, Modeling, and Simulation, Lecture Notes in Computer Science (LNCS), vol. 5657, pp. 78–87. Springer (2009)
Gupta, G., Sohi, G.S.: Dataflow execution of sequential imperative programs on multicore architectures. In: International Symposium on Microarchitecture (MICRO), pp. 59–70. ACM (2011)
Hammond, L., Wong, V., Chen, M., Carlstrom, B.D., Davis, J.D., Hertzberg, B., Prabhu, M.K., Wijaya, H., Kozyrakis, C., Olukotun, K.: Transactional memory coherence and consistency. In: International Symposium on Computer Architecture (ISCA), pp. 102–113. IEEE (2004)
Howard, J., et al.: A 48-core ia-32 message-passing processor with dvfs in 45nm CMOS. In: International Solid-State Circuits Conference (ISSCC). Digest of Technical Papers, pp. 108–109. IEEE (2010)
Hum, H.H.J., Maquelin, O., Theobald, K.B., Tian, X., Tang, X., Gao, G.R., Cupryk, P., Elmasri, N., Hendren, L.J., Jimenez, A., Krishnan, S., Marquez, A., Merali, S., Nemawarkar, S.S., Panangaden, P., Xue, X., Zhu, Y.: A design study of the EARTH multiprocessor. In: International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 59–68. IFIP Working Group (1995)
Iyer, R., Nakka, N., Kalbarczyk, Z., Mitra, S.: Recent advances and new avenues in hardware-level reliability support. IEEE Micro 25(6), 18–29 (2005)
Article Google Scholar
Jafar, S., Gautier, T., Krings, A., louis Roch, J.: A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing. In: Cunha, J.C., Medeiros P.D. (eds.) Euro-Par 2005 Parallel Processing, Lecture Notes in Computer Science (LNCS), vol. 3648, pp. 675–684. Springer, Berlin, Heidelberg (2005)
Kelm, J.H., Johnson, D.R., Johnson, M.R., Crago, N.C., Tuohy, W., Mahesri, A., Lumetta, S.S., Frank, M.I., Patel, S.J.: Rigel: an architecture and scalable programming interface for a 1000-core accelerator. In: International Symposium on Computer Architecture (ISCA), pp. 140–151. IEEE (2009)
Kephart, J., Chess, D.: The vision of autonomic computing. Computer 36(1), 41–50 (2003)
Article MathSciNet Google Scholar
LaFrieda, C., Ipek, E., Martinez, J., Manohar, R.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: International Conference on Dependable Systems and Networks (DSN), pp. 317–326. IEEE (2007)
Lee, B., Hurson, A.R.: Dataflow architectures and multithreading. Computer 27(8), 27–39 (1994)
Article Google Scholar
Li, F., Pop, A., Cohen, A.: Automatic extraction of coarse-grained data-flow threads from imperative programs. IEEE Micro 32(4), 19–31 (2012)
Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives In: International Symposium on Computer Architecture (ISCA), pp. 99–110. IEEE (2002)
Nguyen-tuong, A., Grimshaw, A.S., Hyett, M.: Exploiting data-flow for fault-tolerance in a wide-area parallel system. In: International Symposium on Reliable and Distributed Systems, pp. 1–11 (1996)
Prvulovic, M., Zhang, Z., Torrellas, J.: Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: International Symposium on Computer Architecture (ISCA), pp. 111–122. IEEE (2002)
Rashid, M., Huang, M.: Supporting highly-decoupled thread-level redundancy for parallel programs. In: International Symposium on High Performance Computer Architecture (HPCA), pp. 393–404. IEEE (2008)
Ray, J., Hoe, J.C., Falsafi, B.: Dual use of superscalar datapath for transient-fault detection and recovery. In: International Symposium on Microarchitecture (MICRO), pp. 214–224. IEEE (2001)
Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: International Symposium on Computer Architecture (ISCA), pp. 25–36. ACM (2000)
Rotenberg, E.: AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers, pp. 84-91 (1999)
Sánchez, D., Aragón, J., García, J.: Evaluating dynamic core coupling in a scalable tiled-cmp architecture. In: International Workshop on Duplicating, Deconstructing, and Debunking (WDDD) (2008)
Sánchez, D., Aragón, J., García, J.: Extending SRT for parallel applications in tiled-CMP architectures. In: International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–8. IEEE (2009)
Sánchez, D., Aragón, J.L., García, J.M.: REPAS: Reliable Execution for Parallel Applications in Tiled-CMPs. In: Sips, H., Epema, D., Lin, H.X. (eds.) International Euro-Par Conference on Parallel Processing, Lecture Notes in Computer Science (LNCS), vol. 5704, pp. 321–333. Springer, Berlin, Heidelberg (2009)
Sánchez, D., Aragón, J. L., García, J.M.: A log-based redundant architecture for reliable parallel computation. In: International Conference on High Performance Computing (HiPC), pp. 1–10. IEEE (2010)
Slegel, T., Averill, R.M.I., Check, M., Giamei, B., Krumm, B., Krygowski, C., Li, W., Liptay, J., Macdougall, J., McPherson, T., Navarro, J., Schwarz, E., Shum, K., Webb, C.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2), 12–23 (1999)
Smolens, J.C., Gold, B.T., Falsafi, B., Hoe, J.C.: Reunion: complexity-effective multicore redundancy. In: International Symposium on Microarchitecture (MICRO), pp. 223–234. IEEE (2006)
Smolens, J.C., Gold, B.T., Kim, J., Falsafi, B., Hoe, J.C., Nowatzyk, A.G.: Fingerprinting: bounding soft-error detection latency and bandwidth. In: International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 224–234. IEEE (2004)
Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: International Symposium on Computer Architecture (ISCA), pp. 123–134. IEEE (2002)
Srinivasan, J., Adve, S.V., Bose, P., Rivers, J.A.: The impact of technology scaling on lifetime reliability. In: International Conference on Dependable Systems and Networks (DSN), pp. 177–186. IEEE (2004)
Stavrou, K., Evripidou, P., Trancoso, P.: DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor. In: Hämäläinen, T.D., Pimentel, A.D., Takala, J., Vassiliadis, S. (eds.) Embedded Computer Systems: Architectures, Modeling, and Simulation, Lecture Notes in Computer Science (LNCS), vol. 3553, pp. 364–373. Springer, Berlin, Heidelberg (2005)
Weis, S., Garbade, A., Schlingmann, S., Ungerer, T.: Towards fault detection units as an autonomous fault detection approach for future many-cores. In: ARCS 2011 Workshop Proceedings, pp. 20–23. VDE (2011)
Weis, S., Garbade, A., Wolf, J., Fechner, B., Mendelson, A., Giorgi, R., Ungerer, T.: A fault detection and recovery architecture for a teradevice dataflow system. In: International Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM), pp. 38–44. IEEE (2011)
Wittenbrink, C., Kilgariff, E., Prabhu, A.: Fermi GF100 GPU architecture. IEEE Micro 31(2), 50–59 (2011)
Yeh, Y.: Triple-triple redundant 777 primary flight computer. In: Proceedings of the Aerospace Applications Conference, pp. 293–307. IEEE (1996)
Zuckerman, S., Suetterlein, J., Knauerhase, R., Gao, G.R.: Using a “codelet” program execution model for exascale machines: position paper. In: Proceedings of the International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT), pp. 64–69. ACM (2011)

Download references

Acknowledgments

This work was partly funded by the European FP7 Projects TERAFLUX (id. 249013) and HiPEAC (IST-217068). The authors wish to thank N. Puzovic and Z. Popovic for their initial studies on the DTA-C architecture and P. Faraboschi of HP for his precious suggestions and support on the COTSon simulator.

Author information

Authors and Affiliations

University of Augsburg, Universitaetsstr. 6a, 86159 , Augsburg, Germany
Sebastian Weis, Arne Garbade, Bernhard Fechner & Theo Ungerer
Technion, Technion City, 32000 , Haifa, Israel
Avi Mendelson
University of Siena, Via Roma 56, 53100 , Siena, Italy
Roberto Giorgi

Authors

Sebastian Weis
View author publications
You can also search for this author in PubMed Google Scholar
Arne Garbade
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Fechner
View author publications
You can also search for this author in PubMed Google Scholar
Avi Mendelson
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Giorgi
View author publications
You can also search for this author in PubMed Google Scholar
Theo Ungerer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastian Weis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Weis, S., Garbade, A., Fechner, B. et al. Architectural Support for Fault Tolerance in a Teradevice Dataflow System. Int J Parallel Prog 44, 208–232 (2016). https://doi.org/10.1007/s10766-014-0312-y

Download citation

Received: 28 February 2013
Accepted: 07 May 2014
Published: 29 May 2014
Issue Date: April 2016
DOI: https://doi.org/10.1007/s10766-014-0312-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Architectural Support for Fault Tolerance in a Teradevice Dataflow System

Abstract

Access this article

Similar content being viewed by others

Towards Fault Tolerance and Resilience in the Sequential Codelet Model

Energy Conscious Scheduling for Fault-Tolerant Real-Time Distributed Computing Systems

An Adaptive Low-Overhead Mechanism for Dependable General-Purpose Many-Core Processors

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Architectural Support for Fault Tolerance in a Teradevice Dataflow System

Abstract

Access this article

Similar content being viewed by others

Towards Fault Tolerance and Resilience in the Sequential Codelet Model

Energy Conscious Scheduling for Fault-Tolerant Real-Time Distributed Computing Systems

An Adaptive Low-Overhead Mechanism for Dependable General-Purpose Many-Core Processors

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation