Skip to main content
Log in

Architectural Support for Fault Tolerance in a Teradevice Dataflow System

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine side-effect free execution and reduced synchronization overhead. However, the terascale transistor integration of such future chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means dynamic fault-tolerance mechanisms have to be an essential part of such future system. In this paper, we present a fault tolerant architecture for a coarse-grained dataflow system, leveraging the inherent features of the dataflow execution model. In detail, we provide methods to dynamically detect and manage permanent, intermittent, and transient faults during runtime. Furthermore, we exploit the dataflow execution model for a thread-level recovery scheme. Our results showed that redundant execution of dataflow threads can efficiently make use of underutilized resources in a multi-core, while the overhead in a fully utilized system stays reasonable. Moreover, thread-level recovery suffered from moderate overhead, even in the case of high fault rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. International Technology Roadmap for Semiconductors 2011 Edition. Website. http://www.itrs.net

  2. Agarwal, R., Garg, P., Torrellas, J.: Rebound: scalable checkpointing for coherent shared memory. In: International Symposium on Computer Architecture (ISCA), pp. 153–164. IEEE (2011)

  3. AMD Inc.: AMD64 Architecture Programmer’s Manual Volume 2: System Programming (2006)

  4. Arandi, S., Kyriacou, C., George, M., George, M., Masrujeh, N., Trancoso, P., Evripidou, S., Giorgi, R., Zhibin, Y., Collange, S., Scionti, A., Khan, B., Khan, S., Lujan, M., Watson, I., Etsion, Y., Ungerer, T., Fechner, B., Garbade, A., Weis, S.: D6.2-advanced teraflux architecture. Public deliverable, The TERAFLUX Project (FP7/2007-2013 Grant Agreement No. 249013) (2011)

  5. Argollo, E., Falcón, A., Faraboschi, P., Monchiero, M., Ortega, D.: COTSon: infrastructure for full system simulation. ACM SIGOPS Oper. Syst. Rev. 43(1), 52–61 (2009)

  6. Austin, T.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: International Symposium on Microarchitecture (MICRO), pp. 196–207. IEEE (1999)

  7. Bell, S., et al.: TILE64-processor: a 64-core soc with mesh interconnect. In: International Solid-State Circuits Conference (ISSCC). Digest of Technical Papers, pp. 88–89. IEEE (2008)

  8. Bernick, D., Bruckert, B., Vigna, P., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstop advanced architecture. In: International Conference on Dependable Systems and Networks (DSN), pp. 12–21. IEEE (2005)

  9. Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)

    Article  Google Scholar 

  10. Borkar, S.: Thousand core chips: a technology perspective. In: Annual Design Automation Conference (DAC), pp. 746–749. ACM (2007)

  11. Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

  12. Etsion, Y., Cabarcas, F., Rico, A., Ramirez, A., Badia, R. M., Ayguade, E., Labarta, J., Valero, M.: Task superscalar: an out-of-order task pipeline. In: International Symposium on Microarchitecture (MICRO), pp. 89–100. IEEE (2010)

  13. Gautier, T., Besseron, X., Pigeon, L.: KAAPI: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: International Workshop on Parallel Symbolic Computation (PASCO), pp. 15–23. ACM (2007)

  14. Giorgi, R.: TERAFLUX: exploiting dataflow parallelism in teradevices. In: International Conference on Computing Frontiers (CF), pp. 303–304. ACM (2012)

  15. Giorgi, R., Popovic, Z., Puzovic, N.: DTA-C: a decoupled multi-threaded architecture for CMP systems. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 263–270. IEEE (2007)

  16. Giorgi, R., Popovic, Z., Puzovic, N.: Implementing fine/medium grained TLP support in a many-core architecture. In: Bertels, K., Dimopoulos, N., Silvano, C., Wong, S. (eds.) Embedded Computer Systems: Architectures, Modeling, and Simulation, Lecture Notes in Computer Science (LNCS), vol. 5657, pp. 78–87. Springer (2009)

  17. Gupta, G., Sohi, G.S.: Dataflow execution of sequential imperative programs on multicore architectures. In: International Symposium on Microarchitecture (MICRO), pp. 59–70. ACM (2011)

  18. Hammond, L., Wong, V., Chen, M., Carlstrom, B.D., Davis, J.D., Hertzberg, B., Prabhu, M.K., Wijaya, H., Kozyrakis, C., Olukotun, K.: Transactional memory coherence and consistency. In: International Symposium on Computer Architecture (ISCA), pp. 102–113. IEEE (2004)

  19. Howard, J., et al.: A 48-core ia-32 message-passing processor with dvfs in 45nm CMOS. In: International Solid-State Circuits Conference (ISSCC). Digest of Technical Papers, pp. 108–109. IEEE (2010)

  20. Hum, H.H.J., Maquelin, O., Theobald, K.B., Tian, X., Tang, X., Gao, G.R., Cupryk, P., Elmasri, N., Hendren, L.J., Jimenez, A., Krishnan, S., Marquez, A., Merali, S., Nemawarkar, S.S., Panangaden, P., Xue, X., Zhu, Y.: A design study of the EARTH multiprocessor. In: International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 59–68. IFIP Working Group (1995)

  21. Iyer, R., Nakka, N., Kalbarczyk, Z., Mitra, S.: Recent advances and new avenues in hardware-level reliability support. IEEE Micro 25(6), 18–29 (2005)

    Article  Google Scholar 

  22. Jafar, S., Gautier, T., Krings, A., louis Roch, J.: A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing. In: Cunha, J.C., Medeiros P.D. (eds.) Euro-Par 2005 Parallel Processing, Lecture Notes in Computer Science (LNCS), vol. 3648, pp. 675–684. Springer, Berlin, Heidelberg (2005)

  23. Kelm, J.H., Johnson, D.R., Johnson, M.R., Crago, N.C., Tuohy, W., Mahesri, A., Lumetta, S.S., Frank, M.I., Patel, S.J.: Rigel: an architecture and scalable programming interface for a 1000-core accelerator. In: International Symposium on Computer Architecture (ISCA), pp. 140–151. IEEE (2009)

  24. Kephart, J., Chess, D.: The vision of autonomic computing. Computer 36(1), 41–50 (2003)

    Article  MathSciNet  Google Scholar 

  25. LaFrieda, C., Ipek, E., Martinez, J., Manohar, R.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: International Conference on Dependable Systems and Networks (DSN), pp. 317–326. IEEE (2007)

  26. Lee, B., Hurson, A.R.: Dataflow architectures and multithreading. Computer 27(8), 27–39 (1994)

    Article  Google Scholar 

  27. Li, F., Pop, A., Cohen, A.: Automatic extraction of coarse-grained data-flow threads from imperative programs. IEEE Micro 32(4), 19–31 (2012)

  28. Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives In: International Symposium on Computer Architecture (ISCA), pp. 99–110. IEEE (2002)

  29. Nguyen-tuong, A., Grimshaw, A.S., Hyett, M.: Exploiting data-flow for fault-tolerance in a wide-area parallel system. In: International Symposium on Reliable and Distributed Systems, pp. 1–11 (1996)

  30. Prvulovic, M., Zhang, Z., Torrellas, J.: Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: International Symposium on Computer Architecture (ISCA), pp. 111–122. IEEE (2002)

  31. Rashid, M., Huang, M.: Supporting highly-decoupled thread-level redundancy for parallel programs. In: International Symposium on High Performance Computer Architecture (HPCA), pp. 393–404. IEEE (2008)

  32. Ray, J., Hoe, J.C., Falsafi, B.: Dual use of superscalar datapath for transient-fault detection and recovery. In: International Symposium on Microarchitecture (MICRO), pp. 214–224. IEEE (2001)

  33. Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: International Symposium on Computer Architecture (ISCA), pp. 25–36. ACM (2000)

  34. Rotenberg, E.: AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers, pp. 84-91 (1999)

  35. Sánchez, D., Aragón, J., García, J.: Evaluating dynamic core coupling in a scalable tiled-cmp architecture. In: International Workshop on Duplicating, Deconstructing, and Debunking (WDDD) (2008)

  36. Sánchez, D., Aragón, J., García, J.: Extending SRT for parallel applications in tiled-CMP architectures. In: International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–8. IEEE (2009)

  37. Sánchez, D., Aragón, J.L., García, J.M.: REPAS: Reliable Execution for Parallel Applications in Tiled-CMPs. In: Sips, H., Epema, D., Lin, H.X. (eds.) International Euro-Par Conference on Parallel Processing, Lecture Notes in Computer Science (LNCS), vol. 5704, pp. 321–333. Springer, Berlin, Heidelberg (2009)

  38. Sánchez, D., Aragón, J. L., García, J.M.: A log-based redundant architecture for reliable parallel computation. In: International Conference on High Performance Computing (HiPC), pp. 1–10. IEEE (2010)

  39. Slegel, T., Averill, R.M.I., Check, M., Giamei, B., Krumm, B., Krygowski, C., Li, W., Liptay, J., Macdougall, J., McPherson, T., Navarro, J., Schwarz, E., Shum, K., Webb, C.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2), 12–23 (1999)

  40. Smolens, J.C., Gold, B.T., Falsafi, B., Hoe, J.C.: Reunion: complexity-effective multicore redundancy. In: International Symposium on Microarchitecture (MICRO), pp. 223–234. IEEE (2006)

  41. Smolens, J.C., Gold, B.T., Kim, J., Falsafi, B., Hoe, J.C., Nowatzyk, A.G.: Fingerprinting: bounding soft-error detection latency and bandwidth. In: International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 224–234. IEEE (2004)

  42. Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: International Symposium on Computer Architecture (ISCA), pp. 123–134. IEEE (2002)

  43. Srinivasan, J., Adve, S.V., Bose, P., Rivers, J.A.: The impact of technology scaling on lifetime reliability. In: International Conference on Dependable Systems and Networks (DSN), pp. 177–186. IEEE (2004)

  44. Stavrou, K., Evripidou, P., Trancoso, P.: DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor. In: Hämäläinen, T.D., Pimentel, A.D., Takala, J., Vassiliadis, S. (eds.) Embedded Computer Systems: Architectures, Modeling, and Simulation, Lecture Notes in Computer Science (LNCS), vol. 3553, pp. 364–373. Springer, Berlin, Heidelberg (2005)

  45. Weis, S., Garbade, A., Schlingmann, S., Ungerer, T.: Towards fault detection units as an autonomous fault detection approach for future many-cores. In: ARCS 2011 Workshop Proceedings, pp. 20–23. VDE (2011)

  46. Weis, S., Garbade, A., Wolf, J., Fechner, B., Mendelson, A., Giorgi, R., Ungerer, T.: A fault detection and recovery architecture for a teradevice dataflow system. In: International Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM), pp. 38–44. IEEE (2011)

  47. Wittenbrink, C., Kilgariff, E., Prabhu, A.: Fermi GF100 GPU architecture. IEEE Micro 31(2), 50–59 (2011)

  48. Yeh, Y.: Triple-triple redundant 777 primary flight computer. In: Proceedings of the Aerospace Applications Conference, pp. 293–307. IEEE (1996)

  49. Zuckerman, S., Suetterlein, J., Knauerhase, R., Gao, G.R.: Using a “codelet” program execution model for exascale machines: position paper. In: Proceedings of the International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT), pp. 64–69. ACM (2011)

Download references

Acknowledgments

This work was partly funded by the European FP7 Projects TERAFLUX (id. 249013) and HiPEAC (IST-217068). The authors wish to thank N. Puzovic and Z. Popovic for their initial studies on the DTA-C architecture and P. Faraboschi of HP for his precious suggestions and support on the COTSon simulator.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastian Weis.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Weis, S., Garbade, A., Fechner, B. et al. Architectural Support for Fault Tolerance in a Teradevice Dataflow System. Int J Parallel Prog 44, 208–232 (2016). https://doi.org/10.1007/s10766-014-0312-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-014-0312-y

Keywords

Navigation