skip to main content
10.1145/3229631.3239368acmotherconferencesArticle/Chapter ViewAbstractPublication PagessamosConference Proceedingsconference-collections
research-article

Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems

Authors Info & Claims
Published:15 July 2018Publication History

ABSTRACT

The transition to Exascale computing is going to be characterised by an increased range of application classes. In addition to traditional massively parallel "number crunching" applications, new classes are emerging such as real-time HPC and data-intensive scalable computing. Furthermore, Exascale computing is characterised by a "democratisation" of HPC: to fully exploit the capabilities of Exascale-level facilities, HPC is moving towards enabling access to its resources to a wider range of new players, including SMEs, through cloud-based approaches [1]. Finally, the need for much higher energy efficiency is pushing towards deep heterogeneity, widening the range of options for acceleration, moving from the traditional CPU-only organization, to the CPU plus GPU which currently dominates the Green5001, to more complex options including programmable accelerators and even (reconfigurable) hardware accelerators [2].

References

  1. B. Koller, N. Struckmann, J. Buchholz, and M. Gienger, "Towards an environment to deliver high performance computing to small and medium enterprises," in Sustained Simulation Performance 2015. Cham: Springer International Publishing, 2015, pp. 41--50.Google ScholarGoogle Scholar
  2. J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, A. Cilardo, W. Fornaciari, M. Kovac, F. Roudet, and D. Zoni, "The MANGO FET-HPC Project: An overview," in IEEE 18th Int'l Conf on Computational Science and Engineering (CSE). IEEE, 2015, pp. 351--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, C. Brandolese, A. Cilardo, W. Fornaciari, Y. Hoornenborg, M. Kovac, B. Maitre, G. Massari, H. Mlinaric, E. Papastefanakis, F. Roudet, R. Tornero, and D. Zoni, "Enabling HPC for QoS-sensitive applications: The MANGO approach," in 2016 Design, Automation Test in Europe Conference Exhibition (DATE), March 2016, pp. 702--707. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Agosta, W. Fornaciari, G. Massari, A. Pupykina, F. Reghenzani, and M. Zanella, "Managing Heterogeneous Resources in HPC Systems," in Proc. of PARMA-DITAM '18. ACM, 2018, pp. 7--12. {Online}. Available Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Pupykina and G. Agosta, "Optimizing Memory Management in Deeply Heterogeneous HPC Accelerators," in 2017 46th Int'l Conf on Parallel Processing Workshops (ICPPW), Aug 2017, pp. 291--300.Google ScholarGoogle Scholar
  6. J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, C. Brandolese, E. Cappe, A. Cilardo, L. Dragic, A. Dray, A. Duspara, W. Fornaciari, E. Fusella, M. Gagliardi, G. Guillaume, D. Hofman, Y. Hoornenborg, A. Iranfar, M. Kovac, S. Libutti, B. Maitre, J. M. Martínez, G. Massari, K. Meinds, H. Mlinaric, E. Papastefanakis, T. Picornell, I. Piljic, A. Pupykina, F. Reghenzani, I. Staub, R. Tornero, M. Zanella, M. Zapater, and D. Zoni, "Exploring manycore architectures for next-generation HPC systems through the MANGO approach," Microprocessors and Microsystems, vol. 61, pp. 154 -- 170, 2018. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S0141933118300243Google ScholarGoogle ScholarCross RefCross Ref
  7. L. Huang and Q. Xu, "Characterizing the lifetime reliability of manycore processors with core-level redundancy," in 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2010, pp. 680--685. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. L. Chou and R. Marculescu, "Farm: Fault-aware resource management in noc-based multiprocessor platforms," in 2011 Design, Automation Test in Europe, March 2011, pp. 1--6.Google ScholarGoogle Scholar
  9. P. Mercati, F. Paterna, A. Bartolini, L. Benini, and T. Rosing, "Warm: Workload-aware reliability management in linux/android," IEEE Trans on CAD of Integrated Circuits and Systems, 2016.Google ScholarGoogle Scholar
  10. M. H. Haghbayan, A. Miele, A. M. Rahmani, P. Liljeberg, and H. Tenhunen, "A lifetime-aware runtime mapping approach for many-core systems in the dark silicon era," in 2016 Design, Automation Test in Europe Conference Exhibition (DATE), March 2016, pp. 854--857. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Bellasi, G. Massari, and W. Fornaciari, "Effective runtime resource management using linux control groups with the barbequertrm framework," ACM Trans. Embed. Comput. Syst., vol. 14, no. 2, pp. 39:1--39:17, Mar. 2015. {Online}. Available Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Iranfar, F. Terraneo, W. A. Simon, L. Dragic, I. Piljic, M. Zapater, W. Fornaciari, M. Kovac, and D. Atienza Alonso, "Thermal characterization of next-generation workloads on heterogeneous mpsocs," in International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), 2017, pp. 1--6.Google ScholarGoogle Scholar
  13. F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir, "Toward exascale resilience," Int. J. High Perform. Comput. Appl., vol. 23, no. 4, pp. 374--388, Nov. 2009. {Online}. Available Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Curtsinger and E. D. Berger, "Stabilizer: Statistically sound performance evaluation," SIGARCH Comput. Archit. News, vol. 41, no. 1, pp. 219--228, Mar. 2013. {Online}. Available Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. J. Cazorla, J. Abella, J. Andersson, T. Vardanega, F. Vatrinet, I. Bate, I. Broster, M. Azkarate-askasua, F. Wartel, L. Cucu, F. Cros, G. Farrall, A. Gogonel, A. Gianarro, B. Triquet, C. Hernández, C. Lo, C. Maxim, D. Morales, E. Quiñones, E. Mezzetti, L. Kosmidis, I. Agirre, M. Fernández, M. Slijepcevic, P. Conmy, and W. Talaboulma, "PROXIMA: improving measurement-based timing analysis through randomisation and probabilistic analysis," in 2016 Euromicro DSD, 2016, pp. 276--285. {Online}. AvailableGoogle ScholarGoogle Scholar
  16. F. J. Cazorla, T. Vardanega, E. Quiñones, and J. Abella, "Upper-bounding Program Execution Time with Extreme Value Theory," in 13th Int'l Workshop on Worst-Case Execution Time Analysis, ser. OASIcs, vol. 30, Germany, 2013, pp. 64--76. {Online}. Available: http://drops.dagstuhl.de/opus/volltexte/2013/4123Google ScholarGoogle Scholar
  17. A. K. Coskun, T. S. Rosing, K. Mihic, G. De Micheli, and Y. Leblebici, "Analysis and optimization of mpsoc reliability," Journal of Low Power Electronics, vol. 2, no. 1, pp. 56--69, 2006. {Online}. Available: https://www.ingentaconnect.com/content/asp/jolpe/2006/00000002/00000001/art0008Google ScholarGoogle ScholarCross RefCross Ref
  18. A. K. Coskun, T. S. Rosing, and K. C. Gross, "Temperature management in multiprocessor socs using online learning," in 2008 45th ACM/IEEE Design Automation Conference, June 2008, pp. 890--893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusamy, "Compact thermal modeling for temperature-aware design," in Proceedings. 41st Design Automation Conference, 2004., July 2004, pp. 878--883. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. K. M. Stansberry, "Uptime institute 2013 data center industry survey," 2013.Google ScholarGoogle Scholar
  21. A. Seuret, A. Iranfar, M. Zapater, J. R. Thome, and D. Atienza, "Design of a two-phase gravity-driven micro-scale thermosyphon cooling system for high-performance computing data centers," in Intersociety Conf on Thermal and Thermomechanical Phenomena in Electronic Systems (ITHERM), 2018.Google ScholarGoogle Scholar
  22. A. Sridhar, M. M. S. Aly, and D. Atienza Alonso, "A semi-analytical thermal modeling framework for liquid-cooled ics," IEEE T Comput Aid D, vol. 33, no. 8, pp. 14. 1145--1158, 2014.Google ScholarGoogle Scholar
  23. W. Piatek, A. Oleksiak, M. vor dem Berge, J. Hagemeyer, and E. Senechal, "Intelligent thermal management in M2DC system," in Proc. 8th Int'l Conf on Future Energy Systems, 2017, pp. 309--315. {Online}. Available Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Piatek, A. Oleksiak, and G. Da Costa, "Energy and thermal models for simulation of workload and resource management in computing systems," Simul Model Pract Th, vol. 58, pp. 40 -- 54, 2015. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S1569190X15000684Google ScholarGoogle ScholarCross RefCross Ref
  25. A. Sridhar, A. Vincenzi, M. Ruggiero, and D. Atienza, "Neural network-based thermal simulation of integrated circuits on gpus," IEEE T Comput Aid D, vol. 31, no. 1, pp. 23--36, Jan 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Raghav, M. Ruggiero, A. Marongiu, C. Pinto, D. Atienza, and L. Benini, "Gpu acceleration for simulating massively parallel many-core platforms," IEEE T Parall Distr, vol. 26, no. 5, pp. 1336--1349, May 2015.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. M. Sabry, D. Atienza Alonso, and F. Catthoor, "Ocean: An optimized hw/sw reliability mitigation approach for scratchpad memories in real-time socs," ACM T Embed Comput S, vol. 13, pp. 26. 138.1--138.26, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Zoni, L. Cremona, and W. Fornaciari, "Powerprobe: Run-time power modeling through automatic RTL instrumentation," in 2018 Design, Automation & Test in Europe Conference & Exhibition, DATE 2018, Dresden, Germany, March 19-23, 2018, 2018, pp. 743--748. {Online}. AvailableGoogle ScholarGoogle ScholarCross RefCross Ref
  29. D. Zoni, L. Colombo, and W. Fornaciari, "Darkcache: Energy-performance optimization of tiled multi-cores by adaptively power-gating llc banks," ACM Trans. Archit. Code Optim., vol. 15, no. 2, pp. 21:1--21:26, May 2018. {Online}. Available Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Libutti, G. Massari, and W. Fornaciari, "Co-scheduling tasks on multi-core heterogeneous systems: An energy-aware perspective," IET Computers Digital Techniques, vol. 10, no. 2, pp. 77--84, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  31. D. Zoni, A. Barenghi, G. Pelosi, and W. Fornaciari, "A comprehensive side channel information leakage analysis of an in-order risc cpu microarchitecture," ACM TODAES, vol. 23, no. 5, Sep. 2018. {Online}. Available Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Rabenseifner, G. Hager, and G. Jost, "Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes," in 17th Euromicro PDP, Feb 2009, pp. 427--436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Diaz, C. M. noz Caro, and A. N. no, "A survey of parallel programming models and tools in the multi and many-core era," IEEE T Parall Distr, vol. 23, no. 8, pp. 1369--1386, Aug 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. L. Reyes-Ortiz, L. Oneto, and D. Anguita, "Big data analytics in the cloud: Spark on hadoop vs mpi/openmp on beowulf," Procedia Computer Science, vol. 53, pp. 121 -- 130, 2015, iNNS Conference on Big Data 2015 Program San Francisco, CA, USA 8--10 August 2015.Google ScholarGoogle ScholarCross RefCross Ref
  35. M. Jarus and A. Oleksiak, "Top-down characterization approximation based on performance counters architecture for amd processors," Simul Model Pract Th, vol. 68, pp. 146 -- 162, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  1. Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SAMOS '18: Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation
      July 2018
      263 pages
      ISBN:9781450364942
      DOI:10.1145/3229631

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 July 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader