research-article

Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems

SAMOS '18: Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation

Pages 187 - 194

https://doi.org/10.1145/3229631.3239368

Published: 15 July 2018 Publication History

Abstract

The transition to Exascale computing is going to be characterised by an increased range of application classes. In addition to traditional massively parallel "number crunching" applications, new classes are emerging such as real-time HPC and data-intensive scalable computing. Furthermore, Exascale computing is characterised by a "democratisation" of HPC: to fully exploit the capabilities of Exascale-level facilities, HPC is moving towards enabling access to its resources to a wider range of new players, including SMEs, through cloud-based approaches [1]. Finally, the need for much higher energy efficiency is pushing towards deep heterogeneity, widening the range of options for acceleration, moving from the traditional CPU-only organization, to the CPU plus GPU which currently dominates the Green500¹, to more complex options including programmable accelerators and even (reconfigurable) hardware accelerators [2].

References

[1]

B. Koller, N. Struckmann, J. Buchholz, and M. Gienger, "Towards an environment to deliver high performance computing to small and medium enterprises," in Sustained Simulation Performance 2015. Cham: Springer International Publishing, 2015, pp. 41--50.

[2]

J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, A. Cilardo, W. Fornaciari, M. Kovac, F. Roudet, and D. Zoni, "The MANGO FET-HPC Project: An overview," in IEEE 18th Int'l Conf on Computational Science and Engineering (CSE). IEEE, 2015, pp. 351--354.

Digital Library

[3]

J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, C. Brandolese, A. Cilardo, W. Fornaciari, Y. Hoornenborg, M. Kovac, B. Maitre, G. Massari, H. Mlinaric, E. Papastefanakis, F. Roudet, R. Tornero, and D. Zoni, "Enabling HPC for QoS-sensitive applications: The MANGO approach," in 2016 Design, Automation Test in Europe Conference Exhibition (DATE), March 2016, pp. 702--707.

Digital Library

[4]

G. Agosta, W. Fornaciari, G. Massari, A. Pupykina, F. Reghenzani, and M. Zanella, "Managing Heterogeneous Resources in HPC Systems," in Proc. of PARMA-DITAM '18. ACM, 2018, pp. 7--12. {Online}. Available

Digital Library

[5]

A. Pupykina and G. Agosta, "Optimizing Memory Management in Deeply Heterogeneous HPC Accelerators," in 2017 46th Int'l Conf on Parallel Processing Workshops (ICPPW), Aug 2017, pp. 291--300.

[6]

J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, C. Brandolese, E. Cappe, A. Cilardo, L. Dragic, A. Dray, A. Duspara, W. Fornaciari, E. Fusella, M. Gagliardi, G. Guillaume, D. Hofman, Y. Hoornenborg, A. Iranfar, M. Kovac, S. Libutti, B. Maitre, J. M. Martínez, G. Massari, K. Meinds, H. Mlinaric, E. Papastefanakis, T. Picornell, I. Piljic, A. Pupykina, F. Reghenzani, I. Staub, R. Tornero, M. Zanella, M. Zapater, and D. Zoni, "Exploring manycore architectures for next-generation HPC systems through the MANGO approach," Microprocessors and Microsystems, vol. 61, pp. 154 -- 170, 2018. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S0141933118300243

[7]

L. Huang and Q. Xu, "Characterizing the lifetime reliability of manycore processors with core-level redundancy," in 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2010, pp. 680--685.

Digital Library

[8]

C. L. Chou and R. Marculescu, "Farm: Fault-aware resource management in noc-based multiprocessor platforms," in 2011 Design, Automation Test in Europe, March 2011, pp. 1--6.

[9]

P. Mercati, F. Paterna, A. Bartolini, L. Benini, and T. Rosing, "Warm: Workload-aware reliability management in linux/android," IEEE Trans on CAD of Integrated Circuits and Systems, 2016.

[10]

M. H. Haghbayan, A. Miele, A. M. Rahmani, P. Liljeberg, and H. Tenhunen, "A lifetime-aware runtime mapping approach for many-core systems in the dark silicon era," in 2016 Design, Automation Test in Europe Conference Exhibition (DATE), March 2016, pp. 854--857.

Digital Library

[11]

P. Bellasi, G. Massari, and W. Fornaciari, "Effective runtime resource management using linux control groups with the barbequertrm framework," ACM Trans. Embed. Comput. Syst., vol. 14, no. 2, pp. 39:1--39:17, Mar. 2015. {Online}. Available

Digital Library

[12]

A. Iranfar, F. Terraneo, W. A. Simon, L. Dragic, I. Piljic, M. Zapater, W. Fornaciari, M. Kovac, and D. Atienza Alonso, "Thermal characterization of next-generation workloads on heterogeneous mpsocs," in International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), 2017, pp. 1--6.

[13]

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir, "Toward exascale resilience," Int. J. High Perform. Comput. Appl., vol. 23, no. 4, pp. 374--388, Nov. 2009. {Online}. Available

Digital Library

[14]

C. Curtsinger and E. D. Berger, "Stabilizer: Statistically sound performance evaluation," SIGARCH Comput. Archit. News, vol. 41, no. 1, pp. 219--228, Mar. 2013. {Online}. Available

Digital Library

[15]

F. J. Cazorla, J. Abella, J. Andersson, T. Vardanega, F. Vatrinet, I. Bate, I. Broster, M. Azkarate-askasua, F. Wartel, L. Cucu, F. Cros, G. Farrall, A. Gogonel, A. Gianarro, B. Triquet, C. Hernández, C. Lo, C. Maxim, D. Morales, E. Quiñones, E. Mezzetti, L. Kosmidis, I. Agirre, M. Fernández, M. Slijepcevic, P. Conmy, and W. Talaboulma, "PROXIMA: improving measurement-based timing analysis through randomisation and probabilistic analysis," in 2016 Euromicro DSD, 2016, pp. 276--285. {Online}. Available

[16]

F. J. Cazorla, T. Vardanega, E. Quiñones, and J. Abella, "Upper-bounding Program Execution Time with Extreme Value Theory," in 13th Int'l Workshop on Worst-Case Execution Time Analysis, ser. OASIcs, vol. 30, Germany, 2013, pp. 64--76. {Online}. Available: http://drops.dagstuhl.de/opus/volltexte/2013/4123

[17]

A. K. Coskun, T. S. Rosing, K. Mihic, G. De Micheli, and Y. Leblebici, "Analysis and optimization of mpsoc reliability," Journal of Low Power Electronics, vol. 2, no. 1, pp. 56--69, 2006. {Online}. Available: https://www.ingentaconnect.com/content/asp/jolpe/2006/00000002/00000001/art0008

[18]

A. K. Coskun, T. S. Rosing, and K. C. Gross, "Temperature management in multiprocessor socs using online learning," in 2008 45th ACM/IEEE Design Automation Conference, June 2008, pp. 890--893.

Digital Library

[19]

W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusamy, "Compact thermal modeling for temperature-aware design," in Proceedings. 41st Design Automation Conference, 2004., July 2004, pp. 878--883.

Digital Library

[20]

J. K. M. Stansberry, "Uptime institute 2013 data center industry survey," 2013.

[21]

A. Seuret, A. Iranfar, M. Zapater, J. R. Thome, and D. Atienza, "Design of a two-phase gravity-driven micro-scale thermosyphon cooling system for high-performance computing data centers," in Intersociety Conf on Thermal and Thermomechanical Phenomena in Electronic Systems (ITHERM), 2018.

[22]

A. Sridhar, M. M. S. Aly, and D. Atienza Alonso, "A semi-analytical thermal modeling framework for liquid-cooled ics," IEEE T Comput Aid D, vol. 33, no. 8, pp. 14. 1145--1158, 2014.

[23]

W. Piatek, A. Oleksiak, M. vor dem Berge, J. Hagemeyer, and E. Senechal, "Intelligent thermal management in M2DC system," in Proc. 8th Int'l Conf on Future Energy Systems, 2017, pp. 309--315. {Online}. Available

Digital Library

[24]

W. Piatek, A. Oleksiak, and G. Da Costa, "Energy and thermal models for simulation of workload and resource management in computing systems," Simul Model Pract Th, vol. 58, pp. 40 -- 54, 2015. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S1569190X15000684

[25]

A. Sridhar, A. Vincenzi, M. Ruggiero, and D. Atienza, "Neural network-based thermal simulation of integrated circuits on gpus," IEEE T Comput Aid D, vol. 31, no. 1, pp. 23--36, Jan 2012.

Digital Library

[26]

S. Raghav, M. Ruggiero, A. Marongiu, C. Pinto, D. Atienza, and L. Benini, "Gpu acceleration for simulating massively parallel many-core platforms," IEEE T Parall Distr, vol. 26, no. 5, pp. 1336--1349, May 2015.

[27]

M. M. Sabry, D. Atienza Alonso, and F. Catthoor, "Ocean: An optimized hw/sw reliability mitigation approach for scratchpad memories in real-time socs," ACM T Embed Comput S, vol. 13, pp. 26. 138.1--138.26, 2014.

Digital Library

[28]

D. Zoni, L. Cremona, and W. Fornaciari, "Powerprobe: Run-time power modeling through automatic RTL instrumentation," in 2018 Design, Automation & Test in Europe Conference & Exhibition, DATE 2018, Dresden, Germany, March 19-23, 2018, 2018, pp. 743--748. {Online}. Available

[29]

D. Zoni, L. Colombo, and W. Fornaciari, "Darkcache: Energy-performance optimization of tiled multi-cores by adaptively power-gating llc banks," ACM Trans. Archit. Code Optim., vol. 15, no. 2, pp. 21:1--21:26, May 2018. {Online}. Available

Digital Library

[30]

S. Libutti, G. Massari, and W. Fornaciari, "Co-scheduling tasks on multi-core heterogeneous systems: An energy-aware perspective," IET Computers Digital Techniques, vol. 10, no. 2, pp. 77--84, 2016.

[31]

D. Zoni, A. Barenghi, G. Pelosi, and W. Fornaciari, "A comprehensive side channel information leakage analysis of an in-order risc cpu microarchitecture," ACM TODAES, vol. 23, no. 5, Sep. 2018. {Online}. Available

Digital Library

[32]

R. Rabenseifner, G. Hager, and G. Jost, "Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes," in 17th Euromicro PDP, Feb 2009, pp. 427--436.

Digital Library

[33]

J. Diaz, C. M. noz Caro, and A. N. no, "A survey of parallel programming models and tools in the multi and many-core era," IEEE T Parall Distr, vol. 23, no. 8, pp. 1369--1386, Aug 2012.

Digital Library

[34]

J. L. Reyes-Ortiz, L. Oneto, and D. Anguita, "Big data analytics in the cloud: Spark on hadoop vs mpi/openmp on beowulf," Procedia Computer Science, vol. 53, pp. 121 -- 130, 2015, iNNS Conference on Big Data 2015 Program San Francisco, CA, USA 8--10 August 2015.

[35]

M. Jarus and A. Oleksiak, "Top-down characterization approximation based on performance counters architecture for amd processors," Simul Model Pract Th, vol. 68, pp. 146 -- 162, 2016.

Cited By

Bakhshalipour MGibbons P(2024)Tartan: Microarchitecting a Robotic Processor2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00047(548-565)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00047
Reghenzani FBhuiyan AFornaciari WGuo Z(2021)A Multi-Level DPM Approach for Real-Time DAG Tasks in Heterogeneous Processors2021 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS52674.2021.00014(14-26)Online publication date: Dec-2021
https://doi.org/10.1109/RTSS52674.2021.00014
Reghenzani FSantinelli LFornaciari W(2020)Dealing with Uncertainty in pWCET EstimationsACM Transactions on Embedded Computing Systems10.1145/339623419:5(1-23)Online publication date: 26-Sep-2020
https://dl.acm.org/doi/10.1145/3396234
Show More Cited By

Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems
1. Computer systems organization

Recommendations

Visualization at exascale: Making it all work with VTK-m

The VTK-m software library enables scientific visualization on exascale-class supercomputers. Exascale machines are particularly challenging for software development in part because they use GPU accelerators to provide the vast majority of their ...
Achieving Exascale Capabilities through Heterogeneous Computing
This article provides an overview of AMD's vision for exascale computing, and in particular, how heterogeneity will play a central role in realizing this vision. Exascale computing requires high levels of performance capabilities while staying within ...
Towards exascale computing with heterogeneous architectures
DATE '17: Proceedings of the Conference on Design, Automation & Test in Europe

The goal of reaching exascale computing is made especially challenging by the highly heterogeneous nature of modern platforms and the energy they consume. As compute nodes typically utilize multiple multi-core CPU and are increasingly equipped with PCIe ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SAMOS '18: Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation

July 2018

263 pages

ISBN:9781450364942

DOI:10.1145/3229631

General Chair:
Trevor Mudge
University of Michigan - Ann Arbor
,
Program Chair:
Dionisios N. Pnevmatikatos
Technical University of Crete and ICS - FORTH, Greece

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Horizon 2020

Conference

SAMOS XVIII

SAMOS XVIII: Architectures, Modeling, and Simulation

July 15 - 19, 2018

Pythagorion, Greece

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
160
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bakhshalipour MGibbons P(2024)Tartan: Microarchitecting a Robotic Processor2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00047(548-565)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00047
Reghenzani FBhuiyan AFornaciari WGuo Z(2021)A Multi-Level DPM Approach for Real-Time DAG Tasks in Heterogeneous Processors2021 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS52674.2021.00014(14-26)Online publication date: Dec-2021
https://doi.org/10.1109/RTSS52674.2021.00014
Reghenzani FSantinelli LFornaciari W(2020)Dealing with Uncertainty in pWCET EstimationsACM Transactions on Embedded Computing Systems10.1145/339623419:5(1-23)Online publication date: 26-Sep-2020
https://dl.acm.org/doi/10.1145/3396234
Bhuiyan AReghenzani FFornaciari WGuo Z(2020)Optimizing Energy in Non-preemptive Mixed-Criticality Scheduling by Exploiting Probabilistic InformationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.3012231(1-1)Online publication date: 2020
https://doi.org/10.1109/TCAD.2020.3012231
Premi LReghenzani FMassari GFornaciari W(2020)A Game Theory Approach to Heterogeneous Resource Management: Work-in-Progress2020 International Conference on Embedded Software (EMSOFT)10.1109/EMSOFT51651.2020.9244046(25-27)Online publication date: 20-Sep-2020
https://doi.org/10.1109/EMSOFT51651.2020.9244046
Reghenzani FMassari GFornaciari W(2020)Timing Predictability in High-Performance Computing with Probabilistic Real-TimeIEEE Access10.1109/ACCESS.2020.3038559(1-1)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3038559
Agosta GFornaciari WAtienza DCanal RCilardo AFlich Cardo JHernandez Luz CKulczewski MMassari GTornero Gavilá RZapater M(2020)The RECIPE approach to challenges in deeply heterogeneous high performance systemsMicroprocessors & Microsystems10.1016/j.micpro.2020.10318577:COnline publication date: 1-Sep-2020
https://dl.acm.org/doi/10.1016/j.micpro.2020.103185
Reghenzani FMassari GFornaciari W(2020)Probabilistic-WCET reliabilityMicroprocessors & Microsystems10.1016/j.micpro.2020.10313577:COnline publication date: 1-Sep-2020
https://dl.acm.org/doi/10.1016/j.micpro.2020.103135
Cremona LFornaciari WGalimberti ARomanoni AZoni D(2020)VGM-Bench: FPU Benchmark Suite for Computer Vision, Computer Graphics and Machine Learning ApplicationsEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-030-60939-9_23(323-335)Online publication date: 7-Oct-2020
https://doi.org/10.1007/978-3-030-60939-9_23
Reghenzani FMassari GFornaciari W(2019)The Real-Time Linux KernelACM Computing Surveys10.1145/329771452:1(1-36)Online publication date: 21-Feb-2019
https://dl.acm.org/doi/10.1145/3297714
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten