skip to main content
10.1145/3229631.3239368acmotherconferencesArticle/Chapter ViewAbstractPublication PagessamosConference Proceedingsconference-collections
research-article

Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems

Published: 15 July 2018 Publication History

Abstract

The transition to Exascale computing is going to be characterised by an increased range of application classes. In addition to traditional massively parallel "number crunching" applications, new classes are emerging such as real-time HPC and data-intensive scalable computing. Furthermore, Exascale computing is characterised by a "democratisation" of HPC: to fully exploit the capabilities of Exascale-level facilities, HPC is moving towards enabling access to its resources to a wider range of new players, including SMEs, through cloud-based approaches [1]. Finally, the need for much higher energy efficiency is pushing towards deep heterogeneity, widening the range of options for acceleration, moving from the traditional CPU-only organization, to the CPU plus GPU which currently dominates the Green5001, to more complex options including programmable accelerators and even (reconfigurable) hardware accelerators [2].

References

[1]
B. Koller, N. Struckmann, J. Buchholz, and M. Gienger, "Towards an environment to deliver high performance computing to small and medium enterprises," in Sustained Simulation Performance 2015. Cham: Springer International Publishing, 2015, pp. 41--50.
[2]
J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, A. Cilardo, W. Fornaciari, M. Kovac, F. Roudet, and D. Zoni, "The MANGO FET-HPC Project: An overview," in IEEE 18th Int'l Conf on Computational Science and Engineering (CSE). IEEE, 2015, pp. 351--354.
[3]
J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, C. Brandolese, A. Cilardo, W. Fornaciari, Y. Hoornenborg, M. Kovac, B. Maitre, G. Massari, H. Mlinaric, E. Papastefanakis, F. Roudet, R. Tornero, and D. Zoni, "Enabling HPC for QoS-sensitive applications: The MANGO approach," in 2016 Design, Automation Test in Europe Conference Exhibition (DATE), March 2016, pp. 702--707.
[4]
G. Agosta, W. Fornaciari, G. Massari, A. Pupykina, F. Reghenzani, and M. Zanella, "Managing Heterogeneous Resources in HPC Systems," in Proc. of PARMA-DITAM '18. ACM, 2018, pp. 7--12. {Online}. Available
[5]
A. Pupykina and G. Agosta, "Optimizing Memory Management in Deeply Heterogeneous HPC Accelerators," in 2017 46th Int'l Conf on Parallel Processing Workshops (ICPPW), Aug 2017, pp. 291--300.
[6]
J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, C. Brandolese, E. Cappe, A. Cilardo, L. Dragic, A. Dray, A. Duspara, W. Fornaciari, E. Fusella, M. Gagliardi, G. Guillaume, D. Hofman, Y. Hoornenborg, A. Iranfar, M. Kovac, S. Libutti, B. Maitre, J. M. Martínez, G. Massari, K. Meinds, H. Mlinaric, E. Papastefanakis, T. Picornell, I. Piljic, A. Pupykina, F. Reghenzani, I. Staub, R. Tornero, M. Zanella, M. Zapater, and D. Zoni, "Exploring manycore architectures for next-generation HPC systems through the MANGO approach," Microprocessors and Microsystems, vol. 61, pp. 154 -- 170, 2018. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S0141933118300243
[7]
L. Huang and Q. Xu, "Characterizing the lifetime reliability of manycore processors with core-level redundancy," in 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2010, pp. 680--685.
[8]
C. L. Chou and R. Marculescu, "Farm: Fault-aware resource management in noc-based multiprocessor platforms," in 2011 Design, Automation Test in Europe, March 2011, pp. 1--6.
[9]
P. Mercati, F. Paterna, A. Bartolini, L. Benini, and T. Rosing, "Warm: Workload-aware reliability management in linux/android," IEEE Trans on CAD of Integrated Circuits and Systems, 2016.
[10]
M. H. Haghbayan, A. Miele, A. M. Rahmani, P. Liljeberg, and H. Tenhunen, "A lifetime-aware runtime mapping approach for many-core systems in the dark silicon era," in 2016 Design, Automation Test in Europe Conference Exhibition (DATE), March 2016, pp. 854--857.
[11]
P. Bellasi, G. Massari, and W. Fornaciari, "Effective runtime resource management using linux control groups with the barbequertrm framework," ACM Trans. Embed. Comput. Syst., vol. 14, no. 2, pp. 39:1--39:17, Mar. 2015. {Online}. Available
[12]
A. Iranfar, F. Terraneo, W. A. Simon, L. Dragic, I. Piljic, M. Zapater, W. Fornaciari, M. Kovac, and D. Atienza Alonso, "Thermal characterization of next-generation workloads on heterogeneous mpsocs," in International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), 2017, pp. 1--6.
[13]
F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir, "Toward exascale resilience," Int. J. High Perform. Comput. Appl., vol. 23, no. 4, pp. 374--388, Nov. 2009. {Online}. Available
[14]
C. Curtsinger and E. D. Berger, "Stabilizer: Statistically sound performance evaluation," SIGARCH Comput. Archit. News, vol. 41, no. 1, pp. 219--228, Mar. 2013. {Online}. Available
[15]
F. J. Cazorla, J. Abella, J. Andersson, T. Vardanega, F. Vatrinet, I. Bate, I. Broster, M. Azkarate-askasua, F. Wartel, L. Cucu, F. Cros, G. Farrall, A. Gogonel, A. Gianarro, B. Triquet, C. Hernández, C. Lo, C. Maxim, D. Morales, E. Quiñones, E. Mezzetti, L. Kosmidis, I. Agirre, M. Fernández, M. Slijepcevic, P. Conmy, and W. Talaboulma, "PROXIMA: improving measurement-based timing analysis through randomisation and probabilistic analysis," in 2016 Euromicro DSD, 2016, pp. 276--285. {Online}. Available
[16]
F. J. Cazorla, T. Vardanega, E. Quiñones, and J. Abella, "Upper-bounding Program Execution Time with Extreme Value Theory," in 13th Int'l Workshop on Worst-Case Execution Time Analysis, ser. OASIcs, vol. 30, Germany, 2013, pp. 64--76. {Online}. Available: http://drops.dagstuhl.de/opus/volltexte/2013/4123
[17]
A. K. Coskun, T. S. Rosing, K. Mihic, G. De Micheli, and Y. Leblebici, "Analysis and optimization of mpsoc reliability," Journal of Low Power Electronics, vol. 2, no. 1, pp. 56--69, 2006. {Online}. Available: https://www.ingentaconnect.com/content/asp/jolpe/2006/00000002/00000001/art0008
[18]
A. K. Coskun, T. S. Rosing, and K. C. Gross, "Temperature management in multiprocessor socs using online learning," in 2008 45th ACM/IEEE Design Automation Conference, June 2008, pp. 890--893.
[19]
W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusamy, "Compact thermal modeling for temperature-aware design," in Proceedings. 41st Design Automation Conference, 2004., July 2004, pp. 878--883.
[20]
J. K. M. Stansberry, "Uptime institute 2013 data center industry survey," 2013.
[21]
A. Seuret, A. Iranfar, M. Zapater, J. R. Thome, and D. Atienza, "Design of a two-phase gravity-driven micro-scale thermosyphon cooling system for high-performance computing data centers," in Intersociety Conf on Thermal and Thermomechanical Phenomena in Electronic Systems (ITHERM), 2018.
[22]
A. Sridhar, M. M. S. Aly, and D. Atienza Alonso, "A semi-analytical thermal modeling framework for liquid-cooled ics," IEEE T Comput Aid D, vol. 33, no. 8, pp. 14. 1145--1158, 2014.
[23]
W. Piatek, A. Oleksiak, M. vor dem Berge, J. Hagemeyer, and E. Senechal, "Intelligent thermal management in M2DC system," in Proc. 8th Int'l Conf on Future Energy Systems, 2017, pp. 309--315. {Online}. Available
[24]
W. Piatek, A. Oleksiak, and G. Da Costa, "Energy and thermal models for simulation of workload and resource management in computing systems," Simul Model Pract Th, vol. 58, pp. 40 -- 54, 2015. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S1569190X15000684
[25]
A. Sridhar, A. Vincenzi, M. Ruggiero, and D. Atienza, "Neural network-based thermal simulation of integrated circuits on gpus," IEEE T Comput Aid D, vol. 31, no. 1, pp. 23--36, Jan 2012.
[26]
S. Raghav, M. Ruggiero, A. Marongiu, C. Pinto, D. Atienza, and L. Benini, "Gpu acceleration for simulating massively parallel many-core platforms," IEEE T Parall Distr, vol. 26, no. 5, pp. 1336--1349, May 2015.
[27]
M. M. Sabry, D. Atienza Alonso, and F. Catthoor, "Ocean: An optimized hw/sw reliability mitigation approach for scratchpad memories in real-time socs," ACM T Embed Comput S, vol. 13, pp. 26. 138.1--138.26, 2014.
[28]
D. Zoni, L. Cremona, and W. Fornaciari, "Powerprobe: Run-time power modeling through automatic RTL instrumentation," in 2018 Design, Automation & Test in Europe Conference & Exhibition, DATE 2018, Dresden, Germany, March 19-23, 2018, 2018, pp. 743--748. {Online}. Available
[29]
D. Zoni, L. Colombo, and W. Fornaciari, "Darkcache: Energy-performance optimization of tiled multi-cores by adaptively power-gating llc banks," ACM Trans. Archit. Code Optim., vol. 15, no. 2, pp. 21:1--21:26, May 2018. {Online}. Available
[30]
S. Libutti, G. Massari, and W. Fornaciari, "Co-scheduling tasks on multi-core heterogeneous systems: An energy-aware perspective," IET Computers Digital Techniques, vol. 10, no. 2, pp. 77--84, 2016.
[31]
D. Zoni, A. Barenghi, G. Pelosi, and W. Fornaciari, "A comprehensive side channel information leakage analysis of an in-order risc cpu microarchitecture," ACM TODAES, vol. 23, no. 5, Sep. 2018. {Online}. Available
[32]
R. Rabenseifner, G. Hager, and G. Jost, "Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes," in 17th Euromicro PDP, Feb 2009, pp. 427--436.
[33]
J. Diaz, C. M. noz Caro, and A. N. no, "A survey of parallel programming models and tools in the multi and many-core era," IEEE T Parall Distr, vol. 23, no. 8, pp. 1369--1386, Aug 2012.
[34]
J. L. Reyes-Ortiz, L. Oneto, and D. Anguita, "Big data analytics in the cloud: Spark on hadoop vs mpi/openmp on beowulf," Procedia Computer Science, vol. 53, pp. 121 -- 130, 2015, iNNS Conference on Big Data 2015 Program San Francisco, CA, USA 8--10 August 2015.
[35]
M. Jarus and A. Oleksiak, "Top-down characterization approximation based on performance counters architecture for amd processors," Simul Model Pract Th, vol. 68, pp. 146 -- 162, 2016.

Cited By

View all
  • (2024)Tartan: Microarchitecting a Robotic Processor2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00047(548-565)Online publication date: 29-Jun-2024
  • (2021)A Multi-Level DPM Approach for Real-Time DAG Tasks in Heterogeneous Processors2021 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS52674.2021.00014(14-26)Online publication date: Dec-2021
  • (2020)Dealing with Uncertainty in pWCET EstimationsACM Transactions on Embedded Computing Systems10.1145/339623419:5(1-23)Online publication date: 26-Sep-2020
  • Show More Cited By
  1. Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SAMOS '18: Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation
    July 2018
    263 pages
    ISBN:9781450364942
    DOI:10.1145/3229631
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 July 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SAMOS XVIII
    SAMOS XVIII: Architectures, Modeling, and Simulation
    July 15 - 19, 2018
    Pythagorion, Greece

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Tartan: Microarchitecting a Robotic Processor2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00047(548-565)Online publication date: 29-Jun-2024
    • (2021)A Multi-Level DPM Approach for Real-Time DAG Tasks in Heterogeneous Processors2021 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS52674.2021.00014(14-26)Online publication date: Dec-2021
    • (2020)Dealing with Uncertainty in pWCET EstimationsACM Transactions on Embedded Computing Systems10.1145/339623419:5(1-23)Online publication date: 26-Sep-2020
    • (2020)Optimizing Energy in Non-preemptive Mixed-Criticality Scheduling by Exploiting Probabilistic InformationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.3012231(1-1)Online publication date: 2020
    • (2020)A Game Theory Approach to Heterogeneous Resource Management: Work-in-Progress2020 International Conference on Embedded Software (EMSOFT)10.1109/EMSOFT51651.2020.9244046(25-27)Online publication date: 20-Sep-2020
    • (2020)Timing Predictability in High-Performance Computing with Probabilistic Real-TimeIEEE Access10.1109/ACCESS.2020.3038559(1-1)Online publication date: 2020
    • (2020)The RECIPE approach to challenges in deeply heterogeneous high performance systemsMicroprocessors & Microsystems10.1016/j.micpro.2020.10318577:COnline publication date: 1-Sep-2020
    • (2020)Probabilistic-WCET reliabilityMicroprocessors & Microsystems10.1016/j.micpro.2020.10313577:COnline publication date: 1-Sep-2020
    • (2020)VGM-Bench: FPU Benchmark Suite for Computer Vision, Computer Graphics and Machine Learning ApplicationsEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-030-60939-9_23(323-335)Online publication date: 7-Oct-2020
    • (2019)The Real-Time Linux KernelACM Computing Surveys10.1145/329771452:1(1-36)Online publication date: 21-Feb-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media