Abstract
Many parallel applications do not scale as the number of threads increases, which means that using the maximum number of threads will not always deliver the best outcome in performance or energy consumption. Therefore, many works have already proposed strategies for tuning the number of threads to optimize for performance or energy. Since parallel applications may have more than one parallel region, these tuning strategies can determine a specific number of threads for each application’s parallel region, or determine a fixed number of threads for the whole application execution. In the former case, strategies apply Dynamic Concurrency Throttling (DCT), which enables adapting the number of threads at runtime. However, the use of DCT implies on overheads, such as creating/destroying threads and cache warm-up. DCT’s overhead can be further aggravated in Non-uniform Memory Access systems, where changing the number of threads may incur in remote memory accesses or, more importantly, data migration between nodes. In this way, tuning strategies should not only determine the best number of threads locally, for each parallel region, but also be aware of the impacts when applying DCT. This work investigates how parallel regions may influence each other during DCT employment, showing that data migration may represent a considerable overhead. Effectively, those overheads affect the strategy’s solution, impacting the overall application performance and energy consumption. We demonstrate why many approaches will very likely fail when applied to simulated environments or will hardly reach a near-optimum solution when executed in real hardware.

















Similar content being viewed by others
Notes
The Linux policy for thread scheduling focus on balancing the work between available resources.
We classified the applications based on the rate of L3 cache misses per instructions. We used the Intel PCM tool to collect the misses and instructions counters.
First-Touch data mapping policy places the page on the node that is running the thread that causes its first page-fault [13].
Although UA is the most significant case of the group 4.1, for this application it became impossible to carry out a more detailed analysis for two reasons: first, the high number of parallel regions, makes the evaluation process extremely costly; and second, most regions are short and energy hardware counters are not precise considering periods of time smaller than 0.001 seconds [17].
References
Alessi F, Thoman P, Georgakoudis G, Fahringer T, Nikolopoulos DS (2015) Application-level energy awareness for OpenMP. Springer, Cham, pp 219–232
Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatoohi RA, Frederickson PO, Lasinski TA, Schreiber RS, Simon HD, Venkatakrishnan V, Weeratunga SK (1991) The NAS parallel benchmarks—summary and preliminary results. In: ACM/IEEE CS. ACM, NY, USA , pp 158–165. https://doi.org/10.1145/125826.125925
Bari MAS, Chaimov N, Malik AM, Huck KA, Chapman B, Malony AD, Sarood O (2016) Arcs: adaptive runtime configuration selection for power-constrained openmp applications. In: 2016 IEEE international conference on cluster computing (CLUSTER), pp 461–470
Beck ACS, Lisbôa CAL, Carro L (2012) Adaptable embedded systems. Springer, Berlin
Broquedis F, Aumage O, Goglin B, Thibault S, Wacrenier PA, Namyst R (2010) Structuring the execution of openmp applications for multicore architectures. In: 2010 IEEE international symposium on parallel & distributed processing (IPDPS). IEEE, pp 1–10 (2010)
Broquedis F, Furmento N, Goglin B, Wacrenier PA, Namyst R (2010) Forestgomp: an efficient openmp environment for numa architectures. Int J Parallel Prog 38(5–6):418–439
Chadha G, Mahlke S, Narayanasamy S (2012) When less is more (LIMO): controlled parallelism for improved efficiency. In: CASES. USA, pp 141–150 (2012)
Corbet J. Toward better numa scheduling. https://lwn.net/Articles/486858/
Curtis-Maury M, Blagojevic F, Antonopoulos CD, Nikolopoulos DS (2008) Prediction-based power-performance adaptation of multithreaded scientific codes. IEEE Trans Parallel Distrib Syst 19(10):1396–1410
Curtis-Maury M, Dzierwa J, Antonopoulos CD, Nikolopoulos DS (2006) Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In: Int CS, pp 157–166
Dashti M, Fedorova A, Funston J, Gaud F, Lachaize R, Lepers B, Quema V, Roth M (2013) Traffic management: a holistic approach to memory placement on numa systems. ACM SIGARCH Comput Archit News 41(1):381–394
De Sensi D (2016) Predicting performance and power consumption of parallel applications. In: PDP, pp 200–207. https://doi.org/10.1109/PDP.2016.41
Diener M, Cruz EH, Alves MA, Navaux PO, Koren I (2016) Affinity-based thread and data mapping in shared memory systems. ACM Comput Surv (CSUR) 49(4):1–38
Diener M, Cruz EH, Navaux PO (2015) Locality vs. balance: Exploring data mapping policies on numa systems. In: 2015 23rd Euromicro international conference on parallel, distributed, and network-based processing. IEEE, pp 9–16 (2015)
Diener M, Cruz EH, Navaux PO, Busse A, Heiß HU (2014) kmaf: automatic kernel-level management of thread and data affinity. In: Proceedings of the 23rd international conference on parallel architectures and compilation. ACM, pp 277–288 (2014)
Diener M, Cruz EH, Pilla LL, Dupros F, Navaux PO (2015) Characterizing communication and page usage of parallel applications for thread and data mapping. Perform Eval 88:18–36
Hähnel M, Döbel B, Völp M, Härtig H (2012) Measuring energy consumption for short code paths using RAPL. SIGMETRICS Perform Eval Rev 40(3):13–17. https://doi.org/10.1145/2425248.2425252
Joao JA, Suleman MA, Mutlu O, Patt YN (202) Bottleneck identification and scheduling in multithreaded applications. In: ASPLOS. ACM, NY, USA, pp 223–234. https://doi.org/10.1145/2150976.2151001
Jung C, Lim D, Lee J, Han S (2005) Adaptive execution techniques for SMT multiprocessor architectures. In: ACM symposium on principles and practice of parallel programming. USA , pp 236–246
Lee J, Wu H, Ravichandran M, Clark N (2010) Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. SIGARCH Comput Archit News 38(3):270–279
Lepers B, Quéma V, Fedorova A (2015) Thread and memory placement on numa systems: asymmetry matters. In: 2015 USENIX annual technical conference (USENIX ATC 15), pp 277–289
Lorenzon AF, Beck ACS (2019) Parallel computing hits the power wall: principles, challenges, and a survey of solutions. Springer, Berlin
Lorenzon AF, Cera MC, Beck ACS (2016) Investigating different general-purpose and embedded multicores to achieve optimal trade-offs between performance and energy. J Parallel Distrib Comput 95:107–123
Lorenzon AF, Oliveira CCD, Souza JD, Filho ACSB (2018) Aurora: seamless optimization of OpenMP applications. In: IEEE transactions on parallel and distributed systems, pp 1–1. https://doi.org/10.1109/TPDS.2018.2872992
Lorenzon AF, Sartor AL, Cera MC, Beck ACS (2015) Optimized use of parallel programming interfaces in multithreaded embedded architectures. In: 2015 IEEE computer society annual symposium on VLSI. IEEE, pp 410–415
Lorenzon AF, Souza JD, Beck ACS (2017) LAANT: a library to automatically optimize EDP for OpenMP applications. In: DATE, pp 1229–1232. https://doi.org/10.23919/DATE.2017.7927176
McCalpin JD (1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE computer society technical committee on computer architecture newsletter, pp 19–25 (1995)
Mucci PJ, Browne S, Deane C, Ho G (1999) Papi: a portable interface to hardware performance counters. In: Proceedings of the department of defense HPCMP users group conference, vol 710 (1999)
Petersen W, Arbenz P (2004) Introduction to parallel computing: a practical guide with examples in C. Oxford texts in applied and engineering mathematics. OUP, Oxford
Porterfield AK, Olivier SL, Bhalachandra S, Prins JF (2013) Power measurement and concurrency throttling for energy reduction in OpenMP programs. In: IEEE IPDPS, pp 884–891
Pusukuri KK, Gupta R, Bhuyan LN (2011) Thread reinforcer: dynamically determining number of threads via OS level monitoring. In: IEEE ISWC. USA, pp 116–125
Quinn M (2004) Parallel programming in C with MPI and OpenMP. McGraw-Hill Higher Education, New York City
Raasch SE, Reinhardt SK (2003) The impact of resource partitioning on SMT processors. In: PACT, pp 15–25. https://doi.org/10.1109/PACT.2003.1237998
Schwarzrock J, Lorenzon AF, Navaux PO, Beck ACS, de Freitas EP (2017) Potential gains in EDP by dynamically adapting the number of threads for openmp applications in embedded systems. In: 2017 VII Brazilian symposium on computing systems engineering (SBESC). IEEE, pp 79–85
Sensi DD, Torquati M, Danelutto M (2016) A reconfiguration algorithm for power-aware parallel applications. TACO 13(4):43-1–43-25. https://doi.org/10.1145/3004054
Seo S, Jo G, Lee J (2011) Performance characterization of the NAS parallel benchmarks in OpenCL. In: IEEE ISWC, pp 137–148. https://doi.org/10.1109/IISWC.2011.6114174
Sridharan S, Gupta G, Sohi GS (2014) Adaptive, efficient, parallel execution of parallel programs. In: ACM SIGPLAN PLDI. ACM, NY, USA, pp 169–180
Subramanian L, Seshadri V, Kim Y, Jaiyen B, Mutlu O (2013) MISE: providing performance predictability and improving fairness in shared main memory systems. In: IEEE HPCA, pp 639–650
Suleman MA, Qureshi MK, Patt YN (2008) Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs. SIGARCH Comput Archit News 36(1):277–286
Wang W, Davidson JW, Soffa ML (2016) Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale numa machines. In: 2016 IEEE international symposium on high performance computer architecture (HPCA). IEEE, pp 419–431 (2016)
Acknowledgements
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001, the Fundação de Amparo à Pesquisa do Estado do RS (FAPERGS) and the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Schwarzrock, J., Jordan, M.G., Korol, G. et al. Dynamic concurrency throttling on NUMA systems and data migration impacts. Des Autom Embed Syst 25, 135–160 (2021). https://doi.org/10.1007/s10617-020-09243-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10617-020-09243-5