Skip to main content

Advertisement

Log in

Dynamic concurrency throttling on NUMA systems and data migration impacts

  • Published:
Design Automation for Embedded Systems Aims and scope Submit manuscript

Abstract

Many parallel applications do not scale as the number of threads increases, which means that using the maximum number of threads will not always deliver the best outcome in performance or energy consumption. Therefore, many works have already proposed strategies for tuning the number of threads to optimize for performance or energy. Since parallel applications may have more than one parallel region, these tuning strategies can determine a specific number of threads for each application’s parallel region, or determine a fixed number of threads for the whole application execution. In the former case, strategies apply Dynamic Concurrency Throttling (DCT), which enables adapting the number of threads at runtime. However, the use of DCT implies on overheads, such as creating/destroying threads and cache warm-up. DCT’s overhead can be further aggravated in Non-uniform Memory Access systems, where changing the number of threads may incur in remote memory accesses or, more importantly, data migration between nodes. In this way, tuning strategies should not only determine the best number of threads locally, for each parallel region, but also be aware of the impacts when applying DCT. This work investigates how parallel regions may influence each other during DCT employment, showing that data migration may represent a considerable overhead. Effectively, those overheads affect the strategy’s solution, impacting the overall application performance and energy consumption. We demonstrate why many approaches will very likely fail when applied to simulated environments or will hardly reach a near-optimum solution when executed in real hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. The Linux policy for thread scheduling focus on balancing the work between available resources.

  2. We classified the applications based on the rate of L3 cache misses per instructions. We used the Intel PCM tool to collect the misses and instructions counters.

  3. First-Touch data mapping policy places the page on the node that is running the thread that causes its first page-fault [13].

  4. Although UA is the most significant case of the group 4.1, for this application it became impossible to carry out a more detailed analysis for two reasons: first, the high number of parallel regions, makes the evaluation process extremely costly; and second, most regions are short and energy hardware counters are not precise considering periods of time smaller than 0.001 seconds [17].

References

  1. Alessi F, Thoman P, Georgakoudis G, Fahringer T, Nikolopoulos DS (2015) Application-level energy awareness for OpenMP. Springer, Cham, pp 219–232

    Google Scholar 

  2. Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatoohi RA, Frederickson PO, Lasinski TA, Schreiber RS, Simon HD, Venkatakrishnan V, Weeratunga SK (1991) The NAS parallel benchmarks—summary and preliminary results. In: ACM/IEEE CS. ACM, NY, USA , pp 158–165. https://doi.org/10.1145/125826.125925

  3. Bari MAS, Chaimov N, Malik AM, Huck KA, Chapman B, Malony AD, Sarood O (2016) Arcs: adaptive runtime configuration selection for power-constrained openmp applications. In: 2016 IEEE international conference on cluster computing (CLUSTER), pp 461–470

  4. Beck ACS, Lisbôa CAL, Carro L (2012) Adaptable embedded systems. Springer, Berlin

    Google Scholar 

  5. Broquedis F, Aumage O, Goglin B, Thibault S, Wacrenier PA, Namyst R (2010) Structuring the execution of openmp applications for multicore architectures. In: 2010 IEEE international symposium on parallel & distributed processing (IPDPS). IEEE, pp 1–10 (2010)

  6. Broquedis F, Furmento N, Goglin B, Wacrenier PA, Namyst R (2010) Forestgomp: an efficient openmp environment for numa architectures. Int J Parallel Prog 38(5–6):418–439

    Article  Google Scholar 

  7. Chadha G, Mahlke S, Narayanasamy S (2012) When less is more (LIMO): controlled parallelism for improved efficiency. In: CASES. USA, pp 141–150 (2012)

  8. Corbet J. Toward better numa scheduling. https://lwn.net/Articles/486858/

  9. Curtis-Maury M, Blagojevic F, Antonopoulos CD, Nikolopoulos DS (2008) Prediction-based power-performance adaptation of multithreaded scientific codes. IEEE Trans Parallel Distrib Syst 19(10):1396–1410

    Article  Google Scholar 

  10. Curtis-Maury M, Dzierwa J, Antonopoulos CD, Nikolopoulos DS (2006) Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In: Int CS, pp 157–166

  11. Dashti M, Fedorova A, Funston J, Gaud F, Lachaize R, Lepers B, Quema V, Roth M (2013) Traffic management: a holistic approach to memory placement on numa systems. ACM SIGARCH Comput Archit News 41(1):381–394

    Article  Google Scholar 

  12. De Sensi D (2016) Predicting performance and power consumption of parallel applications. In: PDP, pp 200–207. https://doi.org/10.1109/PDP.2016.41

  13. Diener M, Cruz EH, Alves MA, Navaux PO, Koren I (2016) Affinity-based thread and data mapping in shared memory systems. ACM Comput Surv (CSUR) 49(4):1–38

    Article  Google Scholar 

  14. Diener M, Cruz EH, Navaux PO (2015) Locality vs. balance: Exploring data mapping policies on numa systems. In: 2015 23rd Euromicro international conference on parallel, distributed, and network-based processing. IEEE, pp 9–16 (2015)

  15. Diener M, Cruz EH, Navaux PO, Busse A, Heiß HU (2014) kmaf: automatic kernel-level management of thread and data affinity. In: Proceedings of the 23rd international conference on parallel architectures and compilation. ACM, pp 277–288 (2014)

  16. Diener M, Cruz EH, Pilla LL, Dupros F, Navaux PO (2015) Characterizing communication and page usage of parallel applications for thread and data mapping. Perform Eval 88:18–36

    Article  Google Scholar 

  17. Hähnel M, Döbel B, Völp M, Härtig H (2012) Measuring energy consumption for short code paths using RAPL. SIGMETRICS Perform Eval Rev 40(3):13–17. https://doi.org/10.1145/2425248.2425252

    Article  Google Scholar 

  18. Joao JA, Suleman MA, Mutlu O, Patt YN (202) Bottleneck identification and scheduling in multithreaded applications. In: ASPLOS. ACM, NY, USA, pp 223–234. https://doi.org/10.1145/2150976.2151001

  19. Jung C, Lim D, Lee J, Han S (2005) Adaptive execution techniques for SMT multiprocessor architectures. In: ACM symposium on principles and practice of parallel programming. USA , pp 236–246

  20. Lee J, Wu H, Ravichandran M, Clark N (2010) Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. SIGARCH Comput Archit News 38(3):270–279

    Article  Google Scholar 

  21. Lepers B, Quéma V, Fedorova A (2015) Thread and memory placement on numa systems: asymmetry matters. In: 2015 USENIX annual technical conference (USENIX ATC 15), pp 277–289

  22. Lorenzon AF, Beck ACS (2019) Parallel computing hits the power wall: principles, challenges, and a survey of solutions. Springer, Berlin

    Book  Google Scholar 

  23. Lorenzon AF, Cera MC, Beck ACS (2016) Investigating different general-purpose and embedded multicores to achieve optimal trade-offs between performance and energy. J Parallel Distrib Comput 95:107–123

    Article  Google Scholar 

  24. Lorenzon AF, Oliveira CCD, Souza JD, Filho ACSB (2018) Aurora: seamless optimization of OpenMP applications. In: IEEE transactions on parallel and distributed systems, pp 1–1. https://doi.org/10.1109/TPDS.2018.2872992

  25. Lorenzon AF, Sartor AL, Cera MC, Beck ACS (2015) Optimized use of parallel programming interfaces in multithreaded embedded architectures. In: 2015 IEEE computer society annual symposium on VLSI. IEEE, pp 410–415

  26. Lorenzon AF, Souza JD, Beck ACS (2017) LAANT: a library to automatically optimize EDP for OpenMP applications. In: DATE, pp 1229–1232. https://doi.org/10.23919/DATE.2017.7927176

  27. McCalpin JD (1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE computer society technical committee on computer architecture newsletter, pp 19–25 (1995)

  28. Mucci PJ, Browne S, Deane C, Ho G (1999) Papi: a portable interface to hardware performance counters. In: Proceedings of the department of defense HPCMP users group conference, vol 710 (1999)

  29. Petersen W, Arbenz P (2004) Introduction to parallel computing: a practical guide with examples in C. Oxford texts in applied and engineering mathematics. OUP, Oxford

    Book  Google Scholar 

  30. Porterfield AK, Olivier SL, Bhalachandra S, Prins JF (2013) Power measurement and concurrency throttling for energy reduction in OpenMP programs. In: IEEE IPDPS, pp 884–891

  31. Pusukuri KK, Gupta R, Bhuyan LN (2011) Thread reinforcer: dynamically determining number of threads via OS level monitoring. In: IEEE ISWC. USA, pp 116–125

  32. Quinn M (2004) Parallel programming in C with MPI and OpenMP. McGraw-Hill Higher Education, New York City

    Google Scholar 

  33. Raasch SE, Reinhardt SK (2003) The impact of resource partitioning on SMT processors. In: PACT, pp 15–25. https://doi.org/10.1109/PACT.2003.1237998

  34. Schwarzrock J, Lorenzon AF, Navaux PO, Beck ACS, de Freitas EP (2017) Potential gains in EDP by dynamically adapting the number of threads for openmp applications in embedded systems. In: 2017 VII Brazilian symposium on computing systems engineering (SBESC). IEEE, pp 79–85

  35. Sensi DD, Torquati M, Danelutto M (2016) A reconfiguration algorithm for power-aware parallel applications. TACO 13(4):43-1–43-25. https://doi.org/10.1145/3004054

    Article  Google Scholar 

  36. Seo S, Jo G, Lee J (2011) Performance characterization of the NAS parallel benchmarks in OpenCL. In: IEEE ISWC, pp 137–148. https://doi.org/10.1109/IISWC.2011.6114174

  37. Sridharan S, Gupta G, Sohi GS (2014) Adaptive, efficient, parallel execution of parallel programs. In: ACM SIGPLAN PLDI. ACM, NY, USA, pp 169–180

  38. Subramanian L, Seshadri V, Kim Y, Jaiyen B, Mutlu O (2013) MISE: providing performance predictability and improving fairness in shared main memory systems. In: IEEE HPCA, pp 639–650

  39. Suleman MA, Qureshi MK, Patt YN (2008) Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs. SIGARCH Comput Archit News 36(1):277–286

    Article  Google Scholar 

  40. Wang W, Davidson JW, Soffa ML (2016) Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale numa machines. In: 2016 IEEE international symposium on high performance computer architecture (HPCA). IEEE, pp 419–431 (2016)

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001, the Fundação de Amparo à Pesquisa do Estado do RS (FAPERGS) and the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Janaina Schwarzrock.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schwarzrock, J., Jordan, M.G., Korol, G. et al. Dynamic concurrency throttling on NUMA systems and data migration impacts. Des Autom Embed Syst 25, 135–160 (2021). https://doi.org/10.1007/s10617-020-09243-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10617-020-09243-5

Keywords

Navigation