Skip to main content

A New Hardware Counters Based Thread Migration Strategy for NUMA Systems

  • Conference paper
  • First Online:
Parallel Processing and Applied Mathematics (PPAM 2019)

Abstract

Multicore NUMA systems present on-board memory hierarchies and communication networks that influence performance when executing shared memory parallel codes. Characterising this influence is complex, and understanding the effect of particular hardware configurations on different codes is of paramount importance. In this paper, monitoring information extracted from hardware counters at runtime is used to characterise the behaviour of each thread in the processes running in the system. This characterisation is given in terms of number of instructions per second, operational intensity, and latency of memory access. We propose to use all this information to guide a thread migration strategy that improves execution efficiency by increasing locality and affinity. Different configurations of NAS Parallel OpenMP benchmarks running concurrently on multicore systems were used to validate the benefits of the proposed thread migration strategy. Our proposal produces up to 25% improvement over the OS for heterogeneous workloads, under different and realistic locality and affinity scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adhianto, L., Banerjee, S., Fagan, M., et al.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput.: Pract. Exp. 22(6), 685–701 (2010). https://doi.org/10.1002/cpe.1553

    Article  Google Scholar 

  2. Akiyama, S., Hirofuchi, T.: Quantitative evaluation of intel PEBS overhead for online system-noise analysis. In: Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017, ROSS 2017, pp. 3:1–3:8. ACM, New York (2017). https://doi.org/10.1145/3095770.3095773

  3. Chasparis, G.C., Rossbory, M.: Efficient dynamic pinning of parallelized applications by distributed reinforcement learning. Int. J. Parallel Program. 47(1), 24–38 (2017). https://doi.org/10.1007/s10766-017-0541-y

    Article  Google Scholar 

  4. Cheung, A., Madden, S.: Performance profiling with EndoScope, an acquisitional software monitoring framework. Proc. VLDB Endow. 1(1), 42–53 (2008). https://doi.org/10.14778/1453856.1453866

    Article  Google Scholar 

  5. Cho, J.H., Wang, Y., Chen, R., Chan, K.S., Swami, A.: A survey on modeling and optimizing multi-objective systems. IEEE Commun. Surv. Tutor. 19, 1867–1901 (2017). https://doi.org/10.1109/COMST.2017.2698366

    Article  Google Scholar 

  6. Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput.: Pract. Exp. 22(6), 702–719 (2010). https://doi.org/10.1002/cpe.1556

    Article  Google Scholar 

  7. Goumas, G., Kourtis, K., Anastopoulos, N., Karakasis, V., Koziris, N.: Performance evaluation of the sparse matrix-vector multiplication on modern architectures. J. Supercomput. 50(1), 36–77 (2009). https://doi.org/10.1007/s11227-008-0251-8

    Article  Google Scholar 

  8. Intel Corp.: Intel 64 and IA-32 Architectures Software Developer Manuals (2017). https://software.intel.com/articles/intel-sdm. Accessed Nov 2019

  9. Intel Developer Zone: Fluctuating FLOP count on Sandy Bridge (2013). http://software.intel.com/en-us/forums/topic/375320. Accessed Nov 2019

  10. Jin, H., Frumkin, M., Yan, J.: The OpenMP implementation of NAS parallel benchmarks and its performance. Technical report, Technical Report NAS-99-011, NASA Ames Research Center (1999)

    Google Scholar 

  11. Ju, M., Jung, H., Che, H.: A performance analysis methodology for multicore, multithreaded processors. IEEE Trans. Comput. 63(2), 276–289 (2014). https://doi.org/10.1109/TC.2012.223

    Article  MathSciNet  MATH  Google Scholar 

  12. Kleen, A.: A NUMA API for Linux. Novel Inc. (2005)

    Google Scholar 

  13. Lameter, C., et al.: NUMA (non-uniform memory access): an overview. ACM Queue 11(7), 40 (2013). https://queue.acm.org/detail.cfm?id=2513149

    Article  Google Scholar 

  14. Lorenzo, O.G., Pena, T.F., Cabaleiro, J.C., Pichel, J.C., Rivera, F.F.: 3DyRM: a dynamic roofline model including memory latency information. J. Supercomput. 70(2), 696–708 (2014). https://doi.org/10.1007/s11227-014-1163-4

    Article  Google Scholar 

  15. Lorenzo, O.G., Pena, T.F., Cabaleiro, J.C., Pichel, J.C., Rivera, F.F.: Multiobjective optimization technique based on monitoring information to increase the performance of thread migration on multicores. In: 2014 IEEE International Conference on Cluster Computing (CLUSTER), pp. 416–423. IEEE (2014). https://doi.org/10.1109/CLUSTER.2014.6968733

  16. Rane, A., Stanzione, D.: Experiences in tuning performance of hybrid MPI/OpenMP applications on quad-core systems. In: Proceedings of 10th LCI International Conference on High-Performance Clustered Computing (2009)

    Google Scholar 

  17. Schulz, M., de Supinski, B.R.: PNMPI tools: a whole lot greater than the sum of their parts. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (2007). https://doi.org/10.1145/1362622.1362663

  18. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785

    Article  Google Scholar 

Download references

Acknowledgements

This work has received financial support from the Consellería de Cultura, Educación e Ordenación Universitaria (accreditation 2016-2019, ED431G/08 and reference competitive group 2019-2021, ED431C 2018/19) and the European Regional Development Fund (ERDF). It was also funded by the Ministerio de Economía, Industria y Competitividad within the project TIN2016-76373-P.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oscar García Lorenzo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

García Lorenzo, O., Laso Rodríguez, R., Fernández Pena, T., Cabaleiro Domínguez, J.C., Fernández Rivera, F., Lorenzo del Castillo, J.Á. (2020). A New Hardware Counters Based Thread Migration Strategy for NUMA Systems. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12044. Springer, Cham. https://doi.org/10.1007/978-3-030-43222-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43222-5_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43221-8

  • Online ISBN: 978-3-030-43222-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics