Skip to main content
Log in

Building and using application utility models to dynamically choose thread counts

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

On today’s multiprocessor systems, simultaneously executing multi-threaded applications contend for cache space and CPU time. This contention can be managed by changing application thread count. In this paper, we describe a technique to configure thread count using utility models. A utility model predicts application performance given its thread count and other workload thread counts. Built offline with linear regression, utility models are used online by a system policy to dynamically configure applications’ thread counts. We present a policy which uses the models to maximize throughput while maintaining QoS. Our approach improves system throughput by 6 % and meets QoS 22 % more often than the best evaluated traditional policy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. For brevity of nomenclature, the term “CPU-bound threads” refers to both memory-bound threads and CPU-bound threads. We recognize the distinction between the two terms.

  2. This approach is also amenable to parallelization.

  3. i.e., Multiple settings of appStep, otherStep can result in the same number of profile points being selected.

  4. We limit models to a single area of linear growth, followed by an optional area of no-growth.

  5. A performance plateau at \(\infty \) indicates expected continued scaling.

  6. The constant large number should be greater than any possible expected system throughput.

  7. An application QoS goal of \(Q\) means that the application should execute at least \(Q\) times as fast as its single-threaded performance. \(Q\) should be chosen by the user after considering minimum performance requirements and application scalability.

References

  1. Moore RW, Childers BR (2012) Using utility prediction models to dynamically choose program thread counts. In: 2012 IEEE international symposium on performance analysis of systems and software (ISPASS). doi:10.1109/ISPASS.2012.6189220

  2. Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K (1996) The case for a single-chip multiprocessor. In: ASPLOS VII: Proceedings of the seventh international conference on architectural support for programming languages and operating systems. ACM, New York, NY, USA, pp 2–11

  3. Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, PACT ’08. ACM, New York. doi:10.1145/1454115.1454128

  4. Moore RW, Childers BR (2011) Inflation and deflation of self-adaptive applications. In: Proceedings of the 6th international symposium on software engineering for adaptive and self-managing systems, SEAMS ’11. ACM, New York. doi:10.1145/1988008.1988041

  5. Yu C, Petrov P (2010) Adaptive multi-threading for dynamic workloads in embedded multiprocessors. In: Proceedings of the 23rd symposium on integrated circuits and system design, SBCCI ’10. ACM, New York. doi:10.1145/1854153.1854173

  6. Raman A, Zaks A, Lee JW, August DI (2012) Parcae: a system for flexible parallel execution. In: Proceedings of the 33rd ACM SIGPLAN conference on programming language design and implementation, PLDI ’12. ACM, New York. doi:10.1145/2254064.2254082

  7. Lee J, Wu H, Ravichandran M, Clark N (2010) Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. In: Proceedings of the 37th annual international symposium on computer architecture, ISCA ’10. ACM, New York. doi:10.1145/1815961.1815996

  8. Bienia C, Li K (2010) Fidelity and scaling of the parsec benchmark inputs. In: 2010 IEEE international symposium on workload characterization (IISWC). doi:10.1109/IISWC.2010.5649519

  9. Tian K, Jiang Y, Zhang EZ, Shen X (2010) An input-centric paradigm for program dynamic optimizations. In: Proceedings of the ACM international conference on object oriented programming systems languages and applications, OOPSLA ’10. ACM, New York. doi:10.1145/1869459.1869471

  10. LuxRender Team (2012) Luxrender v0.8. http://www.luxrender.net

  11. Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B. Cache hierarchy and memory subsystem of the amd opteron processor, Micro, IEEE, 30 (2). doi:10.1109/MM.2010.31

  12. Ahmad SB (2011) On improved processor allocation in 2D mesh-based multicomputers: controlled splitting of parallel requests. In: Proceedings of the 2011 international conference on communication computing and security, ICCCS ’11. ACM, New York. doi:10.1145/1947940.1947984

  13. Leung LF, Tsui CY, Ki WH (2004) Minimizing energy consumption of multiple-processors-core systems with simultaneous task allocation, scheduling and voltage assignment. In: Proceedings of the 2004 Asia and South Pacific design automation conference, ASP-DAC ’04, IEEE Press, Piscataway. http://portal.acm.org/citation.cfm?id=1015090.1015267

  14. Kandemir M, Muralidhara SP, Narayanan SHK, Zhang Y, Ozturk O (2009) Optimizing shared cache behavior of chip multiprocessors. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture, MICRO 42. ACM, New York. doi:10.1145/1669112.1669176

  15. Charles P, Grothoff C, Saraswat V, Donawa C, Kielstra A, Ebcioglu K, von Praun C, Sarkar V (2005) X10: an object-oriented approach to nonuniform cluster computing. In: OOPSLA ’05: Proceedings of the 20th annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications. ACM, New York, NY, USA, pp 519–538

  16. Leiserson CE (2009) The cilk++ concurrency platform. In: Proceedings of the 46th annual design automation conference, DAC ’09. ACM, New York. doi:10.1145/1629911.1630048

  17. Architecture Review Board, Openmp application program interface v3.0. http://www.openmp.org/mp-documents/spec30.pdf

  18. Message Passing Interface Forum, Mpi: a message-passing interface standard version 2.2. http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf

  19. Calder B, Grunwald D, Jones M, Lindsay D, Martin J, Mozer M, Zorn B. Evidence-based static branch prediction using machine learning, ACM Trans. Program. Lang. Syst. 19 (1). doi:10.1145/239912.239923

  20. Chen G, Kandemir M (2005) Optimizing embedded applications using programmer-inserted hints. In: Proceedings of the 2005 Asia and South Pacific design automation conference, ASP-DAC ’05. ACM, New York. doi:10.1145/1120725.1120794

  21. Suganuma T, Yasue T, Kawahito M, Komatsu H, Nakatani T. Design and evaluation of dynamic optimizations for a java just-in-time compiler, ACM Trans. Program. Lang. Syst. 27 (4). doi:10.1145/1075382.1075386

  22. Maury MC, Dzierwa J, Antonopoulos CD, Nikolopoulos DS (2006) Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In: Proceedings of the 20th annual international conference on supercomputing, ICS ’06. ACM, New York. doi:10.1145/1183401.1183426

  23. Itzkowitz M, Maruyama Y (2010) HPC profiling with the sun studio performance tools. In: Muller MS, Resch MM, Schulz A, Nagel WE (eds) Tools for high performance computing 2009. Springer, Berlin, p 6. doi:10.1007/978-3-642-11261-4_6

    Google Scholar 

  24. Pusukuri KK, Gupta R, Bhuyan LN. Thread tranquilizer: dynamically reducing performance variation, ACM Trans. Archit. Code Optim. 8 (4). doi:10.1145/2086696.2086725

  25. Becchi M, Crowley P (2006) Dynamic thread assignment on heterogeneous multiprocessor architectures. In: Proceedings of the 3rd conference on computing frontiers, CF ’06. ACM, New York. doi:10.1145/1128022.1128029

  26. Wang Z, O’Boyle MF (2009) Mapping parallelism to multi-cores: a machine learning based approach. In: Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’09. ACM, New York. doi:10.1145/1504176.1504189

  27. Martinez JF, Ipek E, Dynamic multicore resource management: a machine learning approach, IEEE Micro 29 (5). doi:10.1109/MM.2009.77

  28. Barnes BJ, Rountree B, Lowenthal DK, Reeves J, de Supinski B, Schulz M (2008) A regression-based approach to scalability prediction. In: Proceedings of the 22nd annual international conference on supercomputing, ICS ’08. ACM, New York. doi:10.1145/1375527.1375580

  29. Ipek E, de Supinski B, Schulz M, McKee S (2005) An approach to performance prediction for parallel applications Euro-Par 2005 parallel processing. In: Euro-Par 2005 parallel processing, vol. 3648 of Lecture notes in computer science, Springer, Berlin. doi:10.1007/11549468_24

  30. Lee BC, Brooks DM, de Supinski BR, Schulz M, Singh K, McKee SA (2007) Methods of inference and learning for performance modeling of parallel applications. In: Proceedings of the 12th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’07. ACM, New York. doi:10.1145/1229428.1229479

  31. Ipek E, McKee SA, Caruana R, de Supinski BR, Schulz M (2006) Efficiently exploring architectural design spaces via predictive modeling. In: Proceedings of the 12th international conference on architectural support for programming languages and operating systems, ASPLOS-XII. ACM, New York. doi:10.1145/1168857.1168882

  32. Duan R, Nadeem F, Wang J, Zhang Y, Prodan R, Fahringer T (2009) A hybrid intelligent method for performance modeling and prediction of workflow activities in grids. In: Proceedings of the 2009 9th IEEE/ACM international symposium on cluster computing and the grid, CCGRID ’09, IEEE Computer Society, Washington. doi:10.1109/CCGRID.2009.58

  33. Zhai J, Chen W, Zheng W (2010) Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. In: Proceedings of the 15th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’10. ACM, New York. doi:10.1145/1693453.1693493

  34. Suleman MA, Qureshi MK, Patt YN (2008) Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on cmps. In: Proceedings of the 13th international conference on architectural support for programming languages and operating systems, ASPLOS XIII. ACM, New York. doi:10.1145/1346281.1346317

  35. Moseley T, Grunwald D, Kihm JL, Connors DA. Methods for modeling resource contention on simultaneous multithreading processors, In: International conference on computer design. doi:10.1109/ICCD.2005.74

  36. Vengerov D (2005) Adaptive utility-based scheduling in resource-constrained systems. In: Zhang S, Jarvis R (eds) AI 2005: advances in artificial intelligence, vol. 3809 of Lecture notes in computer science. Springer, Berlin

    Google Scholar 

  37. Pusukuri K, Gupta R, Bhuyan L (2011) Thread reinforcer: dynamically determining number of threads via os level monitoring. In: 2011 IEEE international symposium on workload characterization (IISWC). doi:10.1109/IISWC.2011.6114208

  38. Grewe D, Wang Z, O’Boyle MFP (2011) A workload-aware mapping approach for data-parallel programs. In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers, HiPEAC ’11. ACM, New York. doi:10.1145/1944862.1944881

  39. De P, Kothari R, Mann V (2007) Identifying sources of operating system jitter through fine-grained kernel instrumentation. In: 2007 IEEE international conference on cluster computing. doi:10.1109/CLUSTR.2007.4629247

  40. De P, Mann V, Mittaly U (2009) Handling os jitter on multicore multithreaded systems. In: IEEE international symposium on parallel distributed processing, 2009. IPDPS 2009. doi:10.1109/IPDPS.2009.5161046

  41. Nataraj A, Morris A, Malony AD, Sottile M, Beckman P (2007) The ghost in the machine: observing the effects of kernel operation on parallel application performance. In: Proceedings of the 2007 ACM/IEEE conference on supercomputing, SC ’07. ACM, New York. doi:10.1145/1362622.1362662

  42. Shen K (2010) Request behavior variations. In: Proceedings of the 15th edition of ASPLOS on architectural support for programming languages and operating systems. ASPLOS ’10. ACM, New York. doi:10.1145/1736020.1736034

  43. Constantinou T, Sazeides Y, Michaud P, Fetis D, Seznec A. Performance implications of single thread migration on a chip multi-core, SIGARCH Comput. Archit. News. 33 (4). doi:10.1145/1105734.1105745

  44. Teng Q, Sweeney P, Duesterwald E (2009) Understanding the cost of thread migration for multi-threaded java applications running on a multicore platform. In: IEEE international symposium on performance analysis of systems and software, ISPASS 2009. doi:10.1109/ISPASS.2009.4919644

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryan W. Moore.

Additional information

This work is an extension of [1] and includes more than 30 % new content due to: (a) new sampling policies, including a policy (stepwise sampling) that subsumes the previously published sampling policy, (b) model comparisons using several time budgets, (c) more flexible models through the use of real-number performance plateau settings, (d) the use of a new benchmark, luxrender, (e) more diverse quality of service settings, and (f) experiments involving all combinations of up to four applications executing concurrently, instead of all combinations of up to two applications. This research was supported in part by the National Science Foundation through awards CNS-1012070, CCF-0811295, and CCF-0811352.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moore, R.W., Childers, B.R. Building and using application utility models to dynamically choose thread counts. J Supercomput 68, 1184–1213 (2014). https://doi.org/10.1007/s11227-014-1148-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1148-3

Keywords

Navigation