Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems

Khan, Behram; Goodman, Daniel; Khan, Salman; Toms, Will; Faraboschi, Paolo; Luján, Mikel; Watson, Ian

doi:10.1007/s11227-015-1383-2

Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems

Published: 08 February 2015

Volume 71, pages 2309–2338, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Behram Khan¹,
Daniel Goodman³,
Salman Khan²,
Will Toms³,
Paolo Faraboschi⁴,
Mikel Luján³ &
…
Ian Watson³

327 Accesses
2 Citations
Explore all metrics

Abstract

To harness the compute resource of many-core system with tens to hundreds of cores, applications have to expose parallelism to the hardware. Researchers are aggressively looking for program execution models that make it easier to expose parallelism and use the available resources. One common approach is to decompose a program into parallel ‘tasks’ and allow an underlying system layer to schedule these tasks to different threads. Software-only schedulers can implement various scheduling policies and algorithms that match the characteristics of different applications and programming models. Unfortunately with large-scale multi-core systems, software schedulers suffer significant overheads as they synchronize and communicate task information over deep cache hierarchies. To reduce these overheads, hardware-only schedulers like Carbon have been proposed to enable task queuing and scheduling to be done in hardware. This paper presents a hardware scheduling approach where the structure provided to programs by task-based programming models can be incorporated into the scheduler, making it aware of a task’s data requirements. This prior knowledge of a task’s data requirements allows for better task placement by the scheduler which result in a reduction in overall cache misses and memory traffic, improving the program’s performance and power utilization. Simulations of this technique for a range of synthetic benchmarks and components of real applications have shown a reduction in the number of cache misses by up to 72 and 95 % for the L1 and L2 caches, respectively, and up to 30 % improvement in overall execution time against FIFO scheduling. This results not only in faster execution and in less data transfer with reductions of up to 50 %, allowing for less load on the interconnect, but also in lower power consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism

Mitigating the NUMA effect on task-based runtime systems

Article 06 April 2023

Using Data Dependencies to Improve Task-Based Scheduling Strategies on NUMA Architectures

References

Dagum L, Menon R (1998) OpenMP: an industry-standard API for shared-memory programming. In: IEEE computer science engineering, vol 5. IEEE Computer Society Press, Los Alamitos. doi:10.1109/99.660313
Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. In: Proceedings of the 5th ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP ’95. ACM, New York. doi:10.1145/209936.209958
Reinders J (2007) Intel threading building blocks, 1st edn. O’Reilly & Associates Inc, Sebastopol
Google Scholar
Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with CUDA. In: Queue, vol 6. ACM, New York. doi:10.1145/1365490.1365500
Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. In: IEEE design test, vol 12. IEEE Computer Society Press, Los Alamitos. doi:10.1109/MCSE.2010.69
Thies W, Karczmarek M, Amarasinghe SP (2002) StreamIt: a language for streaming applications. In: Proceedings of the 11th international conference on compiler construction, CC ’02. Springer-Verlag, London. http://dl.acm.org/citation.cfm?id=647478.727935
Jenista JC, Eom YH, Demsky B (2010) OoOJava: an out-of-order approach to parallel programming. In: Proceedings of the 2nd USENIX conference on hot topics in parallelism, HotPar’10. USENIX Association, Berkeley. http://dl.acm.org/citation.cfm?id=1863086.1863097
Perez JM, Badia RM, Labarta J (2008) A dependency-aware task-based programming environment for multi-core architectures. In: Proceedings of the 2008 IEEE international conference on cluster computing
Watson I et al (2010) The TERAFLUX project. http://www.teraflux.org. Accessed 1 Jan 2015
Gurd JR, Kirkham CC, Watson I (1985) The manchester prototype dataflow computer. In: Communication ACM, vol 28. ACM, New York. doi:10.1145/2465.2468
Papadopoulos GM, Culler DE (1990) Monsoon: an explicit token-store architecture. In: Proceedings of the 17th annual international symposium on computer architecture, ISCA ’90. ACM, New York. doi:10.1145/325164.325117
Cann D (1992) Retire fortran?: a debate rekindled. In: Communication ACM, vol 35. ACM, New York. doi:10.1145/135226.135231
Watson I, Woods V, Watson P, Banach R, Greenberg M, Sargeant J (1988) Flagship: a parallel architecture for declarative programming. In: Proceedings of the 15th annual international symposium on computer architecture, ISCA ’88. IEEE Computer Society Press, Los Alamitos. http://dl.acm.org/citation.cfm?id=52400.52415
Darlington J, Reeve M (1981) ALICE a multi-processor reduction machine for the parallel evaluation CF applicative languages. In: Proceedings of the 1981 conference on functional programming languages and computer architecture, FPCA ’81. ACM, New York. doi:10.1145/800223.806764
Peyton Jones SL, Clack C, Salkild J, Hardie M (1987) GRIP&Mdash; a high-performance architecture for parallel graph reduction. In: Proceedings of a conference on functional programming languages and computer architecture. Springer-Verlag, London. http://dl.acm.org/citation.cfm?id=36583.36590
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. In: Communication ACM, vol 51. ACM, New York. doi:10.1145/1327452.1327492
Peng D, Dabek F (2010) Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. USENIX Association, Berkeley. http://dl.acm.org/citation.cfm?id=1924943.1924961
Goodman D, Khan S, Seaton C, Guskov Y, Khan B, Lujan M, Watson I (2012) DFScala: high level dataflow support for Scala. In: Proceedings of the data-flow execution models for extreme scale computing
Odersky M, Spoon L, Venners B (2008) Programming in Scala: a comprehensive step-by-step guide, 1st edn. Artima Incorporation, USA
Google Scholar
Roberts ES, Vandevoorde MT (1989) WORKCREWS : an abstraction for controlling parallelism, vol 42. http://opac.inria.fr/record=b1047311
Mohr E, Kranz DA, Halstead Jr RH (1990) Lazy task creation: a technique for increasing the granularity of parallel programs. In: Proceedings of the 1990 ACM conference on LISP and functional programming, LFP ’90. ACM, New York. doi:10.1145/91556.91631
Hendler D, Shavit N (2002) Non-blocking steal-half work queues. In: Proceedings of the 21st annual symposium on principles of distributed computing, PODC ’02. ACM, New York. doi: 10.1145/571825.571876
Chase D, Lev Y (2005) Dynamic circular work-stealing deque. In: Proceedings of the 17th annual ACM symposium on parallelism in algorithms and architectures, SPAA ’05. ACM, New York. doi:10.1145/1073970.1073974
Acar UA, Blelloch GE, Blumofe RD (2000) The data locality of work stealing. In: Proceedings of the 12th annual ACM symposium on parallel algorithms and architectures, SPAA ’00. ACM, New York. doi:10.1145/341800.341801
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. In: Communication ACM, vol 13. ACM, New York. doi:10.1145/362686.362692
Kumar S, Hughes CJ, Nguyen A (2007) Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In: Proceedings of the 34th annual international symposium on computer architecture, ISCA ’07. ACM, New York. doi:10.1145/1250662.1250683
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The Gem5 simulator. In: SIGARCH computer architecture news, vol 39. ACM, New York. doi:10.1145/2024716.2024718
Horn B, Schunck B (1981) Determining optical flow. In: Artificial intelligence, vol 17. Elsevier, London
Project Gutenberg (1971). http://www.gutenberg.org/
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell
Book MATH Google Scholar
Lea D (2000) A Java fork/join framework. In: Proceedings of the ACM 2000 conference on Java Grande
Halstead Jr RH (1984) Implementation of multiLISP: LISP on a multiprocessor. In: Proceedings of the 1984 ACM symposium on LISP and functional programming, LFP ’84. ACM, New York. doi:10.1145/800055.802017
Kwok YK, Ahmad I (1999) Static scheduling algorithms for allocating directed task graphs to multiprocessors. In: ACM Computer Surveys, vol 31. ACM, New York. doi:10.1145/344588.344618
Su E, Tian X, Girkar M, Haab G, Shah S, Petersen P (2002) Compiler support of the workqueuing execution model for Intel SMP architectures. In: 4th European workshop on OpenMP
Arora NS, Blumofe RD, Plaxton CG (1998) Thread scheduling for multiprogrammed multiprocessors. In: Proceedings of the 10th annual ACM symposium on parallel algorithms and architectures, SPAA ’98. ACM, New York. doi:10.1145/277651.277678
Sanchez D, Yoo RM, Kozyrakis C (2010) Flexible architectural support for fine-grain scheduling. In: Proceedings of the 15th edition of ASPLOS on architectural support for programming languages and operating systems, ASPLOS XV. ACM, New York. doi:10.1145/1736020.1736055
Dally W, Towles B (2003) Principles and practices of interconnection networks. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
Yoo RM, Hughes CJ, Kim C, Chen YK, Kozyrakis C (2013) Locality-aware task management for unstructured parallelism: a quantitative limit study. In: Proceedings of the 25th annual ACM symposium on parallelism in algorithms and architectures, SPAA ’13. ACM, New York. doi:10.1145/2486159.2486175
Chen S, Gibbons PB, Kozuch M, Liaskovitis V, Ailamaki A, Blelloch GE, Falsafi B, Fix L, Hardavellas N, Mowry TC, Wilkerson C (2007) Scheduling threads for constructive cache sharing on CMPs. In: Proceedings of the 19th annual ACM symposium on parallel algorithms and architectures, SPAA ’07. ACM, New York. doi:10.1145/1248377.1248396
Blelloch GE, Gibbons PB (2004) Effectively sharing a cache among threads. In: Proceedings of the 16th annual ACM symposium on parallelism in algorithms and architectures, SPAA ’04. ACM, New York. doi:10.1145/1007912.1007948
Blelloch GE, Gibbons PB, Matias Y (1999) Provably efficient scheduling for languages with fine-grained parallelism. In: Journal of ACM, vol 46. ACM, New York. doi:10.1145/301970.301974

Download references

Acknowledgments

The authors would like to thank the European Community’s Seventh Framework Programme (FP7/2007-2013) for funding this work under grant agreement no. 249013 (TERAFLUX-project). Dr. Luján is supported by a Royal Society University Research Fellowship.

Author information

Authors and Affiliations

BT Research, Ipswich, UK
Behram Khan
Solarflare Communications, Irvine, USA
Salman Khan
The University of Manchester, Manchester, UK
Daniel Goodman, Will Toms, Mikel Luján & Ian Watson
HP Labs, Palo Alto, USA
Paolo Faraboschi

Authors

Behram Khan
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Goodman
View author publications
You can also search for this author in PubMed Google Scholar
Salman Khan
View author publications
You can also search for this author in PubMed Google Scholar
Will Toms
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Faraboschi
View author publications
You can also search for this author in PubMed Google Scholar
Mikel Luján
View author publications
You can also search for this author in PubMed Google Scholar
Ian Watson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Behram Khan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khan, B., Goodman, D., Khan, S. et al. Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems. J Supercomput 71, 2309–2338 (2015). https://doi.org/10.1007/s11227-015-1383-2

Download citation

Published: 08 February 2015
Issue Date: June 2015
DOI: https://doi.org/10.1007/s11227-015-1383-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems

Abstract

Access this article

Similar content being viewed by others

TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism

Mitigating the NUMA effect on task-based runtime systems

Using Data Dependencies to Improve Task-Based Scheduling Strategies on NUMA Architectures

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems

Abstract

Access this article

Similar content being viewed by others

TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism

Mitigating the NUMA effect on task-based runtime systems

Using Data Dependencies to Improve Task-Based Scheduling Strategies on NUMA Architectures

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation