Skip to main content
Log in

Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

To harness the compute resource of many-core system with tens to hundreds of cores, applications have to expose parallelism to the hardware. Researchers are aggressively looking for program execution models that make it easier to expose parallelism and use the available resources. One common approach is to decompose a program into parallel ‘tasks’ and allow an underlying system layer to schedule these tasks to different threads. Software-only schedulers can implement various scheduling policies and algorithms that match the characteristics of different applications and programming models. Unfortunately with large-scale multi-core systems, software schedulers suffer significant overheads as they synchronize and communicate task information over deep cache hierarchies. To reduce these overheads, hardware-only schedulers like Carbon have been proposed to enable task queuing and scheduling to be done in hardware. This paper presents a hardware scheduling approach where the structure provided to programs by task-based programming models can be incorporated into the scheduler, making it aware of a task’s data requirements. This prior knowledge of a task’s data requirements allows for better task placement by the scheduler which result in a reduction in overall cache misses and memory traffic, improving the program’s performance and power utilization. Simulations of this technique for a range of synthetic benchmarks and components of real applications have shown a reduction in the number of cache misses by up to 72 and 95 % for the L1 and L2 caches, respectively, and up to 30 % improvement in overall execution time against FIFO scheduling. This results not only in faster execution and in less data transfer with reductions of up to 50 %, allowing for less load on the interconnect, but also in lower power consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Dagum L, Menon R (1998) OpenMP: an industry-standard API for shared-memory programming. In: IEEE computer science engineering, vol 5. IEEE Computer Society Press, Los Alamitos. doi:10.1109/99.660313

  2. Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. In: Proceedings of the 5th ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP ’95. ACM, New York. doi:10.1145/209936.209958

  3. Reinders J (2007) Intel threading building blocks, 1st edn. O’Reilly & Associates Inc, Sebastopol

    Google Scholar 

  4. Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with CUDA. In: Queue, vol 6. ACM, New York. doi:10.1145/1365490.1365500

  5. Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. In: IEEE design test, vol 12. IEEE Computer Society Press, Los Alamitos. doi:10.1109/MCSE.2010.69

  6. Thies W, Karczmarek M, Amarasinghe SP (2002) StreamIt: a language for streaming applications. In: Proceedings of the 11th international conference on compiler construction, CC ’02. Springer-Verlag, London. http://dl.acm.org/citation.cfm?id=647478.727935

  7. Jenista JC, Eom YH, Demsky B (2010) OoOJava: an out-of-order approach to parallel programming. In: Proceedings of the 2nd USENIX conference on hot topics in parallelism, HotPar’10. USENIX Association, Berkeley. http://dl.acm.org/citation.cfm?id=1863086.1863097

  8. Perez JM, Badia RM, Labarta J (2008) A dependency-aware task-based programming environment for multi-core architectures. In: Proceedings of the 2008 IEEE international conference on cluster computing

  9. Watson I et al (2010) The TERAFLUX project. http://www.teraflux.org. Accessed 1 Jan 2015

  10. Gurd JR, Kirkham CC, Watson I (1985) The manchester prototype dataflow computer. In: Communication ACM, vol 28. ACM, New York. doi:10.1145/2465.2468

  11. Papadopoulos GM, Culler DE (1990) Monsoon: an explicit token-store architecture. In: Proceedings of the 17th annual international symposium on computer architecture, ISCA ’90. ACM, New York. doi:10.1145/325164.325117

  12. Cann D (1992) Retire fortran?: a debate rekindled. In: Communication ACM, vol 35. ACM, New York. doi:10.1145/135226.135231

  13. Watson I, Woods V, Watson P, Banach R, Greenberg M, Sargeant J (1988) Flagship: a parallel architecture for declarative programming. In: Proceedings of the 15th annual international symposium on computer architecture, ISCA ’88. IEEE Computer Society Press, Los Alamitos. http://dl.acm.org/citation.cfm?id=52400.52415

  14. Darlington J, Reeve M (1981) ALICE a multi-processor reduction machine for the parallel evaluation CF applicative languages. In: Proceedings of the 1981 conference on functional programming languages and computer architecture, FPCA ’81. ACM, New York. doi:10.1145/800223.806764

  15. Peyton Jones SL, Clack C, Salkild J, Hardie M (1987) GRIP&Mdash; a high-performance architecture for parallel graph reduction. In: Proceedings of a conference on functional programming languages and computer architecture. Springer-Verlag, London. http://dl.acm.org/citation.cfm?id=36583.36590

  16. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. In: Communication ACM, vol 51. ACM, New York. doi:10.1145/1327452.1327492

  17. Peng D, Dabek F (2010) Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. USENIX Association, Berkeley. http://dl.acm.org/citation.cfm?id=1924943.1924961

  18. Goodman D, Khan S, Seaton C, Guskov Y, Khan B, Lujan M, Watson I (2012) DFScala: high level dataflow support for Scala. In: Proceedings of the data-flow execution models for extreme scale computing

  19. Odersky M, Spoon L, Venners B (2008) Programming in Scala: a comprehensive step-by-step guide, 1st edn. Artima Incorporation, USA

    Google Scholar 

  20. Roberts ES, Vandevoorde MT (1989) WORKCREWS : an abstraction for controlling parallelism, vol 42. http://opac.inria.fr/record=b1047311

  21. Mohr E, Kranz DA, Halstead Jr RH (1990) Lazy task creation: a technique for increasing the granularity of parallel programs. In: Proceedings of the 1990 ACM conference on LISP and functional programming, LFP ’90. ACM, New York. doi:10.1145/91556.91631

  22. Hendler D, Shavit N (2002) Non-blocking steal-half work queues. In: Proceedings of the 21st annual symposium on principles of distributed computing, PODC ’02. ACM, New York. doi: 10.1145/571825.571876

  23. Chase D, Lev Y (2005) Dynamic circular work-stealing deque. In: Proceedings of the 17th annual ACM symposium on parallelism in algorithms and architectures, SPAA ’05. ACM, New York. doi:10.1145/1073970.1073974

  24. Acar UA, Blelloch GE, Blumofe RD (2000) The data locality of work stealing. In: Proceedings of the 12th annual ACM symposium on parallel algorithms and architectures, SPAA ’00. ACM, New York. doi:10.1145/341800.341801

  25. Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. In: Communication ACM, vol 13. ACM, New York. doi:10.1145/362686.362692

  26. Kumar S, Hughes CJ, Nguyen A (2007) Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In: Proceedings of the 34th annual international symposium on computer architecture, ISCA ’07. ACM, New York. doi:10.1145/1250662.1250683

  27. Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The Gem5 simulator. In: SIGARCH computer architecture news, vol 39. ACM, New York. doi:10.1145/2024716.2024718

  28. Horn B, Schunck B (1981) Determining optical flow. In: Artificial intelligence, vol 17. Elsevier, London

  29. Project Gutenberg (1971). http://www.gutenberg.org/

  30. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell

    Book  MATH  Google Scholar 

  31. Lea D (2000) A Java fork/join framework. In: Proceedings of the ACM 2000 conference on Java Grande

  32. Halstead Jr RH (1984) Implementation of multiLISP: LISP on a multiprocessor. In: Proceedings of the 1984 ACM symposium on LISP and functional programming, LFP ’84. ACM, New York. doi:10.1145/800055.802017

  33. Kwok YK, Ahmad I (1999) Static scheduling algorithms for allocating directed task graphs to multiprocessors. In: ACM Computer Surveys, vol 31. ACM, New York. doi:10.1145/344588.344618

  34. Su E, Tian X, Girkar M, Haab G, Shah S, Petersen P (2002) Compiler support of the workqueuing execution model for Intel SMP architectures. In: 4th European workshop on OpenMP

  35. Arora NS, Blumofe RD, Plaxton CG (1998) Thread scheduling for multiprogrammed multiprocessors. In: Proceedings of the 10th annual ACM symposium on parallel algorithms and architectures, SPAA ’98. ACM, New York. doi:10.1145/277651.277678

  36. Sanchez D, Yoo RM, Kozyrakis C (2010) Flexible architectural support for fine-grain scheduling. In: Proceedings of the 15th edition of ASPLOS on architectural support for programming languages and operating systems, ASPLOS XV. ACM, New York. doi:10.1145/1736020.1736055

  37. Dally W, Towles B (2003) Principles and practices of interconnection networks. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  38. Yoo RM, Hughes CJ, Kim C, Chen YK, Kozyrakis C (2013) Locality-aware task management for unstructured parallelism: a quantitative limit study. In: Proceedings of the 25th annual ACM symposium on parallelism in algorithms and architectures, SPAA ’13. ACM, New York. doi:10.1145/2486159.2486175

  39. Chen S, Gibbons PB, Kozuch M, Liaskovitis V, Ailamaki A, Blelloch GE, Falsafi B, Fix L, Hardavellas N, Mowry TC, Wilkerson C (2007) Scheduling threads for constructive cache sharing on CMPs. In: Proceedings of the 19th annual ACM symposium on parallel algorithms and architectures, SPAA ’07. ACM, New York. doi:10.1145/1248377.1248396

  40. Blelloch GE, Gibbons PB (2004) Effectively sharing a cache among threads. In: Proceedings of the 16th annual ACM symposium on parallelism in algorithms and architectures, SPAA ’04. ACM, New York. doi:10.1145/1007912.1007948

  41. Blelloch GE, Gibbons PB, Matias Y (1999) Provably efficient scheduling for languages with fine-grained parallelism. In: Journal of ACM, vol 46. ACM, New York. doi:10.1145/301970.301974

Download references

Acknowledgments

The authors would like to thank the European Community’s Seventh Framework Programme (FP7/2007-2013) for funding this work under grant agreement no. 249013 (TERAFLUX-project). Dr. Luján is supported by a Royal Society University Research Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Behram Khan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, B., Goodman, D., Khan, S. et al. Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems. J Supercomput 71, 2309–2338 (2015). https://doi.org/10.1007/s11227-015-1383-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1383-2

Keywords

Navigation