skip to main content
10.1145/2486159.2486175acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

Locality-aware task management for unstructured parallelism: a quantitative limit study

Published: 23 July 2013 Publication History

Abstract

As we increase the number of cores on a processor die, the on-chip cache hierarchies that support these cores are getting larger, deeper, and more complex. As a result, non-uniform memory access effects are now prevalent even on a single chip. To reduce execution time and energy consumption, data access locality should be exploited. This is especially important for task-based programming systems, where a scheduler decides when and where on the chip the code segments, i.e., tasks, should execute. Capturing locality for structured task parallelism has been done effectively, but the more difficult case, unstructured parallelism, remains largely unsolved - little quantitative analysis exists to demonstrate the potential of locality-aware scheduling, and to guide future scheduler implementations in the most fruitful direction.
This paper quantifies the potential of locality-aware scheduling for unstructured parallelism on three different many-core processors. Our simulation results of 32-core systems show that locality-aware scheduling can bring up to 2.39x speedup over a randomized schedule, and 2.05x speedup over a state-of-the-art baseline scheduling scheme. At the same time, a locality-aware schedule reduces average energy consumption by 55% and 47%, relative to the random and the baseline schedule, respectively. In addition, our 1024-core simulation results project that these benefits will only increase: Compared to 32-core executions, we see up to 1.83x additional locality benefits. To capture such potentials in a practical setting, we also perform a detailed scheduler design space exploration to quantify the impact of different scheduling decisions. We also highlight the importance of locality-aware stealing, and demonstrate that a stealing scheme can exploit significant locality while performing load balancing. Over randomized stealing, our proposed scheme shows up to 2.0x speedup for stolen tasks.

References

[1]
Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. In Proc. of the 12th SPAA, pages 1--12, 2000.
[2]
Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G. Saidi, and Steven K. Reinhardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4):52--60, 2006.
[3]
Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. In Proc. of the 35th Annual Symposium on Foundations of Computer Science, pages 356--368, 1994.
[4]
Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. Corey: An operating system for many cores. In Proc. of the 8th OSDI, pages 43--57, 2008.
[5]
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proc. of the 20th OOPSLA, pages 519--538, 2005.
[6]
Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, and Chris Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In Proc. of the 19th SPAA, pages 105--115, 2007.
[7]
Computing Community Consortium. 21st century computer architecture: A community white paper. 2012.
[8]
Cray. Chapel Language Specification 0.796, 2010.
[9]
William J. Dally. The future of GPU computing. In the 22nd Annual Supercomputing Conference, 2009.
[10]
Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. Sequoia: Programming the memory hierarchy. In Proc. of the 2006 ACM/IEEE Conference on Supercomputing, 2006.
[11]
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. Annual IEEE Symposium on Foundations of Computer Science, 0:285--297, 1999.
[12]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. of the 1998 PLDI, pages 212--223.
[13]
Yi Guo, Jisheng Zhao, Vincent Cave, and Vivek Sarkar. SLAW: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proc. of the 2010 PPoPP, pages 341--342.
[14]
Mark Hill and Christos Kozyrakis. Advancing computer systems without technology progress. In DARPA / ISAT Workshop, 2012.
[15]
Christopher J. Hughes, Changkyu Kim, and Yen-Kuang Chen. Performance and energy implications of many-core caches for throughput computing. Micro, IEEE, 30(6):25--35, 2010.
[16]
Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, and Stephen W. Keckler. A NUCA substrate for flexible CMP cache sharing. IEEE TPDS, 18:1028--1040, 2007.
[17]
Intel. Threading Building Blocks, http://www.threadingbuildingblocks.org.
[18]
Mahmut Kandemir, Taylan Yemliha, Sai Prashanth Muralidhara, Shekhar Srikantaiah, Mary Jane Irwin, and Yuanrui Zhang. Cache topology aware computation mapping for multicores. In Proc. of the 2010 PLDI, pages 74--85.
[19]
George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. In Proc. of the 24th International Conference on Parallel Processing, pages 113--122, 1995.
[20]
John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In Proc. of the 36th ISCA, pages 140--151, 2009.
[21]
Changkyu Kim, Doug Burger, and Stephen W. Keckler. Nonuniform cache architectures for wire-delay dominated on-chip caches. IEEE Micro, 23:99--107, 2003.
[22]
Sanjeev Kumar, Christopher J. Hughes, and Anthony Nguyen. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors. In Proc. of the 34th ISCA, pages 162--173, 2007.
[23]
Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R. Marty, Min Xu, Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and David A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Computer Architecture News, 33:92--99, 2005.
[24]
Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. Graphite: A distributed parallel simulator for multicores. In Proc. of the 16th HPCA, pages 1--12, 2010.
[25]
OpenMP Architecture Review Board. OpenMP Application Program Interface Version 3.1, 2011.
[26]
Oracle. The Fortress Language Specification Version 1.0, 2008.
[27]
Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proc. of the 15th ASPLOS, pages 311--322, 2010.
[28]
Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: a many-core x86 architecture for visual computing. In ACM SIGGRAPH 2008 Papers, pages 18:1--18:15.
[29]
Janis Sermulins, William Thies, Rodric Rabbah, and Saman Amarasinghe. Cache aware optimization of stream programs. In Proc. of the LCTES 05, pages 115--126.
[30]
Avinash Sodani. Race to exascale: Opportunities and challenges. In the 44th Annual IEEE / ACM International Symposium on Microarchitecture, 2011.
[31]
William Thies, Michal Karczmarek, and Saman P. Amarasinghe. StreamIt: A language for streaming applications. In Proc. of the 11th CC, pages 179--196, 2002.

Cited By

View all
  • (2024)Enabling HW-Based Task Scheduling in Large Multicore ArchitecturesIEEE Transactions on Computers10.1109/TC.2023.332378173:1(138-151)Online publication date: Jan-2024
  • (2023)Taming data locality for task scheduling under memory constraint in runtime systemsFuture Generation Computer Systems10.1016/j.future.2023.01.024143:C(305-321)Online publication date: 1-Jun-2023
  • (2022)Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and a Future System ArchitectureElectronics10.3390/electronics1201005312:1(53)Online publication date: 23-Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SPAA '13: Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
July 2013
348 pages
ISBN:9781450315722
DOI:10.1145/2486159
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. energy
  2. locality
  3. performance
  4. task scheduling
  5. task stealing

Qualifiers

  • Research-article

Conference

SPAA '13

Acceptance Rates

SPAA '13 Paper Acceptance Rate 31 of 130 submissions, 24%;
Overall Acceptance Rate 447 of 1,461 submissions, 31%

Upcoming Conference

SPAA '25
37th ACM Symposium on Parallelism in Algorithms and Architectures
July 28 - August 1, 2025
Portland , OR , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Enabling HW-Based Task Scheduling in Large Multicore ArchitecturesIEEE Transactions on Computers10.1109/TC.2023.332378173:1(138-151)Online publication date: Jan-2024
  • (2023)Taming data locality for task scheduling under memory constraint in runtime systemsFuture Generation Computer Systems10.1016/j.future.2023.01.024143:C(305-321)Online publication date: 1-Jun-2023
  • (2022)Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and a Future System ArchitectureElectronics10.3390/electronics1201005312:1(53)Online publication date: 23-Dec-2022
  • (2022)TaskStream: accelerating task-parallel workloads by recovering program structureProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507706(1-13)Online publication date: 28-Feb-2022
  • (2022)Memory-Aware Scheduling of Tasks Sharing Data on Multiple GPUs with Dynamic Runtime Systems2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00073(694-704)Online publication date: May-2022
  • (2022)Locality-Aware Scheduling of Independent Tasks for Runtime SystemsEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_1(5-16)Online publication date: 9-Jun-2022
  • (2021)PAVERACM Transactions on Architecture and Code Optimization10.1145/345116418:3(1-26)Online publication date: 8-Jun-2021
  • (2020)Fidelity and Performance of State Fast-forwarding in Microscopic Traffic SimulationsACM Transactions on Modeling and Computer Simulation10.1145/336601930:2(1-26)Online publication date: 10-Apr-2020
  • (2020)An Adaptive Persistence and Work-stealing Combined Algorithm for Load Balancing on Parallel Discrete Event SimulationACM Transactions on Modeling and Computer Simulation10.1145/336421830:2(1-26)Online publication date: 20-Mar-2020
  • (2020)CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI CommunicationSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00040(1-15)Online publication date: Nov-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media