skip to main content
10.1145/2304576.2304599acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures

Authors Info & Claims
Published:25 June 2012Publication History

ABSTRACT

Multi-socket Multi-core architectures with shared caches in each socket have become mainstream when a single multi-core chip cannot provide enough computing capacity for high performance computing. However, traditional task-stealing schedulers tend to pollute the shared cache and incur severe cache misses due to their randomness in stealing. To address the problem, this paper proposes a Cache Aware Task-Stealing (CATS) scheduler, which uses the shared cache efficiently with an online profiling method and schedules tasks with shared data to the same socket. CATS adopts an online DAG partitioner based on the profiling information to ensure tasks with shared data can efficiently utilize the shared cache. One outstanding novelty of CATS is that it does not require any extra user-provided information. Experimental results show that CATS can improve the performance of memory-bound programs up to 74.4% compared with the traditional task-stealing scheduler.

References

  1. U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  2. E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of openmp tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404--418, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In ICS'05, pages 101--110. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Berger and J. Oliger. Adaptive mesh refinement for hyperbolic partial differential equations. Journal of computational Physics, 53(3):484--512, 1984.Google ScholarGoogle Scholar
  5. G. Blelloch, R. Chowdhury, P. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In SODA'08, pages 501--510. Society for Industrial and Applied Mathematics, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Blelloch, J. Fineman, P. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In SPAA'11, San Jose, California, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed computing, 37(1):55--69, Aug. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Q. Chen, Y. Chen, Z. Huang, and M. Guo. WATS: Workload-Aware Task Scheduling in Asymmetric Multi-core Architectures. In IPDPS'12. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware Bi-tier task-stealing in Multi-socket Multi-core architecture. In ICPP'11, Taipei, Taiwan, 2011. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chen, P. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. Mowry, et al. Scheduling threads for constructive cache sharing on CMPs. In SPAA'07, page 115. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Cole and V. Ramachandran. Analysis of Randomized Work Stealing with False Sharing. ArXiv e-prints, Mar. 2011.Google ScholarGoogle Scholar
  13. X. Ding, K. Wang, and X. Zhang. ULCC: a user-level facility for optimizing shared cache performance on multicores. In PPoPP'11, pages 103--112, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI'98, pages 212--223, Montreal, Canada, June 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and DistributedComputing, 16(4):276--291, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  16. W. Gropp, E. Lusk, and A. Skjellum. Using MPI:portable parallel programming with the message passing interface. MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In IPDPS'09, pages 1--12. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Guo, J. Zhao, V. Cave, and V. Sarkar. Slaw: a scalable locality-aware adaptive work-stealing scheduler. In IPDPS'10, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In PPoPP'10, pages 25--36. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Leiserson. The Cilk++ concurrency platform. In DAC'09, pages 522--527. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In PPoPP'09, pages 45--54. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F. Prins. Scheduling task parallelism on multi-socket multicore systems. In ROSS'11, pages 49--56. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Perarnau, M. Tchiboukdjian, and G. Huard. Controlling cache utilization of hpc applications. In ICS'11, pages 295--304. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In EuroPar'10, pages 217--229. Springer-Verlag, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Reinders. Intel threading building blocks. O'Reilly, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Tam, R. Azimi, L. Soares, and M. Stumm Rapidmrc: Approximating l2 miss rate curves on commodity systems for online optimizations. ACM Sigplan Notices, 44(3):121--132, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. Xiang, T. Chen, Q. Shi, and W. Hu. Less reused filter: improving l2 cache performance via filtering less reused lines. In ICS'09, pages 68--79. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Yang, C. Lin, and C. Yang. Cache-aware task scheduling on multi-core architecture. In VLSI-DAT'10, pages 139--142. IEEE, 2010.Google ScholarGoogle Scholar
  29. J. Zhang, Z. Huang, W. Chen, Q. Huang, and W. Zheng. Maotai: View-Oriented Parallel Programming on CMT processors. In ICPP'08, pages 636--643. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICS '12: Proceedings of the 26th ACM international conference on Supercomputing
        June 2012
        400 pages
        ISBN:9781450313162
        DOI:10.1145/2304576

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 June 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate584of2,055submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader