skip to main content
10.1145/2254064.2254082acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Parcae: a system for flexible parallel execution

Published: 11 June 2012 Publication History

Abstract

Workload, platform, and available resources constitute a parallel program's execution environment. Most parallelization efforts statically target an anticipated range of environments, but performance generally degrades outside that range. Existing approaches address this problem with dynamic tuning but do not optimize a multiprogrammed system holistically. Further, they either require manual programming effort or are limited to array-based data-parallel programs.
This paper presents Parcae, a generally applicable automatic system for platform-wide dynamic tuning. Parcae includes (i) the Nona compiler, which creates flexible parallel programs whose tasks can be efficiently reconfigured during execution; (ii) the Decima monitor, which measures resource availability and system performance to detect change in the environment; and (iii) the Morta executor, which cuts short the life of executing tasks, replacing them with other functionally equivalent tasks better suited to the current environment. Parallel programs made flexible by Parcae outperform original parallel implementations in many interesting scenarios.

References

[1]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., 2002.
[2]
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2009.
[3]
C. W. Antoine, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27:2001, 2000.
[4]
Apple Open Source. md5sum: Message Digest 5 computation. http://www.opensource.apple.com/darwinsource.
[5]
M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 219--228, 2009.
[6]
A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP), pages 29--44, 2009.
[7]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008.
[8]
O. Bilgir, M. Martonosi, and Q. Wu. Exploring the potential of CMP core count management on data center energy savings. In Proceedings of the 3rd Workshop on Energy Efficient Design (WEED), 2011.
[9]
S. L. Bird and B. J. Smith. PACORA: Performance aware convex optimization for resource allocation. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar: Posters), 2011.
[10]
F. Blagojevic, D. S. Nikolopoulos, A. Stamatakis, C. D. Antonopoulos, and M. Curtis-Maury. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Computing, 33(10--11):700--719, 2007.
[11]
Y. Ding, M. Kandemir, P. Raghavan, and M. J. Irwin. Adapting application execution in CMPs using helper threads. Journal of Parallel and Distributed Computing, 69(9):790--806, 2009.
[12]
P. Diniz and M. Rinard. Dynamic feedback: An effective technique for adaptive computing. In Proceedings of the 18th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1997.
[13]
G. Edjlali, G. Agrawal, A. Sussman, J. Humphries, and J. Saltz. Compiler and runtime support for programming in adaptive parallel environments. In Scientific Programming, pages 215--227, 1995.
[14]
M. W. Hall and M. Martonosi. Adaptive parallelism in compiler-parallelized code. In Proceedings of the 2nd SUIF Compiler Workshop, 1997.
[15]
J. L. Hellerstein, V. Morrison, and E. Eilebrecht. Applying control theory in the real world: Experience with building a controller for the .NET thread pool. Performance Evaluation Review, 37:38--42, 2010.
[16]
T. Karcher and V. Pankratius. Run-time automatic performance tuning for multicore applications. In Proceedings of the International Euro-Par Conference on Parallel Processing (Euro-Par), pages 3--14, 2011.
[17]
A. Kejariwal, A. Nicolau, A. V. Veidenbaum, U. Banerjee, and C. D. Polychronopoulos. Efficient scheduling of nested parallel loops on multi-core systems. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP), pages 74--83, 2009.
[18]
M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 211--222, 2007.
[19]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the Annual International Symposium on Code Generation and Optimization (CGO), pages 75--86, 2004.
[20]
C. E. Leiserson. The Cilk concurrency platform. In Proceedings of the 46th ACM/IEEE Design Automation Conference (DAC), pages 522--527, 2009.
[21]
LLVM Test Suite Guide. http://llvm.org/docs/TestingGuide.html.
[22]
C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 45--55, 2009.
[23]
J. Mars, N. Vachharajani, M. L. Soffa, and R. Hundt. Contention aware execution: Online contention detection and response. In Proceedings of the Annual International Symposium on Code Generation and Optimization (CGO), Toronto, Canada, 2010.
[24]
G. Memik, W. H. Mangione-Smith, and W. Hu. NetBench: A benchmarking suite for network processors. In Proceedings of the 2001 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2001.
[25]
C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford Transactional Applications for Multi-Processing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2008.
[26]
R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. 2006.
[27]
I. Neamtiu. Elastic executions from inelastic programs. In Proceedings of the 6th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS), 2011.
[28]
H. Pan, B. Hindman, and K. Asanović. Composing parallel software efficiently with Lithe. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 376--387, 2010.
[29]
D. A. Penry. Multicore diversity: A software developer's nightmare. ACM SIGOPS Operating Systems Review, 43:100--101, 2009.
[30]
C. D. Polychronopoulos. The hierarchical task graph and its use in auto-scheduling. In Proceedings of the 5th International Conference on Supercomputing (ICS), pages 252--263, 1991.
[31]
P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011.
[32]
M. Püschel, F. Franchetti, and Y. Voronenko. Encyclopedia of Parallel Computing, chapter Spiral. Springer, 2011.
[33]
A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using DoPE: the degree of parallelism executive. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011.
[34]
E. Raman, G. Ottoni, A. Raman, M. Bridges, and D. I. August. Parallel-stage decoupled software pipelining. In Proceedings of the Annual International Symposium on Code Generation and Optimization (CGO), 2008.
[35]
L. Rauchwerger, N. M. Amato, and D. A. Padua. A scalable method for run-time loop parallelization. International Journal of Parallel Programming (IJPP), 26:537--576, 1995.
[36]
A. Robison, M. Voss, and A. Kukanov. Optimization via reflection on work stealing in TBB. In Proceedings of the 22nd International Parallel and Distributed Processing Symposium (IPDPS), pages 1--8, 2008.
[37]
J. Saltz, R. Mirchandaney, and R. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40, 1991.
[38]
P. Selinger. potrace: Transforming bitmaps into vector graphics. http://potrace.sourceforge.net.
[39]
J. C. Spall. Introduction to Stochastic Search and Optimization. Wiley-Interscience, 2003.
[40]
M. A. Suleman, M. K. Qureshi, Khubaib, and Y. N. Patt. Feedback-directed pipeline parallelism. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 147--156, 2010.
[41]
A. Tiwari and J. K. Hollingsworth. Online adaptive code generation and tuning. In Proceedings of the 25th International Parallel and Distributed Processing Symposium (IPDPS), 2011.
[42]
A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin. Lazy binary-splitting: A run-time adaptive work-stealing scheduler. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 179--190, 2010.
[43]
H. Vandierendonck, S. Rul, and K. De Bosschere. The Paralax infrastructure: Automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 389--400, 2010.
[44]
M. J. Voss and R. Eigenmann. ADAPT: Automated de-coupled adaptive program transformation. In Proceedings of the 1999 International Conference on Parallel Processing (ICPP), pages 163--170, 1999.
[45]
Z. Wang and M. F. O'Boyle. Mapping parallelism to multi-cores: A machine learning based approach. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 75--84, 2009.
[46]
M. Wolfe. DOANY: Not just another parallel loop. In Proceedings of the 4th International Workshop on Languages and Compilers for Parallel Computing (LCPC), 1992.
[47]
H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In Proceedings of the 14th International Symposium on High-Performance Computer Architecture (HPCA), 2008.

Cited By

View all
  • (2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
  • (2022)ParaX : Bandwidth-Efficient Instance Assignment for DL on Multi-NUMA Many-Core CPUsIEEE Transactions on Computers10.1109/TC.2022.314516471:11(3032-3046)Online publication date: 1-Nov-2022
  • (2021)Device HoppingACM Transactions on Architecture and Code Optimization10.1145/347190918:4(1-25)Online publication date: 29-Sep-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2012
572 pages
ISBN:9781450312059
DOI:10.1145/2254064
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 47, Issue 6
    PLDI '12
    June 2012
    534 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2345156
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adaptivity
  2. automatic parallelization
  3. code generation
  4. compiler
  5. flexible
  6. multicore
  7. parallel
  8. performance portability
  9. run-time
  10. tuning

Qualifiers

  • Research-article

Conference

PLDI '12
Sponsor:

Acceptance Rates

PLDI '12 Paper Acceptance Rate 48 of 255 submissions, 19%;
Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)2
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
  • (2022)ParaX : Bandwidth-Efficient Instance Assignment for DL on Multi-NUMA Many-Core CPUsIEEE Transactions on Computers10.1109/TC.2022.314516471:11(3032-3046)Online publication date: 1-Nov-2022
  • (2021)Device HoppingACM Transactions on Architecture and Code Optimization10.1145/347190918:4(1-25)Online publication date: 29-Sep-2021
  • (2021)Smart resource allocation of concurrent execution of parallel applicationsConcurrency and Computation: Practice and Experience10.1002/cpe.660035:17Online publication date: 8-Sep-2021
  • (2020)A Parameter Selection Process by Data Analysis for Tuning Multi-threaded Time-Stepping Algorithms2020 Seventh International Conference on Software Defined Systems (SDS)10.1109/SDS49854.2020.9143911(43-50)Online publication date: Apr-2020
  • (2020)A performance- and energy-oriented extended tuning process for time-step-based scientific applicationsThe Journal of Supercomputing10.1007/s11227-020-03402-yOnline publication date: 25-Aug-2020
  • (2019)Aurora: Seamless Optimization of OpenMP ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.287299230:5(1007-1021)Online publication date: 1-May-2019
  • (2018)Maximizing system utilization via parallelism management for co-located parallel applicationsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243199(1-14)Online publication date: 1-Nov-2018
  • (2018)MemoDynProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243193(1-12)Online publication date: 1-Nov-2018
  • (2018)A portable, automatic data qantizer for deep neural networksProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243180(1-14)Online publication date: 1-Nov-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media