research-article

Composing parallel software efficiently with lithe

Authors:

Benjamin Hindman,

Krste AsanovićAuthors Info & Claims

PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 376 - 387

https://doi.org/10.1145/1806596.1806639

Published: 05 June 2010 Publication History

Abstract

Applications composed of multiple parallel libraries perform poorly when those libraries interfere with one another by obliviously using the same physical cores, leading to destructive resource oversubscription. This paper presents the design and implementation of Lithe, a low-level substrate that provides the basic primitives and a standard interface for composing parallel codes efficiently. Lithe can be inserted underneath the runtimes of legacy parallel libraries to provide bolt-on composability without needing to change existing application code. Lithe can also serve as the foundation for building new parallel abstractions and libraries that automatically interoperate with one another.

In this paper, we show versions of Threading Building Blocks (TBB) and OpenMP perform competitively with their original implementations when ported to Lithe. Furthermore, for two applications composed of multiple parallel libraries, we show that leveraging our substrate outperforms their original, even expertly tuned, implementations.

References

[1]

Atul Adya et al. Cooperative task management without manual stack management. In USENIX, 2002.

Digital Library

[2]

Thomas Anderson et al. Scheduler activations: Effective kernel support for the user-level management of parallelism. In SOSP, 1991.

Digital Library

[3]

Animoto. http://www.animoto.com.

[4]

Robert Blumofe et al. Cilk: An efficient multithreaded runtime system. In PPOPP, 1995.

Digital Library

[5]

Rohit Chandra et al. Parallel Programming in OpenMP. Morgan Kaufmann, 2001.

Digital Library

[6]

Jike Chong et al. Scalable hmm based inference engine in large vocabulary continuous speech recognition. In ICME, 2009.

Digital Library

[7]

Timothy Davis. Multifrontal multithreaded rank-revealing sparse QR factorization. Transactions on Mathematical Software, Submitted.

[8]

K. Dussa et al. Dynamic partitioning in a Transputer environment. In SIGMETRICS, 1990.

Digital Library

[9]

EVE Online. http://www.eveonline.com.

[10]

Kathleen Fisher and John Reppy. Compiler support for lightweight concurrency. Technical report, Bell Labs, 2002.

[11]

Flickr. http://www.flickr.com.

[12]

Matthew Fluet et al. A scheduling framework for general-purpose parallel languages. In ICFP, 2008.

Digital Library

[13]

Bryan Ford and Sai Susarla. CPU inheritance scheduling. In OSDI, 1996.

Digital Library

[14]

Seth Copen Goldstein et al. Lazy threads: Implementing a fast parallel call. Journal of Parallel and Distributed Computing, 1996.

Digital Library

[15]

Google Voice. http://voice.google.com.

[16]

GraphicsMagick. http://www.graphicsmagick.org.

[17]

Benjamin Hindman. Libprocess. http://www.eecs.berkeley.edu/ benh/libprocess.

[18]

Parry Husbands and Katherine Yelick. Multithreading and one-sided communication in parallel lu factorization. In Supercomputing, 2007.

Digital Library

[19]

Intel. Math Kernel Library for the Linux Operating System: User's Guide. 2007.

[20]

Ravi Iyer. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In ICS, 2004.

Digital Library

[21]

Haoqiang Ji et al. The OpenMP implementation of NAS parallel benchmarks and its performance. Technical report, NASA Research Center, 1999.

[22]

Laxmikant V. Kale, Joshua Yelon, and Timothy Knauff. Threads for interoperable parallel programming. Languages and Compilers for Parallel Computing, 1996.

Digital Library

[23]

Jakub Kurzak et al. Scheduling linear algebra operations on multicore processors. Technical report, LAPACK, 2009.

[24]

C. L. Lawson et al. Basic linear algebra subprograms for FORTRAN usage. Transactions on Mathematical Software, 1979.

Digital Library

[25]

Jae Lee et al. Globally-synchronized frames for guaranteed quality-of-service in on-chip networks. In ISCA, 2008.

Digital Library

[26]

Peng Li et al. Lightweight concurrency primitives. In Haskell, 2007.

Digital Library

[27]

Rose Liu et al. Tessellation: Space-time partitioning in a manycore client OS. In HotPar, 2009.

Digital Library

[28]

Brian Marsh et al. First-class user-level threads. OS Review, 1991.

Digital Library

[29]

Cathy McCann et al.A dynamic processor allocation policy for multiprogrammed shared--memory multiprocessors. Transactions on Computer Systems, 1993.

Digital Library

[30]

Ana Lucia De Moura and Robert Ierusalimschy. Revisiting coroutines. Transactions on Programming Languages and Systems, 2009.

Digital Library

[31]

Rajesh Nishtala and Kathy Yelick. Optimizing collective communication on multicores. In HotPar, 2009.

Digital Library

[32]

Simon Peter et al. 30 seconds is not enough! a study of operating system timer usage. In Eurosys, 2008.

Digital Library

[33]

John Regehr. Using Hierarchical Scheduling to Support Soft Real-Time Applications in General-Purpose Operating Systems. PhD thesis, University of Virginia, 2001.

Digital Library

[34]

James Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O'Reilly, 2007.

Digital Library

Cited By

Chen YYu XKoutris PArpaci-Dusseau AArpaci-Dusseau RShu JIves ZBonifati AEl Abbadi A(2022)Plor: General Transactions with Predictable, Low Tail LatencyProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517879(19-33)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517879
Ma JWang WNelson ACuevas MHomerding BLiu CHuang ZCampanoni SHale KDinda Pde Supinski BHall MGamblin T(2021)Paths to OpenMP in the kernelProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476183(1-17)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476183
Bak SHernandez OGates MLuszczek PSarkar VZhou HMoreira JMueller FEtsion Y(2021)Task-graph scheduling extensions for efficient synchronization and communicationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3461616(88-101)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3461616
Show More Cited By

Index Terms

Composing parallel software efficiently with lithe
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Composing parallel software efficiently with lithe
PLDI '10

Applications composed of multiple parallel libraries perform poorly when those libraries interfere with one another by obliviously using the same physical cores, leading to destructive resource oversubscription. This paper presents the design and ...
Efficient multiprogramming for multicores with SCAF
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads ...
Transparently Space Sharing a Multicore Among Multiple Processes

As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2010

514 pages

ISBN:9781450300193

DOI:10.1145/1806596

General Chair:
Ben Zorn
Microsoft Research
,
Program Chair:
Alex Aiken
Stanford University

ACM SIGPLAN Notices Volume 45, Issue 6
PLDI '10
June 2010
496 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1809028
Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '10

Sponsor:

SIGPLAN

PLDI '10: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 5 - 10, 2010

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

54
Total Citations
View Citations
594
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen YYu XKoutris PArpaci-Dusseau AArpaci-Dusseau RShu JIves ZBonifati AEl Abbadi A(2022)Plor: General Transactions with Predictable, Low Tail LatencyProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517879(19-33)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517879
Ma JWang WNelson ACuevas MHomerding BLiu CHuang ZCampanoni SHale KDinda Pde Supinski BHall MGamblin T(2021)Paths to OpenMP in the kernelProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476183(1-17)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476183
Bak SHernandez OGates MLuszczek PSarkar VZhou HMoreira JMueller FEtsion Y(2021)Task-graph scheduling extensions for efficient synchronization and communicationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3461616(88-101)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3461616
Ghosh SCuevas MCampanoni SDinda PCuicchi CQualters IKramer W(2020)Compiler-based timing for extremely fine-grain preemptive parallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433771(1-15)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433771
Ghosh SCuevas MCampanoni SDinda P(2020)Compiler-Based Timing For Extremely Fine-Grain Preemptive ParallelismSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00057(1-15)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00057
Leon EGerofi BJaeger JMercier GRiesen RTakagi MGoglin B(2020)Application-Driven Requirements for Node Resource Management in Next-Generation Systems2020 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS)10.1109/ROSS51935.2020.00006(1-11)Online publication date: Nov-2020
https://doi.org/10.1109/ROSS51935.2020.00006
Ousterhout AFried JBehrens JBelay ABalakrishnan HLorch JYu M(2019)ShenangoProceedings of the 16th USENIX Conference on Networked Systems Design and Implementation10.5555/3323234.3323265(361-377)Online publication date: 26-Feb-2019
https://dl.acm.org/doi/10.5555/3323234.3323265
Iwasaki SAmer ATaura KSeo SBalaji P(2019)BOLT: Optimizing OpenMP Parallel Regions with User-Level ThreadsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00011(29-42)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00011
Ahmad MDogan HMichael CKhan O(2019)HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2019.00039(268-281)Online publication date: Mar-2019
https://doi.org/10.1109/ISPASS.2019.00039
Seo SAmer ABalaji PBordage CBosilca GBrooks ACarns PCastello AGenet DHerault TIwasaki SJindal PKale LKrishnamoorthy SLifflander JLu HMeneses ESnir MSun YTaura KBeckman P(2018)Argobots: A Lightweight Low-Level Threading and Tasking FrameworkIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.276606229:3(512-526)Online publication date: 1-Mar-2018
https://doi.org/10.1109/TPDS.2017.2766062
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten