Export Citations
PPoPP is the premier forum for leading work on all aspects of parallel programming, including theoretical foundations, techniques, languages, compilers, runtime systems, tools, and practical experience. Given the rise of parallel architectures in the consumer market (desktops, laptops, and mobile devices) and data centers, we made an effort to attract work that addresses new parallel workloads and issues that arise out of extreme-scale applications or cloud platforms. In addition, we tried to attract techniques and tools that improve parallel programming productivity or work towards improved synergy with such emerging architectures.
Efficient algorithms for persistent transactional memory
Durable techniques coupled with transactional semantics provide to application developers the guarantee that data is saved consistently in persistent memory (PM), even in the event of a non-corrupting failure. Persistence fences and flush instructions ...
Investigating the semantics of futures in transactional memory systems
This paper investigates the problem of integrating two powerful abstractions for concurrent programming, namely futures and transactional memory. Our focus is on specifying the semantics of execution of "transactional futures", i.e., futures that ...
Constant-time snapshots with applications to concurrent data structures
Given a concurrent data structure, we present an approach for efficiently taking snapshots of its constituent CAS objects. More specifically, we support a constant-time operation that returns a snapshot handle. This snapshot handle can later be used to ...
Reasoning about recursive tree traversals
Traversals are commonly seen in tree data structures, and performance-enhancing transformations between tree traversals are critical for many applications. Existing approaches to reasoning about tree traversals and their transformations are ad hoc, with ...

Synthesizing optimal collective algorithms
Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training.
This paper introduces SCCL (for Synthesized ...
Parallel binary code analysis
- Xiaozhu Meng,
- Jonathon M. Anderson,
- John Mellor-Crummey,
- Mark W. Krentel,
- Barton P. Miller,
- Srđan Milaković
Binary code analysis is widely used to help assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine code ...
Compiler support for near data computing
Recent works from both hardware and software domains offer various optimizations that try to take advantage of near data computing (NDC) opportunities. While the results from these works indicate performance improvements of various magnitudes, the ...
Scaling implicit parallelism via dynamic control replication
- Michael Bauer,
- Wonchan Lee,
- Elliott Slaughter,
- Zhihao Jia,
- Mario Di Renzo,
- Manolis Papadakis,
- Galen Shipman,
- Patrick McCormick,
- Michael Garland,
- Alex Aiken
We present dynamic control replication, a run-time program analysis that enables scalable execution of implicitly parallel programs on large machines through a distributed and efficient dynamic dependence analysis. Dynamic control replication ...
Understanding and bridging the gaps in current GNN performance optimizations
Graph Neural Network (GNN) has recently drawn a rapid increase of interest in many domains for its effectiveness in learning over graphs. Maximizing its performance is essential for many tasks, but remains preliminarily understood. In this work, we ...
A fast work-efficient SSSP algorithm for GPUs
This paper presents a new Single Source Shortest Path (SSSP) algorithm for GPUs. Our key advancement is an improved work scheduler, which is central to the performance of SSSP algorithms. Previous GPU solutions for SSSP use simple work schedulers that ...
ShadowVM: accelerating data plane for data analytics with bare metal CPUs and GPUs
With the development of the big data ecosystem, large-scale data analytics has become more prevalent in the past few years. Apache Spark, etc., provide a flexible approach for scalable processing upon massive data. However, they are not designed for ...
BiPart: a parallel and deterministic hypergraph partitioner
Hypergraph partitioning is used in many problem domains including VLSI design, linear algebra, Boolean satisfiability, and data mining. Most versions of this problem are NP-complete or NP-hard, so practical hypergraph partitioners generate approximate ...
NBR: neutralization based reclamation
Safe memory reclamation (SMR) algorithms suffer from a trade-off between bounding unreclaimed memory and the speed of reclamation. Hazard pointer (HP) based algorithms bound unreclaimed memory at all times, but tend to be slower than other approaches. ...
Efficiently reclaiming memory in concurrent search data structures while bounding wasted memory
Nonblocking data structures face a safe memory reclamation (SMR) problem. In these algorithms, a node removed from the data structure cannot be reclaimed (freed) immediately, as other threads may be about to access it. The goal of an SMR scheme is to ...
OrcGC: automatic lock-free memory reclamation
Dynamic lock-free data structures require a memory reclamation scheme with a similar progress. Until today, lock-free schemes are applied to data structures on a case-by-case basis, often with algorithm modifications to the data structure.
In this paper ...
Are dynamic memory managers on GPUs slow?: a survey and benchmarks
Dynamic memory management on GPUs is generally understood to be a challenging topic. On current GPUs, hundreds of thousands of threads might concurrently allocate new memory or free previously allocated memory. This leads to problems with thread ...
GPTune: multitask learning for autotuning exascale applications
Multitask learning has proven to be useful in the field of machine learning when additional knowledge is available to help a prediction task. We adapt this paradigm to develop autotuning frameworks, where the objective is to find the optimal performance ...
I/O lower bounds for auto-tuning of convolutions in CNNs
Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model ...
ApproxTuner: a compiler and runtime system for adaptive approximations
- Hashim Sharif,
- Yifan Zhao,
- Maria Kotsifakou,
- Akash Kothari,
- Ben Schreiber,
- Elizabeth Wang,
- Yasmin Sarita,
- Nathan Zhao,
- Keyur Joshi,
- Vikram S. Adve,
- Sasa Misailovic,
- Sarita Adve
Manually optimizing the tradeoffs between accuracy, performance and energy for resource-intensive applications with flexible accuracy or precision requirements is extremely difficult. We present ApproxTuner, an automatic framework for accuracy-aware ...
EGEMM-TC: accelerating scientific computing on tensor cores with extended precision
Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision ...
Efficiently running SpMV on long vector architectures
Sparse Matrix-Vector multiplication (SpMV) is an essential kernel for parallel numerical applications. SpMV displays sparse and irregular data accesses, which complicate its vectorization. Such difficulties make SpMV to frequently experiment non-optimal ...
Improving communication by optimizing on-node data movement with data layout
We present optimizations to improve communication performance by reducing on-node data movement for a class of distributed memory applications. The primary concept is to eliminate the data movement associated with packing and unpacking subsets of the ...
Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory
Sparse tensor contractions appear commonly in many applications. Efficiently computing a two sparse tensor product is challenging: It not only inherits the challenges from common sparse matrix-matrix multiplication (SpGEMM), i.e., indirect memory access ...
Advanced synchronization techniques for task-based runtime systems
Task-based programming models like OmpSs-2 and OpenMP provide a flexible data-flow execution model to exploit dynamic, irregular and nested parallelism. Providing an efficient implementation that scales well with small granularity tasks remains a ...
An ownership policy and deadlock detector for promises
Task-parallel programs often enjoy deadlock freedom under certain restrictions, such as the use of structured join operations, as in Cilk and X10, or the use of asynchronous task futures together with deadlock-avoiding policies such as Known Joins or ...
Understanding a program's resiliency through error propagation
Aggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault ...
Lightweight preemptive user-level threads
Many-to-many mapping models for user- to kernel-level threads (or "M:N threads") have been extensively studied for decades as a lightweight substitute for current Pthreads implementations that provide a simple one-to-one mapping ("1:1 threads"). M:N ...
TurboTransformers: an efficient GPU serving system for transformer models
The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, transformers are able to process on dimensions of sequence lengths in parallel, ...
Extracting clean performance models from tainted programs
Performance models are well-known instruments to understand the scaling behavior of parallel applications. They express how performance changes as key execution parameters, such as the number of processes or the size of the input problem, vary. Besides ...
Modernizing parallel code with pattern analysis
Fifty years of parallel programming has generated a substantial legacy parallel codebase, creating a new portability challenge: re-parallelizing already parallel code. Our solution exploits inherently portable parallel patterns, and addresses the ...
Index Terms
- Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming