Export Citations
Welcome to the second volume of ASPLOS'24: the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. This document is dedicated to the 2024 summer review cycle.
We introduced several notable changes to ASPLOS this year, many of which were discussed in the previous message from program chairs in Volume 1. Here, to avoid repetition, we assume that readers have already read the latter message and will only describe differences between the current cycle and the previous one. These include: (1) developing and utilizing an automated format violation identifier script focused on uncovering disallowed vertical space manipulations that "squeeze" space; (2) incorporating authors-declared best-matching topics into our review assignment process; (3) introducing the new ASPLOS role of Program Vice Chairs to cope with the increased number of submissions and the added load caused by foregoing synchronous program committee (PC) meetings, which necessitated additional managerial involvement in online dissensions; and (4) characterizing a systematic problem that ASPLOS is facing in reviewing quantum computing submissions, describing how we addressed it, and highlighting how we believe that it should be handled in the future.
Key statistics of the ASPLOS'24 summer cycle include: 409 submissions were finalized (about 1.5x more than last year's summer count and nearly 2.4x more than our spring cycle), with 107 related to accelerators/FPGAs/GPUs, 97 to machine learning, 88 to storage/memory, 80 to security, and 69 to datacenter/cloud; 179 (44%) submissions were promoted to the second review round; 54 (13.2%) papers were accepted (with 20 awarded one or more artifact evaluation badges); 33 (8.1%) submissions were allowed to submit major revisions, of which 27 were subsequently accepted during the fall cycle (with 13 awarded one or more artifact evaluation badges); 1,499 reviews were uploaded; and 5,557 comments were generated during online discussions.
Analyzing the per-submission most-related broader areas of research, which we asked authors to associate with their work in the submission form, revealed that 71%, 47%, and 28% of the submissions are categorized by their authors as related to architecture, operating systems, and programming languages, respectively, with about 45% being "interdisciplinary" submissions (associated with more than one area). The full details are available in the PDF of the front matter.
Proceeding Downloads
A Fault-Tolerant Million Qubit-Scale Distributed Quantum Computer
- Junpyo Kim,
- Dongmoon Min,
- Jungmin Cho,
- Hyeonseong Jeong,
- Ilkwon Byun,
- Junhyuk Choi,
- Juwon Hong,
- Jangwoo Kim
A million qubit-scale quantum computer is essential to realize the quantum supremacy. Modern large-scale quantum computers integrate multiple quantum computers located in dilution refrigerators (DR) to overcome each DR's unscaling cooling budget. However,...
A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs
- Michael Davies,
- Ian McDougall,
- Selvaraj Anandaraj,
- Deep Machchhar,
- Rithik Jain,
- Karthikeyan Sankaralingam
We are in age of AI, with rapidly changing algorithms and a somewhat synergistic change in hardware. MLPerf is a recent benchmark suite that serves as a way to compare and evaluate hardware. However it has several drawbacks - it is dominated by CNNs and ...
A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors
- Reese Kuper,
- Ipoom Jeong,
- Yifan Yuan,
- Ren Wang,
- Narayan Ranganathan,
- Nikhil Rao,
- Jiayu Hu,
- Sanjay Kumar,
- Philip Lantz,
- Nam Sung Kim
As semiconductor power density is no longer constant with the technology process scaling down, we need different solutions if we are to continue scaling application performance. To this end, modern CPUs are integrating capable data accelerators on the ...
Achieving Near-Zero Read Retry for 3D NAND Flash Memory
As the flash-based storage devices age with program/erase (P/E) cycles, they require an increasing number of read retries for error correction, which in turn deteriorates their read performance. The design of read-retry methods is critical to flash read ...
An Encoding Scheme to Enlarge Practical DNA Storage Capacity by Reducing Primer-Payload Collisions
Deoxyribonucleic Acid (DNA), with its ultra-high storage density and long durability, is a promising long-term archival storage medium and is attracting much attention today. A DNA storage system encodes and stores digital data with synthetic DNA ...
Atalanta: A Bit is Worth a “Thousand” Tensor Values
- Alberto Delmas Lascorz,
- Mostafa Mahmoud,
- Ali Hadi Zadeh,
- Milos Nikolic,
- Kareem Ibrahim,
- Christina Giannoula,
- Ameer Abdelhadi,
- Andreas Moshovos
Atalanta is a lossless, hardware/software co-designed compression technique for the tensors of fixed-point quantized deep neural networks. Atalanta increases effective memory capacity, reduces off-die traffic, and/or helps to achieve the desired ...
AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference
The Transformer-based generative model (TbGM), comprising summarization (Sum) and generation (Gen) stages, has demonstrated unprecedented generative performance across a wide range of applications. However, it also demands immense amounts of compute and ...
Avoiding Instruction-Centric Microarchitectural Timing Channels Via Binary-Code Transformations
With the end of Moore's Law-based scaling, novel microarchitectural optimizations are being patented, researched, and implemented at an increasing rate. Previous research has examined recently published patents and papers and demonstrated ways these ...
BitPacker: Enabling High Arithmetic Efficiency in Fully Homomorphic Encryption Accelerators
Fully Homomorphic Encryption (FHE) enables computing directly on encrypted data. Though FHE is slow on a CPU, recent hardware accelerators compensate most of FHE's overheads, enabling real-time performance in complex programs like deep neural networks. ...
BVAP: Energy and Memory Efficient Automata Processing for Regular Expressions with Bounded Repetitions
Regular pattern matching is pervasive in applications such as text processing, malware detection, network security, and bioinformatics. Recent studies have demonstrated specialized in-memory automata processors with superior energy and memory ...
Carat: Unlocking Value-Level Parallelism for Multiplier-Free GEMMs
In recent years, hardware architectures optimized for general matrix multiplication (GEMM) have been well studied to deliver better performance and efficiency for deep neural networks. With trends towards batched, low-precision data, e.g., FP8 format in ...
CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators
In recent years, various computing-in-memory (CIM) processors have been presented, showing superior performance over traditional architectures. To unleash the potential of various CIM architectures, such as device precision, crossbar size, and crossbar ...
CMC: Video Transformer Acceleration via CODEC Assisted Matrix Condensing
Video Transformers (VidTs) have reached the forefront of accuracy in various video understanding tasks. Despite their remarkable achievements, the processing requirements for a large number of video frames still present a significant performance ...
Codesign of quantum error-correcting codes and modular chiplets in the presence of defects
- Sophia Fuhui Lin,
- Joshua Viszlai,
- Kaitlin N. Smith,
- Gokul Subramanian Ravi,
- Charles Yuan,
- Frederic T. Chong,
- Benjamin J. Brown
Fabrication errors pose a significant challenge in scaling up solid-state quantum devices to the sizes required for fault-tolerant (FT) quantum applications. To mitigate the resource overhead caused by fabrication errors, we combine two approaches: (1) ...
Compiling Loop-Based Nested Parallelism for Irregular Workloads
- Yian Su,
- Mike Rainey,
- Nick Wanninger,
- Nadharm Dhiantravan,
- Jasper Liang,
- Umut A. Acar,
- Peter Dinda,
- Simone Campanoni
Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however:...
Cornucopia Reloaded: Load Barriers for CHERI Heap Temporal Safety
- Nathaniel Wesley Filardo,
- Brett F. Gutstein,
- Jonathan Woodruff,
- Jessica Clarke,
- Peter Rugg,
- Brooks Davis,
- Mark Johnston,
- Robert Norton,
- David Chisnall,
- Simon W. Moore,
- Peter G. Neumann,
- Robert N. M. Watson
Violations of temporal memory safety ("use after free", "UAF") continue to pose a significant threat to software security. The CHERI capability architecture has shown promise as a technology for C and C++ language reference integrity and spatial memory ...
Design of Novel Analog Compute Paradigms with Ark
Previous efforts on reconfigurable analog circuits mostly focused on specialized analog circuits, produced through careful co-design, or on highly reconfigurable, but relatively resource inefficient, accelerators that implement analog compute paradigms. ...
Direct Memory Translation for Virtualized Clouds
Virtual memory translation has become a key performance bottleneck of memory-intensive workloads in virtualized cloud environments. On the x86 architecture, a nested translation needs to sequentially fetch up to 24 page table entries (PTEs). This paper ...
Efficient Microsecond-scale Blind Scheduling with Tiny Quanta
A longstanding performance challenge in datacenter-based applications is how to efficiently handle incoming client requests that spawn many very short (μs scale) jobs that must be handled with high throughput and low tail latency. When no assumptions are ...
Eliminating Storage Management Overhead of Deduplication over SSD Arrays Through a Hardware/Software Co-Design
This paper presents a hardware/software co-design solution to efficiently implement block-layer deduplication over SSD arrays. By introducing complex and varying dependency over the entire storage space, deduplication is infamously subject to high ...
Elivagar: Efficient Quantum Circuit Search for Classification
Designing performant and noise-robust circuits for Quantum Machine Learning (QML) is challenging --- the design space scales exponentially with circuit size, and there are few well-supported guiding principles for QML circuit design. Although recent ...
Energy Efficient Convolutions with Temporal Arithmetic
Convolution is an important operation at the heart of many applications, including image processing, object detection, and neural networks. While data movement and coordination operations continue to be important areas for optimization in general-purpose ...
ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
This paper presents ExeGPT, a distributed system designed for constraint-aware LLM inference. ExeGPT finds and runs with an optimal execution schedule to maximize inference throughput while satisfying a given latency constraint. By leveraging the ...
FaaSGraph: Enabling Scalable, Efficient, and Cost-Effective Graph Processing with Serverless Computing
Graph processing is widely used in cloud services; however, current frameworks face challenges in efficiency and cost-effectiveness when deployed under the Infrastructure-as-a-Service model due to its limited elasticity. In this paper, we present ...
FOCAL: A First-Order Carbon Model to Assess Processor Sustainability
Sustainability in general and global warming in particular are grand societal challenges. Computer systems demand substantial materials and energy resources throughout their entire lifetime. A key question is how computer engineers and scientists can ...
FPGA Technology Mapping Using Sketch-Guided Program Synthesis
- Gus Henry Smith,
- Benjamin Kushigian,
- Vishal Canumalla,
- Andrew Cheung,
- Steven Lyubomirsky,
- Sorawee Porncharoenwase,
- René Just,
- Gilbert Louis Bernstein,
- Zachary Tatlock
FPGA technology mapping is the process of implementing a hardware design expressed in high-level HDL (hardware design language) code using the low-level, architecture-specific primitives of the target FPGA. As FPGAs become increasingly heterogeneous, ...
GIANTSAN: Efficient Memory Sanitization with Segment Folding
Memory safety sanitizers, the sharp weapon for detecting invalid memory operations during execution, employ runtime metadata to model the memory and help find memory errors hidden in the programs. However, location-based methods, the most widely deployed ...
GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
- Cong Guo,
- Rui Zhang,
- Jiale Xu,
- Jingwen Leng,
- Zihan Liu,
- Ziyu Huang,
- Minyi Guo,
- Hao Wu,
- Shouren Zhao,
- Junping Zhao,
- Ke Zhang
Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational ...
Grafu: Unleashing the Full Potential of Future Value Computation for Out-of-core Synchronous Graph Processing
As graphs exponentially grow recently, out-of-core graph systems have been invented to process large-scale graphs by keeping massive data in storage. Among them, many systems process the graphs iteration-by-iteration and provide synchronous semantics ...
Greybox Fuzzing for Concurrency Testing
Uncovering bugs in concurrent programs is a challenging problem owing to the exponentially large search space of thread interleavings. Past approaches towards concurrency testing are either optimistic --- relying on random sampling of these interleavings ...
Recommendations
Acceptance Rates
Year | Submitted | Accepted | Rate |
---|---|---|---|
ASPLOS '19 | 351 | 74 | 21% |
ASPLOS '18 | 319 | 56 | 18% |
ASPLOS '17 | 320 | 53 | 17% |
ASPLOS '16 | 232 | 53 | 23% |
ASPLOS '15 | 287 | 48 | 17% |
ASPLOS '14 | 217 | 49 | 23% |
ASPLOS XV | 181 | 32 | 18% |
ASPLOS XIII | 127 | 31 | 24% |
ASPLOS XII | 158 | 38 | 24% |
ASPLOS X | 175 | 24 | 14% |
ASPLOS IX | 114 | 24 | 21% |
ASPLOS VIII | 123 | 28 | 23% |
ASPLOS VII | 109 | 25 | 23% |
Overall | 2,713 | 535 | 20% |