research-article

Open access

Mozart: Taming Taxes and Composing Accelerators with Shared-Memory

Authors:

Sarita V. AdveAuthors Info & Claims

PACT '24: Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques

Pages 183 - 200

https://doi.org/10.1145/3656019.3676896

Published: 13 October 2024 Publication History

All formats PDF

Abstract

Resource-constrained system-on-chips (SoCs) are increasingly heterogeneous with specialized accelerators for various tasks. Acceleration taxes due to control and data movement, however, diminish end-to-end speedups from hardware acceleration. Meanwhile, emerging workloads are increasingly task-diverse with several, potentially shared, fine-grained acceleration candidates. This motivates a paradigm of parallel and disaggregated acceleration. Compared to a monolithic accelerator, disaggregation provides higher flexibility, reuse, and utilization, but at the cost of higher control and data acceleration taxes.

We propose a novel SoC architecture, Mozart, that enables efficient accelerator disaggregation by leveraging shared-memory to tame control and data acceleration taxes. To address the control tax, Mozart includes a lightweight, modular, and general accelerator synchronization interface (ASI). ASI eliminates the typical CPU-centric accelerator control in favor of a decentralized, uniform synchronization interface through shared-memory. This enables accelerators to directly and transparently synchronize with each other (or CPUs) using the same shared-memory interface as CPUs. To address the data tax, Mozart leverages the Spandex-FCS heterogeneous coherence protocol, which supports decentralized data movement and per-word coherence specialization. We demonstrate the first RTL implementation of Spandex-FCS and the first evaluation of its benefits for a heterogeneous SoC with fixed-function accelerators, running real-world applications with Linux. Mozart simultaneously enables, for the first time, (1) finer-grained acceleration than previously possible, (2) programmable and transparent composition of fine-grained, disaggregated accelerators, (3) efficient accelerator pipelining through shared-memory and decentralization, and (4) a performance-competitive disaggregated alternative to specialized monolithic accelerators. We demonstrate these capabilities of Mozart with a comprehensive one-of-a-kind evaluation of more than 70 hardware configurations prototyped on an FPGA employing various accelerators, running real-world applications on Linux, and a scalability analysis with up to 15 accelerators. We also present an analytical performance model to understand and explore system design choices and to validate the results.

References

[1]

AmirAli Abdolrashidi, Hodjat Asghari Esfeden, Ali Jahanshahi, Kaustubh Singh, Nael Abu-Ghazaleh, and Daniel Wong. 2021. Blockmaestro: Enabling programmer-transparent task-based execution in gpu systems. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, Association for Computing Machinery, New York, NY, USA, 333–346.

Abstract

References

Index Terms

Recommendations

A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems

An asymmetric distributed shared memory model for heterogeneous parallel systems

An asymmetric distributed shared memory model for heterogeneous parallel systems

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations