ABSTRACT
Domain-specific languages (DSLs) improve developer productivity by abstracting away low-level details of an algorithm's implementation within a specialized domain. These languages often provide powerful primitives to describe complex operations, potentially granting flexibility during compilation to target hardware acceleration. This work proposes PriMax, a novel methodology to effectively map DSL applications to hardware accelerators. It builds decision trees based on benchmark results, which select between distinct implementations of accelerated primitives to maximize a target performance metric. In our graph analytics case study with two accelerators, PriMax produces a geometric mean speedup of 1.57x over a multicore CPU, higher than either target accelerator alone, and approaching the maximum 1.58x speedup attainable with these target accelerators.
- Abraham Addisie et al. 2018. Heterogeneous Memory Subsystem for Natural Graph Analytics. In IEEE International Symposium on Workload Characterization. 134--145. Google ScholarCross Ref
- Scott Beamer et al. 2015. The GAP Benchmark Suite. arXiv:1508.03619Google Scholar
- Nathan Binkert et al. 2011. The Gem5 Simulator. ACM SIGARCH Computer Architecture News 39, 2 (Aug. 2011), 1--7. Google ScholarDigital Library
- Ajay Brahmakshatriya et al. 2021. Taming the Zoo: The Unified GraphIt Compiler Framework for Novel Architectures. In International Symposium on Computer Architecture. 429--442. Google ScholarDigital Library
- Tianqi Chen et al. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In USENIX Symposium on Operating Systems Design and Implementation. 578--594. https://www.usenix.org/conference/osdi18/presentation/chenGoogle Scholar
- William Dally et al. 2020. Domain-Specific Hardware Accelerators. Communications of the ACM 63, 7 (Jun. 2020), 48--57. Google ScholarDigital Library
- Hadi Esmaeilzadeh et al. 2011. Dark Silicon and the End of Multicore Scaling. ACM SIGARCH Computer Architecture News 39, 3 (Jun. 2011), 365--376. Google ScholarDigital Library
- John Hennessy and David Patterson. 2019. A New Golden Age for Computer Architecture. Communications of the ACM 62, 2 (Feb. 2019), 48--60. Google ScholarDigital Library
- Hiwot Tadese Kassa et al. 2021. ChipAdvisor: A Machine Learning Approach for Mapping Applications to Heterogeneous Systems. In International Symposium on Quality Electronic Design. 292--299. Google ScholarCross Ref
- Farzad Khorasani et al. 2015. Scalable SIMD-Efficient Graph Processing on GPUs. In International Conference on Parallel Architecture and Compilation Techniques. 39--50. Google ScholarDigital Library
- Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/dataGoogle Scholar
- Fabian Pedregosa et al. 2011. Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research 12, 85 (Feb. 2011), 2825--2830. http://jmlr.org/papers/v12/pedregosa11a.htmlGoogle Scholar
- Jonathan Ragan-Kelley et al. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. ACM SIGPLAN Notices 48, 6 (Jun. 2013), 519--530. Google ScholarDigital Library
- Arvind Sujeeth et al. 2014. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages. ACM Transactions on Embedded Computing Systems 13, 4s, Article 134 (Apr. 2014), 25 pages. Google ScholarDigital Library
- Zheng Wang and Michael O'Boyle. 2018. Machine Learning in Compiler Optimization. Proc. IEEE 106, 11 (Nov. 2018), 1879--1901. Google ScholarCross Ref
- Yunming Zhang et al. 2018. GraphIt: A High-Performance Graph DSL. Proceedings of the ACM on Programming Languages 2, OOPSLA, Article 121 (Nov. 2018), 30 pages. Google ScholarDigital Library
- Yunming Zhang et al. 2020. Optimizing Ordered Graph Algorithms with GraphIt. In International Symposium on Code Generation and Optimization. 158--170. Google ScholarDigital Library
Index Terms
- PriMax: maximizing DSL application performance with selective primitive acceleration
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysField-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Comments