Characterization of a Coherent Hardware Accelerator Framework for SoCs

López-Paradís, Guillem; Venu, Balaji; Armejach, Adriá; Moretó, Miquel

doi:10.1007/978-3-031-46077-7_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14385))

Included in the following conference series:

International Conference on Embedded Computer Systems

454 Accesses

Abstract

Accelerators rich architectures have become the standard in today’s SoCs. After Moore’s law diminish, it is common to only dedicate a fraction of the area of the SoC to traditional cores and leave the rest of space for specialized hardware. This motivates the need for better interconnects and interfaces between accelerators and the SoC both in hardware and software. Recent proposals from industry have put the focus on coherent interconnects for big external accelerators. However, there are still many cases where accelerators benefit from being directly connected to the memory hierarchy of the CPU inside the same chip. In this work, we demonstrate the usability of these interfaces with a characterization of a framework that connects accelerators that benefit from having coherent access to the memory hierarchy. We have evaluated some kernels from the Machsuite benchmark suite in a FPGA environment obtaining performance and area numbers. We obtain speedups from \(1.42\times \) up to \(10\times \) only requiring 45k LUTs for the accelerator framework. We conclude that many accelerators can benefit from having this access to the memory hierarchy and more work is needed for a generic framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Compute Express Link, CXL Consortium White Paper. https://docs.wixstatic.com/ugd/0c1418_d9878707bbb7427786b70c3c91d5fbd1.pdf
Compute Library by Arm. https://github.com/ARM-software/ComputeLibrary
Enabling hardware accelerator and SoC design space exploration. https://community.arm.com/arm-research/b/articles/posts/enabling-hardware-accelerator-and-soc-design-space-exploration
An Introduction to CCIX White Paper. https://www.ccixconsortium.com/wp-content/uploads/2019/11/CCIX-White-Paper-Rev111219.pdf
PCI Express Base Specification Revision 6.0, PCI-SIG. https://pcisig.com/specifications
Alsop, J., Sinclair, M., Adve, S.: Spandex: a flexible interface for efficient heterogeneous coherence. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 261–274 (2018). https://doi.org/10.1109/ISCA.2018.00031
Choi, B., et al.: DeNovo: rethinking the memory hierarchy for disciplined parallelism. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 155–166 (2011). https://doi.org/10.1109/PACT.2011.21
Choi, Y.K., Cong, J., Fang, Z., Hao, Y., Reinman, G., Wei, P.: A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In: Proceedings of the 53rd Annual Design Automation Conference, DAC 2016, Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2897937.2897972
Choi, Y.K., Cong, J., Fang, Z., Hao, Y., Reinman, G., Wei, P.: In-depth analysis on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM Trans. Reconfigurable Technol. Syst. 12(1) (2019). https://doi.org/10.1145/3294054
Dally, W.J., Turakhia, Y., Han, S.: Domain-specific hardware accelerators. Commun. ACM 63(7), 48–57 (2020). https://doi.org/10.1145/3361682
Article Google Scholar
Foley, D., Danskin, J.: Ultra-performance pascal GPU and NVLink interconnect. IEEE Micro 37(2), 7–17 (2017). https://doi.org/10.1109/MM.2017.37
Article Google Scholar
Giri, D., Mantovani, P., Carloni, L.P.: Accelerators and coherence: an SoC perspective. IEEE Micro 38(6), 36–45 (2018). https://doi.org/10.1109/MM.2018.2877288
Article Google Scholar
Hao, Y., Fang, Z., Reinman, G., Cong, J.: Supporting address translation for accelerator-centric architectures. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 37–48 (2017). https://doi.org/10.1109/HPCA.2017.19
Hill, M.D., Reddi, V.J.: Accelerator-level parallelism. Commun. ACM 64(12), 36–38 (2021). https://doi.org/10.1145/3460970
Article Google Scholar
Ito, M., Ohara, M.: A power-efficient FPGA accelerator: systolic array with cache-coherent interface for pair-HMM algorithm. In: 2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX), pp. 1–3 (2016). https://doi.org/10.1109/CoolChips.2016.7503681
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit (2017). https://doi.org/10.1145/3079856.3080246
Kelm, J.H., Johnson, D.R., Tuohy, W., Lumetta, S.S., Patel, S.J.: Cohesion: a hybrid memory model for accelerators. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA 2010, pp. 429–440. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1815961.1816019
Kumar, S., Shriraman, A., Vedula, N.: Fusion: design tradeoffs in coherent cache hierarchies for accelerators. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA 2015, pp. 733–745. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2749469.2750421
López-Paradís, G., Armejach, A., Moretó, M.: Gem5 + RTL: a framework to enable RTL models inside a full-system simulator. In: Proceedings of the 50th International Conference on Parallel Processing, ICPP 2021, Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3472456.3472461
Min, S.W., Huang, S., El-Hadedy, M., Xiong, J., Chen, D., Hwu, W.M.: Analysis and optimization of i/o cache coherency strategies for SoC-FPGA device. In: 2019 29th International Conference on Field Programmable Logic and Applications (FPL), pp. 301–306 (2019). https://doi.org/10.1109/FPL.2019.00055
Molanes, R.F., Rodríguez-Andina, J.J., Fariña, J.: Performance characterization and design guidelines for efficient processor-FPGA communication in cyclone V FPSoCs. IEEE Trans. Ind. Electron. 65(5), 4368–4377 (2018). https://doi.org/10.1109/TIE.2017.2766581
Article Google Scholar
Molanes, R.F., Salgado, F., Fariña, J., Rodríguez-Andina, J.J.: Characterization of FPGA-master arm communication delays in cyclone V devices. In: IECON 2015–41st Annual Conference of the IEEE Industrial Electronics Society (2015). https://doi.org/10.1109/IECON.2015.7392759
Papamarcos, M.S., Patel, J.H.: A low-overhead coherence solution for multiprocessors with private cache memories. In: Proceedings of the 11th Annual International Symposium on Computer Architecture, ISCA 1984, pp. 348–354. Association for Computing Machinery, New York (1984). https://doi.org/10.1145/800015.808204
Powell, A., Silage, D.: Statistical performance of the ARM cortex A9 accelerator coherency port in the xilinx zynq SoC for real-time applications. In: 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–6 (2015). https://doi.org/10.1109/ReConFig.2015.7393362
Putnam, A., et al.: A reconfigurable fabric for accelerating large-scale datacenter services. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) (2014). https://doi.org/10.1109/ISCA.2014.6853195
Reagen, B., Adolf, R., Shao, Y.S., Wei, G.Y., Brooks, D.: MachSuite: benchmarks for accelerator design and customized architectures. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp. 110–119 (2014). https://doi.org/10.1109/IISWC.2014.6983050d
Sadri, M., Weis, C., Wehn, N., Benini, L.: Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ. In: Proceedings of the 10th FPGAworld Conference, FPGAworld 2013, pp. 1–8. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2513683.2513688
Shao, Y.S., Xi, S., Srinivasan, V., Wei, G.Y., Brooks, D.: Toward cache-friendly hardware accelerators, p. 6 (2015)
Google Scholar
Shao, Y.S., Xi, S.L., Srinivasan, V., Wei, G.Y., Brooks, D.: Co-designing accelerators and SoC interfaces using gem5-Aladdin. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12 (2016). https://doi.org/10.1109/MICRO.2016.7783751
Sklyarov, V., Skliarova, I., Silva, J., Sudnitson, A.: Analysis and comparison of attainable hardware acceleration in all programmable systems-on-chip. In: 2015 Euromicro Conference on Digital System Design, pp. 345–352 (2015). https://doi.org/10.1109/DSD.2015.45
Stuecheli, J., Blaner, B., Johns, C.R., Siegel, M.S.: CAPI: a coherent accelerator processor interface. IBM J. Res. Dev. 59(1), 7:1–7:7 (2015). https://doi.org/10.1147/JRD.2014.2380198
Tamimi, S., Stock, F., Koch, A., Bernhardt, A., Petrov, I.: An evaluation of using CCIX for cache-coherent host-FPGA interfacing. In: 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 1–9 (2022). https://doi.org/10.1109/FCCM53951.2022.9786103
Zuckerman, J., Giri, D., Kwon, J., Mantovani, P., Carloni, L.P.: Cohmeleon: learning-based orchestration of accelerator coherence in heterogeneous SoCs. In: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2021, pp. 350–365. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3466752.3480065

Download references

Acknowledgements

This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (PID2019-107255GB-C21, and TED2021-132634A-I00), by the Generalitat de Catalunya (2021-SGR-00763), and by Arm through the Arm-BSC Center of Excellence. G. López-Paradís has been supported by the Generalitat de Catalunya through a FI fellowship 2021FI-B00994, M. Moretó by a Ramon y Cajal fellowship no. RYC-2016-21104, and A. Armejach is a Serra Hunter Fellow.

Author information

Authors and Affiliations

Barcelona Supercomputing Center (BSC), Barcelona, Spain
Guillem López-Paradís, Adriá Armejach & Miquel Moretó
Universitat Politècnica de Catalunya (UPC), Barcelona, Spain
Guillem López-Paradís, Adriá Armejach & Miquel Moretó
Arm Ltd., Cambridge, UK
Balaji Venu

Authors

Guillem López-Paradís
View author publications
You can also search for this author in PubMed Google Scholar
Balaji Venu
View author publications
You can also search for this author in PubMed Google Scholar
Adriá Armejach
View author publications
You can also search for this author in PubMed Google Scholar
Miquel Moretó
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillem López-Paradís .

Editor information

Editors and Affiliations

Politecnico di Milano, Milan, Italy
Cristina Silvano
Politecnico di Milano, Milan, Italy
Christian Pilato
University of Rostock, Rostock, Germany
Marc Reichenbach

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

López-Paradís, G., Venu, B., Armejach, A., Moretó, M. (2023). Characterization of a Coherent Hardware Accelerator Framework for SoCs. In: Silvano, C., Pilato, C., Reichenbach, M. (eds) Embedded Computer Systems: Architectures, Modeling, and Simulation. SAMOS 2023. Lecture Notes in Computer Science, vol 14385. Springer, Cham. https://doi.org/10.1007/978-3-031-46077-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-46077-7_7
Published: 07 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46076-0
Online ISBN: 978-3-031-46077-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Characterization of a Coherent Hardware Accelerator Framework for SoCs