Skip to main content

Characterization of a Coherent Hardware Accelerator Framework for SoCs

  • Conference paper
  • First Online:
Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS 2023)

Abstract

Accelerators rich architectures have become the standard in today’s SoCs. After Moore’s law diminish, it is common to only dedicate a fraction of the area of the SoC to traditional cores and leave the rest of space for specialized hardware. This motivates the need for better interconnects and interfaces between accelerators and the SoC both in hardware and software. Recent proposals from industry have put the focus on coherent interconnects for big external accelerators. However, there are still many cases where accelerators benefit from being directly connected to the memory hierarchy of the CPU inside the same chip. In this work, we demonstrate the usability of these interfaces with a characterization of a framework that connects accelerators that benefit from having coherent access to the memory hierarchy. We have evaluated some kernels from the Machsuite benchmark suite in a FPGA environment obtaining performance and area numbers. We obtain speedups from \(1.42\times \) up to \(10\times \) only requiring 45k LUTs for the accelerator framework. We conclude that many accelerators can benefit from having this access to the memory hierarchy and more work is needed for a generic framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Compute Express Link, CXL Consortium White Paper. https://docs.wixstatic.com/ugd/0c1418_d9878707bbb7427786b70c3c91d5fbd1.pdf

  2. Compute Library by Arm. https://github.com/ARM-software/ComputeLibrary

  3. Enabling hardware accelerator and SoC design space exploration. https://community.arm.com/arm-research/b/articles/posts/enabling-hardware-accelerator-and-soc-design-space-exploration

  4. An Introduction to CCIX White Paper. https://www.ccixconsortium.com/wp-content/uploads/2019/11/CCIX-White-Paper-Rev111219.pdf

  5. PCI Express Base Specification Revision 6.0, PCI-SIG. https://pcisig.com/specifications

  6. Alsop, J., Sinclair, M., Adve, S.: Spandex: a flexible interface for efficient heterogeneous coherence. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 261–274 (2018). https://doi.org/10.1109/ISCA.2018.00031

  7. Choi, B., et al.: DeNovo: rethinking the memory hierarchy for disciplined parallelism. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 155–166 (2011). https://doi.org/10.1109/PACT.2011.21

  8. Choi, Y.K., Cong, J., Fang, Z., Hao, Y., Reinman, G., Wei, P.: A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In: Proceedings of the 53rd Annual Design Automation Conference, DAC 2016, Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2897937.2897972

  9. Choi, Y.K., Cong, J., Fang, Z., Hao, Y., Reinman, G., Wei, P.: In-depth analysis on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM Trans. Reconfigurable Technol. Syst. 12(1) (2019). https://doi.org/10.1145/3294054

  10. Dally, W.J., Turakhia, Y., Han, S.: Domain-specific hardware accelerators. Commun. ACM 63(7), 48–57 (2020). https://doi.org/10.1145/3361682

    Article  Google Scholar 

  11. Foley, D., Danskin, J.: Ultra-performance pascal GPU and NVLink interconnect. IEEE Micro 37(2), 7–17 (2017). https://doi.org/10.1109/MM.2017.37

    Article  Google Scholar 

  12. Giri, D., Mantovani, P., Carloni, L.P.: Accelerators and coherence: an SoC perspective. IEEE Micro 38(6), 36–45 (2018). https://doi.org/10.1109/MM.2018.2877288

    Article  Google Scholar 

  13. Hao, Y., Fang, Z., Reinman, G., Cong, J.: Supporting address translation for accelerator-centric architectures. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 37–48 (2017). https://doi.org/10.1109/HPCA.2017.19

  14. Hill, M.D., Reddi, V.J.: Accelerator-level parallelism. Commun. ACM 64(12), 36–38 (2021). https://doi.org/10.1145/3460970

    Article  Google Scholar 

  15. Ito, M., Ohara, M.: A power-efficient FPGA accelerator: systolic array with cache-coherent interface for pair-HMM algorithm. In: 2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX), pp. 1–3 (2016). https://doi.org/10.1109/CoolChips.2016.7503681

  16. Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit (2017). https://doi.org/10.1145/3079856.3080246

  17. Kelm, J.H., Johnson, D.R., Tuohy, W., Lumetta, S.S., Patel, S.J.: Cohesion: a hybrid memory model for accelerators. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA 2010, pp. 429–440. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1815961.1816019

  18. Kumar, S., Shriraman, A., Vedula, N.: Fusion: design tradeoffs in coherent cache hierarchies for accelerators. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA 2015, pp. 733–745. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2749469.2750421

  19. López-Paradís, G., Armejach, A., Moretó, M.: Gem5 + RTL: a framework to enable RTL models inside a full-system simulator. In: Proceedings of the 50th International Conference on Parallel Processing, ICPP 2021, Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3472456.3472461

  20. Min, S.W., Huang, S., El-Hadedy, M., Xiong, J., Chen, D., Hwu, W.M.: Analysis and optimization of i/o cache coherency strategies for SoC-FPGA device. In: 2019 29th International Conference on Field Programmable Logic and Applications (FPL), pp. 301–306 (2019). https://doi.org/10.1109/FPL.2019.00055

  21. Molanes, R.F., Rodríguez-Andina, J.J., Fariña, J.: Performance characterization and design guidelines for efficient processor-FPGA communication in cyclone V FPSoCs. IEEE Trans. Ind. Electron. 65(5), 4368–4377 (2018). https://doi.org/10.1109/TIE.2017.2766581

    Article  Google Scholar 

  22. Molanes, R.F., Salgado, F., Fariña, J., Rodríguez-Andina, J.J.: Characterization of FPGA-master arm communication delays in cyclone V devices. In: IECON 2015–41st Annual Conference of the IEEE Industrial Electronics Society (2015). https://doi.org/10.1109/IECON.2015.7392759

  23. Papamarcos, M.S., Patel, J.H.: A low-overhead coherence solution for multiprocessors with private cache memories. In: Proceedings of the 11th Annual International Symposium on Computer Architecture, ISCA 1984, pp. 348–354. Association for Computing Machinery, New York (1984). https://doi.org/10.1145/800015.808204

  24. Powell, A., Silage, D.: Statistical performance of the ARM cortex A9 accelerator coherency port in the xilinx zynq SoC for real-time applications. In: 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–6 (2015). https://doi.org/10.1109/ReConFig.2015.7393362

  25. Putnam, A., et al.: A reconfigurable fabric for accelerating large-scale datacenter services. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) (2014). https://doi.org/10.1109/ISCA.2014.6853195

  26. Reagen, B., Adolf, R., Shao, Y.S., Wei, G.Y., Brooks, D.: MachSuite: benchmarks for accelerator design and customized architectures. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp. 110–119 (2014). https://doi.org/10.1109/IISWC.2014.6983050d

  27. Sadri, M., Weis, C., Wehn, N., Benini, L.: Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ. In: Proceedings of the 10th FPGAworld Conference, FPGAworld 2013, pp. 1–8. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2513683.2513688

  28. Shao, Y.S., Xi, S., Srinivasan, V., Wei, G.Y., Brooks, D.: Toward cache-friendly hardware accelerators, p. 6 (2015)

    Google Scholar 

  29. Shao, Y.S., Xi, S.L., Srinivasan, V., Wei, G.Y., Brooks, D.: Co-designing accelerators and SoC interfaces using gem5-Aladdin. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12 (2016). https://doi.org/10.1109/MICRO.2016.7783751

  30. Sklyarov, V., Skliarova, I., Silva, J., Sudnitson, A.: Analysis and comparison of attainable hardware acceleration in all programmable systems-on-chip. In: 2015 Euromicro Conference on Digital System Design, pp. 345–352 (2015). https://doi.org/10.1109/DSD.2015.45

  31. Stuecheli, J., Blaner, B., Johns, C.R., Siegel, M.S.: CAPI: a coherent accelerator processor interface. IBM J. Res. Dev. 59(1), 7:1–7:7 (2015). https://doi.org/10.1147/JRD.2014.2380198

  32. Tamimi, S., Stock, F., Koch, A., Bernhardt, A., Petrov, I.: An evaluation of using CCIX for cache-coherent host-FPGA interfacing. In: 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 1–9 (2022). https://doi.org/10.1109/FCCM53951.2022.9786103

  33. Zuckerman, J., Giri, D., Kwon, J., Mantovani, P., Carloni, L.P.: Cohmeleon: learning-based orchestration of accelerator coherence in heterogeneous SoCs. In: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2021, pp. 350–365. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3466752.3480065

Download references

Acknowledgements

This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (PID2019-107255GB-C21, and TED2021-132634A-I00), by the Generalitat de Catalunya (2021-SGR-00763), and by Arm through the Arm-BSC Center of Excellence. G. López-Paradís has been supported by the Generalitat de Catalunya through a FI fellowship 2021FI-B00994, M. Moretó by a Ramon y Cajal fellowship no. RYC-2016-21104, and A. Armejach is a Serra Hunter Fellow.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillem López-Paradís .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

López-Paradís, G., Venu, B., Armejach, A., Moretó, M. (2023). Characterization of a Coherent Hardware Accelerator Framework for SoCs. In: Silvano, C., Pilato, C., Reichenbach, M. (eds) Embedded Computer Systems: Architectures, Modeling, and Simulation. SAMOS 2023. Lecture Notes in Computer Science, vol 14385. Springer, Cham. https://doi.org/10.1007/978-3-031-46077-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-46077-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-46076-0

  • Online ISBN: 978-3-031-46077-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics