Skip to main content
Log in

Profile-driven memory bandwidth management for accelerators and CPUs in QoS-enabled platforms

  • Published:
Real-Time Systems Aims and scope Submit manuscript

Abstract

The proliferation of multi-core, accelerator-enabled embedded systems has introduced new opportunities to consolidate real-time systems of increasing complexity. But the road to build confidence on the temporal behavior of co-running applications has presented formidable challenges. Most prominently, the main memory subsystem represents a performance bottleneck for both CPUs and accelerators. And industry-viable frameworks for full-system main memory management and performance analysis are past due. In this paper, we propose our Envelope-aWare Predictive model, or E-WarP for short. E-WarP is a methodology and technological framework to: (1) analyze the memory demand of applications following a profile-driven approach; (2) make realistic predictions on the temporal behavior of workload deployed on CPUs and accelerators; and (3) perform saturation-aware system consolidation. This work aims at providing the technological foundations as well as the theoretical grassroots for truly workload-aware analysis of real-time systems. This work combines traditional CPU-centric bandwidth regulation techniques with state-of-the-art hardware support for memory traffic shaping via the ARM QoS extensions. We make three key observations. First, our profile-driven methodology achieves, on average, 6% over-prediction on the runtime of bandwidth-regulated applications. Second, we experimentally validate that the calculated bounds hold system-wide if the main memory subsystem operates below saturation. Third, we show that the E-WarP methodology is practical even when applications exhibit input-dependent memory access patterns. We provide a full implementation of our techniques on a commercial platform (NXP S32V234).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. Contributions indicated with a * are new additions in the journal extension.

  2. This was required to overcome the lack of a PSCI firmware provided by the vendor to control CPU shutdown.

  3. https://github.com/rntmancuso/jailhouse-rt

  4. The DRAM operates at half the frequency of the CPUs.

  5. Figure 15b, c: original photos by Alexander Klein and Stefan Wernthaler, respectively, from https://www.stereoscopy.com/; Figure 15e, f: original video frames from the Visual Tracker Benchmark, respectively Basketball and CarScale data sets available at http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html. The original photos have been scaled and/or cropped to match the same resolution and aspect ratios as the default SD-VBS image files.

References

  • Agrawal A, Fohler G, Freitag J, Nowotsch J, Uhrig S, Paulitsch M (2017) Contention-aware dynamic memory bandwidth isolation with predictability in COTS multicores: an avionics case study. In: 29th Euromicro conference on real-time systems (ECRTS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik

  • Agrawal A, Mancuso R, Pellizzoni R, Fohler G (2018) Analysis of dynamic memory bandwidth regulation in multi-core real-time systems. IEEE Real-Time Syst Symp (RTSS) 2018:230–241

    Google Scholar 

  • Akesson B, Goossens K, Ringhofer M (2007) Predator: a predictable SDRAM memory controller. In: 2007 5th IEEE/ACM/IFIP international conference on hardware/software codesign and system synthesis (CODES + ISSS). pp 251–256

  • Altmeyer S, Burguière CM (2011) Cache-related preemption delay via useful cache blocks: survey and redefinition. J Syst Architect 57(7):707–719

    Article  Google Scholar 

  • Altmeyer S, Maiza C, Reineke J (2010) Resilience analysis: tightening the CRPD bound for set-associative caches. ACM Sigplan Notices 45(4):153–162

    Article  Google Scholar 

  • ARM (2010) AMBA network interconncet(NIC-301) technical reference manual. accessed 07 Jan 2020

  • ARM (2011) ARM$\text{\textregistered} $ CoreLink$^{\rm TM}$ QoS-301 network interconnect advanced quality of service. accessed 07 Jan 2020

  • ARM (2013) ARM$\text{\textregistered} $ CoreLink$^{\rm TM}$ QoS-400 network interconnect advanced quality of service. accessed 07 Jan 2020

  • Arm (2018–2020) arm architecture reference manual supplement memory system resource partitioning and monitoring (MPAM), for Armv8-A. accessed 16 Oct 2020

  • Bui D, Lee E, Liu I, Patel H, Reineke J (2011) Temporal isolation on multiprocessing architectures. In: 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC). pp 274–279

  • C. A. S. Team (2016) Multi-core processors position paper. accessed 07 Jan 2020

  • Dall C, Li S-W, Lim JT, Nieh J, Koloventzos G (2016) ARM virtualization: performance and architectural implications. In: (2016) ACM/IEEE 43rd annual international symposium on computer architecture (ISCA). IEEE, pp 304–316

  • Dinges P, Agha G (2014) Targeted test input generation using symbolic-concrete backward execution. In: Proceedings of the 29th ACM/IEEE international conference on automated software engineering, ser. ASE ’14. New York, NY, USA: Association for Computing Machinery, pp 31–36. https://doi.org/10.1145/2642937.2642951

  • Freitag J, Uhrig S, Ungerer T (2018) Virtual timing isolation for mixed-criticality systems. In: 30th Euromicro conference on real-time systems (ECRTS 2018) ser. Leibniz. In: Altmeyer S (ed) International proceedings in informatics (LIPIcs), vol. 106. Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. pp 13:1–13:23. http://drops.dagstuhl.de/opus/volltexte/2018/8990

  • Gracioli G, Tabish R, Mancuso R, Mirosanlou R, Pellizzoni R, Caccamo M (2019) Designing mixed criticality applications on modern heterogeneous MPSoC platforms. In: 31st Euromicro conference on real-time systems (ECRTS 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik

  • Gustafson JL (2011) Little’s law. Springer, Boston, pp 1038–1041

    Google Scholar 

  • Hassan M (2019) Reduced latency DRAM for multi-core safety-critical real-time systems. Real-Time Syst 56:1–36

    Google Scholar 

  • Hassan M, Pellizzoni R (2020) Analysis of memory-contention in heterogeneous COTS MPSoCs (ECRTS2020)

  • Houdek P, Sojka M, Hanzálek Z (2017) Towards predictable execution model on ARM-based heterogeneous platforms. In: 2017 IEEE 26th international symposium on industrial electronics (ISIE). IEEE pp. 1297–1302

  • Intel (2019) Resource director technology reference manual. accessed 07 Jan 2020

  • Kim H, Rajkumar R (2016) Real-time cache management for multi-core virtualization. Int Conf Embedded Softw (EMSOFT) 2016:1–10

    Google Scholar 

  • Kim H, De Niz D, Andersson B, Klein M, Mutlu O, Rajkumar R (2014) Bounding memory interference delay in COTS-based multi-core systems. In: 2014 IEEE 19th real-time and embedded technology and applications symposium (RTAS). pp 145–154

  • Kiszka J, Sinitsin V, Schild H, contributors, Jailhouse Hypervisor. accessed 07 Jan 2020 https://github.com/siemens/jailhouse

  • Kloda MST, Mancuso R, Capodieci N, Valente P, Bertogna M (2019) Deterministic memory hierarchy and virtualization for modern multi-core embedded systems. In: 25th IEEE real-time and embedded technology and applications symposium (RTAS 2019), Montreal, Canada, conference, pp 1–14

  • Li Y, Akesson K, Goossens K (2016) Architecture and analysis of a dynamically-scheduled real-time memory controller. Real-Time Syst 52(5):675–729

    Article  Google Scholar 

  • Maiza C, Rihani H, Rivas JM, Goossens J, Altmeyer S, Davis RI (2019) A survey of timing verification techniques for multi-core real-time systems. ACM Comput. Surv. 52(3):1–38. https://doi.org/10.1145/3323212

    Article  Google Scholar 

  • Mancuso R, Dudko R, Betti E, Cesati M, Caccamo M, Pellizzoni R (2013) Real-time cache management framework for multi-core architectures. In: 19th IEEE real-time and embedded technology and applications symposium (RTAS 2013), Philadelphia, PA, USA. pp 45–54

  • Mancuso R, Pellizzoni R, Caccamo M, Sha L, Yun H (2015) WCET(m) estimation in multi-core systems using single core equivalence. In: 2015 27th Euromicro conference on real-time systems, pp 174–183

  • Modica P, Biondi A, Buttazzo G, Patel A (2018) Supporting temporal and spatial isolation in a hypervisor for ARM multicore platforms. IEEE Int Conf Ind Technol (ICIT) 2018:1651–1657

    Google Scholar 

  • Neill R, Drebes A, Pop A (2017) Fuse: accurate multiplexing of hardware performance counters across executions. ACM Trans Archit Code Optim (TACO) 14(4):1–26

    Article  Google Scholar 

  • Nelissen G, Fonseca J, Raravi G, Nélis V (2015) Timing analysis of fixed priority self-suspending sporadic tasks. In: 2015 27th Euromicro conference on real-time systems. pp 80–89

  • Nguyen KT (2016) Introduction to memory bandwidth monitoring in the Intel$\text{\textregistered} $ Xeon$\text{\textregistered} $ processor. accessed 07 Jan 2020

  • NXP (2015) P4080 multicore communication processor reference manual. accessed 07 Jan 2020

  • NXP (2016) QorIQ T2080 reference manual. accessed 07 Jan 2020

  • NXP (2020a) P-series in QorIQ processing platforms

  • NXP (2020b) T-series in QorIQ processing platforms

  • NXP (2020) S32V234 reference manual. accessed 07 Jan 2020

  • Pagani M, Balsini A, Biondi A, Marinoni M, Buttazzo G (2017) A Linux-based support for developing real-time applications on heterogeneous platforms with dynamic FPGA reconfiguration. In: 2017 30th IEEE international system-on-chip conference (SOCC). pp 96–101

  • Pellizzoni R, Yun H (2016) Memory servers for multicore systems. In: IEEE Real-time and embedded technology and applications symposium (RTAS). pp. 1–12

  • Roozkhosh S, Mancuso R (2020) The potential of programmable logic in the middle: cache bleaching. In: 2020 IEEE real-time and embedded technology and applications symposium (RTAS). IEEE pp 296–309

  • Scirdino C, Cuomoand L, Solieri M, Sojka M (2018) HERCULES: high-performance real-time architectures for low-power embedded systems. accessed 07 Jan 2020

  • Serrano-Cases A, Reina JM, Abella J, Mezzetti E, Cazorla FJ (2021) Leveraging hardware QoS to control contention in the Xilinx Zynq UltraScale+ MPSoC

  • Sohal P, Tabish R, Drepper U, Mancuso R (2020) E-WarP: a system-wide framework for memory bandwidth profiling and management. In: 2020 IEEE real-time systems symposium (RTSS), pp 345–357

  • Valsan PK, Yun H (2015) MEDUSA: a predictable and high-performance DRAM controller for multicore based embedded systems. In: 2015 IEEE 3rd international conference on cyber-physical systems, networks, and applications. pp 86–93

  • Venkata SK, Ahn I, Jeon D, Gupta A, Louie C, Garcia S, Belongie S, Taylor MB (2009) SD-VBS: the san diego vision benchmark suite. In: 2009 IEEE international symposium on workload characterization (IISWC). IEEE, pp 55–64

  • Vivante, Vega Cores for 3D. accessed 07 Jan 2020. http://www.vivantecorp.com/index.php/en/technology/3d.html

  • Ward BC, Herman JL, Kenna CJ, Anderson JH (2013) Outstanding paper award: making shared caches more predictable on multicore platforms. In: 2013 25th Euromicro conference on real-time systems. IEEE, pp 157–167

  • Xilinx (2016) ZCU102 user guide. accessed 07 Jan 2020

  • Xilinx (2017) AXI4 reference guide. accessed 07 Jan 2020

  • Yao G, Yun H, Wu ZP, Pellizzoni R, Caccamo M, Sha L (2016) Schedulability analysis for memory bandwidth regulated multicore real-time systems. IEEE Trans Comput 65(2):601–614

    Article  MathSciNet  Google Scholar 

  • Ye Y, West R, Cheng Z, Li Y (2014) Coloris: a dynamic cache partitioning system using page coloring. In: 2014 23rd international conference on parallel architecture and compilation techniques (PACT). IEEE. pp 381–392

  • Yun H, Yao G, Pellizzoni R, Caccamo M, Sha L (2013) MemGuard: memory bandwidth reservation system for efficient performance isolation in multi-core platforms. In: 2013 IEEE 19th real-time and embedded technology and applications symposium (RTAS). pp 55–64

  • Yun H, Mancuso R, Wu Z-P, Pellizzoni R (2014) PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms. In: IEEE 19th real-time and embedded technology and applications symposium (RTAS). IEEE pp 155–166

  • Yun H, Ali W, Gondi S, Biswas S (2017) BWLOCK: a dynamic memory access control framework for soft real-time applications on multicore platforms. IEEE Trans Comput 66(7):1247–1252

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The material presented in this paper is based upon work supported by the National Science Foundation (NSF) under Grant Number CCF-2008799. The work was also supported through the Red Hat Research program. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Parul Sohal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sohal, P., Tabish, R., Drepper, U. et al. Profile-driven memory bandwidth management for accelerators and CPUs in QoS-enabled platforms. Real-Time Syst 58, 235–274 (2022). https://doi.org/10.1007/s11241-022-09382-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11241-022-09382-x

Keywords

Navigation