Attacking the one-out-of-m multicore problem by combining hardware management with mixed-criticality provisioning

Kim, Namhoon; Ward, Bryan C.; Chisholm, Micaiah; Anderson, James H.; Smith, F. Donelson

doi:10.1007/s11241-017-9272-9

Attacking the one-out-of-m multicore problem by combining hardware management with mixed-criticality provisioning

Published: 22 April 2017

Volume 53, pages 709–759, (2017)
Cite this article

Real-Time Systems Aims and scope Submit manuscript

Namhoon Kim ORCID: orcid.org/0000-0002-5395-3724¹,
Bryan C. Ward¹,
Micaiah Chisholm¹,
James H. Anderson¹ &
…
F. Donelson Smith¹

450 Accesses
25 Citations
Explore all metrics

Abstract

The multicore revolution is having limited impact in safety-critical application domains. A key reason is the “one-out-of-m” problem: when validating real-time constraints on an m-core platform, excessive analysis pessimism can effectively negate the processing capacity of the additional \(m-1\) cores so that only “one core’s worth” of capacity is utilized even though m cores are available. Two approaches have been investigated previously to address this problem: mixed-criticality allocation techniques, which provision less-critical software components less pessimistically, and hardware-management techniques, which make the underlying platform itself more predictable. A better way forward may be to combine both approaches, but to show this, fundamentally new criticality-cognizant hardware-management tradeoffs must be explored. Such tradeoffs are investigated herein in the context of a new variant of a mixed-criticality framework, called \(\textsf {MC}^\textsf {2} \), that supports configurable criticality-based hardware management. This framework allows specific DRAM memory banks and areas of the last-level cache (LLC) to be allocated to certain groups of tasks. A linear-programming-based optimization framework is presented for sizing such LLC areas, subject to conditions for ensuring \(\textsf {MC}^\textsf {2} \) schedulability. The effectiveness of the overall framework in resolving hardware-management and scheduling tradeoffs is investigated in the context of a large-scale overhead-aware schedulability study. This study was guided by extensive trace data obtained by executing benchmark programs on the new variant of \(\textsf {MC}^\textsf {2} \) presented herein. This study shows that mixed-criticality allocation and hardware-management techniques can be much more effective when applied together instead of alone.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Safe Implementation of Mixed-Criticality Applications in Multicore Platforms: A Model-Based Design Approach

Mixed-criticality scheduling on cluster-based manycores with shared communication and storage resources

Article 23 May 2015

Supporting I/O and IPC via fine-grained OS isolation for mixed-criticality real-time tasks

Article 29 June 2020

Notes

Multicore-related certification difficulties are extensively discussed in a recent position paper from the U.S. Federal Aviation Administration (Certification Authorities Software Team (CAST) 2014, 2016).
We use the terms “processor,” “core,” and “CPU” interchangeably.
The notation \(C_i\) is now commonly used to denote a task execution time, but the term “C” has a pre-existing meaning in the context of \(\textsf {MC}^\textsf {2} \).
We use “PET” instead of “WCET” because under \(\textsf {MC}^\textsf {2} \), some tasks are SRT, and hence may not be provisioned on a worst-case basis.
LITMUS\(^{\mathrm{RT}}\) is a real-time extension of the Linux kernel. Source code is available at http://www.litmus-rt.org.
Other per-level schedulers optionally can be used, and Level-C tasks can be defined according to the sporadic task model. These options, and other considerations, such as slack reallocation, are discussed in prior papers (Herman et al. 2012; Mollison et al. 2010; Ward et al. 2013).
As explained in Mills and Anderson (2011), tardiness bounds with respect to deterministic budget allocations at Level C can be used to bound tardiness in expectation when average-case task execution times are assumed.
All source code for our new \(\textsf {MC}^\textsf {2} \) framework is available online at https://wiki.litmus-rt.org/litmus/Publications.
According to the thesis underlying the design of \(\textsf {MC}^\textsf {2} \) (mentioned in Sect. 2), Level-A and -B tasks are expected to be fly-weight, deterministic tasks, and hence should not require dynamic memory allocation.
We often use the term “area” instead of “partition” to describe these allocated LLC regions because of the potential for some regions to overlap.

References

Alhammad A, Pellizzoni R (2016) Trading cores for memory bandwidth in real-time systems. In: Proceedings of the 22nd IEEE real-time and embedded technology and applications symposium, pp 317–327
Alhammad A, Wasly S, Pellizzoni R (2015) Memory efficient global scheduling of real-time tasks. In: Proceedings of the 21th IEEE real-time and embedded technology and applications symposium, pp 285–296
Altmeyer S, Douma R, Lunniss W, Davis R (2014) Evaluation of cache partitioning for hard real-time systems. In: Proceedings of the 26th euromicro conference on real-time systems, pp 15–26
Audsley N (2013) Memory architecture for NoC-based real-time mixed criticality systems. In: Proceedings of the 1st international workshop on mixed criticality systems, pp 37–42
Baker T, Shaw A (1988) The cyclic executive model and ADA. In: Proceedings of the 9th IEEE real-time systems symposium, pp 120–129
Brandenburg B (2011) Scheduling and locking in multiprocessor real-time operating systems. PhD thesis, University of North Carolina, Chapel Hill, NC
Bui B, Caccamo M, Sha L, Martinez J (2008) Impact of cache partitioning on multi-tasking real time embedded systems. In: Proceedings of the 14th IEEE international conference on embedded and real-time computing systems and applications, pp 101–110
Burns A, Davis R (2016) Mixed criticality systems — a review. Department of Computer Science, University of York, Tech. rep
Google Scholar
Campoy M, Ivars A, Mataix J (2001) Static use of locking caches in multitask preemptive real-time systems. In: Proceedings of IEEE/IEE real-time embedded systems workshop, pp 1283–1286
Certification Authorities Software Team (CAST) (2014) Position paper CAST-32: multi-core processors
Certification Authorities Software Team (CAST) (2016) Position paper CAST-32A: multi-core processors. https://www.faa.gov/aircraft/air_cert/design_approvals/air_software/cast/cast_papers/media/cast-32A.pdf
Chisholm M, Ward B, Kim N, Anderson J (2015) Cache sharing and isolation tradeoffs in multicore mixed-criticality systems. In: Proceedings of the 36th IEEE real-time systems symposium, pp 305–316
Chisholm M, Kim N, Ward B, Otterness N, Anderson J, Smith F (2016) Reconciling the tension between hardware isolation and data sharing in mixed-criticality, multicore systems. In: Proceedings of the 37th IEEE real-time symposium, pp 57–68
Devi U, Anderson J (2008) Tardiness bounds under global EDF scheduling on a multiprocessor. R-Time Syst 38:133–189
Article MATH Google Scholar
Giannopoulou G, Stoimenov N, Huang P, Thiele L (2013) Scheduling of mixed-criticality applications on resource-sharing multicore systems. In: Proceedings of the 13th ACM international conference on embedded software, pp 1–15
Hassan M, Patel H (2016) Criticality- and requirement-aware bus arbitration for multi-core mixed criticality systems. In: Proceedings of the 22nd IEEE real-time and embedded technology and applications symposium, pp 73–83
Hassan M, Patel H, Pellizzoni R (2015) A framework for scheduling DRAM memory accesses for multi-core mixed-time critical systems. In: Proceedings of the 21th IEEE real-time and embedded technology and applications symposium, pp 307–316
Herman J, Kenna C, Mollison M, Anderson J, Johnson D (2012) RTOS support for multicore mixed-criticality systems. In: Proceedings of the 18th IEEE real-time and embedded technology and applications symposium, pp 197–208
Jalle J, Quinones E, Abella J, Fossati L, Zulianello M, Cazorla F (2014) A dual-criticality memory controller (DCmc) proposal and evaluation of a space case study. In: Proceedings of the 35th IEEE real-time systems symposium, pp 207–217
Kessler R, Hill M (1992) Page placement algorithms for large real-indexed caches. ACM Trans Comput Syst 10:338–359
Article Google Scholar
Kim H, Kandhalu A, Rajkumar R (2013) A coordinated approach for practical OS-level cache management in multi-core real-time systems. In: Proceedings of the 25th euromicro conference on real-time systems, pp 80–89
Kim H, Niz DD, Andersson B, Klein M, Mutlu O, Rajkumar R (2014) Bounding memory interference delay in COTS-based multi-core systems. In: Proceedings of the 20th IEEE real-time and embedded technology and applications symposium, pp 145–154
Kim H, Broman D, Lee E, Zimmer M, Shrivastava A, Oh J (2015) A predictable and command-level priority-based DRAM controller for mixed-criticality systems. In: Proceedings of the 21th IEEE real-time and embedded technology and applications symposium, pp 317–326
Kim N, Ward B, Chisholm M, Fu C, Anderson J, Smith F (2016) Attacking the one-out-of-\(m\) multicore problem by combining hardware management with mixed-criticality provisioning. In: Proceedings of the 22nd IEEE real-time and embedded technology and applications symposium, pp 149–160
Kim N, Chisholm M, Otterness N, Anderson J, Smith F (2017a) Allowing shared libraries while supporting hardware isolation in multicore real-time systems. In: Proceedings of the 23rd IEEE Real-Time and Embedded Technology and Applications Symposium (to appear)
Kim N, Ward B, Chisholm M, Fu C, Anderson J, Smith F (2017b) Attacking the one-out-of-\(m\) multicore problem by combining hardware management with mixed-criticality provisioning (full version). Available at URL: http://www.cs.unc.edu/anderson/papers.html
Kirk D (1989) SMART (strategic memory allocation for real-time) cache design. In: Proceedings of the 10th IEEE real-time systems symposium, pp 229–237
Kotaba O, Nowotsch J, Paulitsch M, Petters S, Theiling H (2013) Multicore in real-time systems – temporal isolation challenges due to shared resources. In: Proceedings of the international workshop on industry-driven approaches for cost-effective certification of safety-critical, mixed-criticality systems
Krishnapillai Y, Wu Z, Pellizzoni R (2014) ROC: A rank-switching, open-row DRAM controller for time-predictable systems. In: Proceedings of the 26th euromicro conference on real-time systems, pp 27–38
Kroft D (1981) Lockup-free instruction fetch/prefetch cache organization. In: Proceedings of the 8th annual symposium on computer architecture, pp 81–87
Liu L, Cui Z, Xing M, Bao Y, Chen M, Wu C (2012) A software memory partition approach for eliminating bank-level interference in multicore systems. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, pp 367–376
Mills A, Anderson J (2011) A multiprocessor server-based scheduler for soft real-time tasks with stochastic execution demand. In: Proceedings of the 17th IEEE international conference on embedded and real-time computing systems and applications, pp 207–217
Mollison M, Erickson J, Anderson J, Baruah S, Scoredos J (2010) Mixed criticality real-time scheduling for multicore systems. In: Proceedings of the 7th IEEE international conferences on embedded software and systems, pp 1864–1871
Musmanno J (2003) Data intensive systems (DIS) benchmark performance summary
Pellizzoni R, Schranzhofer A, Chen J, Caccamo M, Thiele L (2010) Worst case delay analysis for memory interference in multicore systems. In: 2010 design, automation test in Europe conference exhibition, pp 741–746
Tabish R, Mancuso R, Wasly S, Alhammad A, Phatak S, Pellizzoni R, Caccamo M (2016) A real-time scratchpad-centric OS for multi-core embedded systems. In: Proceedings of the 22nd IEEE real-time and embedded technology and applications symposium, pp 1–11
Valsan P, Yun H, Farshchi F (2016) Taming non-blocking caches to improve isolation in multicore real-time systems. In: Proceedings of the 22nd IEEE real-time and embedded technology and applications symposium, pp 161–172
Vestal S (2007) Preemptive scheduling of multi-criticality systems with varying degrees of execution time assurance. In: Proceedings of the 28th IEEE real-time systems symposium, pp 239–243
Ward B, Herman J, Kenna C, Anderson J (2013) Making shared caches more predictable on multicore platforms. In: Proceedings of the 25th euromicro conference on real-time systems, pp 157–167
Xu M, Phan LTX, Choi HY, Lee I (2016) Analysis and implementation of global preemptive fixed-priority scheduling with dynamic cache allocation. In: Proceedings of the 22nd IEEE real-time and embedded technology and applications symposium, pp 123–134
Yun H, Yao G, Pellizzoni R, Caccamo M, Sha L (2012) Memory access control in multiprocessor for real-time systems with mixed criticality. In: Proceedings of the 24th euromicro conference on real-time systems, pp 299–308
Yun H, Mancuso R, Wu Z, Pellizzoni R (2014) PALLOC: DRAM bank-aware memory allocator for performance isolation on multicoore platforms. In: Proceedings of the 20th IEEE real-time and embedded technology and applications symposium, pp 155–166

Download references

Acknowledgements

We are grateful to Cheng-Yang Fu for assisting with some of the implementation efforts discussed in this paper.

Author information

Authors and Affiliations

Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, USA
Namhoon Kim, Bryan C. Ward, Micaiah Chisholm, James H. Anderson & F. Donelson Smith

Authors

Namhoon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Bryan C. Ward
View author publications
You can also search for this author in PubMed Google Scholar
Micaiah Chisholm
View author publications
You can also search for this author in PubMed Google Scholar
James H. Anderson
View author publications
You can also search for this author in PubMed Google Scholar
F. Donelson Smith
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Namhoon Kim.

Additional information

Work supported by U.S. National Science Foundation Grants CNS 1115284, CNS 1218693, CPS 1239135, CNS 1409175, and CPS 1446631, U.S. Air Force Office of Scientific Research Grant FA9550-14-1-0161, U.S. Army Research Office Grant W911NF-14-1-0499, and a Grant from General Motors. The second author was also supported by an U.S. National Science Foundation graduate fellowship.

Appendices

Appendix 1: Overhead accounting

In our schedulability study, to account for implementation-related overheads beyond those discussed in Sect. 5.2, we applied several existing overhead-accounting techniques (Brandenburg 2011). While a complete, formal description of all techniques is beyond the scope of this paper, in what follows, we give a high-level description of the techniques employed, and highlight the most relevant ideas. We account for all overhead sources through PET inflation, i.e., increasing the PET of each task before evaluating schedulability. In addition to CRPDs, we considered the following overhead sources, defined in Brandenburg (2011): context switching, release latency, timer ticks, scheduling, job release, and inter-processor interrupts (IPIs).

To account for these overheads, we applied techniques pioneered by Brandenburg (2011) for PEDF and GEDF, for Level B and Level C, respectively, with minor modifications to account for interactions among criticality levels in \(\textsf {MC}^\textsf {2} \). When analyzing tasks at each criticality level, we used measured overheads acquired using similar assumptions as used for PETs, e.g., at Level C, average-case measured overheads were considered. Also, in the case of scheduling and release overheads, we have to account for per-core partitioned scheduling and release overheads at Levels A and B, and also global scheduling and release overheads that may be incurred on any core for Level C. Releases and IPI overheads from task migrations at Level C may cause delays at all criticality levels.

Our \(\textsf {MC}^\textsf {2} \) implementation heavily uses PET budgets. The management of such budgets gives rise to a new overhead source. These overheads are incurred when a budget is replenished or depleted, and are accounted for similarly to other overheads by inflating PETs.

Appendix 2: PET-generation process

The PETs assumed in Sect. 6.1 are based on an analytical model, which we derived by distilling the measured execution-time data discussed in Sect. 4. This appendix describes this PET-generation model in greater detail. As described in Sect. 6.1, all PETs required in our schedulability experiments are defined based on EDF-scheme PETs, which correspond to A-inflated WCETs in an idle system with the full LLC allocated to the task in question. We denote this WCET parameter as \(C_i^0\) for task \(\tau _i\). In our experimental framework, the \(C_i^0\) values are obtained implicitly from the randomly generated task utilizations and periods. All execution-time values used to obtain all other PETs for \(\tau _i\) for different isolation and analysis assumptions are listed in Table 4. Table 5 shows how these values are used to define all PETs under each scheme. The columns of Table 4 indicate how each execution-time value is defined (i.e., whether the value is a Level-A-inflated WCET, a non-inflated WCET, or an ACET, whether the system is assumed to be under load or idle, etc.). Each of these values is generated by applying scaling factor(s) to the prior-listed execution-time values. We present an overview of this entire process here.

Table 4 Generated PET values

Full size table

Step 1: Generate \(C_i^1\) by scaling \(C_i^0\) to account for interfering workload. We choose \(C_i^1\) uniformly from [120, 150)% of \(C_i^0\), based on WCET measurement data in idle and loaded systems with the full LLC allocation.

Table 5 Assignment of execution time parameters to PETs

Full size table

Step 2: Generate \(C_i^2\) by scaling \(C_i^1\) for different LLC allocations. Our \(C_i^0\) values are defined from generated utilizations. The process for generating such utilizations was carefully defined to produce trends similar to those seen from measurement data. Since our \(C_i^1\) values are simply scaled versions of our \(C_i^0\) parameters, similar utilization trends will be seen when utilizations are defined in terms of \(C_i^1\) values. Figure 23 illustrates typical generated utilizations. As seen in this figure, task utilizations monotonically decrease with increasing LLC space and converge at the ICAS. This is in accordance with Obs. 2. To reflect this, we obtain \(C_i^2\) values for different LLC-allocation choices by applying a scaling factor to \(C_i^1\) that exponentially increases with the minimum of the ICAS and LLC space. The actual scaling factors employed were selected to reflect measurement data.

Task ICASs were deduced using the Load Time parameter in Table 2. The two both hinge on a task’s cache footprint. Our Load Time parameter was defined to reflect Obs. 1, which showed that cache isolation can improve a task’s WCET by up to 369%. For example, when the Light Load Time distribution is assumed, LLC isolation typically reduces WCETs by 20–50%, while when the Heavy distribution is assumed, the reduction is typically 200–500%. In addition, for all parameter combinations, tasks at Levels A and B tend to be more insensitive to LLC space than those at Level C. This reflects the underlying motivation for \(\textsf {MC}^\textsf {2} \) that Level-A and -B tasks will tend to be rather deterministic fly-weight tasks and that Level-C tasks will tend to be more complex data-intensive tasks (see Sect. 2).

Step 3: Generate \(C_i^3\) by scaling \(C_i^2\) to account for shared DRAM banks. As seen in Fig. 7, the impact of DRAM bank isolation on task execution times tended to range from imperceptible to 20% under small LLC allocations. Based on these results, we uniformly choose \(C_i^3\) to be [100, 130)% of \(C_i^2\) to account for the lack of bank isolation. Similar to the last step, this step is affected by the task ICAS and LLC allocation.

Step 4: Generate \(C_i^4\) from \(C_i^3\) based on known worst-case shared-cache behavior. When sharing a cache, cross-core interference may prevent a program from reusing any data in any shared cache blocks, thus eliminating any benefit from the LLC in the worst case. Therefore, we define \(C_i^4\) to equal \(C_i^3\) for the case when the allocated LLC space is zero.

Step 5: Generate all Level-B PETs from previously generated Level-A PETs. Using the A-Inflation Factor in Table 2, all Level-B PETs can be computed from corresponding Level-A PETs. This gives us all \(C_i^5\) values.

Step 6: Generate \(C_i^6\) and \(C_i^7\) to reflect expected ACET:WCET ratios under cache isolation and varying background workloads. ACET:WCET ratio trends will depend on the given background workload, i.e., the total utilization of all competing tasks. Based on ACET:WCET ratio trends observed for benchmark and synthetic programs under different background workload utilizations, we identified an appropriate distribution from which to uniformly choose an ACET:WCET ratio for each task. For all tasks, these ratios were chosen uniformly among a range of percentages. For Level-C tasks, these ratios range over 20–40% for the lightest background workloads, and over 30–60% for the heaviest. For Level-A and -B tasks, these ratios range over 50–70% for the lightest background workloads, and 80–100% for the heaviest. This reflects our assumption that higher-criticality tasks tend to be more deterministic in their execution than Level-C tasks. Note that this process requires a means to calculate the Level-C utilization of a background-workload, which is dependent on the ACETs that are generated. This entails an iterative process, such as that in Fig. 24a.

However, given the scale of our schedulability experiments, which involved millions of task systems, an iterative process was infeasible. As a result, we used the non-iterative process outlined in Fig. 24b. We used the EDF-scheme utilization of the background workload as an upper bound on its average-case utilization.

Step 7: Generate \(C_i^8\) to reflect differences between ACETs for a fully unmanaged system and ACETs for a cache-isolated system. From Fig. 7 and Obs. 8, we see that ACETs in an unmanaged cache gradually decline in a linear fashion as the allocated LLC space increases, even beyond the ICAS of the task. However, these ACETS generally remain higher than ACETs under cache isolation. When the LLC allocation is zero, both ACETs are the same, since LLC management does not affect tasks bypassing the LLC. To reflect this behavior, we generated \(C_i^8\) as shown in Fig. 25. On the right axis, we depict a scale showing the range of \(C_i^7\)’s reduction in value as the allocated LLC space increases. On this scale, \(C_i^7\) is at 0% reduction under zero allocated LLC space, and 100% under maximum allocated LLC space. \(C_i^8\) at maximum allocated LLC space for the Matrix program would fall at approximately 50% on this scale. For each generated task, we choose a value from 30 to 70% on this scale for our generated \(C_i^8\) at maximum LLC space.

At zero allocated LLC space, \(C_i^8\) matches \(C_i^7\). For all other LLC allocation sizes, we interpolate values for \(C_i^8\) linearly between values generated for zero allocated LLC space and maximum allocated LLC space.

From these steps, we now have all required PETs. We note once again that this process produces a model for producing PETs. As such, all claims resulting from our schedulability experiments apply only within the context provided by this model. Still, we have taken great pains to ensure that the range of PETs generated by this model encompass those that we have seen based on real measurement data, and that trends among related PETs for the same task correspond to those seen in our measurement data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, N., Ward, B.C., Chisholm, M. et al. Attacking the one-out-of-m multicore problem by combining hardware management with mixed-criticality provisioning. Real-Time Syst 53, 709–759 (2017). https://doi.org/10.1007/s11241-017-9272-9

Download citation

Published: 22 April 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11241-017-9272-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attacking the one-out-of-m multicore problem by combining hardware management with mixed-criticality provisioning

Abstract

Access this article

Similar content being viewed by others

Safe Implementation of Mixed-Criticality Applications in Multicore Platforms: A Model-Based Design Approach

Mixed-criticality scheduling on cluster-based manycores with shared communication and storage resources

Supporting I/O and IPC via fine-grained OS isolation for mixed-criticality real-time tasks

Notes

References

Acknowledgements