Abstract
In embedded systems such as automotive systems, multi-core processors are expected to improve performance and reduce manufacturing cost by integrating multiple functions on a single chip. However, inter-core interference in shared last-level cache (LLC) results in increased and unpredictable execution times for time-sensitive tasks (TSTs), which have (soft) timing constraints, thereby increasing the deadline miss rates of such systems. In this paper, we propose a time-sensitivity-aware dead block-based shared LLC architecture to mitigate these problems. First, a time-sensitivity indication bit is added to each cache block, which allows the proposed LLC architecture to be aware of instructions/data belonging to TSTs. Second, portions of the LLC space are allocated to general tasks without interfering with TSTs by developing a time-sensitivity-aware dead block-based cache partitioning technique. Third, to reduce the deadline miss rate of TSTs further, we propose a task matching in shared caches and a cache partitioning scheme that considers the memory access characteristics and the time-sensitivity of tasks (TATS). The TATS is combined with our proposed dead block-based scheme. Our evaluation shows that the proposed schemes reduce deadline miss rates of TSTs compared to conventional shared caches. On a dual-core system, compared to a baseline, equal partitioning, and state-of-the-art quality-of-service-aware cache partitioning, our proposed dead block-based cache partitioning provides 9.3%, 30.5%, and 2.6% lower average deadline miss rates, respectively. On a quad-core system, compared to the baseline, equal partitioning, and state-of-the-art quality-of-service-aware cache partitioning, the combination of our proposed schemes provides 21.2%, 17.7%, and 4.1% lower average deadline miss rates, respectively.
Similar content being viewed by others
Notes
In this study, we do not consider applications having hard timing constraints, such as engine control and power train, which require very strict timing constraints. We focus on the applications that can tolerate some deadline misses, which are called time-sensitive applications in this paper and [12].
The task categorization is dependent on the timing requirement of the task. If a task is periodically executed and has a (soft) deadline, the task would be categorized as a TST. For example, if there is a video decoding application that requires a frame rate of 30 fps, the video decoding task of the application can be considered as a TST because it is periodically executed and must be accomplished within 33 ms to achieve the required frame rate. On the other hand, if a task has no deadline but requires higher overall performance, the task would be categorized as a GT.
In this paper, performance predictability is defined as the inverse of the difference between the shortest and longest execution times of a task. A higher-performance predictability of a task indicates a smaller performance variation and better tail performance of the task.
We do not consider tasks having hard timing constraints in this paper.
We assume high utilization to examine the impact of inter-core cache interference. High utilization is preferred for lower manufacturing cost of the system.
In this study, we allocate equal size cache partitions to TSTs. However, the cache space is not wasted because the actual cache partitioning is dynamically done during runtime. When some tasks are idle, the other tasks can occupy their cache space.
The parameters are also used in Algorithms 3 and 4.
This is because the maximum number of parallel tasks in a multi-core system is equal to the number of cores. The maximum number of groups is equal to that of cores when only TSTs are running.
If the number of cache partitions cannot be divided by that of current groups (e.g., 16 partitions shared by 3 groups), the remaining partitions are randomly allocated to the groups.
The number of cores used in this paper is at most 4. Therefore, 2-bit group field is used in this paper. Even with 64 cores, only 6 bits are needed.
\(t_{CL}\): CAS (Column Address Strobe) latency, \(t_{RCD}\): row address to column address delay, \(t_{RP}\): row precharge time.
In the experiments, we used configurations to fit the working sets of MiBench benchmarks and to model the inter-core cache interference in a harsh situation.
In this paper, we profile the benchmarks with a halved LLC. For more partitions, one can profile the tasks with LLCs that are partitioned into more than two segments. Nevertheless, halving LLC space can be a good estimator for the performance sensitivities of tasks. Similar categorization is used in [46].
To estimate the probability density of the execution times, kernel density estimation is applied to the data. For kernel function of the estimation, we used normal kernel function which implies that the probability of sampled execution times follows standard normal distribution.
References
Anderson JH, Bud V, Devi UC (2005) An EDF-based scheduling algorithm for multiprocessor soft real-time systems. In: Proceedings of the 17th Euromicro Conference on Real-Time Systems
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7
Bui BD, Caccamo M, Sha L, Martinez J (2008) Impact of cache partitioning on multi-tasking real time embedded systems. In: Embedded and Real-Time Computing Systems and Applications
Calandrino JM, Anderson JH (2008) Cache-aware real-time scheduling on multicore platforms: Heuristics and a case study. In: Euromicro Conference on Real-Time Systems, 2008. ECRTS’08. IEEE, pp 299–308
Chang J, Sohi GS (2014) Cooperative cache partitioning for chip multiprocessors. In: ACM International Conference on Supercomputing 25th Anniversary Volume
Chiou D, Jain P, Devadas S, Rudolph L (2000) Dynamic cache partitioning via columnization. In: DAC
Chisholm M, Kim N, Ward BC, Otterness N, Anderson JH, Smith FD (2016) Reconciling the tension between hardware isolation and data sharing in mixed-criticality, multicore systems. In: RTSS
Ding H, Liang Y, Mitra T (2012) WCET-centric partial instruction cache locking. In: DAC
Ding H, Liang Y, Mitra T (2013) Integrated instruction cache analysis and locking in multitasking real-time systems. In: DAC
Ebert C, Favaro J (2017) Automotive software. IEEE Softw 34(3):33–39
El-Sayed N, Mukkara A, Tsai PA, Kasture H, Ma X, Sanchez D (2018) KPart: a hybrid cache partitioning-sharing technique for commodity multicores. In: HPCA
Goel A, Abeni L, Krasic C, Snow J, Walpole J (2002) Supporting time-sensitive applications on a commodity OS. SIGOPS Oper Syst Rev 36(SI):165–180
Guan N, Stigge M, Yi W, Yu G (2009) Cache-aware scheduling and analysis for multicores. In: Proceedings of the Seventh ACM International Conference on Embedded Software, ACM, pp 245–254
Guo F, Solihin Y, Zhao L, Iyer R (2010) Quality of service shared cache management in chip multiprocessor architecture. ACM Trans Archit Code Optim 7(3):14
Herdrich A, Verplanke E, Autee P, Illikkal R, Gianos C, Singhal R, Iyer R (2016) Cache QoS: from concept to reality in the intel® xeon® processor e5-2600 v3 product family. In: HPCA
Iyer R (2004) CQoS: A framework for enabling QoS in shared caches of CMP platforms. In: Proceedings of the 18th Annual International Conference on Supercomputing, ICS ’04
Jaleel A, Theobald KB, Steely SC Jr, Emer J (2010) High performance cache replacement using re-reference interval prediction (rrip). In: ISCA
Kaxiras S, Hu Z, Martonosi M (2001) Cache decay: exploiting generational behavior to reduce cache leakage power. In: ACM SIGARCH Computer Architecture News
Kern D, Schmidt A (2009) Design space for driver-based automotive user interfaces. In: AutomotiveUI
Kim H, Rajkumar RR (2018) Predictable shared cache management for multi-core real-time virtualization. TECS 17(1):22
Kim H, Kandhalu A, Rajkumar R (2013) A coordinated approach for practical OS-level cache management in multi-core real-time systems. In: 2013 25th Euromicro Conference on Real-Time Systems
Kim S, Chandra D, Solihin Y (2004) Fair cache sharing and partitioning in a chip multiprocessor architecture. In: PACT
Kirk D, Strosnider J (1990) SMART (strategic memory allocation for real-time) cache design using the mips r3000. In: RTSS
Lesage B, Puaut I, Seznec A (2012) Preti: Partitioned real-time shared cache for mixed-criticality real-time systems. In: Proceedings of the 20th International Conference on Real-Time and Network Systems
Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P (2008) Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In: HPCA
Liu T, Li M, Xue CJ (2009) Instruction cache locking for real-time embedded systems with multi-tasks. In: RTCSA
Lo D, Song T, Suh GE (2015) Prediction-guided performance-energy trade-off for interactive applications. In: MICRO
Manikantan R, Rajan K, Govindarajan R (2012) Probabilistic shared cache management (PriSM). In: ISCA
Paolieri M, Quiñones E, Cazorla FJ, Bernat G, Valero M (2009) Hardware support for WCET analysis of hard real-time multicore systems. In: ISCA
Paolieri M, Quiñones E, Cazorla FJ, Bernat G, Valero M (2009) Hardware support for wcet analysis of hard real-time multicore systems. In: ACM SIGARCH Computer Architecture News, pp 57–68
Puaut I, Decotigny D (2002) Low-complexity algorithms for static cache locking in multitasking hard real-time systems. In: RTSS
Puaut I, Pais C (2007) Scratchpad memories vs. locked caches in hard real-time systems: a quantitative comparison. In: DATE
Qureshi M, Patt Y (2006) Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In: MICRO
Rafique N, Lim WT, Thottethodi M (2006) Architectural support for operating system-driven CMP cache management. In: PACT
Sanchez D, Kozyrakis C (2011) Vantage: scalable and efficient fine-grain cache partitioning. SIGARCH Comput Archit News 39(3):57–68
Sangiovanni-Vincentelli A, Di Natale M (2007) Embedded system design for automotive applications. Computer 40(10):42–51
Srikantaiah S, Kandemir M, Wang Q (2009) SHARP control: controlled shared cache management in chip multiprocessors. In: MICRO
Subramanian L, Seshadri V, Ghosh A, Khan S, Mutlu O (2015) The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In: Proceedings of the 48th International Symposium on Microarchitecture, pp 62–75
Suh GE, Rudolph L, Devadas S (2004) Dynamic partitioning of shared cache memory. J Supercomput 28(1):7–26
Tam D, Azimi R, Stumm M (2007) Thread clustering: sharing-aware scheduling on smp-cmp-smt multiprocessors. ACM SIGOPS Oper Syst Rev 41:47–58
Usui H, Subramanian L, Chang KKW, Mutlu O (2016) DASH: deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. ACM Trans Archit Code Optim 12(4):65
Vasilios K, Georgios K, Nikolaos V (2018) Combining software cache partitioning and loop tiling for effective shared cache management. ACM Trans Embedded Comput Syst (TECS) 17(3):72
Wang X, Chen S, Setter J, Martínez JF (2017) Swap: Effective fine-grain management of shared last-level caches with minimum hardware support. In: HPCA
Ward B, Herman J, Kenna C, Anderson J (2013) Making shared caches more predictable on multicore platforms. In: ECRTS
Wilhelm R, Engblom J, Ermedahl A, Holsti N, Thesing S, Whalley D, Bernat G, Ferdinand C, Heckmann R, Mitra T, Mueller F, Puaut I, Puschner P, Staschulat J, Stenström P (2008) The worst-case execution-time problem: overview of methods and survey of tools. ACM Trans Embed Comput Syst 7(3):36
Xie Y, Loh G (2008) Dynamic classification of program memory behaviors in CMPS. In: the 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects
Xie Y, Loh GH (2009) PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. SIGARCH Comput Archit News 37(3):174–183
Xu C, Rajamani K, Ferreira A, Felter W, Rubio J, Li Y (2018) dCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service. In: EuroSys
Ye Y, West R, Cheng Z, Li Y (2014) Coloris: a dynamic cache partitioning system using page coloring. In: 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT)
Acknowledgements
This work was supported by the National Research Foundation (NRF) Grants funded by Korean Government (2018R1A2B2005277).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lee, M., Kim, S. Time-sensitivity-aware shared cache architecture for multi-core embedded systems. J Supercomput 75, 6746–6776 (2019). https://doi.org/10.1007/s11227-019-02891-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02891-w