Abstract
The accurate and quantitative analysis of the cache behavior in a Chip Multi-Core (CMP) machine has long been a challenging work. So far there has been no practical way to predict the cache allocation, i.e., allocated cache size, of a running program. Lots of applications, especially those that have many interactions with the users, cache allocation should be estimated with high accuracy since its variation is closely related to the stability of system performance which is important to the efficient operation of servers and has a great influence on user experience. For these interests, this paper proposes an accurate prediction model for the allocation of the last level cache (LLC) of the co-runners. With a precise cache allocation predicted, we further implemented a performance-stability-oriented co-runner scheduling algorithm which aims to maximize the number of co-runners running in performance-stable state and minimize the performance variation of the unstable ones. We demonstrate that the proposed prediction algorithm exhibits a high accuracy with an average error of 5.7 %; and the co-runner scheduling algorithm can find the optimal solution under the specified target with a time complexity of O(n).
















Similar content being viewed by others
References
Eyerman S, Eeckhout L (2008) System-level performance metrics for multiprogram workloads. IEEE Micro 28:42–53. doi:10.1109/MM.2008.44
Cazorla FJ, Knijnenburg Peter MW, Sakellariou R, Fernandez E, Ramirez A, Valero M (2004) Predictable performance in SMT processors. In: Proceedings of the 1st conference on computing frontiers. ACM, New York, pp 433–443. doi:10.1145/977091.977152
Jiang Y, Shen X (2008) Exploration of the influence of program inputs on cmp co-scheduling. Euro Conf Parall Comp. 263–273. doi:10.1007/978-3-540-85451-7_29
Sandberg A, Sembrant A, Hagersten E, Black-Schaffer D (2013) Modeling Performance Variation Due to Cache Sharing. In: Proceedings International Symposium High Performance Computer Architecture (HPCA), pp 155–166. doi:10.1109/HPCA.2013.6522315
Chen Xi E, Aamodt Tor M (2012) Modeling cache contention and throughput of multiprogrammed manycore processors. IEEE Trans Comp 61:913–927. doi:10.1109/TC.2011.141
Chandra D, Guo F, Kim S, Solihin Y (2005) Predicting inter-thread cache contention on a chip multi-processor Architecture. In: Proceedings of International Symposium High-Performance Computer Architecture (HPCA), pp 76–86. doi:10.1109/HPCA.2005.27
Xu C, Chen X, Dick Rober P, Mao Zhuoqing M (2010) Cache contention and application performance prediction for multi-core systems. In: International Symposium Performance Analysis of Systems and Software (ISPASS). pp 76–86. doi:10.1109/ISPASS.2010.5452065
Xiang X, Ding C, Luo H, Bao B (2013) HOTL: a higher order theory of locality. In: Proceedings of the 18th Intl’ Conf on Architectural support for programming languages and operating systems (ASPLOS ’13). pp 343–356. doi:10.1145/2451116.2451153
Kim S, Chandra D, Solihin Y (2004) Fair cache sharing and partitioning on a chip multi-processor architecture. In: Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques. pp 111–122. doi:10.1109/PACT.2004.15
Qureshi MK, Patt YN (2006) Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of Annual International Symposium on Microarchitecture (MICRO), pp 111–122. doi:10.1109/MICRO.2006.49
Suh GE, Devadas S, Rudolph L (2002) A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of International Symposium on High Performance Computer Architecture. doi:10.1109/HPCA.2002.995703
DeVuyst M, Kumar R, Tullsen Dean M (2006) Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors. In: Proceedings International Parallel and Distributed Processing Symposium(IPDPS), pp 117–126. doi:10.1109/IPDPS.2006.1639374
Jiang Yunlian, Tian Kai, Shen Xipeng, Zhang Jinghe, Jie Chen, Tripath Rahul (2010) The complexity of optimal job co-scheduling on chip multiprocessors and heuristics-based solutions. IEEE Trans Paral Distrib Syst 22:1192–1205. doi:10.1109/TPDS.2010.193
Yunlian J, Xipeng S, Chen J, Rahul T (2008) Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of 17th Interantional Conference on Parallel Architectures and Compilation Techniques(PACT), pp 220–229. doi:10.1145/1454115.1454146
Zhuravlev S, Blagodurov S, Fedorova A (2010) Addressing shared resource contention in multicore processors via scheduling. In: Proceedings of Architectural support for programming languages and operating systems(ASPLOS), pp 129–142. doi:10.1145/1736020.1736036
Snavely A, Tullsen D (2000) Symbiotic job scheduling for a simultaneous multi threading processor. ASPLOS IX. doi:10.1145/356989.357011
Aamer J, Najaf-abadi Hashem H, Samantika S, Steely Simon C, Joel E (2012) CRUISE: cache replacement and utility-aware scheduling. ASPLOS XII 249–260 doi:10.1145/2150976.2151003
Gupta Saurabh, Xiang Ping, Zhang Yi, Zhou Huiyang (2013) Locality principle revisited: a probability-based quantitative approach. J Parallel Distrib Comput 73:1011–1027. doi:10.1016/j.jpdc.2013.01.010
Xiaoya X, Bao B, Ding C, Kai S (2012) Cache conscious task regrouping on multicore processors. In: Proceedings of 12th IEEE/ACM Interantional Symposium on Cluster, Cloud and Grid Computing. doi:10.1109/CCGrid.139
Knauerhase R, Brett P, Hohlt B, Li T, Hahn S (2008) Using OS observations to improve performance in multicore systems. IEEE Micro 28:54–66. doi:10.1109/MM.2008.48
Fedorova A, Seltzer M, Smith Michael D (2007) Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of 16th International Conference Parallel Architecture and Compilaton Techniques (PACT), pp 25-38. doi:10.1109/PACT.2007.40
Eiman E, Joo Lee C, Onur M, Patt Yale N (2010) Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. ASPLOS XV. doi:10.1145/1736020.1736058
Eklov D, Nikoleris N, Black-Schaffer D, Hagersten E (2011) Cache pirating: measuring the curse of the shared cache. In: International Conference on Parallel Processing (ICPP), pp 165–175. doi:10.1109/ICPP.2011.15
Perelman E, Polito M, Bouguet JY, Sampson J, Calder B, Dulong C (2006) Detecting phases in parallel applications on shared memory architectures. In: 20th IEEE Interantional Parallel and Distributed Processing Symposium (IPDPS), pp 88–98 doi:10.1109/IPDPS.2006.1639325
Han W, Xiaopeng G, Zhiqiang W, Yi L (2009) Using GPU to accelerate cache simulation. In: IEEE Interantional Symposium on Parallel and Distributed Processing with Applications, pp 565–570. doi:10.1109/ISP.2009.51
Curtin Ryan R, Cline James R, Slagle Neil P, March William B, Ram P, Mehta Nishant A, Gray Alexander G (2013) MLPACK: a scalable C++ machine learning library. J Mach Learn Res 801–805
Acknowledgments
This work was supported by Huawei Innovation Research Program(HIPRO, Grant Number. YB2015080028).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, F., Gao, X. & Chen, G. Lowering the volatility: a practical cache allocation prediction and stability-oriented co-runner scheduling algorithms. J Supercomput 72, 1126–1151 (2016). https://doi.org/10.1007/s11227-016-1645-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1645-7