Performance Metrics and Models for Shared Cache

Ding, Chen; Xiang, Xiaoya; Bao, Bin; Luo, Hao; Luo, Ying-Wei; Wang, Xiao-Lin

doi:10.1007/s11390-014-1460-7

Performance Metrics and Models for Shared Cache

Survey
Published: 04 July 2014

Volume 29, pages 692–712, (2014)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Chen Ding¹,
Xiaoya Xiang¹,
Bin Bao¹,
Hao Luo¹,
Ying-Wei Luo² &
…
Xiao-Lin Wang²

314 Accesses
14 Citations
Explore all metrics

Abstract

Performance metrics and models are prerequisites for scientific understanding and optimization. This paper introduces a new footprint-based theory and reviews the research in the past four decades leading to the new theory. The review groups the past work into metrics and their models in particular those of the reuse distance, metrics conversion, models of shared cache, performance and optimization, and other related techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Method for Fast Evaluation of Sharing Set Management Strategies in Cache Coherence Protocols

A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC

Article 20 September 2017

Shared Memory in the Many-Core Age

References

Zhang X, Dwarkadas S, Shen K. Towards practical page coloring-based multicore cache management. In Proc. the EuroSys Conference, April 2009, pp.89-102.
Denning P J. Working sets past and present. IEEE Transactions on Software Engineering, 1980, 6(1): 64-84.
Article Google Scholar
Denning P J. The working set model for program behaviour. Communications of the ACM, 1968, 11(5): 323-333.
Article MATH MathSciNet Google Scholar
Brock J, Luo H, Ding C. Locality analysis: A nonillion time window problem. In Proc. Big Data Analytics Workshop, June 2013.
Zhong Y, Shen X, Ding C. Program locality analysis using reuse distance. ACM TOPLAS, 2009, 31(6): 1-39.
Article Google Scholar
Zhong Y, Orlovich M, Shen X, Ding C. Array regrouping and structure splitting using whole-program reference affinity. In Proc. PLDI, June 2004, pp.255-266.
Ding C, Chilimbi T. All-window profiling of concurrent executions. In Proc. the 13th PPoPP (Poster Paper), Feb. 2008, pp.265-266.
Xiang X, Bao B, Bai T, Ding C, Chilimbi T M. All-window profiling and composable models of cache sharing. In Proc. PPoPP, Feb. 2011, pp.91-102.
Xiang X, Bao B, Ding C, Gao Y. Linear-time modeling of program working set in shared cache. In Proc. PACT, Oct. 2011, pp.350-360.
Xiang X, Ding C, Luo H, Bao B. HOTL: A higher order theory of locality. In Proc. ASPLOS, March 2013, pp.343-356.
Xiang X, Bao B, Ding C, Shen K. Cache conscious task regrouping on multicore processors. In Proc. the 12th CCGrid, May 2012, pp.603-611.
Xiang X. A higher order theory of locality and its application in multicore cache management [Ph.D. Thesis]. Computer Science Dept., Univ. of Rochester, 2014.
Wu M, Yeung D. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proc. PACT, Oct. 2011, pp.264-275.
Wu M, Zhao M, Yeung D. Studying multicore processor scaling via reuse distance analysis. In Proc. the 40th ISCA, June 2013, pp.499-510.
Thiébaut D, Stone H S. Footprints in the cache. ACM Transactions on Computer Systems, 1987, 5(4): 305-329.
Article Google Scholar
Suh G E, Devadas S, Rudolph L. Analytical cache models with applications to cache partitioning. In Proc. the 15th ICS, June 2001, pp.1-12.
Chandra D, Guo F, Kim S, Solihin Y. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proc. the 11th HPCA, Feb. 2005, pp.340-351.
Belady L A. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 1966, 5(2): 78-101.
Article Google Scholar
Denning P J. Thrashing: Its causes and prevention. In Proc. AFIPS Fall Joint Computer Conference, Part 1, Dec. 1968, pp.915-922.
Chilimbi T M, Hirzel M. Dynamic hot data stream prefetching for general-purpose programs. In Proc. PLDI, June 2002, pp.199-209.
Mattson R L, Gecsei J, Slutz D, Traiger I L. Evaluation techniques for storage hierarchies. IBM System Journal, 1970, 9(2): 78-117.
Article Google Scholar
Jiang S, Zhang X. LIRS: An efficient low inter-reference recency set replacement to improve buffer cache performance. In Proc. SIGMETRICS, June 2002, pp.31-42.
Smith A J. On the effectiveness of set associative page mapping and its applications in main memory management. In Proc. the 2nd ICSE, Oct. 1976, pp.286-292.
Hill M D, Smith A J. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 1989, 38(12): 1612-1630.
Article Google Scholar
Marin G, Mellor-Crummey J. Cross architecture performance predictions for scientific applications using parameterized models. In Proc. SIGMETRICS, June 2004, pp.2-13.
Snir M, Yu J. On the theory of spatial and temporal locality. Technical Report, DCS-R-2005-2564, Computer Science Dept., Univ. of Illinois at Urbana-Champaign, 2005.
Fang C, Carr S, Önder S, Wang Z. Path-based reuse distance analysis. In Proc. the 15th CC, Mar. 2006, pp.32-46.
Zhong Y, Dropsho S G, Shen X, Studer A, Ding C. Miss rate prediction across program inputs and cache configurations. IEEE Transactions on Computers, 2007, 56(3): 328-343.
Article MathSciNet Google Scholar
Fang C, Carr S, Önder S, Wang Z. Instruction based memory distance analysis and its application to optimization. In Proc. PACT, Sept. 2005, pp.27-37.
Beyls K, D'Hollander E H. Discovery of locality-improving refactorings by reuse path analysis. In Proc. the 2nd Int. Conf. High Performance Computing and Communications, Sept. 2006, pp.220-229.
Beyls K, D'Hollander E H. Intermediately executed code is the key to find refactorings that improve temporal data locality. In Proc. the 3rd ACM Conference on Computing Frontiers, May 2006, pp.373-382.
Kelly T, Cohen I, Goldszmidt M, Keeton K. Inducing models of black-box storage arrays. Technical Report, HPL-2004-108, HP Laboratories Palo Alto, 2004.
Almeida V, Bestavros A, Crovella M, de Oliveira A. Characterizing reference locality in the WWW. In Proc. the 4th International Conference on Parallel and Distributed Information Systems (PDIS), December 1996, pp.92-103.
Bennett B T, Kruskal V J. LRU stack processing. IBM Journal of Research and Development, 1975, 19(4): 353-357.
Article MATH MathSciNet Google Scholar
Olken F. Efficient methods for calculating the success function of fixed space replacement policies. Technical Report, LBL-12370, Lawrence Berkeley Laboratory, 1981.
Ding C, Zhong Y. Predicting whole-program locality through reuse distance analysis. In Proc. PLDI, June 2003, pp.245-257.
Zhong Y, Ding C, Kennedy K. Reuse distance analysis for scientific programs. In Proc. Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers, March 2002.
Schuff D L, Kulkarni M, Pai V S. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proc. the 19th PACT, Sept. 2010, pp.53-64.
Kim Y H, Hill M D, Wood D A. Implementing stack simulation for highly-associative memories. In Proc. SIGMETRICS, May 1991, pp.212-213.
Sugumar R A, Abraham S G. Multi-configuration simulation algorithms for the evaluation of computer architecture designs. Technical Report, University of Michigan, August 1993.
Burger D, Austin T. The SimpleScalar tool set, version 2.0. Technical Report, CS-TR-97-1342, Department of Computer Science, University of Wisconsin, June 1997.
Almasi G, Cascaval C, Padua D A. Calculating stack distances efficiently. In Proc. the ACM SIGPLAN Workshop on Memory System Performance, June 2002, pp.37-43.
Denning P J, Schwartz S C. Properties of the working set model. Communications of the ACM, 1972, 15(3): 191-198.
Article MATH MathSciNet Google Scholar
Berg E, Hagersten E. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proc. ISPASS, March 2004, pp.20-27.
Berg E, Hagersten E. Fast data-locality profiling of native execution. In Proc. SIGMETRICS, June 2005, pp.169-180.
Eklov D, Hagersten E. StatStack: Efficient modeling of LRU caches. In Proc. ISPASS, March 2010, pp.55-65.
Eklov D, Black-Schaffer D, Hagersten E. Fast modeling of shared caches in multicore systems. In Proc. the 6th HiPEAC, Jan. 2011, pp.147-157.
Shen X, Shaw J, Meeker B, Ding C. Locality approximation using time. In Proc. the 34th POPL, Jan. 2007, pp.55-61.
Shen X, Shaw J. Scalable implementation of efficient locality approximation. In Proc. the 21st LCPC Workshop, July 31-August 2, 2008, pp.202-216.
Jiang Y, Zhang E Z, Tian K, Shen X. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proc. the 19th CC, Mar. 2010, pp.264-282.
Shen X, Shaw J, Meeker B, Ding C. Locality approximation using time. Technical Report, TR 901, Department of Computer Science, University of Rochester, December 2006.
Jiang Y, Tian K, Shen X. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. In Proc. HiPEAC, Jan. 2010, pp.201-215.
West R, Zaroo P, Waldspurger C A, Zhang X. Online cache modeling for commodity multicore processors. Operating Systems Review, 2010, 44(4): 19-29.
Article Google Scholar
Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proc. the 16th PACT, Sept. 2007, pp.25-38.
Zhou S. An efficient simulation algorithm for cache of random replacement policy. In Proc. the IFIP Int. Conf. Network and Parallel Computing, Sept. 2010, pp.144-154.
Arnold M, Ryder B G. A framework for reducing the cost of instrumented code. In Proc. PLDI, June 2001, pp.168-179.
Hirzel M, Chilimbi T M. Bursty tracing: A framework for low-overhead temporal profiling. In Proc. ACM Workshop on Feedback-Directed and Dynamic Optimization, Dec. 2001.
Cascaval C, Duesterwald E, Sweeney P F, Wisniewski R W. Multiple page size modeling and optimization. In Proc. the 14th PACT, Sept. 2005, pp.339-349.
Zhong Y, Chang W. Sampling-based program locality approximation. In Proc. the 7th ISMM, June 2008, pp.91-100.
Tam D K, Azimi R, Soares L, Stumm M. RapidMRC: Approximating L2 miss rate curves on commodity systems for online optimizations. In Proc. the 14th ASPLOS, Mar. 2009, pp.121-132.
Niu Q, Dinan J, Lu Q, Sadayappan P. PARDA: A fast parallel reuse distance analysis algorithm. In Proc. IPDPS, May 2012.
Cui H, Yi Q, Xue J, Wang L, Yang Y, Feng X. A highly parallel reuse distance analysis algorithm on GPUs. In Proc. the 26th IPDPS, May 2012, pp. 1284-1294.
Gupta S, Xiang P, Yang Y, Zhou H. Locality principle revisited: A probability-Based quantitative approach. In Proc. the 26th IPDPS, May 2012, pp.995-1009.
Moseley T, Shye A, Reddi V J, Grunwald D, Peri R. Shadow profiling: Hiding instrumentation costs with parallelism. In Proc. CGO, March 2007, pp.198-208.
Wallace S, Hazelwood K. Superpin: Parallelizing dynamic instrumentation for real-time performance. In Proc. CGO, Mar. 2007, pp.209-220.
Cascaval C, Padua D A. Estimating cache misses and locality using stack distances. In Proc. the 17th ICS, June 2003, pp.150-159.
Allen R, Kennedy K. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers, 2001.
Beyls K, D'Hollander E H. Generating cache hints for improved program efficiency. Journal of Systems Architecture, 2005, 51(4): 223-250.
Article Google Scholar
Pugh W, Wonnacott D. Eliminating false data dependences using the Omega test. In Proc. PLDI, June 1992, pp.140-151.
Chauhan A, Shei C Y. Static reuse distances for locality-based optimizations in MATLAB. In Proc. the 24th ICS, June 2010, pp.295-304.
Shen X, Gao Y, Ding C et al. Lightweight reference afinity analysis. In Proc. the 19th ICS, June 2005, pp.131-140.
Bao B, Ding C. Defensive loop tiling for shared cache. In Proc. CGO, Feb. 2013, pp.1-11.
Bao B. Peer-aware program optimization [Ph.D. Thesis]. Computer Science Dept., Univ. of Rochester, January 2013.
Yuan L, Ding C, Štefankovič D, Zhang Y. Modeling the locality in graph traversals. In Proc. the 41st ICPP, Sept. 2012, pp.138-147.
Agarwal A, Hennessy J L, Horowitz M. Cache performance of operating system and multiprogramming workloads. ACM Transactions on Computer Systems, 1988, 6(4): 393-431.
Article Google Scholar
Ding C, Chilimbi T. A composable model for analyzing locality of multi-threaded programs. Technical Report, MSR-TR-2009-107, Microsoft Research, August 2009.
Strohmaier E, Shan H. APEX-Map: A parameterized scalable memory access probe for high-performance computing systems. Concurrency and Computation: Practice and Experience, 2007, 19(17): 2185-2205.
Article Google Scholar
Ibrahim K Z, Strohmaier E. Characterizing the relation between Apex-Map synthetic probes and reuse distance distributions. In Proc. ICPP, Sept. 2010, pp.353-362.
He L, Yu Z, Jin H. FractalMRC: Online cache miss rate curve prediction on commodity systems. In Proc. IPDPS, May 2012, pp.1341-1351.
Saltzer J H. A simple linear model of demand paging performance. Communications of the ACM, 1974, 17(4): 181-186.
Article Google Scholar
Strecker W D. Transient behavior of cache memories. ACM Transactions on Computer Systems, 1983, 1(4): 281-293.
Article Google Scholar
King W F. Analysis of demand paging algorithms. In Proc. IFIP Congress, August 1971, pp.485-490.
Fagin R, Price T G. Efficient calculation of expected miss ratios in the independent reference model. SIAM Journal of Computing, 1978, 7(3): 288-297.
Article MATH MathSciNet Google Scholar
Dan A, Towsley D F. An approximate analysis of the LRU and FIFO buffer replacement schemes. In Proc. SIGMETRICS, May 1990, pp.143-152.
Gu X, Ding C. Reuse distance distribution in random access. Technical Report, URCS #930, University of Rochester, January 2008.
Denning P J, Slutz D R. Generalized working sets for segment reference strings. Communications of the ACM, 1978, 21(9): 750-759.
Article Google Scholar
Easton M C, Fagin R. Cold-start vs. warm-start miss ratios. Communications of the ACM, 1978, 21(10): 866-872.
Article MATH Google Scholar
Shedler G, Tung C. Locality in page reference strings. SIAM Journal on Computing, 1972, 1(3): 218-241.
Article MATH Google Scholar
Stone H S, Turek J, Wolf J L. Optimal partitioning of cache memory. IEEE Transactions on Computers, 1992, 41(9): 1054-1068.
Article Google Scholar
Thiébaut D, Stone H S, Wolf J L. Improving disk cache hit-ratios through cache partitioning. IEEE Transactions on Computers, 1992, 41(6): 665-676.
Article Google Scholar
Falsafi B, Wood D A. Modeling cost/performance of a parallel computer simulator. ACM Transactions on Modeling and Computer Simulation, 1997, 7(1): 104-130.
Article Google Scholar
Wu M J, Yeung D. Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In Proc. the ACM SIGPLAN Workshop on Memory System Performance and Correctness, June 2012, pp.2-11.
Fedorova A, Blagodurov S, Zhuravlev S. Managing contention for shared resources on multicore processors. Communications of the ACM, 2010, 53(2): 49-57.
Article Google Scholar
Zhuravlev S, Blagodurov S, Fedorova A. Addressing shared resource contention in multicore processors via scheduling. In Proc. ASPLOS, March 2010, pp.129-142.
Blagodurov S, Zhuravlev S, Fedorova A. Contention-aware scheduling on multicore systems. ACM Transactions on Computer Systems, 2010, 28(4): Article No.8.
Chen X E, Aamodt T M. A first-order fine-grained multi-threaded throughput model. In Proc. HPCA, Feb. 2009, pp.329-340.
Xie Y, Loh G H. Dynamic classification of program memory behaviors in CMPs. In Proc. CMP-MSI Workshop, June 2008.
Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (4th edition). Morgan Kaufmann, 2006.
Sun X H, Wang D. APC: A performance metric of memory systems. ACM SIGMETRICS Performance Evaluation Review, 2012, 40(2): 125-130.
Article Google Scholar
Zhao J, Feng X, Cui H et al. An empirical model for predicting cross-core performance interference on multicore processors. In Proc. PACT, Sept. 2013, pp.201-212.
Wang W, Dey T, Davidson J W et al. DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead. In Proc. HPCA, Feb. 2014.
Kim M, Kumar P, Kim H, Brett B. Predicting potential speedup of serial code via lightweight profiling and emulations with memory performance model. In Proc. IPDPS, May 2012, pp.1318-1329.
Zhang X, Zhong R, Dwarkadas S, Shen K. A flexible framework for throttling-enabled multicore management (TEMM). In Proc. ICPP, Sept. 2012, pp.389-398.
Liu L, Cui Z, Xing M et al. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proc. PACT, Sept. 2012, pp.367-376.
Jiang Y, Tian K, Shen X, Zhang J, Chen J, Tripathi R. The complexity of optimal job co-scheduling on chip multiprocessors and heuristics-based solutions. IEEE Trans. Parallel and Distributed Systems, 2011, 22(7): 1192-1205.
Article Google Scholar
Jiang Y, Shen X, Chen J, Tripathi R. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proc. PACT, Oct. 2008, pp.220-229.
Snavely A, Tullsen D M. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proc. ASPLOS, Nov. 2000, pp.234-244.
Shen K. Request behavior variations. In Proc. ASPLOS, Mar. 2010, pp.103-116.
Knauerhase R, Brett P, Hohlt B, Li T, Hahn S. Using OS observations to improve performance in multicore systems. IEEE Micro, 2008, 38(3): 54-66.
Article Google Scholar
Denning P J. Equipment configuration in balanced computer systems. IEEE Transactions on Computers, 1969, C-18(11): 1008-1012.
Article Google Scholar
Wulf W A. Performance monitors for multi-programming systems. In Proc. the ACM Symposium on Operating System Principles, Oct. 1969, pp.175-181.
Mars J, Tang L, Skadron K, Soffa M L, Hundt R. Increasing utilization in modern warehouse-scale computers using bubble-up. IEEE Micro, 2012, 32(3): 88-99.
Article Google Scholar
Delimitrou C, Kozyrakis C. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proc. ASPLOS, March 2013, pp.77-88.
Ahn D H, Vetter J S. Scalable analysis techniques for micro-processor performance counter metrics. In Proc. ACM/IEEE Conf. Supercomputing, Nov. 2002.
Rodríguez G, Badia R M, Labarta J. Generation of simple analytical models for message passing applications. In Proc. Euro-Par., Aug. 31-Sept. 3, 2004, pp.183-188.
Jacquet A, Janot V, Leung C et al. An executable analytical performance evaluation approach for early performance prediction. In Proc. IPDPS, April 2003.
Miller B P, Callaghan M D, Cargille J M et al. The Paradyn parallel performance measurement tool. IEEE Computer, 1995, 28(11): 37-46.
Article Google Scholar
Kerbyson D J, Hoisie A, Wasserman H J. Modelling the performance of large-scale systems. IEE Proceedings Software, 2003, 150(4): 214-222.
Article Google Scholar
Wall D W. Predicting program behavior using real or estimated profiles. In Proc. PLDI, June 1991, pp.59-70.
Tian K, Jiang Y, Zhang E Z, Shen X. An input-centric paradigm for program dynamic optimizations. In Proc. OOP-SLA, Oct. 2010, pp.125-139.
Shen X, Zhong Y, Ding C. Regression-based multi-model prediction of data reuse signature. In Proc. the 4th Annual Symposium of the Los Alamos Computer Science Institute, Oct. 2003.
Marin G, Mellor-Crummey J. Scalable cross-architecture predictions of memory hierarchy response for scientific applications. In Proc. the Symposium of the Los Alamos Computer Science Institute, Oct. 2005.
Shen X, Ding C. Parallelization of utility programs based on behavior phase analysis. In Proc. the International Workshop on Languages and Compilers for Parallel Computing, Oct. 2005, pp.425-432.
Shen X, Zhong Y, Ding C. Locality phase prediction. In Proc. ASPLOS, Oct. 2004, pp.165-176.
Shen X, Zhong Y, Ding C. Predicting locality phases for dynamic memory optimization. Journal of Parallel and Distributed Computing, 2007, 67(7): 783-796.
Article MATH Google Scholar
Mao F, Shen X. Cross-input learning and discriminative prediction in evolvable virtual machines. In Proc. CGO, Mar. 2009, pp.92-101.
Jiang Y, Zhang E Z, Tian K et al. Exploiting statistical correlations for proactive prediction of program behaviors. In Proc. the 8th CGO, April 2010, pp.248-256.
Cavazos J, Moss J E B. Inducing heuristics to decide whether to schedule. In Proc. PLDI, June 2004, pp.183-194.
Wu B, Zhao Z, Shen X, Jiang Y, Gao Y, Silvera R. Exploiting inter-sequence correlations for program behavior prediction. In Proc. OOPSLA, Oct. 2012, pp.851-866.
Arnold M, Welc A, Rajan V T. Improving virtual machine performance using a cross-run profile repository. In Proc. OOPSLA, Oct. 2005, pp.297-311.
Tian K, Zhang E Z, Shen X. A step towards transparent integration of input-consciousness into dynamic program optimizations. In Proc. OOPSLA, Oct. 2011, pp.445-462.
Chen Y, Huang Y, Eeckhout L et al. Evaluating iterative optimization across 1000 datasets. In Proc. PLDI, June 2010, pp.448-459.
Wu B, Zhou M, Shen X et al. Simple profile rectifications go a long way – Statistically exploring and alleviating the effects of sampling errors for program optimizations. In Proc. the European Conference on Object-Oriented Programming, July 2013, pp.654-678.
Srivastava A, Eustace A. ATOM: A system for building customized program analysis tools. In Proc. PLDI, June 1994, pp.196-205.
Luk C, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi V J, Hazelwood K. Pin: Building customized program analysis tools with dynamic instrumentation. In Proc. PLDI, June 2005, pp.190-200.
Wagner Meira Jr., LeBlanc T, Poulos A. Waiting time analysis and performance visualization in Carnival. In Proc. ACM SIGMETRICS Symposium on Parallel and Distributed Tools, May 1996.
Reed D A, Elford C L, Madhyastha T M, Smirni E, Lamm S E. The next frontier: Interactive and closed loop performance steering. In Proc. ICPP Workshop, Aug. 1996, pp.20-31.
Darema-Rogers F, Pfister G F, So K. Memory access patterns of parallel scientific programs. In Proc. SIGMETRICS, May 1987, pp.46-58.
Browne S, Dongarra J, Garner N, Ho G, Mucci P. A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 2000, 14(3): 189-204.
Article Google Scholar
Adhianto L, Banerjee S, Fagan M, Krentel M, Marin G, Mellor-Crummey J, Tallent N R. HPCTOOLKIT: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 2010, 22(6): 685-701.
Google Scholar
Shende S, Malony A D. The TAU parallel performance system. International Journal of High Performance Computing Applications, 2006, 20(2): 287-311.
Article Google Scholar
Schulz M, Galarowicz J, Maghrak D, Hachfeld W, Montoya D, Cranford S. Open|SpeedShop: An open source infrastructure for parallel performance analysis. Scientific Programming, 2008, 16(2/3): 105-121.
Google Scholar
Hauswirth M, Sweeney P F, Diwan A. Temporal vertical profiling. Software: Practice and Experience, 2010, 40(8): 627-654.
Google Scholar
Childers B, Davidson J, Soffa M L. Continuous compilation: A new approach to aggressive and adaptive code transformation. In Proc. Symp. Parallel and Distributed Processing, April 2003.
Cascaval C, Duesterwald E, Sweeney P F, Wisniewski R W. Performance and environment monitoring for continuous program optimization. IBM Journal of Research and Development, 2006, 50(2/3): 239-248.
Article Google Scholar
McCurdy C, Vetter J S. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In Proc. ISPASS, March 2010, pp.87-96.
Liu X, Mellor-Crummey J M. Pinpointing data locality problems using data-centric analysis. In Proc. the 9th CGO, April 2011, pp.171-180.
Liu X, Mellor-Crummey J. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proc. the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2014, pp.259-272.
Zhuang X, Serrano M J, Cain H W, Choi J. Accurate, efficient, and adaptive calling context profiling. In Proc. PLDI, June 2006, pp.263-271.
Ding C, Yuan L. Program interaction on multicore: Theory and applications. Computer Engineering and Science, 2014, 36(1): 1-5. (In Chinese)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Rochester, Rochester, NY, 14627-0226, U.S.A.
Chen Ding, Xiaoya Xiang, Bin Bao & Hao Luo
School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Ying-Wei Luo & Xiao-Lin Wang

Authors

Chen Ding
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoya Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Bao
View author publications
You can also search for this author in PubMed Google Scholar
Hao Luo
View author publications
You can also search for this author in PubMed Google Scholar
Ying-Wei Luo
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Lin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Ding.

Additional information

The work is partially supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61232008, the NSFC Joint Research Fund for Overseas Chinese Scholars and Scholars in Hong Kong and Macao under Grant No. 61328201, the National Science Foundation of USA under Contract Nos. CNS-1319617, CCF-1116104, CCF-0963759, an IBM CAS Faculty Fellowship and a research grant from Huawei. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding organizations.

Xiang has graduated and is now working at Twitter Inc. Bao has graduated and is now working at Quacomm Inc.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 84 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ding, C., Xiang, X., Bao, B. et al. Performance Metrics and Models for Shared Cache. J. Comput. Sci. Technol. 29, 692–712 (2014). https://doi.org/10.1007/s11390-014-1460-7

Download citation

Received: 01 March 2014
Revised: 14 May 2014
Published: 04 July 2014
Issue Date: July 2014
DOI: https://doi.org/10.1007/s11390-014-1460-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Metrics and Models for Shared Cache

Abstract

Access this article

Similar content being viewed by others

A Method for Fast Evaluation of Sharing Set Management Strategies in Cache Coherence Protocols

A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC

Shared Memory in the Many-Core Age

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Performance Metrics and Models for Shared Cache

Abstract

Access this article

Similar content being viewed by others

A Method for Fast Evaluation of Sharing Set Management Strategies in Cache Coherence Protocols

A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC

Shared Memory in the Many-Core Age

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation