Abstract
Data access delay has become the prominent performance bottleneck of high-end computing systems. The key to reducing data access delay in system design is to diminish data stall time. Memory locality and concurrency are the two essential factors influencing the performance of modern memory systems. However, existing studies in reducing data stall time rarely focus on utilizing data access concurrency because the impact of memory concurrency on overall memory system performance is not well understood. In this study, a pair of novel data stall time models, the L-C model for the combined effort of locality and concurrency and the P-M model for the effect of pure miss on data stall time, are presented. The models provide a new understanding of data access delay and provide new directions for performance optimization. Based on these new models, a summary table of advanced cache optimizations is presented. It has 38 entries contributed by data concurrency while only has 21 entries contributed by data locality, which shows the value of data concurrency. The L-C and P-M models and their associated results and opportunities introduced in this study are important and necessary for future data-centric architecture and algorithm design of modern computing systems.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Wulf W A, McKee S A. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News, 1995, 23(1): 20–24.
McKee S A. Reflections on the memory wall. In Proc. the 1st Conference on Computing Frontiers, April 2004, p.162.
Borkar S, Chien A A. The future of microprocessors. Communications of the ACM, 2011, 54(5): 67–77.
Nikos H, Ippokratis P, Ryan J et al. Database servers on chip multiprocessors: Limitations and opportunities. In Proc. the 3rd Biennial Conference on Innovative Data Systems Research, Jan. 2007.
Somogyi S, Wenisch T, Ailamaki A et al. Spatio-temporal memory streaming. ACM SIGARCH Computer Architecture News, 2009, 37(3): 69–80.
Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (5th edition). Morgan Kaufmann, 2011
Chou Y, Fahs B, Abraham S. Microarchitecture optimizations for memory-level parallelism. In Proc. the 31st International Symposium on Computer Architecture, June 2004, pp.19-23.
Qureshi M K, Lynch D N, Mutlu O et al. A case for MLPaware cache replacement. ACM SIGARCH Computer Architecture News, 2006, 34(2): 167–178.
Moreto M, Cazorla F J, Ramirez A et al. MLP-aware dynamic cache partitioning. In Proc. the 3rd Int. Conf. High Performance Embedded Architectures and Compilers, Jan. 2008, pp.337-352.
Sun X H,Wang D. Concurrent average memory access time. IEEE Computer, 2014, 47(5): 74–80.
Sun X H. Concurrent-AMAT: A mathematical model for Big Data access. HPC Magazine. http://www.hpcmagazine.eu/state-of-the-art/c-amat-a-model-for-big-data-access/, May 2014.
Karkhanis T, Smith J E. A day in the life of a data cache miss. In Proc. the 2nd Workshop on Memory Performance Issues, May 2002.
Binkert N, Beckmann B, Black G et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 2011, 39(2): 1–7.
Rosenfeld P, Cooper-Balis E, Jacob B. DRAMSim2: A cycle accurate memory system simulator. Computer Architecture Letters, 2011, 10(1): 16–19.
Spradling C D. SPEC CPU2006 benchmark tools. ACM SIGARCH Computer Architecture News, 2007, 35(1): 130-134.
Wu Y, Chen Y, Chen T et al. An elastic architecture adaptable to various application scenarios. Journal of Computer Science and Technology, 2014, 29(2): 227–238.
Mutlu O, Stark J, Wilkerson C et al. Runahead execution: An alternative to very large instruction windows for out-oforder processors. In Proc. the 9th International Symposium on High-Performance Computer Architecture, Feb. 2003, pp.129-140.
Ketterlin A, Clauss P. Profiling data-dependence to assist parallelization: Framework, scope, and optimization. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2012, pp.437-448.
Sun X H, Wang D. APC: A performance metric of memory systems. ACM SIGMETRICS Performance Evaluation Review, 2012, 40(2): 125–130.
Wang D, Sun X H. APC: A novel memory metric and measurement methodology for modern memory system. IEEE Transactions on Computers, 2014, 63(7): 1626–1639.
Van Craeynest K, Jaleel A, Eeckhout L et al. Scheduling heterogeneous multi-cores through performance impact estimation (PIE). In Proc. the 39th Annual International Symposium on Computer Architecture (ISCA), June 2012, pp.213-224.
Wang R, Chen L, Pinkston T M. An analytical performance model for partitioning off-chip memory bandwidth. In Proc. the 27th IEEE International Symposium on Parallel and Distributed Processing, May 2013, pp.165-176.
Kurian G, Khan O, Devadas S. The locality-aware adaptive cache coherence protocol. In Proc. the 40th Annual International Symposium on Computer Architecture, June 2013, pp.523-534.
Iakymchuk R, Bientinesi P. Modeling performance through memory-stalls. ACM SIGMETRICS Performance Evaluation Review, 2012, 40(2): 86–91.
Author information
Authors and Affiliations
Corresponding author
Additional information
Special Section on Applications and Industry
The work was supported in part by the National Science Foundation of USA under Grant Nos. CNS-1162540, CCF-0937877, and CNS-0751200.
Rights and permissions
About this article
Cite this article
Liu, YH., Sun, XH. Reevaluating Data Stall Time with the Consideration of Data Access Concurrency. J. Comput. Sci. Technol. 30, 227–245 (2015). https://doi.org/10.1007/s11390-015-1517-2
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-015-1517-2