skip to main content
10.1145/1064212.1064232acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
Article

Fast data-locality profiling of native execution

Published: 06 June 2005 Publication History

Abstract

Performance tools based on hardware counters can efficiently profile the cache behavior of an application and help software developers improve its cache utilization. Simulator-based tools can potentially provide more insights and flexibility and model many different cache configurations, but have the drawback of large run-time overhead.We present StatCache, a performance tool based on a statistical cache model. It has a small run-time overhead while providing much of the flexibility of simulator-based tools. A monitor process running in the background collects sparse memory access statistics about the analyzed application running natively on a host computer. Generic locality information is derived and presented in a code-centric and/or data-centric view.We evaluate the accuracy and performance of the tool using ten SPEC CPU2000 benchmarks. We also exemplify how the flexibility of the tool can be used to better understand the characteristics of cache-related performance problems.

References

[1]
J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, D. Sites, M. Vandevoorde, C. Waldspurger, and W. Weihl. Continuous profiling: Where have all the cycles gone? ACM Transactions on Computer Systems, 1997.
[2]
E. Berg and E. Hagersten. SIP: Performance Tuning through Source Code Interdependence. In Proceedings of the 8th International Euro-Par Conference (Euro-Par 2002), pages 177--186, Paderborn, Germany, August 2002.
[3]
E. Berg and E. Hagersten. StatCache: A probabilistic approach to efficient and accurate data locality analysis. Technical report 2003-57, Department of information technology, Uppsala University, Sweden, 2003.
[4]
E. Berg and E. Hagersten. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of International Symposium on Performance Analysis of Systems And Software, 2004.
[5]
K. Beyls, E. D'Hollander, and Y. Yu. Visualization enables the programmer to reduce cache misses. In Proceedings of Conference on Parallel and Distributed Computing and Systems, 2002.
[6]
S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. In Proceedings of SuperComputing, 2000.
[7]
B. Buck and J. Hollingsworth. Using hardware performance monitors to isolate memory bottlenecks. In Proceedings of Supercomputing, 2000.
[8]
C. Cascaval and D. A. Padua. Estimating cache misses and locality using stack distances. In Proceedings of International Conference on Supercomputing, 2003.
[9]
T. M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In SIGPLAN Conference on Programming Language Design and Implementation, pages 191--202, 2001.
[10]
T. M. Chilimbi. Dynamic hot data stream prefetching for general-purpose programs. In PLDI, 2002.
[11]
T. M. Conte, M. A. Hirsch, and W. W. Hwu. Combining trace sampling with single pass methods for efficient cache simulation. IEEE Transactions on Computers, 47(6):714--720, 1998.
[12]
Intel Corporation. Intel VTune Performance Analyzers http://www.intel.com/software/products/vtune/.
[13]
L. DeRose, K. Ekanadham, and J. K. Hollingsworth. Sigma: A simulator infrastructure to guide memory analysis. In Proceedings of SuperComputing, 2002.
[14]
A. Eustace and A. Srivastava. ATOM: A flexible interface for building high performance program analysis tools. In USENIX Winter, pages 303--314, 1995.
[15]
S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Transactions on Programming Languages and Systems, 21(4):703--746, 1999.
[16]
M. Itzkowitz, B.J.N. Wylie, C. Aoki, and N. Kosche. Memory profiling using hardware counters. In Proceedings of Supercomputing, 2003.
[17]
R. Fowler J. Mellor-Crummey and D. Whalley. Tools for application-oriented performance tuning. In Proceedings of the 2001 ACM International Conference on Supercomputing, 2001.
[18]
R. E. Kessler, M. D. Hill, and D. A. Wood. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Transactions on Computers, 43(6):664--675, 1994.
[19]
S. Laha, J. A. Patel, and R. K. Iyer. Accurate low-cost methods for performance evaluation of cache memory systems. IEEE Transactions on computers}, 1988.
[20]
J. R. Larus and E. Schnarr. EEL: Machine-independent executable editing. In SIGPLAN Conference on Programming Language Design and Implementation, pages 291--300, 1995.
[21]
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, 1994.
[22]
S. Devine M. Rosenblum, E. Bugnion and S. Herrod. Using the simos machine simulator to study complex systems. ACM Transactions on Modelling and Computer Simulation, 7:78--103, 1997.
[23]
J. Maebe, M. Ronsse, and K. De Bosschere. DIOTA: Dynamic instrumentation, optimization and transformation of applications. In Compendium of Workshops and Tutorials. Held in conjunction with International Conference on Parallel Architectures and Compilation Techniques., September 2002.
[24]
P. Magnusson, F. Larsson, A. Moestedt, B. Werner, F. Dahlgren, M. Karlsson, F. Lundholm, J. Nilsson, P. Stenström, and H. Grahn. SimICS/sun4m: A virtual workstation. In Proceedings of the Usenix Annual Technical Conference, pages 119--130, 1998.
[25]
G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. In Proceedings of Joint International Conference on Measurement and Modeling of Computer Systems, pages 2--13, New York, NY, June 2004.
[26]
M. Martonosi, A. Gupta, and T. Anderson. Memspy: Analyzing memory system bottlenecks in programs. In Proceedings of International Conference on Modeling of Computer Systems, pages 1--12, 1992.
[27]
M. Martonosi, A. Gupta, and T. E. Anderson. Tuning memory performance of sequential and parallel programs. IEEE Computer, 28(4):32--40, 1995.
[28]
R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78--117, 1970.
[29]
T. Mohan, B. R. de Supinski, S. A. McKee, F. Mueller, A. Yoo, and M. Schultz. Identifying and exploiting spatial regularity in data memory references. In Proceedings of Supercomputing, 2003.
[30]
L. Noordergraaf and R. Zak. Smp system interconnect instrumentation for performance analysis. In Proceedings of Supercomputing, 2002.
[31]
E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder. Using SimPoint for accurate and efficient simulation. In Proceedings of SIGMETRICS, 2003.
[32]
E. Perelman, G. Hamerly, and B. Calder. Picking statistically valid and early simulation points. In In Proceedings of Parallel Architectures and Compilation Techniques, 2003.
[33]
SPEC. Standard performance evaluation corporation http://www.spec.org/.
[34]
R. Uhlig, D. Nagle, T. N. Mudge, and S. Sechrest. Trap-driven simulation with tapeworm II. In Proceedings of Architectural Support for Programming Languages and Operating Systems, pages 132--144, 1994.
[35]
X. Vera and J. Xue. Let's study whole-program cache behaviour analytically. In Proceedings of 8th International Symposium on High-Performance Computer Architecture, 2002.
[36]
D. A. Wood, M. D. Hill, and R. E. Kessler. A model for estimating trace-sample miss ratios. ACM SIGMETRICS Performance Evaluation Review, 19(1), May 21-24, 1991.
[37]
R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of International Symposium of Computer Architecture, 2003.
[38]
Y. Zhong, S. G. Dropsho, and C. Ding. Miss rate prediction across all program inputs. In Proceedings of Parallel Architechtures and Compilation Techniques, 2003.

Cited By

View all
  • (2024)GPU Scale-Model Simulation2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00088(1125-1140)Online publication date: 2-Mar-2024
  • (2023)FLORIA: A Fast and Featherlight Approach for Predicting Cache PerformanceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593740(25-36)Online publication date: 21-Jun-2023
  • (2022)MemSweeper: virtualizing cluster memory management for high memory utilization and isolationProceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management10.1145/3520263.3534651(15-28)Online publication date: 14-Jun-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMETRICS '05: Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
June 2005
428 pages
ISBN:1595930221
DOI:10.1145/1064212
  • cover image ACM SIGMETRICS Performance Evaluation Review
    ACM SIGMETRICS Performance Evaluation Review  Volume 33, Issue 1
    Performance evaluation review
    June 2005
    417 pages
    ISSN:0163-5999
    DOI:10.1145/1071690
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache behavior
  2. profiling tool

Qualifiers

  • Article

Conference

SIGMETRICS05

Acceptance Rates

Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)GPU Scale-Model Simulation2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00088(1125-1140)Online publication date: 2-Mar-2024
  • (2023)FLORIA: A Fast and Featherlight Approach for Predicting Cache PerformanceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593740(25-36)Online publication date: 21-Jun-2023
  • (2022)MemSweeper: virtualizing cluster memory management for high memory utilization and isolationProceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management10.1145/3520263.3534651(15-28)Online publication date: 14-Jun-2022
  • (2020)OSCAProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489200(785-798)Online publication date: 15-Jul-2020
  • (2019)Building a Polyhedral Representation from an Instrumented ExecutionACM Transactions on Architecture and Code Optimization10.1145/336378516:4(1-26)Online publication date: 17-Dec-2019
  • (2019)DynaSprintProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358301(426-439)Online publication date: 12-Oct-2019
  • (2019)Directed Statistical Warming through Time TravelingProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358264(1037-1049)Online publication date: 12-Oct-2019
  • (2019)Beating OPT with Statistical Clairvoyance and Variable Size CachingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304067(243-256)Online publication date: 4-Apr-2019
  • (2018)A Performance Prediction Framework for Irregular Applications2018 IEEE 25th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2018.00042(304-313)Online publication date: Dec-2018
  • (2017)Enhancing the Malloc System with Pollution Awareness for Better Cache PerformanceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.258764428:3(731-745)Online publication date: 1-Mar-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media