skip to main content
10.1145/3330345.3330371acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Can we trust profiling results?: understanding and fixing the inaccuracy in modern profilers

Published: 26 June 2019 Publication History

Abstract

Profilers are an indispensable component in modern software stack of data centers and supercomputers. Profilers collect detailed performance data during program execution and guide code optimization across the entire software stack. The accuracy of the profiling result proves to be vital for one to effectively gain performance insights. Unfortunately, inaccuracy may arise due to measurement techniques or hardware limits, which can waste optimization efforts.
However, there are few studies in evaluating the accuracy of modern profiling techniques. In this paper, we study performance monitoring units (PMU) based statistical sampling, one of the profiling techniques widely adopted by many state-of-the-art profilers. While PMU sampling based profilers are efficient in collecting performance data, they suffer from inaccurate instruction measurement due to the intrinsic limit in the PMU design. To understand and fix the instruction profiling inaccuracy, we propose a novel 3-step approach. First, we investigate multiple modern architectures and quantify the PMU instruction profiling inaccuracy in these architectures with mathematical modeling. Second, we design a systematic framework to evaluate the impact of PMU inaccuracy to the profiling results. Finally, we propose a software-based technique to rectify the measurement inaccuracy raised by PMU and demonstrate its effectiveness.

References

[1]
Complement material. https://github.com/simon4173/ics_complement_materials/blob/master/Complement_ICS19.pdf.
[2]
Intel 64 and ia-32 architectures software developer's manual. https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html. {Accessed: 10-22-2018}.
[3]
Intel Vtune. https://software.intel.com/en-us/intel-vtune-amplifier-xe. {Accessed: 08-12-2017}.
[4]
An introduction to last branch records. https://lwn.net/Articles/680985/. {Accessed: 10-24-2018}.
[5]
Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010.
[6]
Jennifer M Anderson, Lance M Berc, Jeffrey Dean, Sanjay Ghemawat, Monika R Henzinger, Shun-Tak A Leung, Richard L Sites, Mark T Vandevoorde, Carl A Waldspurger, and William E Weihl. Continuous profiling: Where have all the cycles gone? In ACM SIGOPS Operating Systems Review, volume 31, pages 1--14. ACM, 1997.
[7]
Thomas Ball and James R Larus. Optimally profiling and tracing programs. ACM Transactions on Programming Languages and Systems (TOPLAS), 16(4):1319--1360, 1994.
[8]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1--7, 2011.
[9]
Trevor E Carlson, Wim Heirman, and Lieven Eeckhout. Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, page 52. ACM, 2011.
[10]
Milind Chabbi, Xu Liu, and John Mellor-Crummey. Call paths for pin tools. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, page 76. ACM, 2014.
[11]
Dehao Chen, Neil Vachharajani, Robert Hundt, Xinliang Li, Stephane Eranian, Wenguang Chen, and Weimin Zheng. Taming hardware event samples for precise and versatile feedback directed optimizations. IEEE Transactions on Computers, 62(2):376--389, 2013.
[12]
Dehao Chen, Neil Vachharajani, Robert Hundt, Shih-wei Liao, Vinodha Ramasamy, Paul Yuan, Wenguang Chen, and Weimin Zheng. Taming hardware event samples for fdo compilation. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, pages 42--52. ACM, 2010.
[13]
William E Cohen. Tuning programs with oprofile. Wide Open Magazine, 1:53--62, 2004.
[14]
Maria Dimakopoulou, Stéphane Eranian, Nectarios Koziris, and Nicholas Bambos. Reliable and efficient performance monitoring in linux. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 34. IEEE Press, 2016.
[15]
Paul J Drongowski. Instruction-based sampling: A new performance analysis technique for amd family 10h processors. Advanced Micro Devices, 2007.
[16]
Alan E Gelfand, Susan E Hills, Amy Racine-Poon, and Adrian FM Smith. Illustration of bayesian inference in normal data models using gibbs sampling. Journal of the American Statistical Association, 85(412):972--985, 1990.
[17]
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 17--30, Hollywood, CA, 2012. USENIX.
[18]
Robert J Hall. Call path profiling. In Proceedings of the 14th international conference on Software engineering, pages 296--306. ACM, 1992.
[19]
John L Henning. Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1--17, 2006.
[20]
Donald B Johnson. Finding all the elementary circuits of a directed graph. SIAM Journal on Computing, 4(1):77--84, 1975.
[21]
Jure Leskovec, Lada A Adamic, and Bernardo A Huberman. The dynamics of viral marketing. ACM Transactions on the Web (TWEB), 1(1):5, 2007.
[22]
Roy Levin, Ilan Newman, and Gadi Haber. Complementing missing and inaccurate profiling using a minimum cost circulation algorithm. In International Conference on High-Performance Embedded Architectures and Compilers, pages 291--304. Springer, 2008.
[23]
David Levinthal. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel Performance Analysis Guide, 30:18, 2009.
[24]
Robert V Lim, David Carrillo-Cisneros, W Alkowaileet, and I Scherson. Computationally efficient multiplexing of events on hardware counters. In Linux Symposium, pages 101--110. Citeseer, 2014.
[25]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices, volume 40, pages 190--200. ACM, 2005.
[26]
Wiplove Mathur and Jeanine Cook. Toward accurate performance evaluation using hardware counters. In ITEA Modeling and Simulation Workshop, pages 23--32, 2003.
[27]
Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F Sweeney. Evaluating the accuracy of java profilers. In ACM Sigplan Notices, volume 45, pages 187--197. ACM, 2010.
[28]
Andrzej Nowak, Ahmad Yasin, Avi Mendelson, and Willy Zwaenepoel. Establishing a base of trust with performance counters for enterprise workloads. In USENIX Annual Technical Conference, pages 541--548, 2015.
[29]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
[30]
R. Panda, S. Song, J. Dean, and L. K. John. Wait of a decade: Did spec cpu 2017 broaden the performance horizon? In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 271--282, Feb 2018.
[31]
S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and L. K. John. Proxy-guided load balancing of graph processing workloads on heterogeneous clusters. In 2016 45th International Conference on Parallel Processing (ICPP), pages 77--86, Aug 2016.
[32]
S. Song, X. Zheng, A. Gerstlauer, and L. K. John. Fine-grained power analysis of emerging graph processing workloads for cloud operations management. In 2016 IEEE International Conference on Big Data (Big Data), pages 2121--2126, Dec 2016.
[33]
Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, and Lizy K. John. Start late or finish early: A distributed graph processing system with redundancy reduction. Proc. VLDB Endow., 12(2):154--168, October 2018.
[34]
SPEC Corporation. SPEC CPU2006 benchmark suite. http://www.spec.org/cpu2006. 3 November 2007.
[35]
Pengfei Su, Shasha Wen, Hailong Yang, Milind Chabbi, and Xu Liu. Redundant loads: A software inefficiency indicator. arXiv preprint arXiv:1902.05462, 2019.
[36]
Nathan R. Tallent. Performance Analysis for Parallel Programs: From Multicore to Petascale. Ph.D. dissertation, Department of Computer Science, Rice University, March 2010.
[37]
Rafael Ubal, Julio Sahuquillo, Salvador Petit, and Pedro Lopez. Multi2sim: A simulation framework to evaluate multicore-multithreaded processors. In 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07), pages 62--68. IEEE, 2007.
[38]
Vincent M Weaver. Linux perf_event features and overhead. In The 2nd International Workshop on Performance Analysis of Workload Optimized Systems, FastPath, volume 13, 2013.
[39]
Vincent M Weaver. Advanced hardware profiling and sampling (pebs, ibs, etc.): Creating a new papi sampling interface. 2016.
[40]
Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. Watching for software inefficiencies with witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 332--347. ACM, 2018.
[41]
Bo Wu, Mingzhou Zhou, Xipeng Shen, Yaoqing Gao, Raul Silvera, and Graham Yiu. Simple profile rectifications go a long way. In European Conference on Object-Oriented Programming, pages 654--678. Springer, 2013.
[42]
Q. Wu, S. Flolid, S. Song, J. Deng, and L. K. John. Invited paper for the hot workloads special session hot regions in spec cpu2017. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pages 71--77, Sep. 2018.
[43]
Youfeng Wu and James R Larus. Static branch frequency and program profile analysis. In Proceedings of the 27th annual international symposium on Microarchitecture, pages 1--11. ACM, 1994.
[44]
Hao Xu, Shasha Wen, Alfredo Gimenez, Todd Gamblin, and Xu Liu. Dr-bw: identifying bandwidth contention in numa architectures with supervised learning. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, pages 367--376. IEEE, 2017.
[45]
Maotong Xu, Sultan Alamro, Tian Lan, and Suresh Subramaniam. Optimizing speculative execution of deadline-sensitive jobs in cloud. In ACM SIGMETRICS Performance Evaluation Review, volume 45, pages 17--18. ACM, 2017.
[46]
Maotong Xu, Sultan Alamro, Tian Lan, and Suresh Subramaniam. Chronos: A unifying optimization framework for speculative execution of deadline-critical mapreduce jobs. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pages 718--729. IEEE, 2018.
[47]
Matt T Yourst. Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator. In Performance Analysis of Systems & Software, 2007. ISPASS 2007. IEEE International Symposium on, pages 23--34. IEEE, 2007.
[48]
Gangyi Zhu and Gagan Agrawal. A performance prediction framework for irregular applications. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC), pages 304--313. IEEE, 2018.

Cited By

View all
  • (2024)Making Sense of Multi-threaded Application Performance at Scale with NonSequiturProceedings of the ACM on Programming Languages10.1145/36897938:OOPSLA2(2325-2354)Online publication date: 8-Oct-2024
  • (2024)Survival Prediction Across Diverse Cancer Types Using Neural NetworksProceedings of the 2024 7th International Conference on Machine Vision and Applications10.1145/3653946.3653966(134-138)Online publication date: 12-Mar-2024
  • (2024)Stale Profile MatchingProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641573(162-173)Online publication date: 17-Feb-2024
  • Show More Cited By

Index Terms

  1. Can we trust profiling results?: understanding and fixing the inaccuracy in modern profilers

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICS '19: Proceedings of the ACM International Conference on Supercomputing
      June 2019
      533 pages
      ISBN:9781450360791
      DOI:10.1145/3330345
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 June 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. PMU
      2. accuracy
      3. call path profiling
      4. statistical sampling

      Qualifiers

      • Research-article

      Conference

      ICS '19
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 629 of 2,180 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)72
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Making Sense of Multi-threaded Application Performance at Scale with NonSequiturProceedings of the ACM on Programming Languages10.1145/36897938:OOPSLA2(2325-2354)Online publication date: 8-Oct-2024
      • (2024)Survival Prediction Across Diverse Cancer Types Using Neural NetworksProceedings of the 2024 7th International Conference on Machine Vision and Applications10.1145/3653946.3653966(134-138)Online publication date: 12-Mar-2024
      • (2024)Stale Profile MatchingProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641573(162-173)Online publication date: 17-Feb-2024
      • (2024)Multi-level Memory-Centric Profiling on ARM Processors with ARM SPEProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00139(996-1005)Online publication date: 17-Nov-2024
      • (2024)Event Monitor Validation in High-Integrity Systems2024 27th Euromicro Conference on Digital System Design (DSD)10.1109/DSD64264.2024.00059(394-402)Online publication date: 28-Aug-2024
      • (2024)OptiWISE: Combining Sampling and Instrumentation for Granular CPI AnalysisProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444771(373-385)Online publication date: 2-Mar-2024
      • (2023)Investigation of Creating Accessibility Linked Data Based on Publicly Available Accessibility DatasetsProceedings of the 2023 13th International Conference on Communication and Network Security10.1145/3638782.3638794(77-81)Online publication date: 6-Dec-2023
      • (2023)Precise Event Sampling on AMD Versus Intel: Quantitative and Qualitative ComparisonIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325710534:5(1594-1608)Online publication date: May-2023
      • (2023)DDoS Attack Dataset (CICEV2023) against EV Authentication in Charging Infrastructure2023 20th Annual International Conference on Privacy, Security and Trust (PST)10.1109/PST58708.2023.10320202(1-9)Online publication date: 21-Aug-2023
      • (2022)Profile inference revisitedProceedings of the ACM on Programming Languages10.1145/34987146:POPL(1-24)Online publication date: 12-Jan-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media