Skip to main content

ParLoT: Efficient Whole-Program Call Tracing for HPC Applications

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11027))

Abstract

The complexity of HPC software and hardware is quickly increasing. As a consequence, the need for efficient execution tracing to gain insight into HPC application behavior is steadily growing. Unfortunately, available tools either do not produce traces with enough detail or incur large overheads. An efficient tracing method that overcomes the tradeoff between maximum information and minimum overhead is therefore urgently needed. This paper presents such a method and tool, called ParLoT, with the following key features. (1) It describes a technique that makes low-overhead on-the-fly compression of whole-program call traces feasible. (2) It presents a new, efficient, incremental trace-compression approach that reduces the trace volume dynamically, which lowers not only the needed bandwidth but also the tracing overhead. (3) It collects all caller/callee relations, call frequencies, call stacks, as well as the full trace of all calls and returns executed by each thread, including in library code. (4) It works on top of existing dynamic binary instrumentation tools, thus requiring neither source-code modifications nor recompilation. (5) It supports program analysis and debugging at the thread, thread-group, and program level. This paper establishes that comparable capabilities are currently unavailable. Our experiments with the NAS parallel benchmarks running on the Comet supercomputer with up to 1,024 cores show that ParLoT can collect whole-program function-call traces at an average tracing bandwidth of just 56 kB/s per core.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Given the absence of tools similar to ParLoT, we employ Callgrind as a “close-enough” tool in our comparisons elaborated in Sect. 4.3. In this capacity, Callgrind is similar to ParLoT(m), a variant of ParLoT that only collects traces from the main image. We perform such comparison to have an idea of how we fare with respect to one other tool. In Sect. 5, we also present a self-assessment of ParLoT separately.

References

  1. Aguilar, X., Fürlinger, K., Laure, E.: Online MPI trace compression using event flow graphs and wavelets. Procedia Comput. Sci. 80(Supp. C), 1497–1506 (2016). https://doi.org/10.1016/j.procs.2016.05.471. http://www.sciencedirect.com/science/article/pii/S1877050916309565. International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA

    Article  Google Scholar 

  2. Arnold, D.C., Ahn, D.H., de Supinski, B.R., Lee, G.L., Miller, B.P., Schulz, M.: Stack trace analysis for large scale debugging. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pp. 1–10 (2007)

    Google Scholar 

  3. Bailey, D.H., et al.: The NAS parallel benchmarks— summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing 1991, pp. 158–165. ACM, New York (1991). https://doi.org/10.1145/125826.125925

  4. Burtscher, M., Rabeti, H.: A scalable heterogeneous parallelization framework for iterative local searches. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 1289–1298, May 2013. https://doi.org/10.1109/IPDPS.2013.27

  5. Burtscher, M., Mukka, H., Yang, A., Hesaaraki, F.: Real-time synthesis of compression algorithms for scientific data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, pp. 23:1–23:12. IEEE Press, Piscataway, NJ, USA (2016). http://dl.acm.org/citation.cfm?id=3014904.3014935

  6. Claggett, S., Azimi, S., Burtscher, M.: SPDP: an automatically synthesized lossless compression algorithm for floating-point data. In: 2018 Data Compression Conference (2018)

    Google Scholar 

  7. Coplin, J., Yang, A., Poppe, A., Burtscher, M.: Increasing telemetry throughput using customized and adaptive data compression. In: AIAA SPACE and Astronautics Forum and Exposition (2016)

    Google Scholar 

  8. Freitag, F., Caubet, J., Labarta, J.: On the scalability of tracing mechanisms. In: Monien, B., Feldmann, R. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 97–104. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45706-2_10

    Chapter  Google Scholar 

  9. Gamblin, T., de Supinski, B.R., Schulz, M., Fowler, R., Reed, D.A.: Scalable load-balance measurement for SPMD codes. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 1–12, November 2008. https://doi.org/10.1109/SC.2008.5222553

  10. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations, 1st edn. Springer, Secaucus (1997). https://doi.org/10.1007/978-3-642-59830-2

    Book  MATH  Google Scholar 

  11. Godin, R., Missaoui, R., Alaoui, H.: Incremental concept formation algorithms based on Galois (concept) lattices. Comput. Intell. 11(2), 246–267

    Google Scholar 

  12. Gopalakrishnan, G., et al.: Report of the HPC correctness summit, 25–26 January 2017, Washington, DC. CoRR abs/1705.07478 (2017). http://arxiv.org/abs/1705.07478

  13. Hazelwood, K., Klauser, A.: A dynamic binary instrumentation engine for the ARM architecture. In: Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2006, pp. 261–270. ACM, New York (2006). https://doi.org/10.1145/1176760.1176793

  14. Heroux, M.A., et al.: Improving performance via mini-applications. Sandia National Laboratories, Technical report SAND2009-5574 3 (2009)

    Google Scholar 

  15. Intel: Pin, a dynamic binary instrumentation. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

  16. Jurenz, M., Brendel, R., Knüpfer, A., Müller, M., Nagel, W.E.: Memory allocation tracing with VampirTrace. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4488, pp. 839–846. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72586-2_118

    Chapter  Google Scholar 

  17. Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for Periscope, Scalasca, Tau, and Vampir. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-31476-6_7

    Chapter  Google Scholar 

  18. Luk, C.K., et al.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2005, pp. 190–200. ACM, New York (2005). https://doi.org/10.1145/1065010.1065034

  19. Miller, B.P., et al.: The Paradyn parallel performance measurement tool. IEEE Comput. 28(11), 37–46 (1995). https://doi.org/10.1109/2.471178

    Article  Google Scholar 

  20. Mohror, K., Karavanic, K.L.: Evaluating similarity-based trace reduction techniques for scalable performance analysis. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 55:1–55:12. ACM, New York (2009). https://doi.org/10.1145/1654059.1654115

  21. Nataraj, A., Malony, A., Morris, A., Arnold, D.C., Miller, B.: A framework for scalable, parallel performance monitoring 22, 720–735 (2009)

    Google Scholar 

  22. Nethercote, N., Seward, J.: How to shadow every byte of memory used by a program. In: Proceedings of the 3rd International Conference on Virtual Execution Environments, VEE 2007, pp. 65–74. ACM, New York (2007)

    Google Scholar 

  23. Nethercote, N., Seward, J.: Valgrind: a program supervision framework. Electr. Notes Theor. Comput. Sci. 89(2), 44–66 (2003). https://doi.org/10.1016/S1571-0661(04)81042-9

    Article  Google Scholar 

  24. Network, Microsoft, Docs: C sequence points. https://msdn.microsoft.com/en-us/library/azk8zbxd.aspx

  25. Noeth, M., Ratn, P., Mueller, F., Schulz, M., de Supinski, B.R.: ScalaTrace: scalable compression and replay of communication traces for high-performance computing. J. Parallel Distrib. Comput. 69(8), 696–710 (2009). https://doi.org/10.1016/j.jpdc.2008.09.001. Best Paper Awards: 21st International Parallel and Distributed Processing Symposium (IPDPS 2007)

    Article  Google Scholar 

  26. de Oliveira, D.C.B., Rakamarić, Z., Gopalakrishnan, G., Humphrey, A., Meng, Q., Berzins, M.: Systematic debugging of concurrent systems using Coalesced Stack Trace Graphs. In: Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 317–331. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17473-0_21. http://www.sci.utah.edu/publications/Oli2014a/OliveiraLCPC2014.pdf

    Chapter  Google Scholar 

  27. Ratanaworabhan, P., Burtscher, M.: Program phase detection based on critical basic block transitions. In: ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software, pp. 11–21, April 2008. https://doi.org/10.1109/ISPASS.2008.4510734

  28. Roth, P.C., Arnold, D.C., Miller, B.P.: MRNet: a software-based multicast/reduction network for scalable tools. In: 2003 ACM/IEEE Conference Supercomputing, p. 21, November 2003. https://doi.org/10.1145/1048935.1050172

  29. Schulz, M., Galarowicz, J., Maghrak, D., Hachfeld, W., Montoya, D., Cranford, S.: Open \(\vert \) SpeedShop: an open source infrastructure for parallel performance analysis. Sci. Prog. 16(2–3), 105–121 (2008). https://doi.org/10.3233/SPR-2008-0256

    Google Scholar 

  30. Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20, 287–311 (2006). https://doi.org/10.1177/1094342006064482. http://portal.acm.org/citation.cfm?id=1125980.1125982

    Article  Google Scholar 

  31. Strande, S.M., et al.: Comet: Tales from the Long Tail: Two Years in and 10,000 users later. In: Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact, PEARC 2017, pp. 38:1–38:7. ACM, New York (2017). https://doi.org/10.1145/3093338.3093383

  32. Tikir, M.M., Laurenzano, M., Carrington, L., Snavely, A.: PMaC binary instrumentation library for PowerPC/AIX. In: Workshop on Binary Instrumentation and Applications (2006)

    Google Scholar 

  33. Weidendorfer, J.: Sequential performance analysis with Callgrind and KCachegrind. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 93–113. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68564-7_7

    Chapter  Google Scholar 

  34. Yang, A., Mukka, H., Hesaaraki, F., Burtscher, M.: MPC: a massively parallel compression algorithm for scientific data. In: 2015 IEEE International Conference on Cluster Computing, pp. 381–389, September 2015. https://doi.org/10.1109/CLUSTER.2015.59

  35. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23(3), 337–343 (2006). https://doi.org/10.1109/TIT.1977.1055714

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgment

This research was supported by the NSF Award CCF 1439002 and CCF 1817073. We thank our colleague Dr. Hari Sundar from the University of Utah who provided insight and expertise that greatly assisted the research. We also thank the Texas Advanced Computing Center (TACC) and the San Diego Supercomputer Center (SDSC) for the infrastructure they provided for running our experiments.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ganesh Gopalakrishnan or Martin Burtscher .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Taheri, S., Devale, S., Gopalakrishnan, G., Burtscher, M. (2019). ParLoT: Efficient Whole-Program Call Tracing for HPC Applications. In: Bhatele, A., Boehme, D., Levine, J., Malony, A., Schulz, M. (eds) Programming and Performance Visualization Tools. ESPT ESPT VPA VPA 2017 2018 2017 2018. Lecture Notes in Computer Science(), vol 11027. Springer, Cham. https://doi.org/10.1007/978-3-030-17872-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17872-7_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17871-0

  • Online ISBN: 978-3-030-17872-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics