ParLoT: Efficient Whole-Program Call Tracing for HPC Applications

Taheri, Saeed; Devale, Sindhu; Gopalakrishnan, Ganesh; Burtscher, Martin

doi:10.1007/978-3-030-17872-7_10

ParLoT: Efficient Whole-Program Call Tracing for HPC Applications

Saeed Taheri¹⁹,
Sindhu Devale²⁰,
Ganesh Gopalakrishnan¹⁹ &
…
Martin Burtscher²⁰

Conference paper
First Online: 24 April 2019

541 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11027))

Abstract

The complexity of HPC software and hardware is quickly increasing. As a consequence, the need for efficient execution tracing to gain insight into HPC application behavior is steadily growing. Unfortunately, available tools either do not produce traces with enough detail or incur large overheads. An efficient tracing method that overcomes the tradeoff between maximum information and minimum overhead is therefore urgently needed. This paper presents such a method and tool, called ParLoT, with the following key features. (1) It describes a technique that makes low-overhead on-the-fly compression of whole-program call traces feasible. (2) It presents a new, efficient, incremental trace-compression approach that reduces the trace volume dynamically, which lowers not only the needed bandwidth but also the tracing overhead. (3) It collects all caller/callee relations, call frequencies, call stacks, as well as the full trace of all calls and returns executed by each thread, including in library code. (4) It works on top of existing dynamic binary instrumentation tools, thus requiring neither source-code modifications nor recompilation. (5) It supports program analysis and debugging at the thread, thread-group, and program level. This paper establishes that comparable capabilities are currently unavailable. Our experiments with the NAS parallel benchmarks running on the Comet supercomputer with up to 1,024 cores show that ParLoT can collect whole-program function-call traces at an average tracing bandwidth of just 56 kB/s per core.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Given the absence of tools similar to ParLoT, we employ Callgrind as a “close-enough” tool in our comparisons elaborated in Sect. 4.3. In this capacity, Callgrind is similar to ParLoT(m), a variant of ParLoT that only collects traces from the main image. We perform such comparison to have an idea of how we fare with respect to one other tool. In Sect. 5, we also present a self-assessment of ParLoT separately.

References

Aguilar, X., Fürlinger, K., Laure, E.: Online MPI trace compression using event flow graphs and wavelets. Procedia Comput. Sci. 80(Supp. C), 1497–1506 (2016). https://doi.org/10.1016/j.procs.2016.05.471. http://www.sciencedirect.com/science/article/pii/S1877050916309565. International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA
Article Google Scholar
Arnold, D.C., Ahn, D.H., de Supinski, B.R., Lee, G.L., Miller, B.P., Schulz, M.: Stack trace analysis for large scale debugging. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pp. 1–10 (2007)
Google Scholar
Bailey, D.H., et al.: The NAS parallel benchmarks— summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing 1991, pp. 158–165. ACM, New York (1991). https://doi.org/10.1145/125826.125925
Burtscher, M., Rabeti, H.: A scalable heterogeneous parallelization framework for iterative local searches. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 1289–1298, May 2013. https://doi.org/10.1109/IPDPS.2013.27
Burtscher, M., Mukka, H., Yang, A., Hesaaraki, F.: Real-time synthesis of compression algorithms for scientific data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, pp. 23:1–23:12. IEEE Press, Piscataway, NJ, USA (2016). http://dl.acm.org/citation.cfm?id=3014904.3014935
Claggett, S., Azimi, S., Burtscher, M.: SPDP: an automatically synthesized lossless compression algorithm for floating-point data. In: 2018 Data Compression Conference (2018)
Google Scholar
Coplin, J., Yang, A., Poppe, A., Burtscher, M.: Increasing telemetry throughput using customized and adaptive data compression. In: AIAA SPACE and Astronautics Forum and Exposition (2016)
Google Scholar
Freitag, F., Caubet, J., Labarta, J.: On the scalability of tracing mechanisms. In: Monien, B., Feldmann, R. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 97–104. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45706-2_10
Chapter Google Scholar
Gamblin, T., de Supinski, B.R., Schulz, M., Fowler, R., Reed, D.A.: Scalable load-balance measurement for SPMD codes. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 1–12, November 2008. https://doi.org/10.1109/SC.2008.5222553
Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations, 1st edn. Springer, Secaucus (1997). https://doi.org/10.1007/978-3-642-59830-2
Book MATH Google Scholar
Godin, R., Missaoui, R., Alaoui, H.: Incremental concept formation algorithms based on Galois (concept) lattices. Comput. Intell. 11(2), 246–267
Google Scholar
Gopalakrishnan, G., et al.: Report of the HPC correctness summit, 25–26 January 2017, Washington, DC. CoRR abs/1705.07478 (2017). http://arxiv.org/abs/1705.07478
Hazelwood, K., Klauser, A.: A dynamic binary instrumentation engine for the ARM architecture. In: Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2006, pp. 261–270. ACM, New York (2006). https://doi.org/10.1145/1176760.1176793
Heroux, M.A., et al.: Improving performance via mini-applications. Sandia National Laboratories, Technical report SAND2009-5574 3 (2009)
Google Scholar
Intel: Pin, a dynamic binary instrumentation. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool
Jurenz, M., Brendel, R., Knüpfer, A., Müller, M., Nagel, W.E.: Memory allocation tracing with VampirTrace. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4488, pp. 839–846. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72586-2_118
Chapter Google Scholar
Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for Periscope, Scalasca, Tau, and Vampir. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-31476-6_7
Chapter Google Scholar
Luk, C.K., et al.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2005, pp. 190–200. ACM, New York (2005). https://doi.org/10.1145/1065010.1065034
Miller, B.P., et al.: The Paradyn parallel performance measurement tool. IEEE Comput. 28(11), 37–46 (1995). https://doi.org/10.1109/2.471178
Article Google Scholar
Mohror, K., Karavanic, K.L.: Evaluating similarity-based trace reduction techniques for scalable performance analysis. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 55:1–55:12. ACM, New York (2009). https://doi.org/10.1145/1654059.1654115
Nataraj, A., Malony, A., Morris, A., Arnold, D.C., Miller, B.: A framework for scalable, parallel performance monitoring 22, 720–735 (2009)
Google Scholar
Nethercote, N., Seward, J.: How to shadow every byte of memory used by a program. In: Proceedings of the 3rd International Conference on Virtual Execution Environments, VEE 2007, pp. 65–74. ACM, New York (2007)
Google Scholar
Nethercote, N., Seward, J.: Valgrind: a program supervision framework. Electr. Notes Theor. Comput. Sci. 89(2), 44–66 (2003). https://doi.org/10.1016/S1571-0661(04)81042-9
Article Google Scholar
Network, Microsoft, Docs: C sequence points. https://msdn.microsoft.com/en-us/library/azk8zbxd.aspx
Noeth, M., Ratn, P., Mueller, F., Schulz, M., de Supinski, B.R.: ScalaTrace: scalable compression and replay of communication traces for high-performance computing. J. Parallel Distrib. Comput. 69(8), 696–710 (2009). https://doi.org/10.1016/j.jpdc.2008.09.001. Best Paper Awards: 21st International Parallel and Distributed Processing Symposium (IPDPS 2007)
Article Google Scholar
de Oliveira, D.C.B., Rakamarić, Z., Gopalakrishnan, G., Humphrey, A., Meng, Q., Berzins, M.: Systematic debugging of concurrent systems using Coalesced Stack Trace Graphs. In: Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 317–331. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17473-0_21. http://www.sci.utah.edu/publications/Oli2014a/OliveiraLCPC2014.pdf
Chapter Google Scholar
Ratanaworabhan, P., Burtscher, M.: Program phase detection based on critical basic block transitions. In: ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software, pp. 11–21, April 2008. https://doi.org/10.1109/ISPASS.2008.4510734
Roth, P.C., Arnold, D.C., Miller, B.P.: MRNet: a software-based multicast/reduction network for scalable tools. In: 2003 ACM/IEEE Conference Supercomputing, p. 21, November 2003. https://doi.org/10.1145/1048935.1050172
Schulz, M., Galarowicz, J., Maghrak, D., Hachfeld, W., Montoya, D., Cranford, S.: Open \(\vert \) SpeedShop: an open source infrastructure for parallel performance analysis. Sci. Prog. 16(2–3), 105–121 (2008). https://doi.org/10.3233/SPR-2008-0256
Google Scholar
Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20, 287–311 (2006). https://doi.org/10.1177/1094342006064482. http://portal.acm.org/citation.cfm?id=1125980.1125982
Article Google Scholar
Strande, S.M., et al.: Comet: Tales from the Long Tail: Two Years in and 10,000 users later. In: Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact, PEARC 2017, pp. 38:1–38:7. ACM, New York (2017). https://doi.org/10.1145/3093338.3093383
Tikir, M.M., Laurenzano, M., Carrington, L., Snavely, A.: PMaC binary instrumentation library for PowerPC/AIX. In: Workshop on Binary Instrumentation and Applications (2006)
Google Scholar
Weidendorfer, J.: Sequential performance analysis with Callgrind and KCachegrind. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 93–113. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68564-7_7
Chapter Google Scholar
Yang, A., Mukka, H., Hesaaraki, F., Burtscher, M.: MPC: a massively parallel compression algorithm for scientific data. In: 2015 IEEE International Conference on Cluster Computing, pp. 381–389, September 2015. https://doi.org/10.1109/CLUSTER.2015.59
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23(3), 337–343 (2006). https://doi.org/10.1109/TIT.1977.1055714
Article MathSciNet MATH Google Scholar

Download references

Acknowledgment

This research was supported by the NSF Award CCF 1439002 and CCF 1817073. We thank our colleague Dr. Hari Sundar from the University of Utah who provided insight and expertise that greatly assisted the research. We also thank the Texas Advanced Computing Center (TACC) and the San Diego Supercomputer Center (SDSC) for the infrastructure they provided for running our experiments.

Author information

Authors and Affiliations

School of Computing, University of Utah, Salt Lake City, UT, USA
Saeed Taheri & Ganesh Gopalakrishnan
Department of Computer Science, Texas State University, San Marcos, TX, USA
Sindhu Devale & Martin Burtscher

Authors

Saeed Taheri
View author publications
You can also search for this author in PubMed Google Scholar
Sindhu Devale
View author publications
You can also search for this author in PubMed Google Scholar
Ganesh Gopalakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Martin Burtscher
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ganesh Gopalakrishnan or Martin Burtscher .

Editor information

Editors and Affiliations

Lawrence Livermore National Laboratory, Livermore, CA, USA
Abhinav Bhatele
Lawrence Livermore National Laboratory, Livermore, CA, USA
David Boehme
The University of Arizona, Tucson, AZ, USA
Joshua A. Levine
University of Oregon, Eugene, OR, USA
Allen D. Malony
Technical University of Munich, Munich, Germany
Martin Schulz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Taheri, S., Devale, S., Gopalakrishnan, G., Burtscher, M. (2019). ParLoT: Efficient Whole-Program Call Tracing for HPC Applications. In: Bhatele, A., Boehme, D., Levine, J., Malony, A., Schulz, M. (eds) Programming and Performance Visualization Tools. ESPT ESPT VPA VPA 2017 2018 2017 2018. Lecture Notes in Computer Science(), vol 11027. Springer, Cham. https://doi.org/10.1007/978-3-030-17872-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-17872-7_10
Published: 24 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17871-0
Online ISBN: 978-3-030-17872-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics