Abstract
The complexity of HPC software and hardware is quickly increasing. As a consequence, the need for efficient execution tracing to gain insight into HPC application behavior is steadily growing. Unfortunately, available tools either do not produce traces with enough detail or incur large overheads. An efficient tracing method that overcomes the tradeoff between maximum information and minimum overhead is therefore urgently needed. This paper presents such a method and tool, called ParLoT, with the following key features. (1) It describes a technique that makes low-overhead on-the-fly compression of whole-program call traces feasible. (2) It presents a new, efficient, incremental trace-compression approach that reduces the trace volume dynamically, which lowers not only the needed bandwidth but also the tracing overhead. (3) It collects all caller/callee relations, call frequencies, call stacks, as well as the full trace of all calls and returns executed by each thread, including in library code. (4) It works on top of existing dynamic binary instrumentation tools, thus requiring neither source-code modifications nor recompilation. (5) It supports program analysis and debugging at the thread, thread-group, and program level. This paper establishes that comparable capabilities are currently unavailable. Our experiments with the NAS parallel benchmarks running on the Comet supercomputer with up to 1,024 cores show that ParLoT can collect whole-program function-call traces at an average tracing bandwidth of just 56 kB/s per core.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Given the absence of tools similar to ParLoT, we employ Callgrind as a “close-enough” tool in our comparisons elaborated in Sect. 4.3. In this capacity, Callgrind is similar to ParLoT(m), a variant of ParLoT that only collects traces from the main image. We perform such comparison to have an idea of how we fare with respect to one other tool. In Sect. 5, we also present a self-assessment of ParLoT separately.
References
Aguilar, X., Fürlinger, K., Laure, E.: Online MPI trace compression using event flow graphs and wavelets. Procedia Comput. Sci. 80(Supp. C), 1497–1506 (2016). https://doi.org/10.1016/j.procs.2016.05.471. http://www.sciencedirect.com/science/article/pii/S1877050916309565. International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA
Arnold, D.C., Ahn, D.H., de Supinski, B.R., Lee, G.L., Miller, B.P., Schulz, M.: Stack trace analysis for large scale debugging. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pp. 1–10 (2007)
Bailey, D.H., et al.: The NAS parallel benchmarks— summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing 1991, pp. 158–165. ACM, New York (1991). https://doi.org/10.1145/125826.125925
Burtscher, M., Rabeti, H.: A scalable heterogeneous parallelization framework for iterative local searches. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 1289–1298, May 2013. https://doi.org/10.1109/IPDPS.2013.27
Burtscher, M., Mukka, H., Yang, A., Hesaaraki, F.: Real-time synthesis of compression algorithms for scientific data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, pp. 23:1–23:12. IEEE Press, Piscataway, NJ, USA (2016). http://dl.acm.org/citation.cfm?id=3014904.3014935
Claggett, S., Azimi, S., Burtscher, M.: SPDP: an automatically synthesized lossless compression algorithm for floating-point data. In: 2018 Data Compression Conference (2018)
Coplin, J., Yang, A., Poppe, A., Burtscher, M.: Increasing telemetry throughput using customized and adaptive data compression. In: AIAA SPACE and Astronautics Forum and Exposition (2016)
Freitag, F., Caubet, J., Labarta, J.: On the scalability of tracing mechanisms. In: Monien, B., Feldmann, R. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 97–104. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45706-2_10
Gamblin, T., de Supinski, B.R., Schulz, M., Fowler, R., Reed, D.A.: Scalable load-balance measurement for SPMD codes. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 1–12, November 2008. https://doi.org/10.1109/SC.2008.5222553
Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations, 1st edn. Springer, Secaucus (1997). https://doi.org/10.1007/978-3-642-59830-2
Godin, R., Missaoui, R., Alaoui, H.: Incremental concept formation algorithms based on Galois (concept) lattices. Comput. Intell. 11(2), 246–267
Gopalakrishnan, G., et al.: Report of the HPC correctness summit, 25–26 January 2017, Washington, DC. CoRR abs/1705.07478 (2017). http://arxiv.org/abs/1705.07478
Hazelwood, K., Klauser, A.: A dynamic binary instrumentation engine for the ARM architecture. In: Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2006, pp. 261–270. ACM, New York (2006). https://doi.org/10.1145/1176760.1176793
Heroux, M.A., et al.: Improving performance via mini-applications. Sandia National Laboratories, Technical report SAND2009-5574 3 (2009)
Intel: Pin, a dynamic binary instrumentation. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool
Jurenz, M., Brendel, R., Knüpfer, A., Müller, M., Nagel, W.E.: Memory allocation tracing with VampirTrace. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4488, pp. 839–846. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72586-2_118
Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for Periscope, Scalasca, Tau, and Vampir. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-31476-6_7
Luk, C.K., et al.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2005, pp. 190–200. ACM, New York (2005). https://doi.org/10.1145/1065010.1065034
Miller, B.P., et al.: The Paradyn parallel performance measurement tool. IEEE Comput. 28(11), 37–46 (1995). https://doi.org/10.1109/2.471178
Mohror, K., Karavanic, K.L.: Evaluating similarity-based trace reduction techniques for scalable performance analysis. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 55:1–55:12. ACM, New York (2009). https://doi.org/10.1145/1654059.1654115
Nataraj, A., Malony, A., Morris, A., Arnold, D.C., Miller, B.: A framework for scalable, parallel performance monitoring 22, 720–735 (2009)
Nethercote, N., Seward, J.: How to shadow every byte of memory used by a program. In: Proceedings of the 3rd International Conference on Virtual Execution Environments, VEE 2007, pp. 65–74. ACM, New York (2007)
Nethercote, N., Seward, J.: Valgrind: a program supervision framework. Electr. Notes Theor. Comput. Sci. 89(2), 44–66 (2003). https://doi.org/10.1016/S1571-0661(04)81042-9
Network, Microsoft, Docs: C sequence points. https://msdn.microsoft.com/en-us/library/azk8zbxd.aspx
Noeth, M., Ratn, P., Mueller, F., Schulz, M., de Supinski, B.R.: ScalaTrace: scalable compression and replay of communication traces for high-performance computing. J. Parallel Distrib. Comput. 69(8), 696–710 (2009). https://doi.org/10.1016/j.jpdc.2008.09.001. Best Paper Awards: 21st International Parallel and Distributed Processing Symposium (IPDPS 2007)
de Oliveira, D.C.B., Rakamarić, Z., Gopalakrishnan, G., Humphrey, A., Meng, Q., Berzins, M.: Systematic debugging of concurrent systems using Coalesced Stack Trace Graphs. In: Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 317–331. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17473-0_21. http://www.sci.utah.edu/publications/Oli2014a/OliveiraLCPC2014.pdf
Ratanaworabhan, P., Burtscher, M.: Program phase detection based on critical basic block transitions. In: ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software, pp. 11–21, April 2008. https://doi.org/10.1109/ISPASS.2008.4510734
Roth, P.C., Arnold, D.C., Miller, B.P.: MRNet: a software-based multicast/reduction network for scalable tools. In: 2003 ACM/IEEE Conference Supercomputing, p. 21, November 2003. https://doi.org/10.1145/1048935.1050172
Schulz, M., Galarowicz, J., Maghrak, D., Hachfeld, W., Montoya, D., Cranford, S.: Open \(\vert \) SpeedShop: an open source infrastructure for parallel performance analysis. Sci. Prog. 16(2–3), 105–121 (2008). https://doi.org/10.3233/SPR-2008-0256
Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20, 287–311 (2006). https://doi.org/10.1177/1094342006064482. http://portal.acm.org/citation.cfm?id=1125980.1125982
Strande, S.M., et al.: Comet: Tales from the Long Tail: Two Years in and 10,000 users later. In: Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact, PEARC 2017, pp. 38:1–38:7. ACM, New York (2017). https://doi.org/10.1145/3093338.3093383
Tikir, M.M., Laurenzano, M., Carrington, L., Snavely, A.: PMaC binary instrumentation library for PowerPC/AIX. In: Workshop on Binary Instrumentation and Applications (2006)
Weidendorfer, J.: Sequential performance analysis with Callgrind and KCachegrind. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 93–113. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68564-7_7
Yang, A., Mukka, H., Hesaaraki, F., Burtscher, M.: MPC: a massively parallel compression algorithm for scientific data. In: 2015 IEEE International Conference on Cluster Computing, pp. 381–389, September 2015. https://doi.org/10.1109/CLUSTER.2015.59
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23(3), 337–343 (2006). https://doi.org/10.1109/TIT.1977.1055714
Acknowledgment
This research was supported by the NSF Award CCF 1439002 and CCF 1817073. We thank our colleague Dr. Hari Sundar from the University of Utah who provided insight and expertise that greatly assisted the research. We also thank the Texas Advanced Computing Center (TACC) and the San Diego Supercomputer Center (SDSC) for the infrastructure they provided for running our experiments.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Taheri, S., Devale, S., Gopalakrishnan, G., Burtscher, M. (2019). ParLoT: Efficient Whole-Program Call Tracing for HPC Applications. In: Bhatele, A., Boehme, D., Levine, J., Malony, A., Schulz, M. (eds) Programming and Performance Visualization Tools. ESPT ESPT VPA VPA 2017 2018 2017 2018. Lecture Notes in Computer Science(), vol 11027. Springer, Cham. https://doi.org/10.1007/978-3-030-17872-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-17872-7_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17871-0
Online ISBN: 978-3-030-17872-7
eBook Packages: Computer ScienceComputer Science (R0)