Abstract
Supercomputers come in a variety of sizes and architectures with thousands of interconnected nodes. Most organizations are required to produce metrics for their funding sources to prove that these machines are being utilized and meeting the availability requirements. While tracking the state of an individual server is trivial, measuring uptime of a supercomputer with several thousand nodes spanning tens to hundreds of cabinets and rows with one or more mounted file systems is a complex task. Additionally, supercomputers have diverse architectures and System Logic (which includes unique characteristics of the machine itself such as networking topology, size, partitions, hardware layout, physical configuration and component hierarchy). These constraints complicate the computation of standardized metrics such as Mean Time To Failure (MTTI), Mean Time to Failure (MTTF), availability, and utilization.
At the Argonne Leadership Computing Facility (ALCF), we developed a tool that standardizes the analyses of these machines so that these metrics can be computed accurately and efficiently. We call this tool Operational Data Processing System (ODPS), and use it to process the data that Theta, a 4,392 node Cray XC40, generates. In addition to the XC40, this tool also works with Mira, a 49,152 node IBM BG/Q system that ALCF houses. This paper explores how ODPS processes the data from Theta and Mira, including the storage design decisions and architecture-independent approach to metric calculations. We quantitatively evaluate our approach, comparing it to alternative methods for storing and processing supercomputer machine state in the database.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Denotes the primary source of the data; all other data is supplemental.
References
ALCF: 2016 operational assessment report argonne leadership computing facility. https://www.alcf.anl.gov/files/CY2016_OAR_ALCF_3_3_2017.pdf
ALCF: Mira. https://www.alcf.anl.gov/mira
Bhalachandra, S., Austin, B., Wright, N.J.: Understanding power variation and its implications on performance optimization on the Cori supercomputer. In: 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 51–62 (2021). https://doi.org/10.1109/PMBS54543.2021.00011
Chan, C.Y., Ioannidis, Y.E.: An efficient bitmap encoding scheme for selection queries. SIGMOD Rec. 28(2), 215–226 (1999). https://doi.acm.org/10.1145/304181.304201
Feldman, S., Zhang, D., Dechev, D., Brandt, J.: Extending LDMS to enable performance monitoring in multi-core applications. In: 2015 IEEE International Conference on Cluster Computing, pp. 717–720 (2015). https://doi.org/10.1109/CLUSTER.2015.125
Harms, K., et al.: Theta: rapid installation and acceptance of an XC40 KNL system. Concurr. Comput. Pract. Exp. 30(1), e4336 (2018). e4336 cpe.4336, https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4336
IBM: Explain information for data operators. https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.admin.perf.doc/doc/c0005140.html
IBM: Sql and xml limits. https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0001029.html
IBM: Types of index access. https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.admin.perf.doc/doc/c0005301.html
Idris, I.: NumPy Beginner’s Guide. Packt Publishing Ltd., Birmingham (2013)
Koegler, W., Chen, J., Shoshani, A.: Using bitmap index for interactive exploration of large datasets. In: 15th International Conference on Scientific and Statistical Database Management, 2003, pp. 65–74, July 2003
Lakner, G., Knudson, B., et al.: IBM System Blue Gene solution: Blue Gene/Q System Administration. IBM Redbooks, Indianapolis (2013)
Lenard, B., Wagner, J., Rasin, A., Grier, J.: SysGen: system state corpus generator. In: Proceedings of the 15th International Conference on Availability, Reliability and Security, pp. 1–6 (2020)
McNally, S.T., et al.: High performance computing facility operational assessment 2016-oak ridge leadership computing facility. Technical report, Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States) (2017)
NERSC: Nersc operational assessment review highlights. https://www.nersc.gov/assets/NUG-2016-business-day/3-OAR-Highlights-NUG-Mar-2016.pdf
Oliner, A., Stearley, J.: What supercomputers say: a study of five system logs. In: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), pp. 575–584, June 2007. https://doi.org/10.1109/DSN.2007.103
Oracle: Logical database limits. https://docs.oracle.com/en/database/oracle/oracle-database/18/refrn/logical-database-limits.html#GUID-685230CF-63F5-4C5A-B8B0-037C566BDA76
Oracle: Mysql : Mysql 8.0 reference manual : C.10.4 limits on table column count and row size. https://dev.mysql.com/doc/refman/8.0/en/column-count-limit.html
Oracle: Rman data repair concepts. https://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmrvcon.htm#BRADV117
Pautsch, G., Roweth, D., Schroeder, S.: The cray® xc\(^{\rm TM}\) supercomputer series: energy-efficient computing. Technical report (2013)
Raman, V., et al.: Db2 with BLU acceleration: so much more than just a column store. Proc. VLDB Endow. 6(11), 1080–1091 (2013)
Sharma, V.: Bitmap index vs. b-tree index: which and when? https://www.oracle.com/technetwork/articles/sharma-indexes-093638.html
Acknowledgement
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH1135. The authors would also like to acknowledge the review and editing help by Nick Scope.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lenard, B., Pershey, E., Nault, Z., Rasin, A. (2023). An Approach for Efficient Processing of Machine Operational Data. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14146. Springer, Cham. https://doi.org/10.1007/978-3-031-39847-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-39847-6_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39846-9
Online ISBN: 978-3-031-39847-6
eBook Packages: Computer ScienceComputer Science (R0)