Skip to main content

An Approach for Efficient Processing of Machine Operational Data

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14146))

Included in the following conference series:

  • 469 Accesses

Abstract

Supercomputers come in a variety of sizes and architectures with thousands of interconnected nodes. Most organizations are required to produce metrics for their funding sources to prove that these machines are being utilized and meeting the availability requirements. While tracking the state of an individual server is trivial, measuring uptime of a supercomputer with several thousand nodes spanning tens to hundreds of cabinets and rows with one or more mounted file systems is a complex task. Additionally, supercomputers have diverse architectures and System Logic (which includes unique characteristics of the machine itself such as networking topology, size, partitions, hardware layout, physical configuration and component hierarchy). These constraints complicate the computation of standardized metrics such as Mean Time To Failure (MTTI), Mean Time to Failure (MTTF), availability, and utilization.

At the Argonne Leadership Computing Facility (ALCF), we developed a tool that standardizes the analyses of these machines so that these metrics can be computed accurately and efficiently. We call this tool Operational Data Processing System (ODPS), and use it to process the data that Theta, a 4,392 node Cray XC40, generates. In addition to the XC40, this tool also works with Mira, a 49,152 node IBM BG/Q system that ALCF houses. This paper explores how ODPS processes the data from Theta and Mira, including the storage design decisions and architecture-independent approach to metric calculations. We quantitatively evaluate our approach, comparing it to alternative methods for storing and processing supercomputer machine state in the database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Denotes the primary source of the data; all other data is supplemental.

References

  1. https://www.sandia.gov/sandia-computing/high-performance-computing/lightweight-distributed-metric-service-ldms/

  2. ALCF: 2016 operational assessment report argonne leadership computing facility. https://www.alcf.anl.gov/files/CY2016_OAR_ALCF_3_3_2017.pdf

  3. ALCF: Mira. https://www.alcf.anl.gov/mira

  4. Bhalachandra, S., Austin, B., Wright, N.J.: Understanding power variation and its implications on performance optimization on the Cori supercomputer. In: 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 51–62 (2021). https://doi.org/10.1109/PMBS54543.2021.00011

  5. Chan, C.Y., Ioannidis, Y.E.: An efficient bitmap encoding scheme for selection queries. SIGMOD Rec. 28(2), 215–226 (1999). https://doi.acm.org/10.1145/304181.304201

  6. Feldman, S., Zhang, D., Dechev, D., Brandt, J.: Extending LDMS to enable performance monitoring in multi-core applications. In: 2015 IEEE International Conference on Cluster Computing, pp. 717–720 (2015). https://doi.org/10.1109/CLUSTER.2015.125

  7. Harms, K., et al.: Theta: rapid installation and acceptance of an XC40 KNL system. Concurr. Comput. Pract. Exp. 30(1), e4336 (2018). e4336 cpe.4336, https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4336

  8. IBM: Explain information for data operators. https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.admin.perf.doc/doc/c0005140.html

  9. IBM: Sql and xml limits. https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0001029.html

  10. IBM: Types of index access. https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.admin.perf.doc/doc/c0005301.html

  11. Idris, I.: NumPy Beginner’s Guide. Packt Publishing Ltd., Birmingham (2013)

    Google Scholar 

  12. Koegler, W., Chen, J., Shoshani, A.: Using bitmap index for interactive exploration of large datasets. In: 15th International Conference on Scientific and Statistical Database Management, 2003, pp. 65–74, July 2003

    Google Scholar 

  13. Lakner, G., Knudson, B., et al.: IBM System Blue Gene solution: Blue Gene/Q System Administration. IBM Redbooks, Indianapolis (2013)

    Google Scholar 

  14. Lenard, B., Wagner, J., Rasin, A., Grier, J.: SysGen: system state corpus generator. In: Proceedings of the 15th International Conference on Availability, Reliability and Security, pp. 1–6 (2020)

    Google Scholar 

  15. McNally, S.T., et al.: High performance computing facility operational assessment 2016-oak ridge leadership computing facility. Technical report, Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States) (2017)

    Google Scholar 

  16. NERSC: Nersc operational assessment review highlights. https://www.nersc.gov/assets/NUG-2016-business-day/3-OAR-Highlights-NUG-Mar-2016.pdf

  17. Oliner, A., Stearley, J.: What supercomputers say: a study of five system logs. In: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), pp. 575–584, June 2007. https://doi.org/10.1109/DSN.2007.103

  18. Oracle: Logical database limits. https://docs.oracle.com/en/database/oracle/oracle-database/18/refrn/logical-database-limits.html#GUID-685230CF-63F5-4C5A-B8B0-037C566BDA76

  19. Oracle: Mysql : Mysql 8.0 reference manual : C.10.4 limits on table column count and row size. https://dev.mysql.com/doc/refman/8.0/en/column-count-limit.html

  20. Oracle: Rman data repair concepts. https://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmrvcon.htm#BRADV117

  21. Pautsch, G., Roweth, D., Schroeder, S.: The cray® xc\(^{\rm TM}\) supercomputer series: energy-efficient computing. Technical report (2013)

    Google Scholar 

  22. Raman, V., et al.: Db2 with BLU acceleration: so much more than just a column store. Proc. VLDB Endow. 6(11), 1080–1091 (2013)

    Google Scholar 

  23. Sharma, V.: Bitmap index vs. b-tree index: which and when? https://www.oracle.com/technetwork/articles/sharma-indexes-093638.html

Download references

Acknowledgement

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH1135. The authors would also like to acknowledge the review and editing help by Nick Scope.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Rasin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lenard, B., Pershey, E., Nault, Z., Rasin, A. (2023). An Approach for Efficient Processing of Machine Operational Data. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14146. Springer, Cham. https://doi.org/10.1007/978-3-031-39847-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39847-6_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39846-9

  • Online ISBN: 978-3-031-39847-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics