Scalable, fault-tolerant job step management for high-performance systems | IBM Journals & Magazine | IEEE Xplore