research-article

CAPES: unsupervised storage performance tuning using neural network-based deep reinforcement learning

Authors:

Ethan L. Miller,

Darrell D. E. LongAuthors Info & Claims

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 42, Pages 1 - 14

https://doi.org/10.1145/3126908.3126951

Published: 12 November 2017 Publication History

Abstract

Parameter tuning is an important task of storage performance optimization. Current practice usually involves numerous tweak-benchmark cycles that are slow and costly. To address this issue, we developed CAPES, a model-less deep reinforcement learning-based unsupervised parameter tuning system driven by a deep neural network (DNN). It is designed to find the optimal values of tunable parameters in computer systems, from a simple client-server system to a large data center, where human tuning can be costly and often cannot achieve optimal performance. CAPES takes periodic measurements of a target computer system's state, and trains a DNN which uses Q-learning to suggest changes to the system's current parameter values. CAPES is minimally intrusive, and can be deployed into a production system to collect training data and suggest tuning actions during the system's daily operation. Evaluation of a prototype on a Lustre file system demonstrates an increase in I/O throughput up to 45% at saturation point.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI '16). USENIX Association, Savannah, GA.

Digital Library

[2]

Eric Anderson, Michael Hobbs, Kimberly Keeton, Susan Spence, Mustafa Uysal, and Alistair Veitch. 2002. Hippodrome: running circles around storage administration. In Proceedings of the Conference on File and Storage Technologies (FAST). Monterey, CA. http://www.ssrc.ucsc.edu/PaperArchive/anderson-fast02.pdf

Digital Library

[3]

Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating Root-cause Diagnosis of Performance Anomalies in Production Software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, Berkeley, CA, USA, 307--320. http://dl.acm.org/citation.cfm?id=2387880.2387910

Digital Library

[4]

Yoshua Bengio. 2009. Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning 2, 1 (Jan. 2009), 1--127.

Digital Library

[5]

Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. 2002. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27, 4 (Nov. 2002), 819--840.

Digital Library

[6]

Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning (1st ed.). Springer.

[7]

Julian Borrill, L. Oliker, J. Shalf, and Hongzhang Shan. 2007. Investigation of leading HPC I/O performance using a scientific-application derived benchmark. In Proceedings of SC07. 1--12.

Digital Library

[8]

Haifeng Chen, Guofei Jiang, Hui Zhang, and Kenji Yoshihira. 2009. Boosting the Performance of Computing Systems Through Adaptive Configuration Tuning. In Proceedings of the 2009 ACM Symposium on Applied Computing (SAC '09). ACM, New York, NY, USA, 1045--1049.

Digital Library

[9]

Y. Diao, J. L. Hellerstein, S. Parekh, and J. P. Bigus. 2003. Managing Web Server Performance with AutoTune Agents. IBM Systems Journal 42, 1 (Jan. 2003), 136--149.

Digital Library

[10]

Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger Hoos, and Kevin Leyton-Brown. 2013. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice. 1--5.

[11]

Adem Efe Gencer, David Bindel, Emin Gün Sirer, and Robbert van Renesse. 2015. Configuring Distributed Computations Using Response Surfaces. In Proceedings of the 16th Annual Middleware Conference (Middleware '15) (Middleware '15). ACM, New York, NY, USA, 235--246.

Digital Library

[12]

Kurt Hornik. 1991. Approximation Capabilities of Multilayer Feedforward Networks. Neural Network 4, 2 (March 1991), 251--257.

Digital Library

[13]

Pooyan Jamshidi and Giuliano Casale. 2016. An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems. In Proceedings of the 24th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '16).

[14]

Magnus Karlsson, Christos Karamanolis, and Xiaoyun Zhu. 2005. Triage: Performance Differentiation for Storage Systems Using Adaptive Control. ACM Transactions on Storage 1, 4 (2005), 457--480. http://www.ssrc.ucsc.edu/PaperArchive/karlsson-tos05.pdf

Digital Library

[15]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. (2015). arXiv:cs.LG/1412.6980

[16]

Yan Li, Yash Gupta, Ethan L. Miller, and Darrell D. E. Long. 2016. Pilot: A Framework that Understands How to Do Performance Benchmarks the Right Way. In Proceedings of the 24th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '16).

[17]

Yan Li, Xiaoyuan Lu, Ethan L. Miller, and Darrell D. E. Long. 2015. ASCAR: Automating Contention Management for High-Performance Storage Systems. In Proceedings of the 31th IEEE Conference on Mass Storage Systems and Technologies.

[18]

Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. (2016). arXiv:cs.LG/1509.02971

[19]

Long-Ji Lin. 1993. Reinforcement learning for robots using neural networks. Technical Report. DTIC Document.

Digital Library

[20]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).

[21]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (26 02 2015), 529--533.

[22]

Open Scalable File Systems, Inc. 2014. The Lustre® file system. http://www.opensfs.org/. (2014).

[23]

Sebastian Ruder. 2017. An overview of gradient descent optimization algorithms. http://sebastianruder.com/optimizing-gradient-descent/. (2017).

[24]

A. Saboori, G. Jiang, and H. Chen. 2008. Autotuning Configurations in Distributed Systems for Performance Improvements Using Evolutionary Strategies. In The 28th International Conference on Distributed Computing Systems (ICDCS '08). 769--776.

Digital Library

[25]

SUN Microsystems, File system and Storage Lab (FSL) at Stony Brook University, and Other Contributors. 2016. Filebench. https://github.com/filebench/filebench. (2016).

[26]

Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.

Digital Library

[27]

Andrew S. Tanenbaum. 2010. Computer Networks (5th Edition). Prentice Hall.

Digital Library

[28]

K. Wang, X. Lin, and W. Tang. 2012. Predator - An experience guided configuration optimizer for Hadoop MapReduce. In IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom '12). 419--426.

Digital Library

[29]

Mengzhi Wang, Kinman Au, Anastassia Ailamaki, Anthony Brockwell, Christos Faloutsos, and Gregory R. Ganger. 2004. Storage device performance prediction with CART models. In Proceedings of the 12th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '04). 588--595.

Digital Library

[30]

Keith Winstein and Hari Balakrishnan. 2013. TCP ex Machina: computer-generated congestion control. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM '13). Hong Kong, 123--134.

Digital Library

[31]

F. Zhang, J. Cao, L. Liu, and C. Wu. 2011. Performance improvement of distributed systems by autotuning of the configuration parameters. Tsinghua Science and Technology 16, 4 (Aug 2011), 440--448.

[32]

Jianyong Zhang, Anand Sivasubramaniam, Qian Wang, Alma Riska, and Erik Riedel. 2006. Storage Performance Virtualization via Throughput and Latency Control. ACM Transactions on Storage 2, 3 (Aug. 2006), 283--308.

Digital Library

[33]

Wei Zheng, Ricardo Bianchini, and Thu D. Nguyen. 2007. Automatic Configuration of Internet Services. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys '07). ACM, New York, NY, USA, 219--229.

Digital Library

Cited By

Robert-Hayek SZertal SCouvée P(2025)EVADyR: A new dynamic resampling algorithm for auto-tuning noisy High Performance Computing systemsJournal of Computational Science10.1016/j.jocs.2024.10246884(102468)Online publication date: Jan-2025
https://doi.org/10.1016/j.jocs.2024.102468
Jeannot ELemarinier PMercier GRobert-Hayek SSartori R(2024)Application-Agnostic Auto-Tuning of Open MPI Collectives Using Bayesian Optimization2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00141(771-781)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00141
Tian ZYu CShan ZYin SXu HZhao B(2024)Experience-Based Parallel DDPG for Automatic Tuning of Ceph Knobs2024 3rd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE)10.1109/CBASE64041.2024.10824653(780-786)Online publication date: 11-Oct-2024
https://doi.org/10.1109/CBASE64041.2024.10824653
Show More Cited By

Index Terms

CAPES: unsupervised storage performance tuning using neural network-based deep reinforcement learning
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information storage systems
    1. Storage architectures
      1. Distributed storage
    2. Storage management

Recommendations

Proposal and evaluation of deep exploitation-oriented learning under multiple reward environment
Abstract
Recently, deep reinforcement learning (DRL) has attracted considerable attention. The well-known deep Q-network (DQN) architecture successfully combines deep learning and Q-learning which is a representative reinforcement learning (RL) ...
Deep Reinforcement Learning for Auto-optimization of I/O Accelerator Parameters
Benchmarking, Measuring, and Optimizing
Abstract
Reinforcement Learning (RL) has made several advances in the machine learning domain especially Deep Reinforcement Learning. AlphaGo developed by DeepMind is a good example of how the deep neural network can train an agent to play and outperform ...
Deep Reinforcement Learning: From Q-Learning to Deep Q-Learning
Neural Information Processing
Abstract
As the two hottest branches of machine learning, deep learning and reinforcement learning both play a vital role in the field of artificial intelligence. Combining deep learning with reinforcement learning, deep reinforcement learning is a method ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2017

801 pages

ISBN:9781450351140

DOI:10.1145/3126908

General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

Colorado, Denver

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
847
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Robert-Hayek SZertal SCouvée P(2025)EVADyR: A new dynamic resampling algorithm for auto-tuning noisy High Performance Computing systemsJournal of Computational Science10.1016/j.jocs.2024.10246884(102468)Online publication date: Jan-2025
https://doi.org/10.1016/j.jocs.2024.102468
Jeannot ELemarinier PMercier GRobert-Hayek SSartori R(2024)Application-Agnostic Auto-Tuning of Open MPI Collectives Using Bayesian Optimization2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00141(771-781)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00141
Tian ZYu CShan ZYin SXu HZhao B(2024)Experience-Based Parallel DDPG for Automatic Tuning of Ceph Knobs2024 3rd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE)10.1109/CBASE64041.2024.10824653(780-786)Online publication date: 11-Oct-2024
https://doi.org/10.1109/CBASE64041.2024.10824653
Liu ZWang JWu HMa QPeng LTang Z(2024)Auto-tuning for HPC storage stack: an optimization perspectiveCCF Transactions on High Performance Computing10.1007/s42514-024-00198-8Online publication date: 13-Dec-2024
https://doi.org/10.1007/s42514-024-00198-8
Mondal ASanyal MBarua HChattopadhyay SMondal K(2024)Comparative Analysis of Object-Based Big Data Storage Systems on Architectures and Services: A Recent SurveyJournal of The Institution of Engineers (India): Series B10.1007/s40031-023-00983-z105:3(685-700)Online publication date: 8-Feb-2024
https://doi.org/10.1007/s40031-023-00983-z
Robert-Hayek SZertal SCouvée P(2024)EVADyR: A New Dynamic Resampling Algorithm for Optimizing Noisy Expensive SystemsMetaheuristics and Nature Inspired Computing10.1007/978-3-031-69257-4_19(261-278)Online publication date: 15-Sep-2024
https://doi.org/10.1007/978-3-031-69257-4_19
Ding QZheng PKudari SVenkataraman SZhang ZMohror KArnold DBadia R(2023)Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement LearningProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607042(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607042
Liu ZZhang CWu HFang JPeng LYe GTang Z(2023)Optimizing HPC I/O Performance with Regression Analysis and Ensemble Learning2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00027(234-246)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00027
Zhang YZhang XShi YHuang Z(2023)Sg: Automated tuning algorithm for storage systems based on simulated environments and group climbingCluster Computing10.1007/s10586-023-04206-427:4(4841-4853)Online publication date: 27-Dec-2023
https://doi.org/10.1007/s10586-023-04206-4
Ather HBez JNorris BByna S(2023)Illuminating the I/O Optimization Path of Scientific ApplicationsHigh Performance Computing10.1007/978-3-031-32041-5_2(22-41)Online publication date: 10-May-2023
https://doi.org/10.1007/978-3-031-32041-5_2
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten