skip to main content
10.1145/3126908.3126951acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

CAPES: unsupervised storage performance tuning using neural network-based deep reinforcement learning

Published: 12 November 2017 Publication History

Abstract

Parameter tuning is an important task of storage performance optimization. Current practice usually involves numerous tweak-benchmark cycles that are slow and costly. To address this issue, we developed CAPES, a model-less deep reinforcement learning-based unsupervised parameter tuning system driven by a deep neural network (DNN). It is designed to find the optimal values of tunable parameters in computer systems, from a simple client-server system to a large data center, where human tuning can be costly and often cannot achieve optimal performance. CAPES takes periodic measurements of a target computer system's state, and trains a DNN which uses Q-learning to suggest changes to the system's current parameter values. CAPES is minimally intrusive, and can be deployed into a production system to collect training data and suggest tuning actions during the system's daily operation. Evaluation of a prototype on a Lustre file system demonstrates an increase in I/O throughput up to 45% at saturation point.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI '16). USENIX Association, Savannah, GA.
[2]
Eric Anderson, Michael Hobbs, Kimberly Keeton, Susan Spence, Mustafa Uysal, and Alistair Veitch. 2002. Hippodrome: running circles around storage administration. In Proceedings of the Conference on File and Storage Technologies (FAST). Monterey, CA. http://www.ssrc.ucsc.edu/PaperArchive/anderson-fast02.pdf
[3]
Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating Root-cause Diagnosis of Performance Anomalies in Production Software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, Berkeley, CA, USA, 307--320. http://dl.acm.org/citation.cfm?id=2387880.2387910
[4]
Yoshua Bengio. 2009. Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning 2, 1 (Jan. 2009), 1--127.
[5]
Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. 2002. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27, 4 (Nov. 2002), 819--840.
[6]
Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning (1st ed.). Springer.
[7]
Julian Borrill, L. Oliker, J. Shalf, and Hongzhang Shan. 2007. Investigation of leading HPC I/O performance using a scientific-application derived benchmark. In Proceedings of SC07. 1--12.
[8]
Haifeng Chen, Guofei Jiang, Hui Zhang, and Kenji Yoshihira. 2009. Boosting the Performance of Computing Systems Through Adaptive Configuration Tuning. In Proceedings of the 2009 ACM Symposium on Applied Computing (SAC '09). ACM, New York, NY, USA, 1045--1049.
[9]
Y. Diao, J. L. Hellerstein, S. Parekh, and J. P. Bigus. 2003. Managing Web Server Performance with AutoTune Agents. IBM Systems Journal 42, 1 (Jan. 2003), 136--149.
[10]
Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger Hoos, and Kevin Leyton-Brown. 2013. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice. 1--5.
[11]
Adem Efe Gencer, David Bindel, Emin Gün Sirer, and Robbert van Renesse. 2015. Configuring Distributed Computations Using Response Surfaces. In Proceedings of the 16th Annual Middleware Conference (Middleware '15) (Middleware '15). ACM, New York, NY, USA, 235--246.
[12]
Kurt Hornik. 1991. Approximation Capabilities of Multilayer Feedforward Networks. Neural Network 4, 2 (March 1991), 251--257.
[13]
Pooyan Jamshidi and Giuliano Casale. 2016. An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems. In Proceedings of the 24th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '16).
[14]
Magnus Karlsson, Christos Karamanolis, and Xiaoyun Zhu. 2005. Triage: Performance Differentiation for Storage Systems Using Adaptive Control. ACM Transactions on Storage 1, 4 (2005), 457--480. http://www.ssrc.ucsc.edu/PaperArchive/karlsson-tos05.pdf
[15]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. (2015). arXiv:cs.LG/1412.6980
[16]
Yan Li, Yash Gupta, Ethan L. Miller, and Darrell D. E. Long. 2016. Pilot: A Framework that Understands How to Do Performance Benchmarks the Right Way. In Proceedings of the 24th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '16).
[17]
Yan Li, Xiaoyuan Lu, Ethan L. Miller, and Darrell D. E. Long. 2015. ASCAR: Automating Contention Management for High-Performance Storage Systems. In Proceedings of the 31th IEEE Conference on Mass Storage Systems and Technologies.
[18]
Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. (2016). arXiv:cs.LG/1509.02971
[19]
Long-Ji Lin. 1993. Reinforcement learning for robots using neural networks. Technical Report. DTIC Document.
[20]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
[21]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (26 02 2015), 529--533.
[22]
Open Scalable File Systems, Inc. 2014. The Lustre® file system. http://www.opensfs.org/. (2014).
[23]
Sebastian Ruder. 2017. An overview of gradient descent optimization algorithms. http://sebastianruder.com/optimizing-gradient-descent/. (2017).
[24]
A. Saboori, G. Jiang, and H. Chen. 2008. Autotuning Configurations in Distributed Systems for Performance Improvements Using Evolutionary Strategies. In The 28th International Conference on Distributed Computing Systems (ICDCS '08). 769--776.
[25]
SUN Microsystems, File system and Storage Lab (FSL) at Stony Brook University, and Other Contributors. 2016. Filebench. https://github.com/filebench/filebench. (2016).
[26]
Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
[27]
Andrew S. Tanenbaum. 2010. Computer Networks (5th Edition). Prentice Hall.
[28]
K. Wang, X. Lin, and W. Tang. 2012. Predator - An experience guided configuration optimizer for Hadoop MapReduce. In IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom '12). 419--426.
[29]
Mengzhi Wang, Kinman Au, Anastassia Ailamaki, Anthony Brockwell, Christos Faloutsos, and Gregory R. Ganger. 2004. Storage device performance prediction with CART models. In Proceedings of the 12th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '04). 588--595.
[30]
Keith Winstein and Hari Balakrishnan. 2013. TCP ex Machina: computer-generated congestion control. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM '13). Hong Kong, 123--134.
[31]
F. Zhang, J. Cao, L. Liu, and C. Wu. 2011. Performance improvement of distributed systems by autotuning of the configuration parameters. Tsinghua Science and Technology 16, 4 (Aug 2011), 440--448.
[32]
Jianyong Zhang, Anand Sivasubramaniam, Qian Wang, Alma Riska, and Erik Riedel. 2006. Storage Performance Virtualization via Throughput and Latency Control. ACM Transactions on Storage 2, 3 (Aug. 2006), 283--308.
[33]
Wei Zheng, Ricardo Bianchini, and Thu D. Nguyen. 2007. Automatic Configuration of Internet Services. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys '07). ACM, New York, NY, USA, 219--229.

Cited By

View all
  • (2025)EVADyR: A new dynamic resampling algorithm for auto-tuning noisy High Performance Computing systemsJournal of Computational Science10.1016/j.jocs.2024.10246884(102468)Online publication date: Jan-2025
  • (2024)Application-Agnostic Auto-Tuning of Open MPI Collectives Using Bayesian Optimization2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00141(771-781)Online publication date: 27-May-2024
  • (2024)Experience-Based Parallel DDPG for Automatic Tuning of Ceph Knobs2024 3rd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE)10.1109/CBASE64041.2024.10824653(780-786)Online publication date: 11-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
  • General Chair:
  • Bernd Mohr,
  • Program Chair:
  • Padma Raghavan
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. performance tuning
  3. q-learning

Qualifiers

  • Research-article

Conference

SC '17
Sponsor:

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)EVADyR: A new dynamic resampling algorithm for auto-tuning noisy High Performance Computing systemsJournal of Computational Science10.1016/j.jocs.2024.10246884(102468)Online publication date: Jan-2025
  • (2024)Application-Agnostic Auto-Tuning of Open MPI Collectives Using Bayesian Optimization2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00141(771-781)Online publication date: 27-May-2024
  • (2024)Experience-Based Parallel DDPG for Automatic Tuning of Ceph Knobs2024 3rd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE)10.1109/CBASE64041.2024.10824653(780-786)Online publication date: 11-Oct-2024
  • (2024)Auto-tuning for HPC storage stack: an optimization perspectiveCCF Transactions on High Performance Computing10.1007/s42514-024-00198-8Online publication date: 13-Dec-2024
  • (2024)Comparative Analysis of Object-Based Big Data Storage Systems on Architectures and Services: A Recent SurveyJournal of The Institution of Engineers (India): Series B10.1007/s40031-023-00983-z105:3(685-700)Online publication date: 8-Feb-2024
  • (2024)EVADyR: A New Dynamic Resampling Algorithm for Optimizing Noisy Expensive SystemsMetaheuristics and Nature Inspired Computing10.1007/978-3-031-69257-4_19(261-278)Online publication date: 15-Sep-2024
  • (2023)Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement LearningProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607042(1-13)Online publication date: 12-Nov-2023
  • (2023)Optimizing HPC I/O Performance with Regression Analysis and Ensemble Learning2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00027(234-246)Online publication date: 31-Oct-2023
  • (2023)Sg: Automated tuning algorithm for storage systems based on simulated environments and group climbingCluster Computing10.1007/s10586-023-04206-427:4(4841-4853)Online publication date: 27-Dec-2023
  • (2023)Illuminating the I/O Optimization Path of Scientific ApplicationsHigh Performance Computing10.1007/978-3-031-32041-5_2(22-41)Online publication date: 10-May-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media