ABSTRACT
Modern distributed systems should be built to anticipate performance degradation. Often requests in these systems involve ten to thousands Remote Procedure Calls, each of which can be a source of performance degradation. The PhD programme presented here intends to address this issue by providing automated instruments to effectively drive performance fault injection in distributed systems. The envisioned approach exploits multi-objective search-based techniques to automatically find small combinations of tiny performance degradations induced by specific RPCs,which have significant impacts on the user-perceived performance. Automating the search of these events will improve the ability to inject performance issues in production in order to force developers to anticipate and mitigate them.
- Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, and Lorin Hochstein. 2016. Automating Failure Testing Research at Internet Scale. In the ACM Symposium on Cloud Computing. 17–28. Google ScholarDigital Library
- Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In SIGMOD. 331–346. Google ScholarDigital Library
- Dan Ardelean, Amer Diwan, and Chandra Erdman. 2018. Performance Analysis of Cloud Applications. In the Symposium on Networked Systems Design and Implementation. 405–417.Google Scholar
- Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal. 2016. Chaos Engineering. IEEE Software 33, 3 (May 2016), 35–41. Google ScholarDigital Library
- Jake Brutlag. 2009. Google AI Blog: Speed matters. https://ai.googleblog.com/ 2009/06/speed-matters.htmlGoogle Scholar
- Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. 2002. A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II. IEEE Transaction Evolutionary Computation 6, 2 (April 2002), 182–197. Google ScholarDigital Library
- Dror Feitelson, Eitan Frachtenberg, and Kent Beck. 2013. Development and Deployment at Facebook. IEEE Internet Computing 17, 4 (July 2013), 8–17. Google ScholarDigital Library
- Haryadi S. Gunawi, Thanh Do, Joseph M. Hellerstein, Ion Stoica, Dhruba Borthakur, and Jesse Robbins. 2011. Failure as a Service (FaaS): A Cloud Service for Large-Scale, Online Failure Drills. Technical Report. http://www2.eecs. berkeley.edu/Pubs/TechRpts/2011/EECS-2011-87.htmlGoogle Scholar
- Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In the Conference on Networked Systems Design and Implementation. 238–252. Google ScholarDigital Library
- Mark Harman, S. Afshin Mansouri, and Yuanyuan Zhang. 2012. Search-based Software Engineering: Trends, Techniques and Applications. Comput. Surveys 45, 1, Article 11 (Dec. 2012), 11:1–11:61 pages. Google ScholarDigital Library
- Lorin Hochstein and Casey Rosenthal. 2016. Chaos Engineering Panel. In ICSE (Companion). 90–91. Google ScholarDigital Library
- Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and Detecting Real-world Performance Bugs. In PLDI. 77–88. Google ScholarDigital Library
- Jonathan Kaldor, Jonathan Mace, MichałBejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An End-to-End Performance Tracing And Analysis System. In SOSP. 34–50. Google ScholarDigital Library
- Ghani A. Kanawati, Nasser A. Kanawati, and Jacob A. Abraham. 1995. FERRARI: A Flexible Software-Based Fault and Error Injection System. IEEE Transanctions on Computers 44, 2 (Feb. 1995), 248–260. Google ScholarDigital Library
- Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In the Symposium on Operating Systems Principles. 378–393. Google ScholarDigital Library
- Sam Newman. 2015. Building Microservices (1st ed.). O’Reilly Media, Inc. Google ScholarDigital Library
- Charlene O’Hanlon. 2006. A Conversation with Werner Vogels. Queue 4, 4, Article 14 (May 2006), 14:14–14:22 pages. Google ScholarDigital Library
- Julia Rubin and Martin Rinard. 2016. The Challenges of Staying Together While Moving Fast: An Exploratory Study. In ICSE. 982–993. Google ScholarDigital Library
- Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc. https://research.google.com/archive/papers/dapper-2010-1.pdfGoogle Scholar
- Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song. 2016. Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services. In OSDI. 635–651. Google ScholarDigital Library
- Wei Zheng, Ricardo Bianchini, G. John Janakiraman, Jose Renato Santos, and Yoshio Turner. 2009. JustRunIt: Experiment-based Management of Virtualized Data Centers. In the USENIX Annual Technical Conference. 18–18. Abstract 1 Introduction 2 Envisioned approach 2.1 Approach 2.2 Explanatory example 2.3 Instantiations 3 Expected contribution 4 Related work 5 conclusion References Google ScholarDigital Library
Index Terms
- A multi-objective framework for effective performance fault injection in distributed systems
Recommendations
A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors
IPDS '00: Proceedings of the 4th International Computer Performance and Dependability SymposiumMany fault injection tools are available for dependability assessment. Although these tools are good at injecting a single fault model into a single system, they suffer from two main limitations for use in distributed systems: (1) no single tool is ...
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems
The authors describe a dependability evaluation method based on fault injection that establishes the link between the experimental evaluation of the fault tolerance process and the fault occurrence process. The main characteristics of a fault injection ...
A Java Framework to Specify Faultloads for Fault Injection Campaigns
In an operational environment, the identification and reproduction of faults may be hard to be done, specially in complex systems. Use of fault injection accelerates this process, improving the test of fault tolerance mechanisms. However, there are a ...
Comments