Skip to main content
Log in

Using black-box performance models to detect performance regressions under varying workloads: an empirical study

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Performance regressions of large-scale software systems often lead to both financial and reputational losses. In order to detect performance regressions, performance tests are typically conducted in an in-house (non-production) environment using test suites with predefined workloads. Then, performance analysis is performed to check whether a software version has a performance regression against an earlier version. However, the real workloads in the field are constantly changing, making it unrealistic to resemble the field workloads in predefined test suites. More importantly, performance testing is usually very expensive as it requires extensive resources and lasts for an extended period. In this work, we leverage black-box machine learning models to automatically detect performance regressions in the field operations of large-scale software systems. Practitioners can leverage our approaches to complement or replace resource-demanding performance tests that may not even be realistic in a fast-paced environment. Our approaches use black-box models to capture the relationship between the performance of a software system (e.g., CPU usage) under varying workloads and the runtime activities that are recorded in the readily-available logs. Then, our approaches compare the black-box models derived from the current software version with an earlier version to detect performance regressions between these two versions. We performed empirical experiments on two open-source systems and applied our approaches on a large-scale industrial system. Our results show that such black-box models can effectively and timely detect real performance regressions and injected ones under varying workloads that are unseen when training these models. Our approaches have been adopted in practice to detect performance regressions of a large-scale industry system on a daily basis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. https://www.salesforce.com

  2. https://jmeter.apache.org

  3. Note that our approach serves different purposes and has different usage scenarios from A/B testing and canary releasing, as discussed in Section 9.

  4. Our experimental setup, workloads, and results are shared online https://github.com/senseconcordia/EMSE2020Data as a replication package.

References

  • Apache James (2019) Project-apache james server 3-release notes. http://james.apache.org/server/release-notes.html. Last accessed 10/09/2019

  • Gridsearchc (2019) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html. Last accessed 10/11/2019

  • Pidstat (2019) Report statistics for tasks - linux man page. https://linux.die.net/man/1/pidstat. Last accessed 10/11/2019

  • Alcocer JPS, Bergel A (2015) Tracking down performance variation against source code evolution. In: Proceedings of the 11th symposium on dynamic languages, DLS 2015. Association for Computing Machinery, New York, pp 129–139

  • Barna C, Litoiu M, Ghanbari H (2011) Autonomic load-testing framework. In: Proceedings of the 8th international conference on autonomic computing, ICAC 2011, Karlsruhe, Germany, June 14-18, 2011, pp 91–100

  • Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In Noise reduction in speech processing, pp 1–4. Springer

  • Breiman L, Cutler A, Liaw A, Wiener M (2018) Breiman and cutler’s random forests for classification and regression. R Package Version 4.6–14

  • Cannon J (2019) Performance degradation affecting salesforce clients. https://marketingland.com/performance-degradation-affecting-salesforce-clients-267699. Last accessed 10/11/2019

  • Chen T, Shang W, Hassan AE, Nasser MN, Flora P (2016) Cacheoptimizer: helping developers configure caching frameworks for hibernate-based database-centric web applications. In: Proceedings of the 24th ACM SIGSOFT international symposium on foundations of software engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pp 666–677

  • Cliff N (1996) Ordinal methods for behavioral data analysis

  • Cohen I, Chase JS, Goldszmidt M, Kelly T, Symons J (2004) Correlating instrumentation data to system states: A building block for automated diagnosis and control. In: 6th symposium on operating system design and implementation (OSDI 2004), San Francisco, California, USA, December 6-8, 2004, pp 231–244

  • Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the 20th ACM symposium on operating systems principles 2005, SOSP 2005, Brighton, UK, October 23-26, 2005, pp 105–118

  • Cortez E, Bonde A, Muzio A, Russinovich M, Fontoura M, Bianchini R (2017) Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In: Proceedings of the 26th symposium on operating systems principles, Shanghai, China, October 28-31, 2017, pp 153–167

  • Dacrema MF, Cremonesi P, Jannach D (2019) Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM conference on recommender systems, RecSys 2019, Copenhagen, Denmark, September 16-20, 2019., pp 101–109

  • de Oliveira AB, Fischmeister S, Diwan A, Hauswirth M, Sweeney PF (2013) Why you should care about quantile regression. In: Architectural support for programming languages and operating systems, ASPLOS ’13, Houston, TX, USA - March 16 - 20, 2013, pp 207–218

  • Didona D, Quaglia F, Romano P, Torre E (2015) Enhancing performance prediction robustness by combining analytical modeling and machine learning. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, Austin, TX, USA, Jan 31 - Feb 4, 2015, pp 145–156

  • Farshchi M, Schneider J, Weber I, Grundy JC (2015) Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis. In: 26th IEEE international symposium on software reliability engineering, ISSRE 2015, Gaithersbury, MD, USA, November 2-5, 2015, pp 24–34

  • Foo KC, Jiang ZM, Adams B, Hassan AE, Zou Y, Flora P (2010) Mining performance regression testing repositories for automated performance analysis. In: Proceedings of the 2010 10th international conference on quality software, QSIC ’10, pp 32–41

  • Foo KC, Jiang ZMJ, Adams B, Hassan AE, Zou Y, Flora P (2015) An industrial case study on the automated detection of performance regressions in heterogeneous environments. In: Proceedings of the 37th international conference on software engineering - vol 2, ICSE ’15, pp 159–168

  • Gao R, Jiang ZM, Barna C, Litoiu M (2016) A framework to evaluate the effectiveness of different load testing analysis techniques. In: 2016 IEEE International conference on software testing, verification and validation, ICST 2016, chicago, IL, USA, April 11-15, 2016, pp 22–32

  • Ghaith S, Wang M, Perry P, Jiang ZM, O’Sullivan P, Murphy J (2016) Anomaly detection in performance regression testing by transaction profile estimation. Softw Test Verif Reliab 26(1):4–39

    Article  Google Scholar 

  • Gong Z, Gu X, Wilkes J (2010) PRESS: Predictive elastic resource scaling for cloud systems. In: Proceedings of the 6th international conference on network and service management, CNSM 2010, Niagara Falls, Canada, October 25-29, 2010, pp 9–16

  • Greenberg A, Hamilton J, Maltz DA, Patel P (2008) The cost of a cloud: Research problems in data center networks. SIGCOMM Comput Commun Rev 39(1):68–73

    Article  Google Scholar 

  • Guo J, Czarnecki K, Apel S, Siegmund N, Wasowski A (2013) Variability-aware performance prediction: a statistical learning approach. In: 2013 28Th IEEE/ACM international conference on automated software engineering, ASE 2013, silicon valley, CA, USA, November 11-15, 2013, pp 301–311

  • Guo J, Yang D, Siegmund N, Apel S, Sarkar A, Valov P, Czarnecki K, Wasowski A, Yu H (2018) Data-efficient performance learning for configurable systems. Empir Softw Eng 23(3):1826–1867

    Article  Google Scholar 

  • He S, Lin Q, Lou J, Zhang H, Lyu MR, Zhang D (2018) Identifying impactful service system problems via log analysis. In: Proceedings of the 2018 ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pp 60-70

  • Ibidunmoye O, Hernández-rodriguez F, Elmroth E (2015) Performance anomaly detection and bottleneck identification. ACM Comput Surv 48 (1):4:1–4:35

    Article  Google Scholar 

  • Jiang ZM, Hassan AE (2015) A survey on load testing of large-scale software systems. IEEE Trans Software Eng 41(11):1091–1118

    Article  Google Scholar 

  • Jiang ZM, Hassan AE, Hamann G, Flora P (2009) Automated performance analysis of load tests. In: 25Th IEEE international conference on software maintenance (ICSM 2009), September 20-26, 2009, Edmonton, Alberta, Canada, pp 125–134

  • Krasic C, Sinha A, Kirsh L (2007) Priority-progress CPU adaptation for elastic real-time applications. In: Zimmermann R, Griwodz C (eds) Multimedia computing and networking 2007, vol 6504, International Society for Optics and Photonics, SPIE, pp 172–183

  • Krishnamurthy D, Rolia JA, Majumdar S (2006) A synthetic workload generation technique for stress testing session-based systems. IEEE Trans Software Eng 32(11):868–882

    Article  Google Scholar 

  • Lazowska ED, Zahorjan J, Graham GS, Sevcik KC (1984) Quantitative system performance - computer system analysis using queueing network models. Prentice Hall

  • Lim M, Lou J, Zhang H, Fu Q, Teoh ABJ, Lin Q, Ding R, Zhang D (2014) Identifying recurrent and unknown performance issues. In: 2014 IEEE International conference on data mining, ICDM 2014, Shenzhen, China, December 14-17, 2014, pp 320–329

  • Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: 14Th european conference on software maintenance and reengineering, CSMR 2010, 15-18 March 2010, Madrid, Spain, pp 222–231

  • Malik H, Hemmati H, Hassan AE (2013) Automatic detection of performance deviations in the load testing of large scale systems. In: 35Th international conference on software engineering, ICSE ’13, san francisco, CA, USA, May 18-26, 2013, pp 1012–1021

  • Nachar N et al (2008) The mann-whitney u: A test for assessing whether two independent samples come from the same distribution. Tutorials in Quantitative Methods for Psychology 4(1):13–20

    Article  Google Scholar 

  • Nguyen THD, Adams B, Jiang ZM, Hassan AE, Nasser MN, Flora P (2011) Automated verification of load tests using control charts. In: 18Th asia pacific software engineering conference, APSEC 2011, ho chi minh, Vietnam, December 5-8, 2011, pp 282–289

  • Nguyen THD, Adams B, Jiang ZM, Hassan AE, Nasser MN, Flora P (2012) Automated detection of performance regressions using statistical process control techniques. In: Third joint WOSP/SIPEW international conference on performance engineering, ICPE’12, boston, MA, USA - April 22 - 25, 2012, pp 299–310

  • Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: Should we really be using t-test and cohen’sd for evaluating group differences on the nsse and other surveys. In: annual meeting of the Florida association of institutional research, pp 1–33

  • Sato D (2014) Canary release. MartinFowler. com

  • Shang W, Hassan AE, Nasser MN, Flora P (2015) Automated detection of performance regressions using regression models on clustered performance counters. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, Austin, TX, USA, January 31 - February 4, 2015, pp 15–26

  • Sullivan GM, Feinn R (2012) Using effect size—or why the p value is not enough. Journal of Graduate Medical Education 4(3):279–282

    Article  Google Scholar 

  • Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser MN, Flora P (2013) Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: 2013 IEEE International conference on software maintenance, eindhoven, The Netherlands, September 22-28, 2013, pp 110–119

  • Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser MN, Flora P (2014) Continuous validation of load test suites. In: ACM/SPEC International conference on performance engineering, ICPE’14, dublin, ireland, March 22-26, 2014, pp 259–270

  • Syer MD, Shang W, Jiang ZM, Hassan AE (2017) Continuous validation of performance test workloads. Autom Softw Eng 24(1):189–231

    Article  Google Scholar 

  • Syncsort (2018) White paper: Assessing the financial impact of downtime

  • Tan J, Kavulya S, Gandhi R, Narasimhan P (2010) Visual, log-based causal tracing for performance debugging of mapreduce systems. In: 2010 International conference on distributed computing systems, ICDCS 2010, genova, italy, june 21-25, 2010, pp 795–806

  • Valov P, Petkovich J, Guo J, Fischmeister S, Czarnecki K (2017) Transferring performance prediction models across different hardware platforms. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering, ICPE 2017, L’Aquila, Italy, April 22-26, 2017, pp 39–50

  • Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: Issues, an approach, and case study. IEEE Trans Software Eng 26(12):1147–1156

    Article  Google Scholar 

  • Xiong P, Pu C, Zhu X, Griffith R (2013) vperfguard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments. In: ACM/SPEC international conference on performance engineering, ICPE’13, Prague, Czech Republic, pp 271–282

  • Xu W, Huang L, Fox A, Patterson DA, Jordan MI (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, Montana, USA, October 11-14, 2009, pp 117–132

  • Xu Y, Chen N, Fernandez A, Sinno O, Bhasin A (2015) From infrastructure to culture: A/b testing challenges in large scale social networks. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’15. Association for Computing Machinery, New York, pp 2227–2236

  • Yadwadkar NJ, Bhattacharyya C, Gopinath K, Niranjan T, Susarla S (2010) Discovery of application workloads from network file traces. In: 8Th USENIX conference on file and storage technologies, san jose, CA, USA, February 23-26, 2010, pp 183–196

  • Yao KB, de Pádua G, Shang W, Sporea S, Toma A, Sajedi S (2018) Log4perf: Suggesting logging locations for web-based systems’ performance monitoring. In: Proceedings of the 2018 ACM/SPEC international conference on performance engineering, ICPE ’18, pp 127–138

  • Zhou M, Chen J, Hu H, Yu J, Li Z, Hu H (2019) Deeptle: Learning code-level features to predict code performance before it runs. In: 2019 26th Asia-Pacific software engineering conference (APSEC), pp 252–259

Download references

Acknowledgements

We would like to thank ERA Environmental Management Solutions for providing access to the enterprise system used in our case study. The findings and opinions expressed in this paper are those of the authors and do not necessarily represent or reflect those of ERA Environmental Management Solutions and/or its subsidiaries and affiliates. Moreover, our results do not reflect the quality of ERA Environmental Management Solutions’ products.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lizhi Liao.

Additional information

Communicated by: Ali Ouni, David Lo, Xin Xia, Alexander Serebrenik and Christoph Treude

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Recommendation Systems for Software Engineering

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liao, L., Chen, J., Li, H. et al. Using black-box performance models to detect performance regressions under varying workloads: an empirical study. Empir Software Eng 25, 4130–4160 (2020). https://doi.org/10.1007/s10664-020-09866-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-020-09866-z

Keywords

Navigation