Using black-box performance models to detect performance regressions under varying workloads: an empirical study

Liao, Lizhi; Chen, Jinfu; Li, Heng; Zeng, Yi; Shang, Weiyi; Guo, Jianmei; Sporea, Catalin; Toma, Andrei; Sajedi, Sarah

doi:10.1007/s10664-020-09866-z

Using black-box performance models to detect performance regressions under varying workloads: an empirical study

Published: 28 August 2020

Volume 25, pages 4130–4160, (2020)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Lizhi Liao ORCID: orcid.org/0000-0001-9920-5855¹,
Jinfu Chen¹,
Heng Li²,
Yi Zeng¹,
Weiyi Shang¹,
Jianmei Guo³,
Catalin Sporea⁴,
Andrei Toma⁴ &
…
Sarah Sajedi⁴

908 Accesses
8 Citations
Explore all metrics

Abstract

Performance regressions of large-scale software systems often lead to both financial and reputational losses. In order to detect performance regressions, performance tests are typically conducted in an in-house (non-production) environment using test suites with predefined workloads. Then, performance analysis is performed to check whether a software version has a performance regression against an earlier version. However, the real workloads in the field are constantly changing, making it unrealistic to resemble the field workloads in predefined test suites. More importantly, performance testing is usually very expensive as it requires extensive resources and lasts for an extended period. In this work, we leverage black-box machine learning models to automatically detect performance regressions in the field operations of large-scale software systems. Practitioners can leverage our approaches to complement or replace resource-demanding performance tests that may not even be realistic in a fast-paced environment. Our approaches use black-box models to capture the relationship between the performance of a software system (e.g., CPU usage) under varying workloads and the runtime activities that are recorded in the readily-available logs. Then, our approaches compare the black-box models derived from the current software version with an earlier version to detect performance regressions between these two versions. We performed empirical experiments on two open-source systems and applied our approaches on a large-scale industrial system. Our results show that such black-box models can effectively and timely detect real performance regressions and injected ones under varying workloads that are unseen when training these models. Our approaches have been adopted in practice to detect performance regressions of a large-scale industry system on a daily basis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

Using machine learning and deep learning algorithms for downtime minimization in manufacturing systems: an early failure detection diagnostic service

Article 23 August 2023

Overfitting, Model Tuning, and Evaluation of Prediction Performance

Notes

https://www.salesforce.com
https://jmeter.apache.org
Note that our approach serves different purposes and has different usage scenarios from A/B testing and canary releasing, as discussed in Section 9.
Our experimental setup, workloads, and results are shared online https://github.com/senseconcordia/EMSE2020Data as a replication package.

References

Apache James (2019) Project-apache james server 3-release notes. http://james.apache.org/server/release-notes.html. Last accessed 10/09/2019
Gridsearchc (2019) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html. Last accessed 10/11/2019
Pidstat (2019) Report statistics for tasks - linux man page. https://linux.die.net/man/1/pidstat. Last accessed 10/11/2019
Alcocer JPS, Bergel A (2015) Tracking down performance variation against source code evolution. In: Proceedings of the 11th symposium on dynamic languages, DLS 2015. Association for Computing Machinery, New York, pp 129–139
Barna C, Litoiu M, Ghanbari H (2011) Autonomic load-testing framework. In: Proceedings of the 8th international conference on autonomic computing, ICAC 2011, Karlsruhe, Germany, June 14-18, 2011, pp 91–100
Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In Noise reduction in speech processing, pp 1–4. Springer
Breiman L, Cutler A, Liaw A, Wiener M (2018) Breiman and cutler’s random forests for classification and regression. R Package Version 4.6–14
Cannon J (2019) Performance degradation affecting salesforce clients. https://marketingland.com/performance-degradation-affecting-salesforce-clients-267699. Last accessed 10/11/2019
Chen T, Shang W, Hassan AE, Nasser MN, Flora P (2016) Cacheoptimizer: helping developers configure caching frameworks for hibernate-based database-centric web applications. In: Proceedings of the 24th ACM SIGSOFT international symposium on foundations of software engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pp 666–677
Cliff N (1996) Ordinal methods for behavioral data analysis
Cohen I, Chase JS, Goldszmidt M, Kelly T, Symons J (2004) Correlating instrumentation data to system states: A building block for automated diagnosis and control. In: 6th symposium on operating system design and implementation (OSDI 2004), San Francisco, California, USA, December 6-8, 2004, pp 231–244
Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the 20th ACM symposium on operating systems principles 2005, SOSP 2005, Brighton, UK, October 23-26, 2005, pp 105–118
Cortez E, Bonde A, Muzio A, Russinovich M, Fontoura M, Bianchini R (2017) Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In: Proceedings of the 26th symposium on operating systems principles, Shanghai, China, October 28-31, 2017, pp 153–167
Dacrema MF, Cremonesi P, Jannach D (2019) Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM conference on recommender systems, RecSys 2019, Copenhagen, Denmark, September 16-20, 2019., pp 101–109
de Oliveira AB, Fischmeister S, Diwan A, Hauswirth M, Sweeney PF (2013) Why you should care about quantile regression. In: Architectural support for programming languages and operating systems, ASPLOS ’13, Houston, TX, USA - March 16 - 20, 2013, pp 207–218
Didona D, Quaglia F, Romano P, Torre E (2015) Enhancing performance prediction robustness by combining analytical modeling and machine learning. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, Austin, TX, USA, Jan 31 - Feb 4, 2015, pp 145–156
Farshchi M, Schneider J, Weber I, Grundy JC (2015) Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis. In: 26th IEEE international symposium on software reliability engineering, ISSRE 2015, Gaithersbury, MD, USA, November 2-5, 2015, pp 24–34
Foo KC, Jiang ZM, Adams B, Hassan AE, Zou Y, Flora P (2010) Mining performance regression testing repositories for automated performance analysis. In: Proceedings of the 2010 10th international conference on quality software, QSIC ’10, pp 32–41
Foo KC, Jiang ZMJ, Adams B, Hassan AE, Zou Y, Flora P (2015) An industrial case study on the automated detection of performance regressions in heterogeneous environments. In: Proceedings of the 37th international conference on software engineering - vol 2, ICSE ’15, pp 159–168
Gao R, Jiang ZM, Barna C, Litoiu M (2016) A framework to evaluate the effectiveness of different load testing analysis techniques. In: 2016 IEEE International conference on software testing, verification and validation, ICST 2016, chicago, IL, USA, April 11-15, 2016, pp 22–32
Ghaith S, Wang M, Perry P, Jiang ZM, O’Sullivan P, Murphy J (2016) Anomaly detection in performance regression testing by transaction profile estimation. Softw Test Verif Reliab 26(1):4–39
Article Google Scholar
Gong Z, Gu X, Wilkes J (2010) PRESS: Predictive elastic resource scaling for cloud systems. In: Proceedings of the 6th international conference on network and service management, CNSM 2010, Niagara Falls, Canada, October 25-29, 2010, pp 9–16
Greenberg A, Hamilton J, Maltz DA, Patel P (2008) The cost of a cloud: Research problems in data center networks. SIGCOMM Comput Commun Rev 39(1):68–73
Article Google Scholar
Guo J, Czarnecki K, Apel S, Siegmund N, Wasowski A (2013) Variability-aware performance prediction: a statistical learning approach. In: 2013 28Th IEEE/ACM international conference on automated software engineering, ASE 2013, silicon valley, CA, USA, November 11-15, 2013, pp 301–311
Guo J, Yang D, Siegmund N, Apel S, Sarkar A, Valov P, Czarnecki K, Wasowski A, Yu H (2018) Data-efficient performance learning for configurable systems. Empir Softw Eng 23(3):1826–1867
Article Google Scholar
He S, Lin Q, Lou J, Zhang H, Lyu MR, Zhang D (2018) Identifying impactful service system problems via log analysis. In: Proceedings of the 2018 ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pp 60-70
Ibidunmoye O, Hernández-rodriguez F, Elmroth E (2015) Performance anomaly detection and bottleneck identification. ACM Comput Surv 48 (1):4:1–4:35
Article Google Scholar
Jiang ZM, Hassan AE (2015) A survey on load testing of large-scale software systems. IEEE Trans Software Eng 41(11):1091–1118
Article Google Scholar
Jiang ZM, Hassan AE, Hamann G, Flora P (2009) Automated performance analysis of load tests. In: 25Th IEEE international conference on software maintenance (ICSM 2009), September 20-26, 2009, Edmonton, Alberta, Canada, pp 125–134
Krasic C, Sinha A, Kirsh L (2007) Priority-progress CPU adaptation for elastic real-time applications. In: Zimmermann R, Griwodz C (eds) Multimedia computing and networking 2007, vol 6504, International Society for Optics and Photonics, SPIE, pp 172–183
Krishnamurthy D, Rolia JA, Majumdar S (2006) A synthetic workload generation technique for stress testing session-based systems. IEEE Trans Software Eng 32(11):868–882
Article Google Scholar
Lazowska ED, Zahorjan J, Graham GS, Sevcik KC (1984) Quantitative system performance - computer system analysis using queueing network models. Prentice Hall
Lim M, Lou J, Zhang H, Fu Q, Teoh ABJ, Lin Q, Ding R, Zhang D (2014) Identifying recurrent and unknown performance issues. In: 2014 IEEE International conference on data mining, ICDM 2014, Shenzhen, China, December 14-17, 2014, pp 320–329
Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: 14Th european conference on software maintenance and reengineering, CSMR 2010, 15-18 March 2010, Madrid, Spain, pp 222–231
Malik H, Hemmati H, Hassan AE (2013) Automatic detection of performance deviations in the load testing of large scale systems. In: 35Th international conference on software engineering, ICSE ’13, san francisco, CA, USA, May 18-26, 2013, pp 1012–1021
Nachar N et al (2008) The mann-whitney u: A test for assessing whether two independent samples come from the same distribution. Tutorials in Quantitative Methods for Psychology 4(1):13–20
Article Google Scholar
Nguyen THD, Adams B, Jiang ZM, Hassan AE, Nasser MN, Flora P (2011) Automated verification of load tests using control charts. In: 18Th asia pacific software engineering conference, APSEC 2011, ho chi minh, Vietnam, December 5-8, 2011, pp 282–289
Nguyen THD, Adams B, Jiang ZM, Hassan AE, Nasser MN, Flora P (2012) Automated detection of performance regressions using statistical process control techniques. In: Third joint WOSP/SIPEW international conference on performance engineering, ICPE’12, boston, MA, USA - April 22 - 25, 2012, pp 299–310
Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: Should we really be using t-test and cohen’sd for evaluating group differences on the nsse and other surveys. In: annual meeting of the Florida association of institutional research, pp 1–33
Sato D (2014) Canary release. MartinFowler. com
Shang W, Hassan AE, Nasser MN, Flora P (2015) Automated detection of performance regressions using regression models on clustered performance counters. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, Austin, TX, USA, January 31 - February 4, 2015, pp 15–26
Sullivan GM, Feinn R (2012) Using effect size—or why the p value is not enough. Journal of Graduate Medical Education 4(3):279–282
Article Google Scholar
Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser MN, Flora P (2013) Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: 2013 IEEE International conference on software maintenance, eindhoven, The Netherlands, September 22-28, 2013, pp 110–119
Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser MN, Flora P (2014) Continuous validation of load test suites. In: ACM/SPEC International conference on performance engineering, ICPE’14, dublin, ireland, March 22-26, 2014, pp 259–270
Syer MD, Shang W, Jiang ZM, Hassan AE (2017) Continuous validation of performance test workloads. Autom Softw Eng 24(1):189–231
Article Google Scholar
Syncsort (2018) White paper: Assessing the financial impact of downtime
Tan J, Kavulya S, Gandhi R, Narasimhan P (2010) Visual, log-based causal tracing for performance debugging of mapreduce systems. In: 2010 International conference on distributed computing systems, ICDCS 2010, genova, italy, june 21-25, 2010, pp 795–806
Valov P, Petkovich J, Guo J, Fischmeister S, Czarnecki K (2017) Transferring performance prediction models across different hardware platforms. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering, ICPE 2017, L’Aquila, Italy, April 22-26, 2017, pp 39–50
Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: Issues, an approach, and case study. IEEE Trans Software Eng 26(12):1147–1156
Article Google Scholar
Xiong P, Pu C, Zhu X, Griffith R (2013) vperfguard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments. In: ACM/SPEC international conference on performance engineering, ICPE’13, Prague, Czech Republic, pp 271–282
Xu W, Huang L, Fox A, Patterson DA, Jordan MI (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, Montana, USA, October 11-14, 2009, pp 117–132
Xu Y, Chen N, Fernandez A, Sinno O, Bhasin A (2015) From infrastructure to culture: A/b testing challenges in large scale social networks. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’15. Association for Computing Machinery, New York, pp 2227–2236
Yadwadkar NJ, Bhattacharyya C, Gopinath K, Niranjan T, Susarla S (2010) Discovery of application workloads from network file traces. In: 8Th USENIX conference on file and storage technologies, san jose, CA, USA, February 23-26, 2010, pp 183–196
Yao KB, de Pádua G, Shang W, Sporea S, Toma A, Sajedi S (2018) Log4perf: Suggesting logging locations for web-based systems’ performance monitoring. In: Proceedings of the 2018 ACM/SPEC international conference on performance engineering, ICPE ’18, pp 127–138
Zhou M, Chen J, Hu H, Yu J, Li Z, Hu H (2019) Deeptle: Learning code-level features to predict code performance before it runs. In: 2019 26th Asia-Pacific software engineering conference (APSEC), pp 252–259

Download references

Acknowledgements

We would like to thank ERA Environmental Management Solutions for providing access to the enterprise system used in our case study. The findings and opinions expressed in this paper are those of the authors and do not necessarily represent or reflect those of ERA Environmental Management Solutions and/or its subsidiaries and affiliates. Moreover, our results do not reflect the quality of ERA Environmental Management Solutions’ products.

Author information

Authors and Affiliations

Department of Computer Science and Software Engineering, Concordia University, Montreal, Quebec, Canada
Lizhi Liao, Jinfu Chen, Yi Zeng & Weiyi Shang
Département de génie informatique et génie logiciel, Polytechnique Montréal, Montreal, Quebec, Canada
Heng Li
Alibaba Group, Hangzhou, Zhejiang, China
Jianmei Guo
ERA Environmental Management Solutions, Montreal, Quebec, Canada
Catalin Sporea, Andrei Toma & Sarah Sajedi

Authors

Lizhi Liao
View author publications
You can also search for this author in PubMed Google Scholar
Jinfu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Heng Li
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Weiyi Shang
View author publications
You can also search for this author in PubMed Google Scholar
Jianmei Guo
View author publications
You can also search for this author in PubMed Google Scholar
Catalin Sporea
View author publications
You can also search for this author in PubMed Google Scholar
Andrei Toma
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Sajedi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lizhi Liao.

Additional information

Communicated by: Ali Ouni, David Lo, Xin Xia, Alexander Serebrenik and Christoph Treude

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Recommendation Systems for Software Engineering

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liao, L., Chen, J., Li, H. et al. Using black-box performance models to detect performance regressions under varying workloads: an empirical study. Empir Software Eng 25, 4130–4160 (2020). https://doi.org/10.1007/s10664-020-09866-z

Download citation

Published: 28 August 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10664-020-09866-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using black-box performance models to detect performance regressions under varying workloads: an empirical study

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

Using machine learning and deep learning algorithms for downtime minimization in manufacturing systems: an early failure detection diagnostic service

Overfitting, Model Tuning, and Evaluation of Prediction Performance

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using black-box performance models to detect performance regressions under varying workloads: an empirical study

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

Using machine learning and deep learning algorithms for downtime minimization in manufacturing systems: an early failure detection diagnostic service

Overfitting, Model Tuning, and Evaluation of Prediction Performance

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation