Abstract
The rise of large-scale software systems poses many new challenges for the software performance engineering field. Failures in these systems are often associated with performance issues, rather than with feature bugs. Therefore, performance testing has become essential to ensuring the problem-free operation of these systems. However, the performance testing process is faced with a major challenge: evolving field workloads, in terms of evolving feature sets and usage patterns, often lead to “outdated” tests that are not reflective of the field. Hence performance analysts must continually validate whether their tests are still reflective of the field. Such validation may be performed by comparing execution logs from the test and the field. However, the size and unstructured nature of execution logs makes such a comparison unfeasible without automated support. In this paper, we propose an automated approach to validate whether a performance test resembles the field workload and, if not, determines how they differ. Performance analysts can then update their tests to eliminate such differences, hence creating more realistic tests. We perform six case studies on two large systems: one open-source system and one enterprise system. Our approach identifies differences between performance tests and the field with a precision of 92 % compared to only 61 % for the state-of-the-practice and 19 % for a conventional statistical comparison.


Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Adam K.: Process a million songs with apache pig. http://blog.cloudera.com/blog/2012/08/process-a-million-songs-with-apache-pig/ (2012). Accessed 28 Oct 2015
Ausick, P.: NASDAQ gets off cheap in Facebook IPO SNAFU. http://finance.yahoo.com/news/nasdaq-gets-off-cheap-facebook-174557126.html (2012). Accessed 09 Dec 2014
Avritzer, A., Weyuker, E.J.: Generating test suites for software load testing. In: Proceedings of the International Symposium on Software Testing and Analysis, pp. 44–57 (1994)
Avritzer, A., Weyuker, E.J.: The automatic generation of load test suites and the assessment of the resulting software. Trans. Softw. Eng. 21(9), 705–716 (1995)
Barros, M.D., Shiau, J., Shang, C., Gidewall, K., Shi, H., Forsmann, J.: Web services wind tunnel: on performance testing large-scale stateful web services. In: International Conference on Dependable Systems and Networks, pp. 612–617 (2007)
Bataille, J.: Operational progress report. http://www.hhs.gov/digitalstrategy/blog/2013/12/operational-progress-report.html (2013). Accessed 01 Jun 2014
Benoit, D.: Nasdaqs blow-by-blow on what happened to Facebook. http://blogs.wsj.com/deals/2012/05/21/nasdaqs-blow-by-blow-on-what-happened-to-facebook/ (2013). Accessed 05 May 2014
Bernat, A.R., Miller B.P.: Anywhere, any-time binary instrumentation. In: Proceedings of the Workshop on Program Analysis for Software Tools, pp. 9–16 (2011)
Bertolotti, L., Calzarossa, M.C.: Models of mail server workloads. Perform. Eval. 46(2–3), 65–76 (2001)
Cai, Y., Grundy, J., Hosking, J.: Synthesizing client load models for performance engineering via web crawling. In: Proceedings of the International Conference on Automated Software Engineering, pp. 353–362 (2007)
Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)
Cha, S.H.: Comprehensive survey on distance/similarity measures between probability density functions. Int. J Math. Models Methods Appl. Sci. 1(4), 300–307 (2007)
Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. Proc. VLDB Endow. 5(12), 1802–1813 (2012)
Cheng, J.: Steve jobs on MobileMe. http://arstechnica.com/apple/2008/08/steve-jobs-on-mobileme-the-full-e-mail/ (2008). Accessed 25 Jan 2014
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge, New York (1988)
Coleman P.: The avoidable cost of downtime. http://www.ca.com//media/Files/SupportingPieces/acd_report_110110.ashx (2011). Accessed 14 Apr 2014
Cornelissen, B., Zaidman, A., van Deursen, A., Moonen, L., Koschke, R.: A systematic survey of program comprehension through dynamic analysis. Trans. Softw. Eng. 35(5), 684–702 (2009)
Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Draheim, D., Grundy, J., Hosking, J., Lutteroth, C., Weber, G.: Realistic load testing of web applications. In: Proceedings of the European Conference on Software Maintenance and Reengineering, pp. 57–68 (2006)
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis, 1st edn. Wiley, New York (1973)
Elliott, A.C.: Statistical Analysis Quick Reference Guidebook, 1st edn. Sage, Thousand Oaks (2006)
Frades, I., Matthiesen, R.: Overview on techniques in cluster analysis. Bioinform. Methods Clin. Res. 593, 81–107 (2009)
Fulekar, M.H.: Bioinformatics: Applications in Life and Environmental Sciences, 1st edn. Springer, New York (2008)
Greenwood, D., Lyell, M., Mallya, A., Suguri, H.: The IEEE FIPA approach to integrating software agents and web services. In: Proceedings of the International Joint Conference on Autonomous-Agents and Multiagent Systems, pp. 1412–1418 (2007)
Hadoop: http://hadoop.apache.org/ (2014). Accessed 17 Apr 2013
Hadoop-LZO: https://github.com/twitter/hadoop-lzo (2011). Accessed 28 Oct 2015
Harris, C.: IT downtime costs \({\$}\)26.5 billion in lost revenue. http://www.informationweek.com/it-downtime-costs-$265-billion-in-lost-revenue/d/d-id/1097919? (2011). Accessed 25 Jan 2014
Hassan, A.E., Flora, P.: Performance engineering in industry: current practices and adoption challenges. In: Proceedings of the International Workshop on Software and Performance, pp. 209–209 (2007)
Hassan, A.E., Martin, D.J., Flora, P., Mansfield, P., Dietz, D.: An industrial case study of customizing operational profiles using log compression. In: Proceedings of the 30th International Conference on Software Engineering, pp. 713–723 (2008)
Howell Jr., T., Dinan, S.: Price of fixing, upgrading obamacare website rises to \$121 million. http://www.washingtontimes.com/news/2014/apr/29/obamacare-website-fix-will-cost-feds-121-million/ (2014). Accessed 09 Dec 2014
Huang, A.: Similarity measures for text document clustering. In: Proceedings of the New Zealand Computer Science Research Student Conference, pp. 44–56 (2008)
Jiang Z.M.: Automated analysis of load testing results. PhD thesis, Queen’s University (2013)
Jiang, Z.M., Hassan, A.E., Hamann, G., Flora, P.: An automated approach for abstracting execution logs to execution events. J. Softw. Maint. Evol. 20(4), 249–267 (2008a)
Jiang, Z.M., Hassan, A.E., Hamann, G., Flora, P.: Automatic identification of load testing problems. In: Proceedings of the International Conference on Software Maintenance, pp. 307–316 (2008b)
Jiang, Z.M., Hassan, A.E., Hamann, G., Flora, P.: Automated performance analysis of load tests. In: Proceedings of the International Conference on Software Maintenance, pp. 125–134 (2009)
Kampenes, V.B., Dybå, T., Hannay, J.E., Sjøberg, D.I.K.: A systematic review of effect size in software engineering experiments. Inform. Softw. Technol. 49(11–12), 1073–1086 (2007)
Kavulya, S., Tan, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production mapreduce cluster. In: Proceedings of the International Conference on Cluster, Cloud and Grid Computing, pp. 94–103 (2010)
Klose, O.: Hadoop on Linux on Azure. http://blogs.technet.com/b/oliviaklose/archive/2014/06/17/hadoop-on-linux-on-azure-1.aspx (2014). Accessed 28 Oct 2015
Kremenek, T., Engler, D.: Z-ranking: using statistical analysis to counter the impact of static analysis approximations. In: Proceedings of the International Conference on Static Analysis, pp. 295–315 (2003)
Krishnamurthy, D., Rolia, J.A., Majumdar, S.: A synthetic workload generation technique for stress testing session-based systems. Trans. Softw. Eng. 32(11), 868–882 (2006)
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)
Laurenzano, M.A., Peraza, J., Carrington, L., Tiwari Jr., A., Ward, W., Campbell, R.: Pebil: binary instrumentation for practical data-intensive program analysis. Clust. Comput. 1(18), 1–14 (2015)
MapReduce Tutorial: http://hadoop.apache.org/docs/stable/mapred_tutorial.html (2014). Accessed 16 Jun 2014
Meira, J.A., de Almeida, E.C., Traon, Y.L., Sunye, G.: Peer-to-peer load testing. In: Proceedings of the International Conference on Software Testing, Verification and Validation, pp. 642–647 (2012)
Menascé, D.A.: Load testing of web sites. IEEE Internet Comput. 6(4), 70–74 (2002)
Metrics 20: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html (2014). Accessed 16 Jun 2014
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)
Million Song Dataset: https://aws.amazon.com/datasets/million-song-dataset/ (2011). Accessed 28 Oct 2015
Million Song Dataset: http://labrosa.ee.columbia.edu/millionsong/ (2012). Accessed 28 Oct 2015
Mojena, R.: Hierarchical grouping methods and stopping rules: an evaluation. Comput. J. 20(4), 353–363 (1977)
Nagappan, M., Wu, K., Vouk M.A.: Efficiently extracting operational profiles from execution logs using suffix arrays. In: Proceedings of the International Symposium on Software Reliability Engineering, pp. 41–50 (2009)
OutputCommitter: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/OutputCommitter.html (2014). Accessed 16 Jun 2014
Parnas, D.L.: Software aging. In: Proceedings of the International Conference on Software Engineering, pp. 279–287 (1994)
PerfMon: http://perfmon.sourceforge.net/ (2014). Accessed 26 Jan 2014
RecordReader: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RecordReader.html (2014). Accessed 16 Jun 2014
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)
Sandhya, N., Govardhan, A.: Analysis of similarity measures with wordnet based text document clustering. In: Proceedings of the International Conference on Information Systems Design and Intelligent Applications, pp. 703–714 (2012)
Shang, W.: Log engineering: towards systematic log mining to support the development of ultra-large scale systems. PhD thesis, Queen’s University (2014)
Shang, W., Jiang, Z.M., Adams, B., Hassan, A.E., Godfrey, M.W., Nasser, M., Flora, P.: An exploratory study of the evolution of communicated information about the execution of large software systems. In: Proceedings of the Working Conference on Reverse Engineering, pp. 335–344 (2011)
Shang, W., Jiang, Z.M., Hemmati, H., Adams, B., Hassan, A.E., Martin, P.: Assisting developers of big data analytics applications when deploying on hadoop clouds. In: Proceedings of the International Conference on Software Engineering, pp. 402–411 (2013)
Shang, W., Nagappan, M., Hassan, A.E.: Studying the relationship between logging characteristics and the code quality of platform software. Empir. Softw. Eng. 20(1), 20:1–20:27 (2015)
SiliconBeat: Firefox download stunt sets record for quickest meltdown. http://www.siliconbeat.com/2008/06/17/firefox-download-stunt-sets-record-for-quickest-meltdown/ (2008). Accessed 25 Jan 2014
Software Engineering Institute: Ultra-Large-Scale Systems: The Software Challenge of the Future. Carnegie Mellon University, Pittsburgh (2006)
Sokal, R.R., Rohlf, F.J.: Biometry: The Principles and Practice of Statistics in Biological Research, 4th edn. W. H. Freeman, New York (2011)
Student: The probable error of a mean. Biometrika 6(1), 1–25 (1908)
Syer, M.D., Adams, B., Hassan A.E.: Identifying performance deviations in thread pools. In: Proceedings of the International Conference on Software Maintenance, pp. 83–92 (2011a)
Syer, M.D., Adams, B., Hassan A.E.: Industrial case study on supporting the comprehension of system behaviour. In: Proceedings of the International Conference on Program Comprehension, pp. 215–216 (2011b)
Syer, M.D., Jiang, Z.M., Nagappan, M., Hassan, A.E., Nasser, M., Flora, P.: Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: Proceedings of the International Conference on Software Maintenance, pp. 110–119 (2013)
Syer, M.D., Jiang, Z.M., Nagappan, M., Hassan, A.E., Nasser, M., Flora, P.: Continuous validation of load test suites. In: Proceedings of the International Conference on Performance Engineering, pp. 259–270 (2014)
Tan, P.N., Steinbach, M., Kumar, V.: Cluster Analysis: Basic Concepts and Algorithms, 1st edn. Addison-Wesley Longman Publishing Co., Inc, Boston (2005)
TextInputFormat: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/TextInputFormat.html (2014). Accessed 16 Jun 2014
The Sarbanes-Oxley Act 2002: http://soxlaw.com/ (2014). Accessed 28 Jan 2014
Twitter: New Tweets per second record, and how! https://blog.twitter.com/2013/new-tweets-per-second-record-and-how (2013). Accessed 12 Dec 2014
Uh, G.R., Cohn, R., Yadavalli, B., Peri, R., Ayyagari, R.: Analyzing dynamic binary instrumentation overhead. In: Proceedings of the Workshop on Binary Instrumentation and Applications, pp. 56–64 (2006)
Voas, J.: Will the real operational profile please stand up? IEEE Softw. 17(2), 87–89 (2000)
Welch, B.L.: The generalization of “student’s” problem when several different population variances are involved. Biometrika 34(1–2), 28–35 (1997)
Weyuker, E., Vokolos, F.: Experience with performance testing of software systems: issues, an approach, and case study. Trans. Softw. Eng. 26(12), 1147–1156 (2000)
Williams, A.: Amazon web services outage caused by memory leak and failure in monitoring alarm. http://techcrunch.com/2012/10/27/amazon-web-services-outage-caused-by-memory-leak-and-failure-in-monitoring-alarm/ (2012). Accessed 09 Dec 2014
Yuan, D., Luo, Y., Zhuang, X., Rodrigues, G.R., Zhao, X., Zhang, Y., Jain, P.U., Stumm, M.: Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In: Proceedings of the Conference on Operating Systems Design and Implementation, pp. 249–265 (2014)
Zhang, J., Cheung, S.C.: Automated test case generation for the stress testing of multimedia systems. Softw. Pract. Exp. 32, 1411–1435 (2002)
Zhang, Z., Cherkasova, L., Loo B.T. Benchmarking approach for designing a mapreduce performance model. In: Proceedings of the International Conference on Performance Engineering, pp. 253–258 (2013)
Acknowledgments
We would like to thank BlackBerry for providing access to the enterprise system used in our case study. The findings and opinions expressed in this paper are those of the authors and do not necessarily represent or reflect those of BlackBerry and/or its subsidiaries and affiliates. Moreover, our results do not reflect the quality of BlackBerry’s products. We would also like to thank Microsoft Azure for (1) providing us access to a large-scale deployment and (2) working closely with us to setup and troubleshoot our deployment.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Syer, M.D., Shang, W., Jiang, Z.M. et al. Continuous validation of performance test workloads. Autom Softw Eng 24, 189–231 (2017). https://doi.org/10.1007/s10515-016-0196-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10515-016-0196-8