Abstract
This article systematically studies 99 distributed performance bugs from five widely deployed distributed storage and computing systems (Cassandra, HBase, HDFS, Hadoop MapReduce and ZooKeeper). We present the TaxPerf database, which collectively organizes the analysis results as over 400 classification labels and over 2,500 lines of bug re-description. TaxPerf is classified into six bug categories (and 18 bug subcategories) by their root causes; resource, blocking, synchronization, optimization, configuration, and logic. TaxPerf can be used as a benchmark for performance bug studies and debug tool designs. Although it is impractical to automatically detect all categories of performance bugs in TaxPerf, we find that an important category of blocking bugs can be effectively solved by analysis tools. We analyze the cascading nature of blocking bugs and design an automatic detection tool called PCatch, which (i) performs program analysis to identify code regions whose execution time can potentially increase dramatically with the workload size; (ii) adapts the traditional happens-before model to reason about software resource contention and performance dependency relationship; and (iii) uses dynamic tracking to identify whether the slowdown propagation is contained in one job. Evaluation shows that PCatch can accurately detect blocking bugs of representative distributed storage and computing systems by observing system executions under small-scale workloads.
- [1] Apache HBase Project. (n. d.). Retrieved January 29, 2023 from http://hbase.apache.org.Google Scholar
- [2] Apache ZooKeeper Project. (n. d.). Retrieved January 29, 2023 from http://zookeeper.apache.org.Google Scholar
- [3] HDFS Architecture. (n. d.). Retrieved January 29, 2023 from http://hadoop.apache.org/common/docs/current/hdfs_design.html.Google Scholar
- [4] . 1990. Dynamic program slicing. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’90). Association for Computing Machinery, New York, NY, 246–256. Google ScholarDigital Library
- [5] . 2003. Performance debugging for distributed systems of black boxes. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP’03). Association for Computing Machinery, New York, NY, 74–89. Google ScholarDigital Library
- [6] . 2017. SyncPerf: Categorizing, detecting, and diagnosing synchronization performance bugs. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys’17). Association for Computing Machinery, New York, NY, 298–313. Google ScholarDigital Library
- [7] . 2010. Performance analysis of idle programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’10). Association for Computing Machinery, New York, NY, 739–753. Google ScholarDigital Library
- [8] . MapReduce-4576. (n. d.). Retrieved January 29, 2023 from https://issues.apache.org/jira/browse/MAPREDUCE-4576.Google Scholar
- [9] . 2012. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 307–320.Google Scholar
- [10] . 2010. Automating configuration troubleshooting with dynamic information flow analysis. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI’10). USENIX Association, 237–250.Google Scholar
- [11] . 2014. The mystery machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, 217–231.Google Scholar
- [12] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI’10). USENIX Association, 21.Google Scholar
- [13] . 2012. Input-sensitive profiling. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). Association for Computing Machinery, New York, NY, 89–98. Google ScholarDigital Library
- [14] . 2015. C oz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). Association for Computing Machinery, New York, NY, 184–197. Google ScholarDigital Library
- [15] . 2014. Continuously measuring critical section pressure with the free-lunch profiler. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA’14). Association for Computing Machinery, New York, NY, 291–307. Google ScholarDigital Library
- [16] . 2014. PerfScope: Practical online server performance bug inference in production cloud computing infrastructures. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’14). Association for Computing Machinery, New York, NY, 1–13. Google ScholarDigital Library
- [17] . 2013. Efficient concurrency-bug detection across inputs. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2013, part of SPLASH 2013, Indianapolis, IN, USA, October 26-31, 2013, , , and (Eds.). ACM, 785–802. Google ScholarDigital Library
- [18] . 2008. A scalable technique for characterizing the usage of temporaries in framework-intensive Java applications. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering (SIGSOFT’08/FSE-16). Association for Computing Machinery, New York, NY, 59–70. Google ScholarDigital Library
- [19] . 2007. Measuring empirical computational complexity. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering (ESEC-FSE’07). Association for Computing Machinery, New York, NY, 395–404. Google ScholarDigital Library
- [20] . 2009. SPEED: Symbolic complexity bound analysis. Computer Aided Verification (CAV’09), A. Bouajjani and O. Maler (Eds.). Lecture Notes in Computer Science, Vol. 5643, Springer, Berlin, Heidelberg. Google ScholarDigital Library
- [21] . 2010. The reachability-bound problem. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). Association for Computing Machinery, New York, NY, 292–304. Google ScholarDigital Library
- [22] . 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’14). Association for Computing Machinery, New York, NY, 1–14. Google ScholarDigital Library
- [23] Diwaker Gupta, Kashi Venkatesh Vishwanath, Marvin McNett, Amin Vahdat, Ken Yocum, Alex Snoeren, and Geoffrey M. Voelker. 2011. DieCast: Testing distributed systems with an accurate scale model. ACM Trans. Comput. Syst. 29, 2 (2011), 48 pages. Google ScholarDigital Library
- [24] . 2011. No one size fits all: Automatic cluster sizing for data-intensive analytics. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC’11). Association for Computing Machinery, New York, NY, 1–14. Google ScholarDigital Library
- [25] . 2014. Performance regression testing target prioritization via performance risk analysis. In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). Association for Computing Machinery, New York, NY, 60–71. Google ScholarDigital Library
- [26] . Main Page - WalaWiki. (n. d.). Retrieved January 29, 2023 from http://wala.sourceforge.net/wiki/index.php/Main_Page.Google Scholar
- [27] . Javassist. (n. d.). Retrieved January 29, 2023 from http://jboss-javassist.github.io/javassist/.Google Scholar
- [28] . 2012. Understanding and detecting real-world performance bugs. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). Association for Computing Machinery, New York, NY, 77–88. Google ScholarDigital Library
- [29] . 2007. Mace: Language support for building distributed systems. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). Association for Computing Machinery, New York, NY, 179–188. Google ScholarDigital Library
- [30] . 2010. Finding latent performance bugs in systems implementations. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’10). Association for Computing Machinery, New York, NY, 17–26. Google ScholarDigital Library
- [31] . 2010. Cassandra - a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35–40.Google Scholar
- [32] . 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7 (
July 1978), 558–565. Google ScholarDigital Library - [33] . 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16), Atlanta, GA, USA, April 2-6, 2016, and (Eds.). ACM, 517–530. Google ScholarDigital Library
- [34] . 2018. PCatch: Automatically detecting performance cascading bugs in cloud systems. In Proceedings of the 13th EuroSys Conference (EuroSys’18), Porto, Portugal, April 23-26, 2018, , , and (Eds.). ACM, 7:1–7:14. Google ScholarDigital Library
- [35] . 2016. Understanding and generating high quality patches for concurrency bugs. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’16, Seattle, WA, November 13-18, 2016, , , and (Eds.). ACM, 715–726. Google ScholarDigital Library
- [36] . 2017. DCatch: Automatically detecting distributed concurrency bugs in cloud systems. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). Association for Computing Machinery, New York, NY, 677–691. Google ScholarDigital Library
- [37] . 2014. PREDATOR: Predictive false sharing detection. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel programming (PPoPP’14). Association for Computing Machinery, New York, NY, 3–14. Google ScholarDigital Library
- [38] . 2015. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). Association for Computing Machinery, New York, NY, 378–393. Google ScholarDigital Library
- [39] . 2008. Finding and reproducing heisenbugs in concurrent programs. In Proceedings of the 8th USENIX conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, 267–280.Google Scholar
- [40] . 2013. Whose cache line is it anyway?: Operating system support for live detection and repair of false sharing. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys’13). Association for Computing Machinery, New York, NY, 141–154. Google ScholarDigital Library
- [41] . 1991. Improving the accuracy of data Race detection. In Proceedings of the third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’91). Association for Computing Machinery, New York, NY, 133–144. Google ScholarDigital Library
- [42] . 2015. CARAMEL: Detecting and fixing performance problems that have non-intrusive fixes. In Proceedings of the 37th International Conference on Software Engineering (ICSE’15), Volume 1, IEEE Press, 902–912.Google Scholar
- [43] . 2013. Toddler: Detecting performance problems via similar memory-access patterns. In Proceedings of the International Conference on Software Engineering (ICSE’13). IEEE Press, 562–571.Google Scholar
- [44] . 2015. Static detection of asymptotic performance bugs in collection traversals. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’15). Association for Computing Machinery, New York, NY, 369–378. Google ScholarDigital Library
- [45] . HPROF: A heap/cpu profiling tool. (n. d.). Retrieved January 29, 2023 from http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html.Google Scholar
- [46] . 2005. I/O system performance debugging using model-driven anomaly characterization. In Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies (FAST’05), Volume 4, USENIX Association, 23.Google Scholar
- [47] . 2014. Statistical debugging for real-world performance problems. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA’14). Association for Computing Machinery, New York, NY, 561–578. Google ScholarDigital Library
- [48] . 2006. Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems. In Proceedings of the Second conference on Hot topics in System Dependability (HotDep’06). USENIX Association, 1.Google Scholar
- [49] . 2010. Visual, log-based causal tracing for performance debugging of MaprReduce systems. In Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems (ICDCS’10). IEEE Computer Society, 795–806. Google ScholarDigital Library
- [50] . 2018. Understanding and auto-adjusting performance-sensitive configurations. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). Association for Computing Machinery, New York, NY, 154–168. Google ScholarDigital Library
- [51] . 2014. Exalt: Empowering researchers to evaluate large-scale storage systems. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI’14). USENIX Association, 129–141.Google Scholar
- [52] . 1981. Program slicing. In Proceedings of the 2013 International Conference on Software Engineering (ICSE’81). 439–449.Google Scholar
- [53] . 2013. Supporting swift reaction: Automatically uncovering performance problems by systematic experiments. In Proceedings of the 2013 International Conference on Software Engineering (ICSE’13). IEEE Press, 552–561.Google Scholar
- [54] . 2013. Context-sensitive delta inference for identifying workload-dependent performance bottlenecks. In Proceedings of the 2013 International Symposium on Software Testing and Analysis (ISSTA’13). Association for Computing Machinery, New York, NY, 90–100. Google ScholarDigital Library
- [55] . 2009. Go with the flow: Profiling copies to find runtime bloat. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09). Association for Computing Machinery, New York, NY, 419–430. Google ScholarDigital Library
- [56] . 2010. Finding low-utility data structures. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). Association for Computing Machinery, New York, NY, 174–186. Google ScholarDigital Library
- [57] . 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). Association for Computing Machinery, New York, NY, 117–132. Google ScholarDigital Library
- [58] . 2016. SyncProf: Detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA’16). Association for Computing Machinery, New York, NY, 389–400. Google ScholarDigital Library
- [59] . 2014. Comprehending performance from real-world execution traces: A device-driver case. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). Association for Computing Machinery, New York, NY, 193–206. Google ScholarDigital Library
- [60] . IBM thread and monitor dump analyze for Java. (n. d.). Retrieved January 29, 2023 from https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=2245aa39-fa5c-4475-b891-14c205f7333c.Google Scholar
- [61] . 2012. Algorithmic profiling. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). Association for Computing Machinery, New York, NY, 67–76. Google ScholarDigital Library
- [62] . 2014. Heading off correlated failures through independence-as-a-service. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, 317–334.Google Scholar
- [63] . 2006. Dynamic slicing long running programs through execution fast forwarding. In Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering (SIGSOFT’06/FSE-14). Association for Computing Machinery, New York, NY, 81–91. Google ScholarDigital Library
- [64] . 2016. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 603–618.Google Scholar
Index Terms
- Performance Bug Analysis and Detection for Distributed Storage and Computing Systems
Recommendations
An Empirical Study on Critical Blocking Bugs
ICPC '20: Proceedings of the 28th International Conference on Program ComprehensionBlocking bugs are a severe type of bugs that prevent other bugs from being fixed. As software becomes increasingly complex and large, blocking bugs occur in many large-scale software, especially in software ecosystems. Blocking bugs may have a high ...
Application Performance Analysis of Distributed File Systems under Cloud Computing Environment
ICISCE '15: Proceedings of the 2015 2nd International Conference on Information Science and Control EngineeringThe processing efficiency of data-intensive application on Hadoop with the general-purpose distributed file system such as Lustre, as the backend file system, is not clear. This paper focuses on the similarities and differences between Lustre and HDFS (...
Effective Bug Triage Based on Historical Bug-Fix Information
ISSRE '14: Proceedings of the 2014 IEEE 25th International Symposium on Software Reliability EngineeringFor complex and popular software, project teams could receive a large number of bug reports. It is often tedious and costly to manually assign these bug reports to developers who have the expertise to fix the bugs. Many bug triage techniques have been ...
Comments