To design effective tools for detecting and recovering from software failures requires a deep understanding of software bug characteristics. We study software bug characteristics by sampling 2,060 real world bugs in three large, representative open-source projects—the Linux kernel, Mozilla, and Apache. We manually study these bugs in three dimensions—root causes, impacts, and components. We further study the correlation between categories in different dimensions, and the trend of different types of bugs. The findings include: (1) semantic bugs are the dominant root cause. As software evolves, semantic bugs increase, while memory-related bugs decrease, calling for more research effort to address semantic bugs; (2) the Linux kernel operating system (OS) has more concurrency bugs than its non-OS counterparts, suggesting more effort into detecting concurrency bugs in operating system code; and (3) reported security bugs are increasing, and the majority of them are caused by semantic bugs, suggesting more support to help developers diagnose and fix security bugs, especially semantic security bugs. In addition, to reduce the manual effort in building bug benchmarks for evaluating bug detection and diagnosis tools, we use machine learning techniques to classify 109,014 bugs automatically.

We thank Luyang Wang and Yaoqiang Li for classifying some bug reports. We thank Shan Lu for the early discussion and feedback. The work is partially supported by the National Science and Engineering Research Council of Canada, the United States National Science Foundation, the United States Department of Energy, a Google gift grant, and an Intel gift grant.
1.1 A Combining Two Data Sets
To leverage the randomly sampled bug reports studied in our prior work in 2005 (Li et al. 2006), each of the two bug report samples from Mozilla and Apache Bugzilla databases is combined from two random samples. The combination is performed in the following way to maintain the pure randomness of sampling. The goal is to ensure that the combined set of fixed bug reports is no different from a random sample of fixed bug reports on the entire Bugzilla databases now.
Figure 9 illustrates the combination approach. We randomly sampled 2X % of fixed bug reports in one Bugzilla database by the cutoff date of our prior work, referred to as Date1. Now we randomly select half of the 2X % of fixed bug reports, referred to as Set1; the other half is discarded. Note that Set1 is a random sample of X% of bug reports fixed by Date1. On our new sampling date (Table 1), denoted as Date2, we sample another X % of the fixed bug reports that were opened after Date1 and before Date2, denoted as Set2. We keep only half of the bug reports fixed by Date1 so that the sampled bug reports before Date1 and sampled bug reports after Date1 are in proportion to the bug reports belong to the two time ranges.
Combining Two Data Sets. Not Fixed denotes any status other than Fixed, e.g., reopened, invalid, etc. Old Data Set is the data set used in our previous work (Li et al. 2006) and New Data Set is the data set used in this paper
The status of a bug report may have changed since Date1 in the following two ways: (1) a fixed bug report by Date1 is no longer fixed by Date2; or (2) a unfixed bug report by Date1 is fixed by Date2. To compensate for these two scenarios, we identify all the fixed bug reports in Set1 that are no longer marked as fixed on Date2, denoted as Set3. Bug reports in Set3 should not be included in our sample, because if we take a random sample of fixed bug reports on Date2, those bug reports would not be sampled as they are not fixed. From the bug reports that are unfixed by Date1 but are fixed by Date2, we randomly sample X % of them, denoted as Set4. Our final random sample is the union of Set1, Set2, and Set4 with Set3 excluded.
Table 16 lists sizes of the four data sets for the three software projects. No combining is needed for the Linux kernel since it was sampled in 2010.
There is no difference between our combined sample and a sample randomly picked from the fixed bug reports on Date2, as either is a random sample of X% on the population. Therefore, the combined sample is representative of fixed bug reports in the Bugzilla database. In addition, our results show the distributions of these sets are similar, meaning that the variance in different data sets is small, which increases the confidence and reproducibility of our results.
Developer may update bug reports after the bugs are fixed. Therefore, we check all bugs in Set1 to find out whether the later activities affect our classifications of the bugs. Fortunately, only 5 of those bug reports in Mozilla and none in Apache have activities after Date1. We manually read these 5 bug reports again; and find that those activities change the product, the QA contact, or the component of the bug reports, and do not change our original classifications. The component field used in the bug reports is finer-grained than the definition of component in Table 3. Therefore, the finer-grained component change in bug reports does not affect the higher-level component used in this paper.
1.2 B Bug Examples
1.3 C Detailed Numbers for the Figures
Tan, L., Liu, C., Li, Z. et al. Bug characteristics in open source software. Empir Software Eng 19, 1665–1705 (2014). https://doi.org/10.1007/s10664-013-9258-8
Issue Date:
DOI: https://doi.org/10.1007/s10664-013-9258-8