Skip to main content
Log in

The Debsources Dataset: two decades of free and open source software

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. https://www.debian.org.

  2. http://www.dwheeler.com/sloccount/.

  3. https://github.com/AlDanial/cloc.

  4. http://www.darwinsys.com/file/.

  5. http://ctags.sourceforge.net/.

  6. http://www.postgresql.org.

  7. http://dx.doi.org/https://zenodo.org/.

  8. https://packages.debian.org/sid/debmirror.

  9. A list of Debian mirrors organized by geographical location is available at https://www. debian.org/mirror/list.

  10. https://packages.debian.org/sid/python-debian.

  11. https://en.wikipedia.org/wiki/List_{o}f_{D}ebian_{r}eleases.

  12. https://github.com/AlDanial/cloc.

  13. http://www.dwheeler.com/sloccount/.

  14. http://ctags.sourceforge.net/.

  15. https://www.westgrid.ca/.

  16. Note that two different SLOC metrics are available in the dataset: as computed by sloccount and cloc. Each tool has its strength and weaknesses. For this case study we use sloccount numbers.

  17. https://www.blackducksoftware.com/top-open-source-licenses.

  18. http://bugs.debian.org/740883.

  19. http://www.ubuntu.com/.

References

  • Abate P, Boender J, Di Cosmo R, Zacchiroli S (2009) Strong dependencies between software components. In: ESEM, pp 89–99

  • Adams B, Bird C, Khomh F, Moir K (2013) 1st international workshop on release engineering (RELENG 2013). In: ICSE’13, pp 1545–1546

  • Brooks FP Jr (1995) The mythical man-month: essays on software engineering, 2nd edn. Addison-Wesley

  • Caneill M, Zacchiroli S (2014) Debsources: live and historical views on macro-level software evolution. In: ESEM 2014: 8th international symposium on empirical software engineering and measurement. ACM

  • Demeyer S, Murgia A, Wyckmans K, Lamkanfi A (2013) Happy birthday! A trend analysis on past msr papers. In: MSR 13: 10th Working Conference on Mining Software Repositories, MSR’13. IEEE, Piscataway, NJ, USA, pp 353–362

  • Distrowatch distribution search — debian-based distributions. http://distrowatch.com/search.php?ostype=linux&basedon=debian&status=active

  • Dyer R, Nguyen H A, Rajan H, Nguyen T N (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: ICSE. IEEE/ACM, pp 422–431

  • German D M, Di Penta M, Davis J (2010) Understanding and auditing the licensing of open source software distributions. In: 18th international conference on program comprehension (ICPC’2010), pp 84–93

  • German D M, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In: Proceedings of the IEEE/ACM international conference on automated software engineering, ASE’10. ACM, pp 437–446

  • Gobeille R (2008) The fossology project. In: MSR 2008: the 5th working conference on mining software repositories. ACM, pp 47–50

  • González-Barahona J M, Ortuno Perez M A, de las Heras Quirós P, González J C, Olivera V M (2001) Counting potatoes: the size of debian 2.2. Upgrade Magazine 2(6):60–66

    Google Scholar 

  • González-Barahona J M, Robles G, Michlmayr M, Amor J J, Germán D M (2009) Macro-level software evolution: a case study of a large software compilation. Empir Softw Eng 14(3):262–285

    Article  Google Scholar 

  • Howison J, Conklin M, Crowston K (2006) FLOSSmole: a collaborative repository for FLOSS research data and analyses. IJITWE 1(3):17–26

    Google Scholar 

  • Jackson I, et al. (1996) Debian policy manual. Available at https://www.debian.org/doc/debian-policy/

  • Kerrisk M (2013) Surveying open source licenses. Available at https://lwn.net/Articles/547400/

  • La A (2015) Language trends on github. Available at https://github.com/blog/2047-language-trends-on-github

  • Lehman M M (1980) Programs, life cycles, and laws of software evolution. Proc IEEE 68(9):1060–1076

    Article  Google Scholar 

  • Nussbaum L, Zacchiroli S (2010) The ultimate debian database: consolidating bazaar metadata for quality assurance and data mining. In: MSR. IEEE, pp 52–61

  • Oliver J, Cheng C, Chen Y (2013) Tlsh - a locality sensitive hash. In: CTC, 4th Cybercrime and Trustworthy Computing Workshop. IEEE, pp 7–13

  • Robles G, Gonzalez-Barahona J M, Michlmayr M (2005) Evolution of volunteer participation in libre software projects: evidence from debian. In: Proceedings of the 1st international conference on open source systems, pp 100–107

  • Sowe S, Stamelos I, Angelis L (2006) Identifying knowledge brokers that yield software engineering knowledge in oss projects. Inf Softw Technol 48(11):1025–1033

    Article  Google Scholar 

  • Stewart K, Odence P, Rockett E (2011) Software package data exchange (SPDX™) specification. International Free and Open Source Software Law Review 2 (2):191–196

    Article  Google Scholar 

  • Tridgell A (1999) Efficient algorithms for sorting and synchronization. PhD thesis Australian National University Canberra

  • Wheeler D A (2001) More than a gigabuck: Estimating GNU/linux’s size. http://www.dwheeler.com/sloc/redhat71-v1/redhat71sloc.1.03.html

  • Whitehead J, Zimmermann T (eds) (2010) Mining software repositories, MSR 2010. IEEE

  • Wu Y, Manabe Y, Kanda T, German D M, Inoue K (2015) A method to detect license inconsistencies in large-scale open source projects. In: Proceedings of the 12th working conference on mining software repositories, MSR ’15. IEEE Press, Piscataway, NJ, USA, pp 324–333

  • Zacchiroli S (2015) The Debsources dataset: two decades of Debian source code metadata. In: MSR 2015: the 12th working conference on mining software repositories. IEEE, pp 466–469

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Zacchiroli.

Additional information

Communicated by: Romain Robbes, Martin Pinzger and Yasutaka Kamei

This work has been partially performed at IRILL, center for Free Software Research and Innovation in Paris, France http://www.irill.org . Unless noted otherwise, all URLs in the text have been retrieved on September 1st, 2016. Authors are listed alphabetically.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Caneill, M., Germán, D.M. & Zacchiroli, S. The Debsources Dataset: two decades of free and open source software. Empir Software Eng 22, 1405–1437 (2017). https://doi.org/10.1007/s10664-016-9461-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-016-9461-5

Keywords

Navigation