Abstract
We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.
Similar content being viewed by others
Notes
A list of Debian mirrors organized by geographical location is available at https://www. debian.org/mirror/list.
Note that two different SLOC metrics are available in the dataset: as computed by sloccount and cloc. Each tool has its strength and weaknesses. For this case study we use sloccount numbers.
References
Abate P, Boender J, Di Cosmo R, Zacchiroli S (2009) Strong dependencies between software components. In: ESEM, pp 89–99
Adams B, Bird C, Khomh F, Moir K (2013) 1st international workshop on release engineering (RELENG 2013). In: ICSE’13, pp 1545–1546
Brooks FP Jr (1995) The mythical man-month: essays on software engineering, 2nd edn. Addison-Wesley
Caneill M, Zacchiroli S (2014) Debsources: live and historical views on macro-level software evolution. In: ESEM 2014: 8th international symposium on empirical software engineering and measurement. ACM
Demeyer S, Murgia A, Wyckmans K, Lamkanfi A (2013) Happy birthday! A trend analysis on past msr papers. In: MSR 13: 10th Working Conference on Mining Software Repositories, MSR’13. IEEE, Piscataway, NJ, USA, pp 353–362
Distrowatch distribution search — debian-based distributions. http://distrowatch.com/search.php?ostype=linux&basedon=debian&status=active
Dyer R, Nguyen H A, Rajan H, Nguyen T N (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: ICSE. IEEE/ACM, pp 422–431
German D M, Di Penta M, Davis J (2010) Understanding and auditing the licensing of open source software distributions. In: 18th international conference on program comprehension (ICPC’2010), pp 84–93
German D M, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In: Proceedings of the IEEE/ACM international conference on automated software engineering, ASE’10. ACM, pp 437–446
Gobeille R (2008) The fossology project. In: MSR 2008: the 5th working conference on mining software repositories. ACM, pp 47–50
González-Barahona J M, Ortuno Perez M A, de las Heras Quirós P, González J C, Olivera V M (2001) Counting potatoes: the size of debian 2.2. Upgrade Magazine 2(6):60–66
González-Barahona J M, Robles G, Michlmayr M, Amor J J, Germán D M (2009) Macro-level software evolution: a case study of a large software compilation. Empir Softw Eng 14(3):262–285
Howison J, Conklin M, Crowston K (2006) FLOSSmole: a collaborative repository for FLOSS research data and analyses. IJITWE 1(3):17–26
Jackson I, et al. (1996) Debian policy manual. Available at https://www.debian.org/doc/debian-policy/
Kerrisk M (2013) Surveying open source licenses. Available at https://lwn.net/Articles/547400/
La A (2015) Language trends on github. Available at https://github.com/blog/2047-language-trends-on-github
Lehman M M (1980) Programs, life cycles, and laws of software evolution. Proc IEEE 68(9):1060–1076
Nussbaum L, Zacchiroli S (2010) The ultimate debian database: consolidating bazaar metadata for quality assurance and data mining. In: MSR. IEEE, pp 52–61
Oliver J, Cheng C, Chen Y (2013) Tlsh - a locality sensitive hash. In: CTC, 4th Cybercrime and Trustworthy Computing Workshop. IEEE, pp 7–13
Robles G, Gonzalez-Barahona J M, Michlmayr M (2005) Evolution of volunteer participation in libre software projects: evidence from debian. In: Proceedings of the 1st international conference on open source systems, pp 100–107
Sowe S, Stamelos I, Angelis L (2006) Identifying knowledge brokers that yield software engineering knowledge in oss projects. Inf Softw Technol 48(11):1025–1033
Stewart K, Odence P, Rockett E (2011) Software package data exchange (SPDX™) specification. International Free and Open Source Software Law Review 2 (2):191–196
Tridgell A (1999) Efficient algorithms for sorting and synchronization. PhD thesis Australian National University Canberra
Wheeler D A (2001) More than a gigabuck: Estimating GNU/linux’s size. http://www.dwheeler.com/sloc/redhat71-v1/redhat71sloc.1.03.html
Whitehead J, Zimmermann T (eds) (2010) Mining software repositories, MSR 2010. IEEE
Wu Y, Manabe Y, Kanda T, German D M, Inoue K (2015) A method to detect license inconsistencies in large-scale open source projects. In: Proceedings of the 12th working conference on mining software repositories, MSR ’15. IEEE Press, Piscataway, NJ, USA, pp 324–333
Zacchiroli S (2015) The Debsources dataset: two decades of Debian source code metadata. In: MSR 2015: the 12th working conference on mining software repositories. IEEE, pp 466–469
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Romain Robbes, Martin Pinzger and Yasutaka Kamei
This work has been partially performed at IRILL, center for Free Software Research and Innovation in Paris, France http://www.irill.org . Unless noted otherwise, all URLs in the text have been retrieved on September 1st, 2016. Authors are listed alphabetically.
Rights and permissions
About this article
Cite this article
Caneill, M., Germán, D.M. & Zacchiroli, S. The Debsources Dataset: two decades of free and open source software. Empir Software Eng 22, 1405–1437 (2017). https://doi.org/10.1007/s10664-016-9461-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-016-9461-5