skip to main content
10.1145/3524842.3528003acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
short-paper

GitDelver enterprise dataset (GDED): an industrial closed-source dataset for socio-technical research

Published: 17 October 2022 Publication History

Abstract

Conducting socio-technical software engineering research on closed-source software is difficult as most organizations do not want to give access to their code repositories. Most experiments and publications therefore focus on open-source projects, which only provides a partial view of software development communities. Yet, closing the gap between open and closed source software industries is essential to increase the validity and applicability of results stemming from socio-technical software engineering research. We contribute to this effort by sharing our work in a large company counting 4,800 employees. We mined 101 repositories and produced the GDED dataset containing socio-technical information about 106,216 commits, 470,940 file modifications and 3,471,556 method modifications from 164 developers during the last 13 years, using various programming languages. For that, we used GitDelver, an open-source tool we developed on top of Pydriller, and anonymized and scrambled the data to comply with legal and corporate requirements. Our dataset can be used for various purposes and provides information about code complexity, self-admitted technical debt, bug fixes, as well as temporal information. We also share our experience regarding the processing of sensitive data to help other organizations making datasets publicly available to the research community.

References

[1]
Mamdouh Alenezi and Khaled Almustafa. 2015. Empirical analysis of the complexity evolution in open-source software systems. International Journal of Hybrid Information Technology 8, 2 (2015), 257--266.
[2]
Guilherme Avelino, Leonardo Teixeira Passos, André C. Hora, and Marco Tulio Valente. 2016. A novel approach for estimating Truck Factors. In 24th IEEE International Conference on Program Comprehension, ICPC 2016, Austin, TX, USA, May 16--17, 2016. IEEE Computer Society, 1--10.
[3]
Bahareh Bafandeh Mayvan, Abbas Rasoolzadegan, and Abbas Javan Jafari. 2020. Bad smell detection using quality metrics and refactoring opportunities. Journal of Software: Evolution and Process 32, 8 (2020), e2255.
[4]
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In 35th International Conference on Software Engineering, ICSE '13, San Francisco, CA, USA, May 18--26, 2013, David Notkin, Betty H. C. Cheng, and Klaus Pohl (Eds.). IEEE Computer Society, 422--431.
[5]
Neil Ernst, Rick Kazman, and Julien Delange. 2021. Technical Debt in Practice. The MIT Press.
[6]
Jon Eyolfson, Lin Tan, and Patrick Lam. 2011. Do time of day and developer experience affect commit bugginess. In Proceedings of the 8th International Working Conference on Mining Software Repositories, MSR 2011 (Co-located with ICSE), Waikiki, Honolulu, HI, USA, May 21--28, 2011, Proceedings, Arie van Deursen, Tao Xie, and Thomas Zimmermann (Eds.). ACM, 153--162.
[7]
Georgios Gousios. 2013. The GHTorent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR '13, San Francisco, CA, USA, May 18--19, 2013, Thomas Zimmermann, Massimiliano Di Penta, and Sunghun Kim (Eds.). IEEE Computer Society, 233--236.
[8]
Lile Hattori and Michele Lanza. 2008. On the nature of commits. In 23rd IEEE/ACM International Conference on Automated Software Engineering - Workshop Proceedings (ASE Workshops 2008), 15--16 September 2008, L'Aquila, Italy. IEEE, 63--71.
[9]
Philippe Kruchten, Robert L. Nord, and Ipek Ozkaya. 2012. Technical debt: From metaphor to theory and practice. IEEE Software 29, 6 (2012), 18--21.
[10]
Valentina Lenarduzzi, Nyyti Saarimäki, and Davide Taibi. 2019. The Technical Debt Dataset. In Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2019, Recife, Brazil, September 18, 2019, Leandro L. Minku, Foutse Khomh, and Jean Petric (Eds.). ACM, 2--11.
[11]
Matias Martinez and Martin Monperrus. 2019. Coming: a tool for mining change pattern instances from git commits. In Proceedings of the 41st International Conference on Software Engineering: Companion Proceedings, ICSE 2019, Montreal, QC, Canada, May 25--31, 2019, Joanne M. Atlee, Tevfik Bultan, and Jon Whittle (Eds.). IEEE / ACM, 79--82.
[12]
Thomas J. McCabe. 1976. A Complexity Measure. IEEE Trans. Software Eng. 2, 4 (1976), 308--320.
[13]
Wes McKinney et al. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Vol. 445. Austin, TX, 51--56.
[14]
Birendra K. Mishra, Ashutosh Prasad, and Srinivasan Raghunathan. 2002. Quality and Profits Under Open Source Versus Closed Source. (2002), 32. http://aisel.aisnet.org/icis2002/32
[15]
Mathieu Nassif and Martin P. Robillard. 2017. Revisiting Turnover-Induced Knowledge Loss in Software Projects. In 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, Shanghai, China, September 17--22, 2017. IEEE Computer Society, 261--272.
[16]
Marco Ortu, Giuseppe Destefanis, Bram Adams, Alessandro Murgia, Michele Marchesi, and Roberto Tonelli. 2015. The JIRA Repository Dataset: Understanding Social Aspects of Software Development. In Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2015, Beijing, China, October 21, 2015, Ayse Bener, Leandro L. Minku, and Burak Turhan (Eds.). ACM, 1:1--1:4.
[17]
European Parliament and Council of the European Union. 2021. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (Text with EEA relevance). https://eurlex.europa.eu/eli/reg/2016/679/oj
[18]
Shaun Phillips, Jonathan Sillito, and Robert J. Walker. 2011. Branching and merging: an investigation into current version control practices. In Proceedings of the 4th International Workshop on Cooperative and Human Aspects of Software Engineering, CHASE 2011, Waikiki, Honolulu, HI, USA, May 21, 2011, Marcelo Cataldo, Cleidson R. B. de Souza, Yvonne Dittrich, Rashina Hoda, and Helen Sharp (Eds.). ACM, 9--15.
[19]
Vidyasagar Potdar and Elizabeth Chang. 2004. Open source and closed source software development methodologies. In 26th International Conference on Software Engineering. IET, 105--109.
[20]
S. Raghunathan, A. Prasad, B.K. Mishra, and Hsihui Chang. 2005. Open source versus closed source: software quality in monopoly and competitive markets. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 35, 6 (2005), 903--918.
[21]
Foyzur Rahman and Premkumar T. Devanbu. 2011. Ownership, experience and defects: a fine-grained study of authorship. In Proceedings of the 33rd International Conference on Software Engineering, ICSE 2011, Waikiki, Honolulu, HI, USA, May 21--28, 2011, Richard N. Taylor, Harald C. Gall, and Nenad Medvidovic (Eds.). ACM, 491--500.
[22]
Nicolas Riquet. 2022. nicolasriquet/GitDelver: 1.7.3.
[23]
Nicolas Riquet, Xavier Devroey, and Benoît Vanderose. 2022. GDED (GitDelver Enterprise Dataset).
[24]
Brian Robinson and Patrick Francis. 2010. Improving industrial adoption of software engineering research: a comparison of open and closed source software. In Proceedings of the International Symposium on Empirical Software Engineering and Measurement, ESEM 2010, 16--17 September 2010, Bolzano/Bozen, Italy, Giancarlo Succi, Maurizio Morisio, and Nachiappan Nagappan (Eds.). ACM.
[25]
Emad Shihab, Christian Bird, and Thomas Zimmermann. 2012. The effect of branching strategies on software quality. In 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM '12, Lund, Sweden - September 19 - 20, 2012, Per Runeson, Martin Höst, Emilia Mendes, Anneliese Amschler Andrews, and Rachel Harrison (Eds.). ACM, 301--310.
[26]
SonarQube. 2021. Code Quality and Code Security - SonarQube. https://www.sonarqube.org/
[27]
Davide Spadini, Maurício Finavaro Aniche, and Alberto Bacchelli. 2018. PyDriller: Python framework for mining software repositories. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04--09, 2018, Gary T. Leavens, Alessandro Garcia, and Corina S. Pasareanu (Eds.). ACM, 908--911.
[28]
Nikolai Sviridov, Mikhail Evtikhiev, and Vladimir Kovalenko. 2021. TNM: A Tool for Mining of Socio-Technical Data from Git Repositories. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 295--299.
[29]
Damian A. Tamburri, Philippe Kruchten, Patricia Lago, and Hans van Vliet. 2015. Social debt in software engineering: insights from industry. Journal of Internet Services and Applications 6, 1 (dec 2015), 10.
[30]
The pandas development team. 2020. pandas-dev/pandas: Pandas.
[31]
Adam Tornhill. 2018. Software Design X-Rays: Fix Technical Debt with Behavioral Code Analysis. Pragmatic Bookshelf.
[32]
Terry Yin. 2021. Lizard's GitHub page. https://github.com/terryyin/lizard
[33]
Andy Zaidman, Bart Van Rompaey, Serge Demeyer, and Arie van Deursen. 2008. Mining Software Repositories to Study Co-Evolution of Production & Test Code. In First International Conference on Software Testing, Verification, and Validation, ICST 2008, Lillehammer, Norway, April 9--11, 2008. IEEE Computer Society, 220--229.
[34]
Thomas Zimmermann, Andreas Zeller, Peter Weissgerber, and Stephan Diehl. 2005. Mining version histories to guide software changes. IEEE Transactions on Software Engineering 31, 6 (2005), 429--445.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories
May 2022
815 pages
ISBN:9781450393034
DOI:10.1145/3524842
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dataset showcase
  2. development teams
  3. socio-technical aspects

Qualifiers

  • Short-paper

Conference

MSR '22
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 60
    Total Downloads
  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media