ABSTRACT
With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features---namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.
- J. Aranda and G. Venolia. The secret life of bugs: Going past the errors and omissions in software repositories. In Proc. of the 31st Int. Conf. on Software Engineering, pages 298–308, 2009. Google ScholarDigital Library
- A. Bacchelli and C. Bird. Expectations, outcomes, and challenges of modern code review. In Proc. Int. Conf. on Soft. Eng.p, ICSE ’13, pages 712–721, 2013. Google ScholarDigital Library
- A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. The missing links: bugs and bug-fix commits. In Proc. of the 18th ACM SIGSOFT international symposium on Foundations of software engineering, pages 97–106, 2010. Google ScholarDigital Library
- A. Begel, J. Bosch, and M.-A. Storey. Social networking meets software development: Perspectives from github, msdn, stack exchange, and topcoder. Software, IEEE, 30(1):52–66, 2013. Google ScholarDigital Library
- C. Bird, A. Bachmann, E. Aune, J. Du↵y, A. Bernstein, and et al. Fair and balanced?: bias in bug-fix datasets. In Proc. of the the Symposium On The Foundations Of Software Engineering, pages 121–130, 2009. Google ScholarDigital Library
- C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. German, and P. Devanbu. The promises and perils of mining git. In Mining Software Repositories, (MSR’09), pages 1–10. IEEE, 2009. Google ScholarDigital Library
- L. Dabbish, C. Stuart, J. Tsay, and J. Herbsleb. Social coding in github: transparency and collaboration in an open software repository. In Proc. Conf. on Computer Supported Cooperative Work, pages 1277–1286, 2012. Google ScholarDigital Library
- K. Finley. Github has surpassed sourceforge and google code in popularity. http://readwrite.com/ 2011/06/02/github-has-passed-sourceforge, 2011.Google Scholar
- G. Gousios. The GHTorrent dataset and tool suite. In Proceedings of the 10th Conference on Mining Software Repositories, MSR ’13, pages 233–236, 2013. Google ScholarDigital Library
- G. Gousios, M. Pinzger, and A. van Deursen. An exploration of the pull-based software development model. In ICSE ’14: Proc. of the 36th Int. Conf. on Software Engineering, June 2014. To appear. Google ScholarDigital Library
- G. Gousios and D. Spinellis. GHTorrent: GitHub’s data from a firehose. In MSR ’12: Proc. of the 9th Working Conf. on Mining Software Repositories, pages 12–21, jun 2012.Google Scholar
- G. Gousios and A. Zaidman. A dataset for pull request research. In Submitted to MSR ’14 – data track. Google ScholarDigital Library
- I. Grigorik. The github archive. http://www.githubarchive.org/, 2012.Google Scholar
- J. Howison and K. Crowston. The perils and pitfalls of mining sourceforge. In Proc. of the Int. Workshop on Mining Software Repositories, pages 7–11, 2004.Google ScholarCross Ref
- E. Kalliamvakou, D. Damian, L. Singer, and D. M. German. The code-centric collaboration perspective: Evidence from github. Technical Report DCS-352-IR, University of Victoria, February 2014.Google Scholar
- J. Marlow, L. Dabbish, and J. Herbsleb. Impression formation in online peer production: activity traces and personal profiles in github. In Proc. Conf. Computer Supported Cooperative Work, pages 117–128, 2013. Google ScholarDigital Library
- N. McDonald and S. Goggins. Performance and participation in open source software on github. In CHI’13 Extended Abstracts on Human Factors in Computing Systems, pages 139–144. ACM, 2013. Google ScholarDigital Library
- T. H. Nguyen, B. Adams, and A. E. Hassan. A case study of bias in bug-fix datasets. In Reverse Engineering (WCRE), 2010 17th Working Conference on, pages 259–268. IEEE, 2010. Google ScholarDigital Library
- R. Pham, L. Singer, O. Liskin, F. Figueira Filho, and K. Schneider. Creating a shared understanding of testing culture on a social coding site. In Proc. Int. Conf. on Soft. Eng., ICSE ’13, pages 112–121, 2013. Google ScholarDigital Library
- F. Rahman, D. Posnett, I. Herraiz, and P. Devanbu. Sample size vs. bias in defect prediction. In Proc. of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 147–157, 2013. Google ScholarDigital Library
- A. Rainer and S. Gale. Evaluating the quality and quantity of data on open source software projects. In Proceedings of the First International Conference on Open Source Systems (OSS 2005), pages 29–36, 2005.Google Scholar
- P. C. Rigby and C. Bird. Convergent contemporary software peer review practices. In Proc. of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 202–212, 2013. Google ScholarDigital Library
- P. C. Rigby, D. M. German, and M.-A. Storey. Open source software peer review practices: a case study of the Apache server. In Proce. of the 30th Int. Conf. on Software engineering, ICSE ’08, pages 541–550, 2008. Google ScholarDigital Library
- Y. Takhteyev and A. Hilts. Investigating the geography of open source software through github. http://takhteyev.org/papers/ Takhteyev-Hilts-2010.pdf, 2010.Google Scholar
- F. Thung, T. Bissyande, D. Lo, and L. Jiang. Network structure of social coding in github. In 17th European Conference on Software Maintenance and Reengineering (CSMR), pages 323–326, 2013. Google ScholarDigital Library
- J. T. Tsay, L. Dabbish, and J. Herbsleb. Social media and success in open source projects. In Proc. Computer Supported Cooperative Work Companion, pages 223–226, 2012. Google ScholarDigital Library
- P. Wagstrom, C. Jergensen, and A. Sarma. A network of rails: a graph dataset of ruby on rails and associated projects. In Proc. of the 10th Int. Work. Conf. on Mining Software Repositories, pages 229–232, 2013. Google ScholarDigital Library
- D. Weiss. Quantitative analysis of open source projects on sourceforge. In Proc. of the First Int. Conf. on Open Source Systems (OSS 2005), pages 140–147, 2005.Google Scholar
Index Terms
- The promises and perils of mining GitHub
Recommendations
An in-depth study of the promises and perils of mining GitHub
With over 10 million git repositories, GitHub is becoming one of the most important sources of software artifacts on the Internet. Researchers mine the information stored in GitHub's event logs to understand how its users employ the site to collaborate ...
Mining software engineering data from GitHub
ICSE-C '17: Proceedings of the 39th International Conference on Software Engineering CompanionGitHub is the largest collaborative source code hosting site built on top of the Git version control system. The availability of a comprehensive API has made GitHub a target for many software engineering and online collaboration research efforts. In our ...
The promises and perils of open source software release and usage by government – evidence from GitHub and literature
DGO '23: Proceedings of the 24th Annual International Conference on Digital Government ResearchAbstract: Open Source Software (OSS) is extensively utilized in industry and government because it allows for open access to the source code and allows for external involvement in the software development process. There are several factors driving this ...
Comments