skip to main content
10.1145/2597073.2597074acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
Article

The promises and perils of mining GitHub

Published:31 May 2014Publication History

ABSTRACT

With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features---namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

References

  1. J. Aranda and G. Venolia. The secret life of bugs: Going past the errors and omissions in software repositories. In Proc. of the 31st Int. Conf. on Software Engineering, pages 298–308, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Bacchelli and C. Bird. Expectations, outcomes, and challenges of modern code review. In Proc. Int. Conf. on Soft. Eng.p, ICSE ’13, pages 712–721, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. The missing links: bugs and bug-fix commits. In Proc. of the 18th ACM SIGSOFT international symposium on Foundations of software engineering, pages 97–106, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Begel, J. Bosch, and M.-A. Storey. Social networking meets software development: Perspectives from github, msdn, stack exchange, and topcoder. Software, IEEE, 30(1):52–66, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Bird, A. Bachmann, E. Aune, J. Du↵y, A. Bernstein, and et al. Fair and balanced?: bias in bug-fix datasets. In Proc. of the the Symposium On The Foundations Of Software Engineering, pages 121–130, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. German, and P. Devanbu. The promises and perils of mining git. In Mining Software Repositories, (MSR’09), pages 1–10. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Dabbish, C. Stuart, J. Tsay, and J. Herbsleb. Social coding in github: transparency and collaboration in an open software repository. In Proc. Conf. on Computer Supported Cooperative Work, pages 1277–1286, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Finley. Github has surpassed sourceforge and google code in popularity. http://readwrite.com/ 2011/06/02/github-has-passed-sourceforge, 2011.Google ScholarGoogle Scholar
  9. G. Gousios. The GHTorrent dataset and tool suite. In Proceedings of the 10th Conference on Mining Software Repositories, MSR ’13, pages 233–236, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Gousios, M. Pinzger, and A. van Deursen. An exploration of the pull-based software development model. In ICSE ’14: Proc. of the 36th Int. Conf. on Software Engineering, June 2014. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Gousios and D. Spinellis. GHTorrent: GitHub’s data from a firehose. In MSR ’12: Proc. of the 9th Working Conf. on Mining Software Repositories, pages 12–21, jun 2012.Google ScholarGoogle Scholar
  12. G. Gousios and A. Zaidman. A dataset for pull request research. In Submitted to MSR ’14 – data track. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. Grigorik. The github archive. http://www.githubarchive.org/, 2012.Google ScholarGoogle Scholar
  14. J. Howison and K. Crowston. The perils and pitfalls of mining sourceforge. In Proc. of the Int. Workshop on Mining Software Repositories, pages 7–11, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. E. Kalliamvakou, D. Damian, L. Singer, and D. M. German. The code-centric collaboration perspective: Evidence from github. Technical Report DCS-352-IR, University of Victoria, February 2014.Google ScholarGoogle Scholar
  16. J. Marlow, L. Dabbish, and J. Herbsleb. Impression formation in online peer production: activity traces and personal profiles in github. In Proc. Conf. Computer Supported Cooperative Work, pages 117–128, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. N. McDonald and S. Goggins. Performance and participation in open source software on github. In CHI’13 Extended Abstracts on Human Factors in Computing Systems, pages 139–144. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. H. Nguyen, B. Adams, and A. E. Hassan. A case study of bias in bug-fix datasets. In Reverse Engineering (WCRE), 2010 17th Working Conference on, pages 259–268. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Pham, L. Singer, O. Liskin, F. Figueira Filho, and K. Schneider. Creating a shared understanding of testing culture on a social coding site. In Proc. Int. Conf. on Soft. Eng., ICSE ’13, pages 112–121, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. Rahman, D. Posnett, I. Herraiz, and P. Devanbu. Sample size vs. bias in defect prediction. In Proc. of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 147–157, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Rainer and S. Gale. Evaluating the quality and quantity of data on open source software projects. In Proceedings of the First International Conference on Open Source Systems (OSS 2005), pages 29–36, 2005.Google ScholarGoogle Scholar
  22. P. C. Rigby and C. Bird. Convergent contemporary software peer review practices. In Proc. of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 202–212, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. C. Rigby, D. M. German, and M.-A. Storey. Open source software peer review practices: a case study of the Apache server. In Proce. of the 30th Int. Conf. on Software engineering, ICSE ’08, pages 541–550, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Takhteyev and A. Hilts. Investigating the geography of open source software through github. http://takhteyev.org/papers/ Takhteyev-Hilts-2010.pdf, 2010.Google ScholarGoogle Scholar
  25. F. Thung, T. Bissyande, D. Lo, and L. Jiang. Network structure of social coding in github. In 17th European Conference on Software Maintenance and Reengineering (CSMR), pages 323–326, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. T. Tsay, L. Dabbish, and J. Herbsleb. Social media and success in open source projects. In Proc. Computer Supported Cooperative Work Companion, pages 223–226, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Wagstrom, C. Jergensen, and A. Sarma. A network of rails: a graph dataset of ruby on rails and associated projects. In Proc. of the 10th Int. Work. Conf. on Mining Software Repositories, pages 229–232, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Weiss. Quantitative analysis of open source projects on sourceforge. In Proc. of the First Int. Conf. on Open Source Systems (OSS 2005), pages 140–147, 2005.Google ScholarGoogle Scholar

Index Terms

  1. The promises and perils of mining GitHub

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MSR 2014: Proceedings of the 11th Working Conference on Mining Software Repositories
      May 2014
      427 pages
      ISBN:9781450328630
      DOI:10.1145/2597073
      • General Chair:
      • Premkumar Devanbu,
      • Program Chairs:
      • Sung Kim,
      • Martin Pinzger

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 31 May 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Upcoming Conference

      ICSE 2025

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader