skip to main content
10.1145/3397537.3397551acmotherconferencesArticle/Chapter ViewAbstractPublication PagesprogrammingConference Proceedingsconference-collections
research-article

Three trillion lines: infrastructure for mining GitHub in the classroom

Published: 04 August 2020 Publication History

Abstract

The increasing interest in collaborative software development on platforms like GitHub has led to the availability of large amounts of data about development activities. The GHTorrent project has recorded a significant proportion of GitHub’s public event stream and hosts the currently largest public dataset of meta-data about open-source development. We describe our infrastructure that makes this data locally available to researchers and students, examples for research activities carried out on this infrastructure, and what we learned from building the system. We identify a need for domain-specific tools, especially databases, that can deal with large-scale code repositories and associated meta-data and outline open challenges to use them more effectively for research and machine learning settings.

References

[1]
Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. 2009. Gephi: An Open Source Software for Exploring and Manipulating Networks. http://www.aaai.org/ ocs/index.php/ICWSM/09/paper/view/154
[2]
Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration. In Proceedings of the 14th working conference on mining software repositories.
[3]
Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (San Francisco, CA, USA) (MSR ’13). IEEE Press, Piscataway, NJ, USA, 233–236. http://dl.acm.org/ citation.cfm?id=2487085.2487132
[4]
Siegfried Horschig, Toni Mattis, and Robert Hirschfeld. 2018. Do Java Programmers Write Better Python? Studying off-Language Code Quality on GitHub. In Conference Companion of the 2nd International Conference on Art, Science, and Engineering of Programming - Programming’18 Companion. ACM Press, Nice, France, 127–134.
[5]
Dan Ingalls, Ted Kaehler, John Maloney, Scott Wallace, and Alan Kay. 1997. Back to the Future: The Story of Squeak, a Practical Smalltalk Written in Itself. In Proceedings of the 12th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA ’97). ACM, New York, NY, USA, 318–326.
[6]
Jens Lincke, Patrick Rein, Stefan Ramson, Robert Hirschfeld, Marcel Taeumel, and Tim Felgentreff. 2017. Designing a live development experience for webcomponents. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Programming Experience, PX/17.2, Vancouver, BC, Canada, October 23-27, 2017. 28–35.
[7]
https://dl.acm.org/citation.cfm?id=3167109
[8]
Toni Mattis and Robert Hirschfeld. 2020. Lightweight Lexical Test Prioritization for Immediate Feedback. Programming Journal 4, 3 (2020), 12. 22152/programming-journal.org/2020/4/12
[9]
Toni Mattis, Patrick Rein, Falco Dürsch, and Robert Hirschfeld. 2020. RTPTorrent: An Open-source Dataset for Evaluating Regression Test Prioritization. In Proceedings of the Conference on Mining Software Repositories (MSR) 2020. To Appear.
[10]
Andrew S. Tanenbaum. 2007. Modern Operating Systems (3rd ed.). Prentice Hall Press, USA. Abstract 1 Introduction 2 Dataset 3 Infrastructure 3.1 Hardware and Software Setup 4 Data Procurement at Scale 4.1 Failure Model 5 Tools and Experience 5.1 Research Questions 6 Conclusion and Future Work Acknowledgments References

Cited By

View all
  • (2024)Integrated Visual Software Analytics on the GitHub PlatformComputers10.3390/computers1302003313:2(33)Online publication date: 25-Jan-2024
  • (2022)Tooling for time- and space-efficient git repository miningProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528503(413-417)Online publication date: 23-May-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
Programming '20: Companion Proceedings of the 4th International Conference on Art, Science, and Engineering of Programming
March 2020
228 pages
ISBN:9781450375078
DOI:10.1145/3397537
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Big Code
  2. GitHub
  3. Repository Mining
  4. Teaching
  5. TravisCI

Qualifiers

  • Research-article

Conference

<Programming> '20

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)5
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Integrated Visual Software Analytics on the GitHub PlatformComputers10.3390/computers1302003313:2(33)Online publication date: 25-Jan-2024
  • (2022)Tooling for time- and space-efficient git repository miningProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528503(413-417)Online publication date: 23-May-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media