Predicting long-time contributors for GitHub projects using machine learning

https://doi.org/10.1016/j.infsof.2021.106616Get rights and content
Under a Creative Commons license
open access

Abstract

Context:

Many organizations develop software systems using open source software (OSS), which is risky due to the high possibility of losing support. Contributors are critical for the survival of OSS projects, but very few new contributors remain with OSS projects to become long-time contributors (LTCs). Identification of factors that contribute to become an LTC can help OSS project owners utilize limited resources to retain new contributors.

Objective:

In this paper, we investigate whether we can effectively predict new contributors to OSS repositories becoming long time contributors based on repository and contributor meta-data collected from GitHub.

Method:

We construct a dataset containing 70,899 observations from 888 most popular repositories with 56,766 contributors. Each observation represents a contributor who joined the repository and is categorized as either an LTC or a non-LTC, depending on whether their project tenure is longer than 3 years. Each observation has 31 features that are calculated using the information of the new contributor and the repository when a new contributor joins the project. We build several machine learning models, including naive Bayes, k-nearest neighbor, logistic regression, decision tree, and random forest to predict LTC validated using 10-fold cross-validation. We compare our best model with state of the art model in terms of precision, recall, F1-score, Matthews correlation coefficient (MCC), and area under the curve (AUC).

Results:

In 10-fold cross-validation, the precision, recall, F1-score, MCC, and AUC of our best model (random forest) are 0.695, 0.079, 0.140, 0.226, and 0.913, respectively. These values are 27.29%, 92.68%, 86.67%, 56.94%, and 0.55%, respectively better than the best baseline state of the art model (random forest).

Conclusion:

Compared to state of the art models, the models built using our approach use less than 50% features (31 vs 63), have no wait time of one month after the contributor joins to predict future LTC status, and produce better results.

Keywords

Long-time contributor
GitHub
GHTorrent
BigQuery
Machine learning models

Cited by (0)

Vijaya Kumar Eluri received the B.Tech. degree in computer science from Andhra University, India, in 1993, the M.Tech. degree in computer science from the Indian Statistical Institute, Calcutta, India, in 1996, an MBA degree from the University of Illinois at Urbana–Champaign, USA, in 2017, and an MS in Supply Chain Management from Rutgers Business School, USA, in 2019. He is currently pursuing the Ph.D. degree in systems engineering with George Washington University, Washington, DC, USA. Since 1996, he has been in the software industry in various roles. Currently, he is working as a Senior Software Development Manager at Amazon Inc., in the Supply Chain Optimization Technology group.

Thomas A. Mazzuchi School of Engineering Management and Systems Engineering, George Washington University, Washington, DC, USA.

Thomas A. Mazzuchi received the B.A. degree in mathematics from Gettysburg College, Gettysburg, PA, USA, in 1978, and the M.S. and D.Sc. degrees in operations research from George Washington University, Washington, DC, USA, in 1979 and 1982, respectively.

Since 2000, he has been the Chair of the Department of Engineering Management and Systems Engineering, George Washington University, where he is also a Professor of Operations Research and of Engineering Management, following three and a half years as an Interim Dean with the School of Engineering and Applied Science, from 1997 to 2000, and three years as the Chair of the Department of Operations Research from 1994 to 1997. He also holds responsibility as an Associate Director of the Laboratory for Infrastructure Safety and Reliability, where he conducts reliability research. His current research interests include applied statistics, Bayesian inference, quality control, reliability analysis, risk analysis, stochastic models of operations research, and time series analysis.

Dr. Mazzuchi is an Elected Member of the International Statistics Institute and remains active in professional society matters. He is an Elected Member of the Operations Research Honor Society, Omega Rho, and the Science Honor Society, Sigma Xi. Since 1985, he has published scientific papers in such scholarly, refereed journals as the Journal of Risk Analysis, Safety Science, Technometrics, Lifetime Data Analysis, the IEEE Transactions on Reliability, the Journal of Statistical Planning and Inference, and Systems Engineering.

Shahram Sarkani Engineering Management and Systems Engineering Off Campus Programs, George Washington University, Washington, DC, USA.

Shahram Sarkani received the B.S. and M.S. degrees in civil engineering from Louisiana State University, Baton Rouge, LA, USA, in 1980 and 1981, respectively, and the Ph.D. degree in civil engineering from Rice University, Houston, TX, USA, in 1987.

He is the Faculty Adviser and the Head of the Engineering Management and Systems Engineering Off-Campus Programs Office, George Washington University (GWU), Washington, DC, USA. He has been a Professor of Engineering Management and Systems Engineering, since 1999, and the Founder and the Director of GWU’s Laboratory for Infrastructure Safety and Reliability since 1993. He served as an Interim Associate Dean for research and development from 1997 to 2001, and the Chair of the Department of Civil, Mechanical, and Environmental Engineering from 1994 to 1997. His current research interests include engineering management, systems engineering, civil engineering, and logistics management. He has conducted contract research work with organizations such as the National Aeronautics and Space Administration, Washington, the National Institute of Standards and Technology, Gaithersburg, MD, USA, the National Science Foundation, Arlington, VA, USA, the U.S. Department of Interior, Washington, the U.S. Department of the Navy, Washington, the U.S. Department of Transportation, Washington, and Walcoff and Associates Inc., McLean, VA, USA.

Dr. Sarkani is a member of Sigma Xi (as the GW Chapter President from 1994 to 1995), a member of the Probabilistic Methods Committee of the Engineering Mechanics Division, ASCE, and served as the Chair of the Committee on Fatigue and Fracture Reliability from 1990 to 1994. He is a member of the Technical Administration Committee of Structural Safety and Reliability, ASCE. He is a Registered Professional Engineer in the State of Virginia.