skip to main content
10.1145/3350768.3351800acmotherconferencesArticle/Chapter ViewAbstractPublication PagessbesConference Proceedingsconference-collections
research-article

Is There an Interplay Between Library Usage and Repository Features?: An Analysis with Regression Models

Published: 23 September 2019 Publication History

Abstract

The advent of open source has changed the way developers reuse software. The availability of libraries and their corresponding source code in public software repositories enables new forms of analyzing project aspects that can provide clues on their stability and maintainability. However, the literature lacks studies aiming to identify and understand whether and which repository features may correlate with the likeliness of usage of a library. In this sense, we present a factorial experiment using three different regression models - Multiple Linear Regression, Random Forest, and Neural Networks -, aiming at analyzing whether there is a correlation between library usage and a set of features extracted from release management and version control repositories. The results allowed to map features with positive learning impact, such as the number of stars, pull requests, and number of downloads, as well as features that contributed much less to the models (e.g., the repository size). Although the impact level of each feature varied from model to model, we also noticed from the analysis of regression results that the models were capable of achieving higher accuracy when considering only a subset of features.
Paper category: Experimental; Language: English

References

[1]
A S Badashian and E Stroulia. 2016. Measuring user influence in GitHub: the million follower fallacy. In IEEE/ACM 3rd International Workshop on CrowdSourcing in Software Engineering (CSI-SE). Austin, USA, 15--21.
[2]
B Barnes, T Durek, J Gaffney, and A Pyster. 1988. A Framework and Economic Foundation for Software Reuse. In Software Reuse: Emerging Technology, Will Tracz (Ed.). IEEE Computer Society Press, Los Alamitos, USA, 77--88.
[3]
V R Basili, H D Rombach, J Bailey, and A Delis. 1990. Ada reusability analysis and measurement. In Empirical Foundations of Information and Software Science V, P. Zunde and D. Hocking (Eds.). Springer, Boston, USA, 355--368.
[4]
C M Bishop. 2006. Pattern Recognition and Machine Learning. Springer-Verlag, New York, USA.
[5]
K Blincoe, J Sheoran, S Goggins, E Petakovic, and D Damian. 2016. Understanding the popular users: Following, affiliation influence and leadership on GitHub. Information and Software Technology 70 (2016), 30--39.
[6]
H Borges, A Hora, and M T Valente. 2016. Predicting the popularity of GitHub repositories. In The 12th International Conference on Predictive Models and Data Analytics in Software Engineering. Ciudad Real, Spain, 9.
[7]
H Borges, A Hora, and M T Valente. 2016. Understanding the factors that impact the popularity of GitHub repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution. Raleigh, USA, 334--344.
[8]
H Borges and M T Valente. 2018. What's in a GitHub star? understanding repository starring practices in a social coding platform. Journal of Systems and Software 146 (2018), 112--129.
[9]
L Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32.
[10]
D L Civco. 1993. Artificial neural networks for land-cover classification and mapping. International Journal of Geographical Information Science 7, 2 (1993), 173--186.
[11]
T Davis. 1993. The reuse capability model: a basis for improving an organization's reuse capability. In 2nd International Workshop on Software Reusability - Advances in Software Reuse. Lucca, Italy, 126--133.
[12]
W Frakes and C Terry. 1994. Reuse level metrics. In 3rd International Conference on Software Reuse: Advances in Software Reusability. Rio de Janeiro, Brazil, 139--148.
[13]
W Frakes and C Terry. 1996. Software reuse: metrics and models. ACM Computing Surveys (CSUR) 28, 2 (1996), 415--435.
[14]
W B Frakes and C J Fox. 1995. Modeling reuse across the software life cycle. Journal of Systems and Software 30, 3 (1995), 295--301.
[15]
W B. Frakes and C J. Fox. 1996. Quality improvement using a software reuse failure modes model. IEEE Transactions on Software Engineering 22, 4 (1996), 274--279.
[16]
W B Frakes and P B Gandel. 1990. Representing reusable software. Information and Software Technology 32, 10 (1990), 653--664.
[17]
W B. Frakes and T P. Pole. 1994. An empirical study of representation methods for reusable software components. IEEE Transactions on Software Engineering 20, 8 (1994), 617--630.
[18]
J E Gaffney and T A Durek. 1989. Software reuse - key to enhanced productivity: some quantitative models. Information and Software Technology 31, 5 (1989), 258--267.
[19]
G Gousios. 2013. The GHTorrent dataset and tool suite. In 10th Working Conference on Mining Software Repositories. San Francisco, USA, 233--236.
[20]
G Grégoire. 2014. Multiple linear regression. European Astronomical Society Publications Series 66 (2014), 45--72.
[21]
R Hecht-Nielsen. 1988. Theory of the backpropagation neural network. Neural Networks 1, Supplement-1 (1988), 445--448.
[22]
E Kalliamvakou, G Gousios, K Blincoe, L Singer, D M German, and D Damian. 2014. The promises and perils of mining GitHub. In 11th Working Conference on Mining Software Repositories. Hyderabad, India, 92--101.
[23]
R Kohavi. 1996. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. In 2nd International Conference on Knowledge Discovery and Data Mining. Portland, USA, 202--207.
[24]
P Koltun and A Hudson. 1991. A reuse maturity model. In 4th Annual Workshop on Software Reuse, W. B. Frakes (Ed.). Hemdon, USA, 1--4.
[25]
J Margono and T E Rhoads. 1992. Software reuse economics: cost-benefit analysis on a large-scale Ada project. In 14th International Conference on Software Engineering. Melbourne, Australia, 338--348.
[26]
M D McIlroy. 1968. Mass-produced software components. In Software Engineering: Report on a Conference Sponsored by the NATO Science Committee, P Naur and B Randell (Eds.). NATO Scientific Affairs Division, Garmisch, Germany, 88--98.
[27]
A Michail. 2000. Data mining library reuse patterns using generalized association rules. In 22nd International Conference on Software Engineering. Limerick, Ireland, 167--176.
[28]
Y M Mileva, V Dallmeier, M Burger, and A Zeller. 2009. Mining trends of library usage. In Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops. Amsterdam, The Netherlands, 57--62.
[29]
M Morisio, M Ezran, and C Tully. 2002. Success and failure factors in software reuse. IEEE Transactions on Software Engineering 28, 4 (2002), 340--357.
[30]
M S Oliveira. 2015. On the use of visualization for supporting software reuse. Ph.D. Dissertation. Federal University of Rio de Janeiro (COPPE/UFRJ).
[31]
L Rokach and O Maimon. 2005. Clustering Methods. In Data Mining and Knowledge Discovery Handbook, O Maimon and L Rokach (Eds.). Springer US, Boston, USA, 321--352.
[32]
Richard W Selby. 1989. Quantitative studies of software reuse. In Software Reusability, Ted J. Biggerstaff and Alan J. Perlis (Eds.). ACM, New York, USA, 213--233.
[33]
R Setiono and H Liu. 1997. Neural-network feature selector. IEEE Transactions on Neural Networks 8, 3 (1997), 654--662.
[34]
J Tsay, L Dabbish, and J Herbsleb. 2014. Influence of social and technical factors for evaluating contribution in GitHub. In 36th International Conference on Software Engineering. Hyderabad, India, 356--366.
[35]
J Zhu, M Zhou, and A Mockus. 2014. Patterns of folder use and project popularity: A case study of GitHub repositories. In 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. Torino, Italy, 30.

Cited By

View all
  • (2024)Variable Neighborhood Search with Dynamic Exploration for the Set Union Knapsack ProblemCombinatorial Optimization and Applications10.1007/978-3-031-57603-4_2(17-35)Online publication date: 28-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SBES '19: Proceedings of the XXXIII Brazilian Symposium on Software Engineering
September 2019
583 pages
ISBN:9781450376518
DOI:10.1145/3350768
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • SBC: Sociedade Brasileira de Computação

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 September 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Regression models
  2. library usage
  3. mining software repositories
  4. software reuse

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SBES 2019

Acceptance Rates

SBES '19 Paper Acceptance Rate 67 of 153 submissions, 44%;
Overall Acceptance Rate 147 of 427 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Variable Neighborhood Search with Dynamic Exploration for the Set Union Knapsack ProblemCombinatorial Optimization and Applications10.1007/978-3-031-57603-4_2(17-35)Online publication date: 28-Jun-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media