Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem

Cheng, Can; Li, Bing; Li, Zengyang; Liang, Peng; Han, Xiaofeng; Zhang, Jiahua

doi:10.1007/s10515-022-00322-4

Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem

Published: 18 March 2022

Volume 29, article number 33, (2022)
Cite this article

Automated Software Engineering Aims and scope Submit manuscript

Can Cheng ORCID: orcid.org/0000-0003-4260-4825¹,
Bing Li¹,
Zengyang Li²,
Peng Liang¹,
Xiaofeng Han¹ &
…
Jiahua Zhang¹

566 Accesses
1 Citation
Explore all metrics

Abstract

With available tools and datasets existing on GitHub ecosystem, researchers have the opportunities to study diverse software engineering problems on a large-scale dataset. However, there are many potential threats when researchers try to directly use large-scale datasets, and one important threat is that GitHub contains many private projects (e.g., homework) and non-development projects (e.g., blog). For researchers who want to study cooperative behavior of developers or development process of projects, their research samples should not contain private projects and non-development projects. To solve this problem, we first analyzed the weaknesses of the base line methods (i.e., selecting top projects) and extended ML-based methods (i.e., training models on a labeled training dataset using ML algorithms, Extended_MLMs for short), and proposed two methods called Enhanced_RFM and Fusion_DL_RFM to address the weaknesses of Extended_RFM (the Extended_MLM that is based on Random Forest and has the best performance among all the Extended_MLMs). The results show that: (1) existing project sample selection methods have a low F-measure and poor generality (i.e., have a bad performance on the testing dataset); (2) Enhanced_RFM outperforms Fusion_DL_RFM on accuracy and stability; and (3) by adopting Enhanced_RFM, the F-measure of Extended_RFM is improved from 0.690 to 0.810 and the precision of Extended_RFM is improved from 0.559 to 0.785 under cross validation, which indicates that the generality of Extended_RFM is significantly improved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting health indicators for open source projects (using hyperparameter optimization)

Article 22 June 2022

Curating GitHub for engineered software projects

Article 18 April 2017

Creating Evolving Project Data Sets in Software Engineering

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

There are many datasets of GitHub which can be potentially used in this study (e.g., GitHub Archive, BOA Nguyen 2013). Considering that the GHTorrent dataset is easy to use and is adopted by 64% studies on mining software repositories (MSR) (Kotti et al. 2020), we decide to test our method on this dataset.
They defined an engineered software project as a software project that leverages sound software engineering practices in each of its dimensions such as documentation, testing, and project management.
Feature_set has two values: selecting all features (all in short), and selecting the basic features (basic in short). ML_method have six values: J48 decision tree (J48), Logistic Regression (LR), Naïve Bayes (NB), Random Forest (RF), Support Vector Machine (SVM) and Bayesian Network (BN). On the other hand, MLM is a special case of Extended_MLMs, in which Feature_set is set to all and ML_method is set to J48.
By using the website (e.g., https://api.github.com/repos/Sulcalibur/Responsive-Web-Design-Photoshop-Blueprint), we can obtain much useful statistical information of a project.
Weka is a data mining tool. https://www.cs.waikato.ac.nz/ml/weka.
There are many personal projects like “my personal blog”, “my homework”.
Dimensionality refers to a phenomenon that if a dataset contains too many features, it will be much inefficient when fitting a model on this dataset. For the task of predicting PDPs, if the model contains more keywords, the result is likely to be better. However, too many features will seriously affect the efficiency in generating the model.
We stopped at 100 because when the num_samples_keyword larger than 100, the training dataset will exceed half of the Standard Dataset. Too few samples in the testing dataset is not enough to verify the generality of the fitted model.
In this study, Enhanced_RFM and Fusion_DL_RFM are based on randomly selected training dataset. Hence, the stability is important since researchers cannot control the random process of collecting training datase.
A free open source software, http://www.openssl.org/.

References

Aggarwal, K., Hindle, A., Stroulia, E.: Co-evolution of project documentation and popularity within github. In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), pp. 360–363 (2014)
Bao, L., Xia, X., Lo, D., Murphy, G.C.: A large scale study of long-time contributor prediction for github projects. IEEE Trans. Softw. Eng. 47(6), 1277–1298 (2021)
Article Google Scholar
Beel, J., Gipp, B., Langer, S., Breitinger, C.: Research-paper recommender systems: a literature survey. Int. J. Digit. Libr. 17(4), 1–34 (2015)
Google Scholar
Bertoncello, M.V., Pinto, G., Wiese, I.S., Steinmacher, I.: Pull requests or commits? which method should we use to study contributors’ behavior? In: Proceedings of the 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 592–601 (2020)
Brindescu, C., Codoban, M., Shmarkatiuk, S., Dig, D.: How do centralized and distributed version control systems impact software changes? In: Proceedings of the 36th International Conference on Software Engineering (ICSE), pp. 322–333 (2014)
Burlet, G., Hindle, A.: An empirical study of end-user programmers in the computer music community. In: Proceedings of the 12th Working Conference on Mining Software Repositories (MSR), pp. 292–302 (2015)
Cheng, C., Li, B., Li, Z., Liang, P., Han, X., Zhang, J.: Datasets for “Improving generality and accuracy of existing public development project selection methods: a study on github ecosystem”. https://github.com/vyqrvwgf1/Study_dataset, (2021)
Cheng, C., Li, B., Li, Z., Liang, P.: Automatic detection of public development projects in large open source ecosystems: an exploratory study on github. In: Proceedings of the 30th International Conference on Software Engineering and Knowledge Engineering (SEKE), pp. 193–198 (2018)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Article Google Scholar
Constantinou, E., Mens, T.: Socio-technical evolution of the ruby ecosystem in github. In: Proceedings of the 24th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 34–44 (2017)
Cosentino, V., Izquierdo, J.L.C., Cabot, J.: A systematic mapping study of software development with github. IEEE Access 5(99), 7173–7192 (2017)
Article Google Scholar
Del Carpio, A.F., Angarita, L.B.: Trends in software engineering processes using deep learning: a systematic literature review. In: Proceedings of the 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 445–454 (2020)
Elazhary, O., Storey, M.A., Ernst, N., Zaidman, A.: Do as I do, not as I say: do contribution guidelines match the github contribution process. In: Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 286–290 (2019)
Falessi, D., Smith, W., Serebrenik, A.: Stress: a semi-automated, fully replicable approach for project selection. In: Proceedings of the 11th International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 151–156 (2017)
Fu, W., Menzies, T.: Easy over hard: a case study on deep learning. In: Proceedings of the 12th Joint Meeting on Foundations of Software Engineering (FSE), pp. 49–60 (2017)
Gousios, G., Pinzger, M., Deursen, A.V.: An exploratory study of the pull-based software development model. In: Proceedings of the 36th International Conference on Software Engineering (ICSE), pp. 345–355 (2014)
Gousios, G., Spinellis, D.: Ghtorrent: Github’s data from a firehose. In: Proceedings of the 9th Working Conference on Mining Software Repositories (MSR), pp. 12–21 (2012)
Gousios, G., Zaidman, A., Storey, M.A., Van Deursen, A.: Work practices and challenges in pull-based development: the integrator’s perspective. In: Proceedings of the 37th IEEE International Conference on Software Engineering (ICSE), pp. 358–368 (2015)
Gousios, G.: The ghtorent dataset and tool suite. In: Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), pp. 233–236 (2013)
Goyal, R., Ferreira, G., Kästner, C., Herbsleb, J.: Identifying unusual commits on github. J. Softw. Evol. Process 30(1), e1893 (2018)
Article Google Scholar
Hata, H., Todo, T., Onoue, S., Matsumoto, K.: Characteristics of sustainable oss projects: a theoretical and empirical study. In: Proceedings of the 8th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE), pp. 15–21 (2015)
He, P., Li, B., Liu, X., Chen, J., Ma, Y.: An empirical study on software defect prediction with a simplified metric set. Inf. Softw. Technol. 59(3), 170–190 (2015)
Article Google Scholar
Hilton, M., Tunnell, T., Huang, K., Marinov, D., Dig, D.: Usage, costs, and benefits of continuous integration in open-source projects. In: Proceedings of the 31st International Conference on Automated Software Engineering (ASE), pp. 426–437 (2016)
Jiang, J., Zhang, L., Li, L.: Understanding project dissemination on a social coding site. In: Proceedings of the 20th Working Conference on Reverse Engineering (WCRE), pp. 132–141 (2013)
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext. zip: compressing text classification models. arXiv:1612.03651 (2016)
Jun-Wei, L., Navid, S., imageNavid, S., Sam, M., imageSam, M.: Test automation in open-source android apps: a large-scale empirical study. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1078–1089 (2020)
Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: The promises and perils of mining github. In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), pp. 92–101 (2014)
Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: An in-depth study of the promises and perils of mining github. Empir. Softw. Eng. 21(5), 2035–2071 (2016)
Article Google Scholar
Kikas, R., Dumas, M., Pfahl, D.: Using dynamic and contextual features to predict issue lifetime in github projects. In: Proceedings of the 13th Working Conference on Mining Software Repositories (MSR), pp. 291–302 (2016)
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014)
Kotti, Z., Kravvaritis, K., Dritsa, K., Spinellis, D.: Standing on shoulders or feet? An extended study on the usage of the msr data papers. Empir. Softw. Eng. 25(5), 3288–3322 (2020)
Article Google Scholar
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 2267–2273 (2015)D
Mechelli, A., Vieira, S.: Machine Learning: Methods and Applications to Brain Disorders. Academic Press, Cambridge (2019)
Google Scholar
Meli, M., Mcniece, M.R., Reaves, B.: How bad can it git? Characterizing secret leakage in public github repositories. In: Proceedings of the 26th Network and Distributed System Security Symposium (NDSS), pp. 1–15 (2019)
Meng, Y., Wang, G., Liu, Q.: Multi-layer convolutional neural network model based on prior knowledge of knowledge graph for text classification. In: Proceedings of the 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), pp. 618–624 (2019)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)
Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating github for engineered software projects. Empir. Softw. Eng. 22(3), 1–35 (2016)
Google Scholar
Murphy, G.C., Terra, R., Figueiredo, J., Serey, D.: Do developers discuss design? In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), pp. 340–343 (2014)
Nguyen, T.N.: Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: Proceedings of the 35th International Conference on Software Engineering (ICSE), pp. 422–431 (2013)
Overney, C., Meinicke, J., Kstner, C., Vasilescu, B.: How to not get rich: an empirical study of donations in open source. In: Proceedings of 42nd International Conference on Software Engineering (ICSE), pp. 1209–1221 (2020)
Padhye, R., Mani, S., Sinha, V.S.: A study of external community contribution to open-source projects on github. In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), pp. 332–335 (2014)
Rausch, T., Hummer, W., Leitner, P., Schulte, S.: An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In: Proceedings of the 14th International Conference on Mining Software Repositories (MSR), pp. 345–355 (2017)
Robles, G., Ho-Quang, T., Hebig, R., Chaudron, M.R.V., Fernandez, M.A.: An extensive dataset of uml models in github. In: Proceedings of the 14th International Conference on Mining Software Repositories (MSR), pp. 519–522 (2017)
Saha, A.K., Saha, R.K., Schneider, K.A.: A discriminative model approach for suggesting tags automatically for stack overflow questions. In: Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), pp. 73–76 (2013)
Sharma, T., Fragkoulis, M., Spinellis, D.: Does your configuration code smell? In: Proceedings of the 13th Working Conference on Mining Software Repositories (MSR), pp. 189–200 (2016)
Surhone, L.M., Tennoe, M.T., Henssonow, S.F., Breiman, L.: Random forest. Mach. Learn. 45(1), 5–32 (2010)
Google Scholar
Vasilescu, B., Blincoe, K., Xuan, Q., Casalnuovo, C., Filkov, V.: The sky is not the limit: multitasking across github projects. In: Proceedings of the 38th International Conference on Software Engineering (ICSE), pp. 994–1005 (2016)
Vasilescu, B., Yu, Y., Wang, H., Devanbu, P., Filkov, V.: Quality and productivity outcomes relating to continuous integration in github. In: Proceedings of the 23rd Joint Meeting on Foundations of Software Engineering (FSE), pp. 805–816 (2015)
Wang, T., Wang, H., Yin, G., Ling, C.X., Li, X., Zou, P.: Tag recommendation for open source software. Front. Comput. Sci. 8(1), 69–82 (2014)
Article MathSciNet Google Scholar
Xavier, J., Macedo, A., Maia, M.D.A.: Understanding the popularity of reporters and assignees in the github. In: Proceedings of the 26th International Conference on Software Engineering and Knowledge Engineering (SEKE), pp. 484–489 (2014)
Yu, Y., Wang, H., Filkov, V., Devanbu, P., Vasilescu, B.: Wait for it: determinants of pull request evaluation latency on github. In: Proceedings of the 12th Working Conference on Mining Software Repositories (MSR), pp. 367–371 (2015)
Yu, Y., Wang, H., Yin, G., Wang, T.: Reviewer recommendation for pull-requests in github: what can we learn from code review and bug assignment? Inf. Softw. Technol. 74, 204–218 (2016)
Article Google Scholar
Zahavy, T., Krishnan, A., Magnani, A., Mannor, S.: Is a picture worth a thousand words? A deep multi-modal architecture for product classification in e-commerce. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 7873–7880 (2018)
Zhao, Y., Serebrenik, A., Zhou, Y., Filkov, V., Vasilescu, B.: The impact of continuous integration on other software development practices: a large-scale empirical study. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 60–71 (2017)
Zhao, G., Da Costa, D.A., Zou, Y.: Improving the pull requests review process using learning-to-rank algorithms. Empir. Softw. Eng. 24(4), 2140–2170 (2019)
Article Google Scholar
Zhou, P., Liu, J., Liu, X., Yang, Z., Grundy, J.: Is deep learning better than traditional approaches in tag recommendation for software information sites? Inf. Softw. Technol. 109, 1–13 (2019)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Nos. 61832014, 62172311, 62032016) and the Natural Science Foundation of Hubei Province of China (No. 2021CFB577).

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, China
Can Cheng, Bing Li, Peng Liang, Xiaofeng Han & Jiahua Zhang
School of Computer Science, Central China Normal University, Wuhan, China
Zengyang Li

Authors

Can Cheng
View author publications
You can also search for this author inPubMed Google Scholar
Bing Li
View author publications
You can also search for this author inPubMed Google Scholar
Zengyang Li
View author publications
You can also search for this author inPubMed Google Scholar
Peng Liang
View author publications
You can also search for this author inPubMed Google Scholar
Xiaofeng Han
View author publications
You can also search for this author inPubMed Google Scholar
Jiahua Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Bing Li or Peng Liang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Table 11.

Table 11 Glossary used in the study

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, C., Li, B., Li, Z. et al. Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem. Autom Softw Eng 29, 33 (2022). https://doi.org/10.1007/s10515-022-00322-4

Download citation

Received: 19 June 2021
Accepted: 03 January 2022
Published: 18 March 2022
DOI: https://doi.org/10.1007/s10515-022-00322-4

Keywords

Part of a collection:

Special Issue on Deep Learning in Open-Source Software Ecosystems

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Predicting health indicators for open source projects (using hyperparameter optimization)

Curating GitHub for engineered software projects

Creating Evolving Project Data Sets in Software Engineering

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now