Skip to main content
Log in

Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

With available tools and datasets existing on GitHub ecosystem, researchers have the opportunities to study diverse software engineering problems on a large-scale dataset. However, there are many potential threats when researchers try to directly use large-scale datasets, and one important threat is that GitHub contains many private projects (e.g., homework) and non-development projects (e.g., blog). For researchers who want to study cooperative behavior of developers or development process of projects, their research samples should not contain private projects and non-development projects. To solve this problem, we first analyzed the weaknesses of the base line methods (i.e., selecting top projects) and extended ML-based methods (i.e., training models on a labeled training dataset using ML algorithms, Extended_MLMs for short), and proposed two methods called Enhanced_RFM and Fusion_DL_RFM to address the weaknesses of Extended_RFM (the Extended_MLM that is based on Random Forest and has the best performance among all the Extended_MLMs). The results show that: (1) existing project sample selection methods have a low F-measure and poor generality (i.e., have a bad performance on the testing dataset); (2) Enhanced_RFM outperforms Fusion_DL_RFM on accuracy and stability; and (3) by adopting Enhanced_RFM, the F-measure of Extended_RFM is improved from 0.690 to 0.810 and the precision of Extended_RFM is improved from 0.559 to 0.785 under cross validation, which indicates that the generality of Extended_RFM is significantly improved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. There are many datasets of GitHub which can be potentially used in this study (e.g., GitHub Archive, BOA Nguyen 2013). Considering that the GHTorrent dataset is easy to use and is adopted by 64% studies on mining software repositories (MSR) (Kotti et al. 2020), we decide to test our method on this dataset.

  2. They defined an engineered software project as a software project that leverages sound software engineering practices in each of its dimensions such as documentation, testing, and project management.

  3. Feature_set has two values: selecting all features (all in short), and selecting the basic features (basic in short). ML_method have six values: J48 decision tree (J48), Logistic Regression (LR), Naïve Bayes (NB), Random Forest (RF), Support Vector Machine (SVM) and Bayesian Network (BN). On the other hand, MLM is a special case of Extended_MLMs, in which Feature_set is set to all and ML_method is set to J48.

  4. By using the website (e.g., https://api.github.com/repos/Sulcalibur/Responsive-Web-Design-Photoshop-Blueprint), we can obtain much useful statistical information of a project.

  5. Weka is a data mining tool. https://www.cs.waikato.ac.nz/ml/weka.

  6. There are many personal projects like “my personal blog”, “my homework”.

  7. Dimensionality refers to a phenomenon that if a dataset contains too many features, it will be much inefficient when fitting a model on this dataset. For the task of predicting PDPs, if the model contains more keywords, the result is likely to be better. However, too many features will seriously affect the efficiency in generating the model.

  8. We stopped at 100 because when the num_samples_keyword larger than 100, the training dataset will exceed half of the Standard Dataset. Too few samples in the testing dataset is not enough to verify the generality of the fitted model.

  9. In this study, Enhanced_RFM and Fusion_DL_RFM are based on randomly selected training dataset. Hence, the stability is important since researchers cannot control the random process of collecting training datase.

  10. A free open source software, http://www.openssl.org/.

References

  • Aggarwal, K., Hindle, A., Stroulia, E.: Co-evolution of project documentation and popularity within github. In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), pp. 360–363 (2014)

  • Bao, L., Xia, X., Lo, D., Murphy, G.C.: A large scale study of long-time contributor prediction for github projects. IEEE Trans. Softw. Eng. 47(6), 1277–1298 (2021)

    Article  Google Scholar 

  • Beel, J., Gipp, B., Langer, S., Breitinger, C.: Research-paper recommender systems: a literature survey. Int. J. Digit. Libr. 17(4), 1–34 (2015)

    Google Scholar 

  • Bertoncello, M.V., Pinto, G., Wiese, I.S., Steinmacher, I.: Pull requests or commits? which method should we use to study contributors’ behavior? In: Proceedings of the 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 592–601 (2020)

  • Brindescu, C., Codoban, M., Shmarkatiuk, S., Dig, D.: How do centralized and distributed version control systems impact software changes? In: Proceedings of the 36th International Conference on Software Engineering (ICSE), pp. 322–333 (2014)

  • Burlet, G., Hindle, A.: An empirical study of end-user programmers in the computer music community. In: Proceedings of the 12th Working Conference on Mining Software Repositories (MSR), pp. 292–302 (2015)

  • Cheng, C., Li, B., Li, Z., Liang, P., Han, X., Zhang, J.: Datasets for “Improving generality and accuracy of existing public development project selection methods: a study on github ecosystem”. https://github.com/vyqrvwgf1/Study_dataset, (2021)

  • Cheng, C., Li, B., Li, Z., Liang, P.: Automatic detection of public development projects in large open source ecosystems: an exploratory study on github. In: Proceedings of the 30th International Conference on Software Engineering and Knowledge Engineering (SEKE), pp. 193–198 (2018)

  • Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)

    Article  Google Scholar 

  • Constantinou, E., Mens, T.: Socio-technical evolution of the ruby ecosystem in github. In: Proceedings of the 24th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 34–44 (2017)

  • Cosentino, V., Izquierdo, J.L.C., Cabot, J.: A systematic mapping study of software development with github. IEEE Access 5(99), 7173–7192 (2017)

    Article  Google Scholar 

  • Del Carpio, A.F., Angarita, L.B.: Trends in software engineering processes using deep learning: a systematic literature review. In: Proceedings of the 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 445–454 (2020)

  • Elazhary, O., Storey, M.A., Ernst, N., Zaidman, A.: Do as I do, not as I say: do contribution guidelines match the github contribution process. In: Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 286–290 (2019)

  • Falessi, D., Smith, W., Serebrenik, A.: Stress: a semi-automated, fully replicable approach for project selection. In: Proceedings of the 11th International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 151–156 (2017)

  • Fu, W., Menzies, T.: Easy over hard: a case study on deep learning. In: Proceedings of the 12th Joint Meeting on Foundations of Software Engineering (FSE), pp. 49–60 (2017)

  • Gousios, G., Pinzger, M., Deursen, A.V.: An exploratory study of the pull-based software development model. In: Proceedings of the 36th International Conference on Software Engineering (ICSE), pp. 345–355 (2014)

  • Gousios, G., Spinellis, D.: Ghtorrent: Github’s data from a firehose. In: Proceedings of the 9th Working Conference on Mining Software Repositories (MSR), pp. 12–21 (2012)

  • Gousios, G., Zaidman, A., Storey, M.A., Van Deursen, A.: Work practices and challenges in pull-based development: the integrator’s perspective. In: Proceedings of the 37th IEEE International Conference on Software Engineering (ICSE), pp. 358–368 (2015)

  • Gousios, G.: The ghtorent dataset and tool suite. In: Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), pp. 233–236 (2013)

  • Goyal, R., Ferreira, G., Kästner, C., Herbsleb, J.: Identifying unusual commits on github. J. Softw. Evol. Process 30(1), e1893 (2018)

    Article  Google Scholar 

  • Hata, H., Todo, T., Onoue, S., Matsumoto, K.: Characteristics of sustainable oss projects: a theoretical and empirical study. In: Proceedings of the 8th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE), pp. 15–21 (2015)

  • He, P., Li, B., Liu, X., Chen, J., Ma, Y.: An empirical study on software defect prediction with a simplified metric set. Inf. Softw. Technol. 59(3), 170–190 (2015)

    Article  Google Scholar 

  • Hilton, M., Tunnell, T., Huang, K., Marinov, D., Dig, D.: Usage, costs, and benefits of continuous integration in open-source projects. In: Proceedings of the 31st International Conference on Automated Software Engineering (ASE), pp. 426–437 (2016)

  • Jiang, J., Zhang, L., Li, L.: Understanding project dissemination on a social coding site. In: Proceedings of the 20th Working Conference on Reverse Engineering (WCRE), pp. 132–141 (2013)

  • Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext. zip: compressing text classification models. arXiv:1612.03651 (2016)

  • Jun-Wei, L., Navid, S., imageNavid, S., Sam, M., imageSam, M.: Test automation in open-source android apps: a large-scale empirical study. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1078–1089 (2020)

  • Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: The promises and perils of mining github. In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), pp. 92–101 (2014)

  • Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: An in-depth study of the promises and perils of mining github. Empir. Softw. Eng. 21(5), 2035–2071 (2016)

    Article  Google Scholar 

  • Kikas, R., Dumas, M., Pfahl, D.: Using dynamic and contextual features to predict issue lifetime in github projects. In: Proceedings of the 13th Working Conference on Mining Software Repositories (MSR), pp. 291–302 (2016)

  • Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014)

  • Kotti, Z., Kravvaritis, K., Dritsa, K., Spinellis, D.: Standing on shoulders or feet? An extended study on the usage of the msr data papers. Empir. Softw. Eng. 25(5), 3288–3322 (2020)

    Article  Google Scholar 

  • Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 2267–2273 (2015)D

  • Mechelli, A., Vieira, S.: Machine Learning: Methods and Applications to Brain Disorders. Academic Press, Cambridge (2019)

    Google Scholar 

  • Meli, M., Mcniece, M.R., Reaves, B.: How bad can it git? Characterizing secret leakage in public github repositories. In: Proceedings of the 26th Network and Distributed System Security Symposium (NDSS), pp. 1–15 (2019)

  • Meng, Y., Wang, G., Liu, Q.: Multi-layer convolutional neural network model based on prior knowledge of knowledge graph for text classification. In: Proceedings of the 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), pp. 618–624 (2019)

  • Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)

  • Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating github for engineered software projects. Empir. Softw. Eng. 22(3), 1–35 (2016)

    Google Scholar 

  • Murphy, G.C., Terra, R., Figueiredo, J., Serey, D.: Do developers discuss design? In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), pp. 340–343 (2014)

  • Nguyen, T.N.: Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: Proceedings of the 35th International Conference on Software Engineering (ICSE), pp. 422–431 (2013)

  • Overney, C., Meinicke, J., Kstner, C., Vasilescu, B.: How to not get rich: an empirical study of donations in open source. In: Proceedings of 42nd International Conference on Software Engineering (ICSE), pp. 1209–1221 (2020)

  • Padhye, R., Mani, S., Sinha, V.S.: A study of external community contribution to open-source projects on github. In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), pp. 332–335 (2014)

  • Rausch, T., Hummer, W., Leitner, P., Schulte, S.: An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In: Proceedings of the 14th International Conference on Mining Software Repositories (MSR), pp. 345–355 (2017)

  • Robles, G., Ho-Quang, T., Hebig, R., Chaudron, M.R.V., Fernandez, M.A.: An extensive dataset of uml models in github. In: Proceedings of the 14th International Conference on Mining Software Repositories (MSR), pp. 519–522 (2017)

  • Saha, A.K., Saha, R.K., Schneider, K.A.: A discriminative model approach for suggesting tags automatically for stack overflow questions. In: Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), pp. 73–76 (2013)

  • Sharma, T., Fragkoulis, M., Spinellis, D.: Does your configuration code smell? In: Proceedings of the 13th Working Conference on Mining Software Repositories (MSR), pp. 189–200 (2016)

  • Surhone, L.M., Tennoe, M.T., Henssonow, S.F., Breiman, L.: Random forest. Mach. Learn. 45(1), 5–32 (2010)

    Google Scholar 

  • Vasilescu, B., Blincoe, K., Xuan, Q., Casalnuovo, C., Filkov, V.: The sky is not the limit: multitasking across github projects. In: Proceedings of the 38th International Conference on Software Engineering (ICSE), pp. 994–1005 (2016)

  • Vasilescu, B., Yu, Y., Wang, H., Devanbu, P., Filkov, V.: Quality and productivity outcomes relating to continuous integration in github. In: Proceedings of the 23rd Joint Meeting on Foundations of Software Engineering (FSE), pp. 805–816 (2015)

  • Wang, T., Wang, H., Yin, G., Ling, C.X., Li, X., Zou, P.: Tag recommendation for open source software. Front. Comput. Sci. 8(1), 69–82 (2014)

    Article  MathSciNet  Google Scholar 

  • Xavier, J., Macedo, A., Maia, M.D.A.: Understanding the popularity of reporters and assignees in the github. In: Proceedings of the 26th International Conference on Software Engineering and Knowledge Engineering (SEKE), pp. 484–489 (2014)

  • Yu, Y., Wang, H., Filkov, V., Devanbu, P., Vasilescu, B.: Wait for it: determinants of pull request evaluation latency on github. In: Proceedings of the 12th Working Conference on Mining Software Repositories (MSR), pp. 367–371 (2015)

  • Yu, Y., Wang, H., Yin, G., Wang, T.: Reviewer recommendation for pull-requests in github: what can we learn from code review and bug assignment? Inf. Softw. Technol. 74, 204–218 (2016)

    Article  Google Scholar 

  • Zahavy, T., Krishnan, A., Magnani, A., Mannor, S.: Is a picture worth a thousand words? A deep multi-modal architecture for product classification in e-commerce. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 7873–7880 (2018)

  • Zhao, Y., Serebrenik, A., Zhou, Y., Filkov, V., Vasilescu, B.: The impact of continuous integration on other software development practices: a large-scale empirical study. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 60–71 (2017)

  • Zhao, G., Da Costa, D.A., Zou, Y.: Improving the pull requests review process using learning-to-rank algorithms. Empir. Softw. Eng. 24(4), 2140–2170 (2019)

    Article  Google Scholar 

  • Zhou, P., Liu, J., Liu, X., Yang, Z., Grundy, J.: Is deep learning better than traditional approaches in tag recommendation for software information sites? Inf. Softw. Technol. 109, 1–13 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Nos. 61832014, 62172311, 62032016) and the Natural Science Foundation of Hubei Province of China (No. 2021CFB577).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Bing Li or Peng Liang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Table 11.

Table 11 Glossary used in the study

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, C., Li, B., Li, Z. et al. Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem. Autom Softw Eng 29, 33 (2022). https://doi.org/10.1007/s10515-022-00322-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10515-022-00322-4

Keywords

Navigation