Abstract
Mining software repositories (MSR) has become a popular research area recently. MSR analyzes different sources of data, such as version control systems, code repositories, defect tracking systems, archived communication, deployment logs, and so on, to uncover interesting and actionable insights from the data for improved software development, maintenance, and evolution. This chapter provides an overview of MSR and how to conduct an MSR study, including setting up a study, formulating research goals and questions, identifying repositories, extracting and cleaning the data, performing data analysis and synthesis, and discussing MSR study limitations. Furthermore, the chapter discusses MSR as part of a mixed method study and how to mine data ethically and gives an overview of recent trends in MSR as well as reflects on the future. As a teaching aid, the chapter provides tips for educators, exercises for students at all levels, and a list of repositories that can be used as a starting point for an MSR study.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
GitHub Copilot: https://github.com/features/copilot/
- 2.
In this scenario, the commit order is particularly relevant [19].
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
The GitHub’s acceptable use policy: https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies.
- 15.
MSR 2024 Mining Challenge: https://2024.msrconf.org/track/msr-2024-mining-challenge?
References
Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., Hassan, A.: Topic modeling algorithms and applications: a survey. Inform. Syst. 112, 102131 (2023)
Azeem, M.I., Palomba, F., Shi, L., Wang, Q.: Machine learning techniques for code smell detection: a systematic literature review and meta-analysis. Inform. Softw. Technol. 108, 115–138 (2019)
Barros, D., Horita, F., Wiese, I., Silva, K.: A mining software repository extended cookbook: lessons learned from a literature review. In: Proceedings of the XXXV Brazilian Symposium on Software Engineering, pp. 1–10 (2021)
Basili, V.R.: Goal, question, metric paradigm. Encyclopedia Softw. Eng. 1, 528–532 (1994)
Binkley, D.: Source code analysis: a road map. In: Future of Software Engineering (FOSE’07), pp. 104–119 (2007)
Borges, H., Tulio Valente, M.: What’s in a GitHub star? Understanding repository starring practices in a social coding platform. J. Syst. Softw. 146, 112–129 (2018).
Catolino, G., Palomba, F., Zaidman, A., Ferrucci, F.: Not all bugs are the same: understanding, characterizing, and classifying bug types. J. Syst. Softw. 152, 165–181 (2019)
Chatterjee, P., Sharma, T., Ralph, P.: Empirical standards for repository mining. MSR ’22, pp. 142–143. Association for Computing Machinery, New York (2022)
Chen, T.H., Thomas, S.W., Hassan, A.E.: A survey on the use of topic models when mining software repositories. Empir. Softw. Eng. 21, 1843–1919 (2016)
Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994)
Creswell, J.W.: Mixed-method research: Introduction and application. In: Handbook of Educational Policy, pp. 455–472. Elsevier, Amsterdam (1999)
Dalla Palma, S., Di Nucci, D., Palomba, F., Tamburri, D.A.: Within-project defect prediction of infrastructure-as-code using product and process metrics. IEEE Trans. Softw. Eng. 48(6), 2086–2104 (2021)
di Biase, M., Rastogi, A., Bruntink, M., van Deursen, A.: The delta maintainability model: measuring maintainability of fine-grained code changes. In: 2019 IEEE/ACM International Conference on Technical Debt (TechDebt), pp. 113–122. IEEE, Piscataway (2019)
de Oliveira Neto, F.G., Torkar, R., Feldt, R., Gren, L., Furia, C.A., Huang, Z.: Evolution of statistical analysis in empirical software engineering research: Current state and steps forward. J. Syst. Softw. 156, 246–267 (2019)
Dey, T., Mousavi, S., Ponce, E., Fry, T., Vasilescu, B., Filippova, A., Mockus, A.: Detecting and characterizing bots that commit code. In: Proceedings of the 17th International Conference on Mining Software Repositories, pp. 209–219 (2020)
Dey, T., Vasilescu, B., Mockus, A.: An exploratory study of bot commits. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, ICSEW’20, pp. 61–65. Association for Computing Machinery, New York (2020)
Dit, B., Revelle, M., Gethers, M., Poshyvanyk, D.: Feature location in source code: a taxonomy and survey. J. Softw. Evol. Process 25(1), 53–95 (2013)
Emanuelsson, P., Nilsson, U.: A comparative study of industrial static analysis tools. Electron. Notes Theor. Comput. Sci. 217, 5–21 (2008)
Falessi, D., Huang, J., Narayana, L., Thai, J.F., Turhan, B.: On the need of preserving order of data when validating within-project defect classifiers. Empir. Softw. Eng. 25, 4805–4830 (2020)
Falessi, D., Juristo, N., Wohlin, C., Turhan, B., Münch, J., Jedlitschka, A., Oivo, M.: Empirical software engineering experts on the use of students and professionals in experiments. Empir. Softw. Eng. 23, 452–489 (2018)
Feitelson, D.G.: We do not appreciate being experimented on: developer and researcher views on the ethics of experiments on open-source projects. J. Syst. Softw. 204, 111774 (2023)
Giordano, G., Festa, G., Catolino, G., Palomba, F., Ferrucci, F., Gravino, C.: On the adoption and effects of source code reuse on defect proneness and maintenance effort. Empir. Softw. Eng. 29(1), 20 (2024)
Gold, N.E., Krinke, J.: Ethics in the mining of software repositories. Empir. Softw. Eng. 27(1), 17 (2022)
Gonzalez-Barahona, J.M., Robles, G., Izquierdo-Cortazar, D.: The MetricsGrimoire database collection. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 478–481. IEEE, Piscataway (2015)
Gousios, G., Kalliamvakou, E., Spinellis, D.: Measuring developer contribution from software repository data. In: Proceedings of the 2008 International Working Conference on Mining Software Repositories, pp. 129–132 (2008)
Güemes-Peña, D., López-Nozal, C., Marticorena-Sánchez, R., Maudes-Raedo, J.: Emerging topics in mining software repositories: machine learning in software repositories and datasets. Progr. Artif. Intell. 7, 237–247 (2018)
Gupta, M., Sureka, A., Padmanabhuni, S.: Process mining multiple repositories for software defect resolution from control and organizational perspective. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 122–131. Association for Computing Machinery, New York (2014)
Hassan, A.E.: The road ahead for mining software repositories. In: 2008 Frontiers of Software Maintenance, pp. 48–57. IEEE, Piscataway (2008)
Herzig, K., Just, S., Zeller, A.: The impact of tangled code changes on defect prediction models. Empir. Softw. Eng. 21, 303–336 (2016)
Herzig, K., Zeller, A.: The impact of tangled code changes. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 121–130. IEEE, Piscataway (2013)
Hoda, R.: Socio-technical grounded theory for software engineering. IEEE Trans. Softw. Eng. 48(10), 3808–3832 (2021)
Kagdi, H., Collard, M.L., Maletic, J.I.: A survey and taxonomy of approaches for mining software repositories in the context of software evolution. J. Softw. Maintenance Evol. Res. Practice 19(2), 77–131 (2007)
Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: The promises and perils of mining GitHub. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pp. 92–101 (2014)
Kamei, Y., Shihab, E., Adams, B., Hassan, A.E., Mockus, A., Sinha, A., Ubayashi, N.: A large-scale empirical study of just-in-time quality assurance. IEEE Trans. Softw. Eng. 39(6), 757–773 (2013)
Kitchenham, B., Madeyski, L., Budgen, D., Keung, J., Brereton, P., Charters, S., Gibbs, S., Pohthong, A.: Robust statistical methods for empirical software engineering. Empir. Softw. Eng. 22, 579–630 (2017)
Kovalenko, V., Palomba, F., Bacchelli, A.: Mining file histories: should we consider branches? In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE ’18, pp. 202–213. Association for Computing Machinery, New York (2018)
Liu, Z., Xia, X., Hassan, A.E., Lo, D., Xing, Z., Wang, X.: Neural-machine-translation-based commit message generation: how far are we? In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE ’18, pp. 373–384. Association for Computing Machinery, New York (2018)
Mahmood, Z., Bowes, D., Hall, T., Lane, P.C., Petri, J.: Reproducibility and replicability of software defect prediction studies. Inform. Softw. Technol. 99, 148–163 (2018)
Marcus, A., Sergeyev, A., Rajlich, V., Maletic, J.I.: An information retrieval approach to concept location in source code. In: 11th Working Conference on Reverse Engineering, pp. 214–223. IEEE, Piscataway (2004)
Mens, T.: An ecosystemic and socio-technical view on software maintenance and evolution. In: 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 1–8. IEEE, Piscataway (2016)
Moser, R., Pedrycz, W., Succi, G.: A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on Software Engineering, pp. 181–190 (2008)
Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating GitHub for engineered software projects. Empir. Softw. Eng. 22, 3219–3253 (2017)
Nguyen, H.A., Nguyen, A.T., Nguyen, T.N.: Filtering noise in mixed-purpose fixing commits to improve defect prediction and localization. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), pp. 138–147. IEEE, Piscataway (2013)
Nguyen, N., Nadi, S.: An empirical evaluation of GitHub copilot’s code suggestions. In: Proceedings of the 19th International Conference on Mining Software Repositories, MSR ’22, pp. 1–5. Association for Computing Machinery, New York (2022)
Poncin, W., Serebrenik, A., Van Den Brand, M.: Process mining software repositories. In: 2011 15th European Conference on Software Maintenance and Reengineering, pp. 5–14. IEEE, Piscataway (2011)
Qiu, H.S., Nolte, A., Brown, A., Serebrenik, A., Vasilescu, B.: Going farther together: the impact of social capital on sustained participation in open source. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 688–699. IEEE, Piscataway (2019)
Ram, A., Sawant, A.A., Castelluccio, M., Bacchelli, A.: What makes a code change easier to review: an empirical investigation on code change reviewability. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 201–212 (2018)
Rao, S., Kak, A.: Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: Proceedings of the 8th Working Conference on Mining Software Repositories, pp. 43–52 (2011)
Rosa, G., Pascarella, L., Scalabrino, S., Tufano, R., Bavota, G., Lanza, M., Oliveto, R.: Evaluating SZZ implementations through a developer-informed oracle. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 436–447. IEEE, Piscataway (2021)
Salza, P., Palomba, F., Di Nucci, D., D’Uva, C., De Lucia, A., Ferrucci, F.: Do developers update third-party libraries in mobile apps? In: Proceedings of the 26th Conference on Program Comprehension, ICPC ’18, pp. 255–265. Association for Computing Machinery, New York (2018)
Silva, C.C., Galster, M., Gilson, F.: Topic modeling in software engineering research. Empir. Softw. Eng. 26(6), 120 (2021)
Śliwerski, J., Zimmermann, T., Zeller, A.: When do changes induce fixes? ACM SIGSOFT Softw. Eng. Notes 30(4), 1–5 (2005)
Spadini, D., Aniche, M., Bacchelli, A.: Pydriller: Python framework for mining software repositories. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 908–911 (2018)
Storey, M.A., Ernst, N.A., Williams, C., Kalliamvakou, E.: The who, what, how of software engineering research: a socio-technical framework. Empir. Softw. Eng. 25, 4097–4129 (2020)
Storey, M.A., Hoda, R., Milani, A.M.P., Baldassarre, M.T.: Guidelines for using mixed and multi methods research in software engineering (2024). arXiv preprint arXiv:2404.06011
Storey, M.A., Russo, D., Novielli, N., Kobayashi, T., Wang, D.: A disruptive research playbook for studying disruptive innovations (2024). arXiv preprint arXiv:2402.13329
Sullivan, G., Feinn, R.: Using effect size—or why the p value is not enough. J. Grad. Med. Educ. 4(3), 279–282 2012. https://doi.org/10.4300
Tao, Y., Dang, Y., Xie, T., Zhang, D., Kim, S.: How do software engineers understand code changes? An exploratory study in industry. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, pp. 1–11 (2012)
Teo, W., Teoh, Z., Arabi, D.A., Aboushadi, M., Lai, K., Ng, Z., Pant, A., Hoda, R., Tantithamthavorn, C., Turhan, B.: What would you do? an ethical ai quiz. In: 2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 112–116. IEEE, Piscataway (2023)
Tufano, M., Palomba, F., Bavota, G., Oliveto, R., Di Penta, M., De Lucia, A., Poshyvanyk, D.: When and why your code starts to smell bad. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 1, pp. 403–414 (2015)
Verdecchia, R., Engström, E., Lago, P., Runeson, P., Song, Q.: Threats to validity in software engineering research: a critical reflection. Inform. Softw. Technol. 164, 107329 (2023)
Wankhade, M., Rao, A.C.S., Kulkarni, C.: A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 55(7), 5731–5780 (2022)
Wen, F., Nagy, C., Lanza, M., Bavota, G.: An empirical study of quick remedy commits. In: Proceedings of the 28th International Conference on Program Comprehension, ICPC ’20, pp. 60–71. Association for Computing Machinery, New York (2020)
Wen, F., Nagy, C., Lanza, M., Bavota, G.: Quick remedy commits and their impact on mining software repositories. Empir. Softw. Eng. 27, 1–31 (2022)
Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering. Springer, Berlin (2012)
Yamaguchi, F., Rieck, K., et al.: Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: 5th USENIX Workshop on Offensive Technologies (WOOT 11) (2011)
Yamashita, A., Abtahizadeh, S.A., Khomh, F., Guéhéneuc, Y.G.: Software evolution and quality data from controlled, multiple, industrial case studies. In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 507–510. IEEE (2017)
Yang, Y., Xia, X., Lo, D., Grundy, J.: A survey on deep learning for software engineering. ACM Comput. Surv. 54(10s), 1–73 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Codabux, Z., Fard, F., Verdecchia, R., Palomba, F., Di Nucci, D., Recupito, G. (2024). Teaching Mining Software Repositories. In: Mendez, D., Avgeriou, P., Kalinowski, M., Ali, N.B. (eds) Handbook on Teaching Empirical Software Engineering. Springer, Cham. https://doi.org/10.1007/978-3-031-71769-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-71769-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71768-0
Online ISBN: 978-3-031-71769-7
eBook Packages: EducationEducation (R0)