Skip to main content

Teaching Mining Software Repositories

  • Chapter
  • First Online:
Handbook on Teaching Empirical Software Engineering

Abstract

Mining software repositories (MSR) has become a popular research area recently. MSR analyzes different sources of data, such as version control systems, code repositories, defect tracking systems, archived communication, deployment logs, and so on, to uncover interesting and actionable insights from the data for improved software development, maintenance, and evolution. This chapter provides an overview of MSR and how to conduct an MSR study, including setting up a study, formulating research goals and questions, identifying repositories, extracting and cleaning the data, performing data analysis and synthesis, and discussing MSR study limitations. Furthermore, the chapter discusses MSR as part of a mixed method study and how to mine data ethically and gives an overview of recent trends in MSR as well as reflects on the future. As a teaching aid, the chapter provides tips for educators, exercises for students at all levels, and a list of repositories that can be used as a starting point for an MSR study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    GitHub Copilot: https://github.com/features/copilot/

  2. 2.

    In this scenario, the commit order is particularly relevant [19].

  3. 3.

    https://owasp.org/www-community/Source_Code_Analysis_Tools

  4. 4.

    https://pmd.github.io/

  5. 5.

    https://spotbugs.github.io/

  6. 6.

    https://www.microfocus.com/en-us/cyberres/application-security/static-code-analyzer

  7. 7.

    https://github.com/analysis-tools-dev/dynamic-analysis

  8. 8.

    https://valgrind.org/

  9. 9.

    https://learn.microsoft.com/en-us/windows-hardware/drivers/devtest/application-verifier

  10. 10.

    https://code-pulse.com/

  11. 11.

    https://www.synopsys.com/software-integrity.html

  12. 12.

    http://www.openml.org/

  13. 13.

    https://zenodo.org/

  14. 14.

    The GitHub’s acceptable use policy: https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies.

  15. 15.

    MSR 2024 Mining Challenge: https://2024.msrconf.org/track/msr-2024-mining-challenge?

References

  1. Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., Hassan, A.: Topic modeling algorithms and applications: a survey. Inform. Syst. 112, 102131 (2023)

    Article  Google Scholar 

  2. Azeem, M.I., Palomba, F., Shi, L., Wang, Q.: Machine learning techniques for code smell detection: a systematic literature review and meta-analysis. Inform. Softw. Technol. 108, 115–138 (2019)

    Article  Google Scholar 

  3. Barros, D., Horita, F., Wiese, I., Silva, K.: A mining software repository extended cookbook: lessons learned from a literature review. In: Proceedings of the XXXV Brazilian Symposium on Software Engineering, pp. 1–10 (2021)

    Google Scholar 

  4. Basili, V.R.: Goal, question, metric paradigm. Encyclopedia Softw. Eng. 1, 528–532 (1994)

    Google Scholar 

  5. Binkley, D.: Source code analysis: a road map. In: Future of Software Engineering (FOSE’07), pp. 104–119 (2007)

    Google Scholar 

  6. Borges, H., Tulio Valente, M.: What’s in a GitHub star? Understanding repository starring practices in a social coding platform. J. Syst. Softw. 146, 112–129 (2018).

    Google Scholar 

  7. Catolino, G., Palomba, F., Zaidman, A., Ferrucci, F.: Not all bugs are the same: understanding, characterizing, and classifying bug types. J. Syst. Softw. 152, 165–181 (2019)

    Article  Google Scholar 

  8. Chatterjee, P., Sharma, T., Ralph, P.: Empirical standards for repository mining. MSR ’22, pp. 142–143. Association for Computing Machinery, New York (2022)

    Google Scholar 

  9. Chen, T.H., Thomas, S.W., Hassan, A.E.: A survey on the use of topic models when mining software repositories. Empir. Softw. Eng. 21, 1843–1919 (2016)

    Article  Google Scholar 

  10. Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994)

    Article  Google Scholar 

  11. Creswell, J.W.: Mixed-method research: Introduction and application. In: Handbook of Educational Policy, pp. 455–472. Elsevier, Amsterdam (1999)

    Google Scholar 

  12. Dalla Palma, S., Di Nucci, D., Palomba, F., Tamburri, D.A.: Within-project defect prediction of infrastructure-as-code using product and process metrics. IEEE Trans. Softw. Eng. 48(6), 2086–2104 (2021)

    Article  Google Scholar 

  13. di Biase, M., Rastogi, A., Bruntink, M., van Deursen, A.: The delta maintainability model: measuring maintainability of fine-grained code changes. In: 2019 IEEE/ACM International Conference on Technical Debt (TechDebt), pp. 113–122. IEEE, Piscataway (2019)

    Google Scholar 

  14. de Oliveira Neto, F.G., Torkar, R., Feldt, R., Gren, L., Furia, C.A., Huang, Z.: Evolution of statistical analysis in empirical software engineering research: Current state and steps forward. J. Syst. Softw. 156, 246–267 (2019)

    Article  Google Scholar 

  15. Dey, T., Mousavi, S., Ponce, E., Fry, T., Vasilescu, B., Filippova, A., Mockus, A.: Detecting and characterizing bots that commit code. In: Proceedings of the 17th International Conference on Mining Software Repositories, pp. 209–219 (2020)

    Google Scholar 

  16. Dey, T., Vasilescu, B., Mockus, A.: An exploratory study of bot commits. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, ICSEW’20, pp. 61–65. Association for Computing Machinery, New York (2020)

    Google Scholar 

  17. Dit, B., Revelle, M., Gethers, M., Poshyvanyk, D.: Feature location in source code: a taxonomy and survey. J. Softw. Evol. Process 25(1), 53–95 (2013)

    Article  Google Scholar 

  18. Emanuelsson, P., Nilsson, U.: A comparative study of industrial static analysis tools. Electron. Notes Theor. Comput. Sci. 217, 5–21 (2008)

    Article  Google Scholar 

  19. Falessi, D., Huang, J., Narayana, L., Thai, J.F., Turhan, B.: On the need of preserving order of data when validating within-project defect classifiers. Empir. Softw. Eng. 25, 4805–4830 (2020)

    Article  Google Scholar 

  20. Falessi, D., Juristo, N., Wohlin, C., Turhan, B., Münch, J., Jedlitschka, A., Oivo, M.: Empirical software engineering experts on the use of students and professionals in experiments. Empir. Softw. Eng. 23, 452–489 (2018)

    Article  Google Scholar 

  21. Feitelson, D.G.: We do not appreciate being experimented on: developer and researcher views on the ethics of experiments on open-source projects. J. Syst. Softw. 204, 111774 (2023)

    Article  Google Scholar 

  22. Giordano, G., Festa, G., Catolino, G., Palomba, F., Ferrucci, F., Gravino, C.: On the adoption and effects of source code reuse on defect proneness and maintenance effort. Empir. Softw. Eng. 29(1), 20 (2024)

    Article  Google Scholar 

  23. Gold, N.E., Krinke, J.: Ethics in the mining of software repositories. Empir. Softw. Eng. 27(1), 17 (2022)

    Article  Google Scholar 

  24. Gonzalez-Barahona, J.M., Robles, G., Izquierdo-Cortazar, D.: The MetricsGrimoire database collection. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 478–481. IEEE, Piscataway (2015)

    Google Scholar 

  25. Gousios, G., Kalliamvakou, E., Spinellis, D.: Measuring developer contribution from software repository data. In: Proceedings of the 2008 International Working Conference on Mining Software Repositories, pp. 129–132 (2008)

    Google Scholar 

  26. Güemes-Peña, D., López-Nozal, C., Marticorena-Sánchez, R., Maudes-Raedo, J.: Emerging topics in mining software repositories: machine learning in software repositories and datasets. Progr. Artif. Intell. 7, 237–247 (2018)

    Article  Google Scholar 

  27. Gupta, M., Sureka, A., Padmanabhuni, S.: Process mining multiple repositories for software defect resolution from control and organizational perspective. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 122–131. Association for Computing Machinery, New York (2014)

    Google Scholar 

  28. Hassan, A.E.: The road ahead for mining software repositories. In: 2008 Frontiers of Software Maintenance, pp. 48–57. IEEE, Piscataway (2008)

    Google Scholar 

  29. Herzig, K., Just, S., Zeller, A.: The impact of tangled code changes on defect prediction models. Empir. Softw. Eng. 21, 303–336 (2016)

    Article  Google Scholar 

  30. Herzig, K., Zeller, A.: The impact of tangled code changes. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 121–130. IEEE, Piscataway (2013)

    Google Scholar 

  31. Hoda, R.: Socio-technical grounded theory for software engineering. IEEE Trans. Softw. Eng. 48(10), 3808–3832 (2021)

    Article  Google Scholar 

  32. Kagdi, H., Collard, M.L., Maletic, J.I.: A survey and taxonomy of approaches for mining software repositories in the context of software evolution. J. Softw. Maintenance Evol. Res. Practice 19(2), 77–131 (2007)

    Article  Google Scholar 

  33. Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: The promises and perils of mining GitHub. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pp. 92–101 (2014)

    Google Scholar 

  34. Kamei, Y., Shihab, E., Adams, B., Hassan, A.E., Mockus, A., Sinha, A., Ubayashi, N.: A large-scale empirical study of just-in-time quality assurance. IEEE Trans. Softw. Eng. 39(6), 757–773 (2013)

    Article  Google Scholar 

  35. Kitchenham, B., Madeyski, L., Budgen, D., Keung, J., Brereton, P., Charters, S., Gibbs, S., Pohthong, A.: Robust statistical methods for empirical software engineering. Empir. Softw. Eng. 22, 579–630 (2017)

    Article  Google Scholar 

  36. Kovalenko, V., Palomba, F., Bacchelli, A.: Mining file histories: should we consider branches? In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE ’18, pp. 202–213. Association for Computing Machinery, New York (2018)

    Google Scholar 

  37. Liu, Z., Xia, X., Hassan, A.E., Lo, D., Xing, Z., Wang, X.: Neural-machine-translation-based commit message generation: how far are we? In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE ’18, pp. 373–384. Association for Computing Machinery, New York (2018)

    Google Scholar 

  38. Mahmood, Z., Bowes, D., Hall, T., Lane, P.C., Petri, J.: Reproducibility and replicability of software defect prediction studies. Inform. Softw. Technol. 99, 148–163 (2018)

    Article  Google Scholar 

  39. Marcus, A., Sergeyev, A., Rajlich, V., Maletic, J.I.: An information retrieval approach to concept location in source code. In: 11th Working Conference on Reverse Engineering, pp. 214–223. IEEE, Piscataway (2004)

    Google Scholar 

  40. Mens, T.: An ecosystemic and socio-technical view on software maintenance and evolution. In: 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 1–8. IEEE, Piscataway (2016)

    Google Scholar 

  41. Moser, R., Pedrycz, W., Succi, G.: A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on Software Engineering, pp. 181–190 (2008)

    Google Scholar 

  42. Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating GitHub for engineered software projects. Empir. Softw. Eng. 22, 3219–3253 (2017)

    Article  Google Scholar 

  43. Nguyen, H.A., Nguyen, A.T., Nguyen, T.N.: Filtering noise in mixed-purpose fixing commits to improve defect prediction and localization. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), pp. 138–147. IEEE, Piscataway (2013)

    Google Scholar 

  44. Nguyen, N., Nadi, S.: An empirical evaluation of GitHub copilot’s code suggestions. In: Proceedings of the 19th International Conference on Mining Software Repositories, MSR ’22, pp. 1–5. Association for Computing Machinery, New York (2022)

    Google Scholar 

  45. Poncin, W., Serebrenik, A., Van Den Brand, M.: Process mining software repositories. In: 2011 15th European Conference on Software Maintenance and Reengineering, pp. 5–14. IEEE, Piscataway (2011)

    Google Scholar 

  46. Qiu, H.S., Nolte, A., Brown, A., Serebrenik, A., Vasilescu, B.: Going farther together: the impact of social capital on sustained participation in open source. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 688–699. IEEE, Piscataway (2019)

    Google Scholar 

  47. Ram, A., Sawant, A.A., Castelluccio, M., Bacchelli, A.: What makes a code change easier to review: an empirical investigation on code change reviewability. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 201–212 (2018)

    Google Scholar 

  48. Rao, S., Kak, A.: Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: Proceedings of the 8th Working Conference on Mining Software Repositories, pp. 43–52 (2011)

    Google Scholar 

  49. Rosa, G., Pascarella, L., Scalabrino, S., Tufano, R., Bavota, G., Lanza, M., Oliveto, R.: Evaluating SZZ implementations through a developer-informed oracle. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 436–447. IEEE, Piscataway (2021)

    Google Scholar 

  50. Salza, P., Palomba, F., Di Nucci, D., D’Uva, C., De Lucia, A., Ferrucci, F.: Do developers update third-party libraries in mobile apps? In: Proceedings of the 26th Conference on Program Comprehension, ICPC ’18, pp. 255–265. Association for Computing Machinery, New York (2018)

    Google Scholar 

  51. Silva, C.C., Galster, M., Gilson, F.: Topic modeling in software engineering research. Empir. Softw. Eng. 26(6), 120 (2021)

    Article  Google Scholar 

  52. Śliwerski, J., Zimmermann, T., Zeller, A.: When do changes induce fixes? ACM SIGSOFT Softw. Eng. Notes 30(4), 1–5 (2005)

    Article  Google Scholar 

  53. Spadini, D., Aniche, M., Bacchelli, A.: Pydriller: Python framework for mining software repositories. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 908–911 (2018)

    Google Scholar 

  54. Storey, M.A., Ernst, N.A., Williams, C., Kalliamvakou, E.: The who, what, how of software engineering research: a socio-technical framework. Empir. Softw. Eng. 25, 4097–4129 (2020)

    Article  Google Scholar 

  55. Storey, M.A., Hoda, R., Milani, A.M.P., Baldassarre, M.T.: Guidelines for using mixed and multi methods research in software engineering (2024). arXiv preprint arXiv:2404.06011

    Google Scholar 

  56. Storey, M.A., Russo, D., Novielli, N., Kobayashi, T., Wang, D.: A disruptive research playbook for studying disruptive innovations (2024). arXiv preprint arXiv:2402.13329

    Google Scholar 

  57. Sullivan, G., Feinn, R.: Using effect size—or why the p value is not enough. J. Grad. Med. Educ. 4(3), 279–282 2012. https://doi.org/10.4300

    Article  Google Scholar 

  58. Tao, Y., Dang, Y., Xie, T., Zhang, D., Kim, S.: How do software engineers understand code changes? An exploratory study in industry. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, pp. 1–11 (2012)

    Google Scholar 

  59. Teo, W., Teoh, Z., Arabi, D.A., Aboushadi, M., Lai, K., Ng, Z., Pant, A., Hoda, R., Tantithamthavorn, C., Turhan, B.: What would you do? an ethical ai quiz. In: 2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 112–116. IEEE, Piscataway (2023)

    Google Scholar 

  60. Tufano, M., Palomba, F., Bavota, G., Oliveto, R., Di Penta, M., De Lucia, A., Poshyvanyk, D.: When and why your code starts to smell bad. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 1, pp. 403–414 (2015)

    Google Scholar 

  61. Verdecchia, R., Engström, E., Lago, P., Runeson, P., Song, Q.: Threats to validity in software engineering research: a critical reflection. Inform. Softw. Technol. 164, 107329 (2023)

    Article  Google Scholar 

  62. Wankhade, M., Rao, A.C.S., Kulkarni, C.: A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 55(7), 5731–5780 (2022)

    Article  Google Scholar 

  63. Wen, F., Nagy, C., Lanza, M., Bavota, G.: An empirical study of quick remedy commits. In: Proceedings of the 28th International Conference on Program Comprehension, ICPC ’20, pp. 60–71. Association for Computing Machinery, New York (2020)

    Google Scholar 

  64. Wen, F., Nagy, C., Lanza, M., Bavota, G.: Quick remedy commits and their impact on mining software repositories. Empir. Softw. Eng. 27, 1–31 (2022)

    Article  Google Scholar 

  65. Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering. Springer, Berlin (2012)

    Book  Google Scholar 

  66. Yamaguchi, F., Rieck, K., et al.: Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: 5th USENIX Workshop on Offensive Technologies (WOOT 11) (2011)

    Google Scholar 

  67. Yamashita, A., Abtahizadeh, S.A., Khomh, F., Guéhéneuc, Y.G.: Software evolution and quality data from controlled, multiple, industrial case studies. In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 507–510. IEEE (2017)

    Google Scholar 

  68. Yang, Y., Xia, X., Lo, D., Grundy, J.: A survey on deep learning for software engineering. ACM Comput. Surv. 54(10s), 1–73 (2022)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zadia Codabux .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Codabux, Z., Fard, F., Verdecchia, R., Palomba, F., Di Nucci, D., Recupito, G. (2024). Teaching Mining Software Repositories. In: Mendez, D., Avgeriou, P., Kalinowski, M., Ali, N.B. (eds) Handbook on Teaching Empirical Software Engineering. Springer, Cham. https://doi.org/10.1007/978-3-031-71769-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-71769-7_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-71768-0

  • Online ISBN: 978-3-031-71769-7

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics