Skip to main content
Log in

Emerging topics in mining software repositories

Machine learning in software repositories and datasets

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

A software process is a set of related activities that culminates in the production of a software package: specification, design, implementation, testing, evolution into new versions, and maintenance. There are also other supporting activities such as configuration and change management, quality assurance, project management, evaluation of user experience, etc. Software repositories are infrastructures to support all these activities. They can be composed with several systems that include code change management, bug tracking, code review, build system, release binaries, wikis, forums, etc. This position paper on mining software repositories presents a review and a discussion of research in this field over the past decade. We also identify applied machine learning strategies, current working topics, and future challenges for the improvement of company decision-making systems. Machine learning is defined as the process of discovering patterns in data. It can be applied to software repositories, since every change is recorded as data. Companies can then use these patterns as the basis for their decision-making systems and for knowledge discovery.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://sourceforge.net/.

  2. https://github.com/.

  3. https://about.gitlab.com/.

  4. https://bitbucket.org/.

  5. https://travis-ci.org/.

  6. https://www.codacy.com/.

  7. http://www.msrconf.org/.

  8. http://text-processing.com/demo/sentiment/.

References

  1. Ali, N., Guhneuc, Y.G., Antoniol, G.: Trustrace: mining software repositories to improve the accuracy of requirement traceability links. IEEE Trans. Softw. Eng. 39(5), 725–741 (2013). https://doi.org/10.1109/TSE.2012.71

    Article  Google Scholar 

  2. Arnaoudova, V., Eshkevari, L., Penta, M., Oliveto, R., Antoniol, G., Guhneuc, Y.G.: Repent: analyzing the nature of identifier renamings. IEEE Trans. Softw. Eng. 40(5), 502–532 (2014). https://doi.org/10.1109/TSE.2014.2312942

    Article  Google Scholar 

  3. Bavota, G., Linares-Vsquez, M., Bernal-Crdenas, C., Di Penta, M., Oliveto, R., Poshyvanyk, D.: The impact of api change- and fault-proneness on the user ratings of android apps. IEEE Trans. Softw. Eng. 41(4), 384–407 (2015). https://doi.org/10.1109/TSE.2014.2367027

    Article  Google Scholar 

  4. Brown, W.H., Malveau, R.C., McCormick, H.W.S., Mowbray, T.J.: AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis, 1st edn. Wiley, New York, NY (1998)

    Google Scholar 

  5. Canfora, G., Cerulo, L., Cimitile, M., Di Penta, M.: How changes affect software entropy: an empirical study. Empir. Softw. Eng. 19(1), 1–38 (2014). https://doi.org/10.1007/s10664-012-9214-z

    Article  Google Scholar 

  6. Chen, T.H., Thomas, S., Hassan, A.: A survey on the use of topic models when mining software repositories. Empir. Softw. Eng. 21(5), 1843–1919 (2016). https://doi.org/10.1007/s10664-015-9402-8

    Article  Google Scholar 

  7. Chowdhury, S.A., Hindle, A.: Mining stackoverflow to filter out off-topic irc discussion. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp. 422–425. IEEE Press, Piscataway, NJ, USA (2015). http://dl.acm.org/citation.cfm?id=2820518.2820577

  8. Dagenais, B., Robillard, M.: Recommending adaptive changes for framework evolution. ACM Trans. Softw. Eng. Methodol. 20(4), 9 (2011). https://doi.org/10.1145/2000799.2000805

    Article  Google Scholar 

  9. Destefanis, G., Ortu, M., Counsell, S., Swift, S., Marchesi, M., Tonelli, R.: Software development: do good manners matter? PeerJ Comput. Sci. (2016). https://doi.org/10.7717/peerj-cs.73

    Article  Google Scholar 

  10. Dyer, R., Nguyen, H., Rajan, H., Nguyen, T.: Boa: Ultra-large-scale software repository and source-code mining. ACM Trans. Softw. Eng. Methodol. 25(1), 7 (2015). https://doi.org/10.1145/2803171

    Article  Google Scholar 

  11. German, D.M.: A study of the contributors of postgresql. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR ’06, pp. 163–164. ACM, New York, NY, USA (2006). https://doi.org/10.1145/1137983.1138022

  12. Gonzalez-Barahona, J., Robles, G., Herraiz, I., Ortega, F.: Studying the laws of software evolution in a long-lived floss project. J. Softw. Evolut. Process 26(7), 589–612 (2014). https://doi.org/10.1002/smr.1615

    Article  Google Scholar 

  13. Gonzlez-Barahona, J., Robles, G.: On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir. Softw. Eng. 17(1–2), 75–89 (2012). https://doi.org/10.1007/s10664-011-9181-9

    Article  Google Scholar 

  14. Goyal, A., Sardana, N.: Nrfixer: Sentiment based model for predicting the fixability of non-reproducible bugs. E-Inf. Softw. Eng. J. 11(1), 103–116 (2017). https://doi.org/10.5277/e-Inf170105

    Article  Google Scholar 

  15. Grant, S., Betts, B.: Encouraging user behaviour with achievements: an empirical study. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp. 65–68. IEEE Press, Piscataway, NJ, USA (2013). http://dl.acm.org/citation.cfm?id=2487085.2487101

  16. Guana, V., Rocha, F., Hindle, A., Stroulia, E.: Do the stars align? Multidimensional analysis of android’s layered architecture. In: 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), pp. 124–127 (2012)

  17. Guzman, E., Azócar, D., Li, Y.: Sentiment analysis of commit comments in GitHub: an empirical study. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 352–355. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2597073.2597118

  18. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)

    Article  Google Scholar 

  19. Hammad, M., Hammad, M., Bani-Salameh, H.: Identifying designers and their design knowledge. Int. J. Softw. Eng. Its Appl. 7(6), 277–288 (2013). https://doi.org/10.14257/ijseia.2013.7.6.23

    Article  Google Scholar 

  20. Han, J., Jung, W.: Extracting communication structure of a development organization from a software repository. Pers. Ubiquit. Comput. 18(6), 1413–1421 (2014). https://doi.org/10.1007/s00779-013-0742-3

    Article  Google Scholar 

  21. Hassan, A., Holt, R.: Replaying development history to assess the effectiveness of change propagation tools. Empir. Softw. Eng. 11(3), 335–367 (2006). https://doi.org/10.1007/s10664-006-9006-4

    Article  Google Scholar 

  22. Hindle, A.: Green mining: a methodology of relating software change and configuration to power consumption. Empir. Softw. Eng. 20(2), 374–409 (2015). https://doi.org/10.1007/s10664-013-9276-6

    Article  Google Scholar 

  23. Holmes, R., Walker, R.J.: A newbie’s guide to eclipse APIs. In: Proceedings of the 2008 International Working Conference on Mining Software Repositories, MSR 2008 (Co-located with ICSE), Leipzig, Germany, May 10–11, 2008, Proceedings, pp. 149–152 (2008). https://doi.org/10.1145/1370750.1370787

  24. Hora, A., Anquetil, N., Etien, A., Ducasse, S., Valente, M.: Automatic detection of system-specific conventions unknown to developers. J. Syst. Softw. 109, 192–204 (2015). https://doi.org/10.1016/j.jss.2015.08.007

    Article  Google Scholar 

  25. Jacobson, I., Booch, G., Rumbaugh, J.: The Unified Software Development Process. Addison-Wesley Longman Publishing Co., Inc., Boston, MA (1999)

    Google Scholar 

  26. Kagdi, H., Gethers, M., Poshyvanyk, D., Hammad, M.: Assigning change requests to software developers. J. Softw. Evolut. Process 24(1), 3–33 (2012). https://doi.org/10.1002/smr.530

    Article  Google Scholar 

  27. Kamei, Y., Shihab, E., Adams, B., Hassan, A., Mockus, A., Sinha, A., Ubayashi, N.: A large-scale empirical study of just-in-time quality assurance. IEEE Trans. Softw. Eng. 39(6), 757–773 (2013). https://doi.org/10.1109/TSE.2012.70

    Article  Google Scholar 

  28. Khomh, F., Penta, M., Guhneuc, Y.G., Antoniol, G.: An exploratory study of the impact of antipatterns on class change- and fault-proneness. Empir. Softw. Eng. 17(3), 243–275 (2012). https://doi.org/10.1007/s10664-011-9171-y

    Article  Google Scholar 

  29. Kim, S., Shivaji, S., Whitehead Jr., E.: Kenyon-web: Reconfigurable web-based feature extractor. Vancouver, BC, pp. 287–288 (2009). https://doi.org/10.1109/ICPC.2009.5090061

  30. Kirbas, S., Caglayan, B., Hall, T., Counsell, S., Bowes, D., Sen, A., Bener, A.: The relationship between evolutionary coupling and defects in large industrial software. J. Softw. Evolut. Process (2017). https://doi.org/10.1002/smr.1842

    Article  Google Scholar 

  31. Krinke, J., Gold, N., Jia, Y., Binkley, D.: Cloning and copying between gnome projects. In: 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), pp. 98–101 (2010)

  32. Kumaresh, S., Baskaran, R.: Mining software repositories for defect categorization. J. Commun. Softw. Syst. 11(1), 31–36 (2015)

    Article  Google Scholar 

  33. Lehman, M.M., Belady, L.A. (eds.): Program Evolution: Processes of Software Change, 1st edn. Academic Press Professional Inc, San Diego, CA (1985)

    Google Scholar 

  34. Li, H., Shang, W., Zou, Y., Hassan, A.E.: Towards just-in-time suggestions for log changes. Empir. Softw. Eng. 22(4), 1831–1865 (2017). https://doi.org/10.1007/s10664-016-9467-z

    Article  Google Scholar 

  35. Linares-Vásquez, M., Vendome, C., Tufano, M., Poshyvanyk, D.: How developers micro-optimize android apps. J. Syst. Softw. 130, 1–23 (2017). https://doi.org/10.1016/j.jss.2017.04.018

    Article  Google Scholar 

  36. Linstead, E., Rigor, P., Bajracharya, S., Lopes, C., Baldi, P.: Mining eclipse developer contributions via author-topic models. In: Fourth International Workshop on Mining Software Repositories (MSR’07:ICSE Workshops 2007), pp. 30–30 (2007)

  37. López-Fernández, L., Robles, G., Gonzalez-Barahona, J., Herraiz, I.: Applying social network analysis techniques to community-driven libre software projects. Int. J. Inf. Technol. Web Eng. (IJITWE) 1(3), 27–48 (2006). https://doi.org/10.4018/jitwe.2006070103

    Article  Google Scholar 

  38. Louridas, P., Ebert, C.: Machine learning. IEEE Softw. 33(5), 110–115 (2016). https://doi.org/10.1109/MS.2016.114

    Article  Google Scholar 

  39. Munaiah, N., Camilo, F., Wigham, W., Meneely, A., Nagappan, M.: Do bugs foreshadow vulnerabilities? An in-depth study of the chromium project. Empir. Softw. Eng. 22(3), 1305–1347 (2017). https://doi.org/10.1007/s10664-016-9447-3

    Article  Google Scholar 

  40. Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating github for engineered software projects. Empir. Softw. Eng. 22(6), 3219–3253 (2017). https://doi.org/10.1007/s10664-017-9512-6

    Article  Google Scholar 

  41. Penta, M., Cerulo, L., Aversano, L.: The life and death of statically detected vulnerabilities: an empirical study. Inf. Softw. Technol. 51(10), 1469–1484 (2009). https://doi.org/10.1016/j.infsof.2009.04.013

    Article  Google Scholar 

  42. Porter, M.: An algorithm for suffix stripping. Program 3, 130–137 (1980)

    Article  Google Scholar 

  43. Prechelt, L., Pepper, A.: Why software repositories are not used for defect-insertion circumstance analysis more often: a case study. Inf. Softw. Technol. 56(10), 1377–1389 (2014). https://doi.org/10.1016/j.infsof.2014.05.001

    Article  Google Scholar 

  44. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Burlington (1993)

    Google Scholar 

  45. Rebouças, M., Santos, R.O., Pinto, G., Castor, F.: How does contributors’ involvement influence the build status of an open-source software project? In: Proceedings of the 14th International Conference on Mining Software Repositories, MSR ’17, pp. 475–478. IEEE Press, Piscataway, NJ, USA (2017). https://doi.org/10.1109/MSR.2017.32

  46. Robles, G., Gonzalez-Barahona, J.: Contributor turnover in libre software projects. IFIP Int. Fed. Inf. Process. 203, 273–286 (2006). https://doi.org/10.1007/0-387-34226-5_28

    Article  Google Scholar 

  47. Santos, E.A., Hindle, A.: Judging a commit by its cover: Correlating commit message entropy with build status on travis-ci. In: Proceedings of the 13th International Conference on Mining Software Repositories, MSR ’16, pp. 504–507. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2901739.2903493

  48. Schröter, A.: Msr challenge 2011: Eclipse, netbeans, firefox, and chrome. In: Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pp. 227–229. ACM, New York, NY, USA (2011). https://doi.org/10.1145/1985441.1985478

  49. Shihab, E., Jiang, Z.M., Hassan, A.E.: On the use of internet relay chat (IRC) meetings by developers of the gnome gtk+ project. In: 2009 6th IEEE International Working Conference on Mining Software Repositories, pp. 107–110 (2009)

  50. Sun, X., Li, B., Duan, Y., Shi, W., Liu, X.: Mining software repositories for automatic interface recommendation. Sci. Program. (2016). https://doi.org/10.1155/2016/5475964

    Article  Google Scholar 

  51. Sun, X., Li, B., Leung, H., Li, B., Li, Y.: Msr4sm: using topic models to effectively mining software repositories for software maintenance tasks. Inf. Softw. Technol. 66, 1–12 (2015). https://doi.org/10.1016/j.infsof.2015.05.003

    Article  Google Scholar 

  52. Sun, Y., Wang, Q., Yang, Y.: Frlink: improving the recovery of missing issue-commit links by revisiting file relevance. Inf. Softw. Technol. 84, 33–47 (2017). https://doi.org/10.1016/j.infsof.2016.11.010

    Article  Google Scholar 

  53. Tappolet, J., Kiefer, C., Bernstein, A.: Semantic web enabled software analysis. J. Web Semant. 8(2–3), 225–240 (2010). https://doi.org/10.1016/j.websem.2010.04.009

    Article  Google Scholar 

  54. Teixeira, J., Robles, G., Gonzlez-Barahona, J.: Lessons learned from applying social network analysis on an industrial free/libre/open source software ecosystem. J. Internet Serv. Appl. 6(1), 14 (2015). https://doi.org/10.1186/s13174-015-0028-2

    Article  Google Scholar 

  55. Thummalapenta, S., Cerulo, L., Aversano, L., Di Penta, M.: An empirical study on the maintenance of source code clones. Empir. Softw. Eng. 15(1), 1–34 (2010). https://doi.org/10.1007/s10664-009-9108-x

    Article  Google Scholar 

  56. Vanya, A., Klusener, S., Premraj, R., Van Vliet, H.: Supporting software architects to improve their software system’s decomposition—lessons learned. J. Softw. Evolut. Process 25(3), 219–232 (2013). https://doi.org/10.1002/smr.574

    Article  Google Scholar 

  57. Vendome, C., Bavota, G., Penta, M., Linares-Vsquez, M., German, D., Poshyvanyk, D.: License usage and changes: a large-scale study on github. Empir. Softw. Eng. 22(3), 1537–1577 (2017). https://doi.org/10.1007/s10664-016-9438-4

    Article  Google Scholar 

  58. Voinea, L., Telea, A.: Visual querying and analysis of large software repositories. Empir. Softw. Eng. 14(3), 316–340 (2009). https://doi.org/10.1007/s10664-008-9068-6

    Article  Google Scholar 

  59. Xuan, J., Jiang, H., Hu, Y., Ren, Z., Zou, W., Luo, Z., Wu, X.: Towards effective bug triage with software data reduction techniques. IEEE Trans. Knowl. Data Eng. 27(1), 264–280 (2015). https://doi.org/10.1109/TKDE.2014.2324590

    Article  Google Scholar 

  60. Yamashita, K., Kamei, Y., McIntosh, S., Hassan, A., Ubayashi, N.: Magnet or sticky? Measuring project characteristics from the perspective of developer attraction and retention. J. Inf. Process. 24(2), 339–348 (2016). https://doi.org/10.2197/ipsjjip.24.339

    Article  Google Scholar 

  61. Yuan, Z., Yu, L.L., Liu, C.: Bug prediction method for fine-grained source code changes. Ruan Jian Xue Bao/J. Softw. 25(11), 2499–2517 (2014). https://doi.org/10.13328/j.cnki.jos.004559

    Article  Google Scholar 

  62. Zamani, S., Lee, S., Shokripour, R., Anvik, J.: A feature location approach supported by time-aware weighting of terms associated with developer expertise profiles. Knowl. Inf. Syst. 49(2), 629–659 (2016). https://doi.org/10.1007/s10115-015-0909-5

    Article  Google Scholar 

  63. Zhou, M., Mockus, A.: Who will stay in the floss community? Modeling participant’s initial behavior. IEEE Trans. Software Eng. 41(1), 82–99 (2015). https://doi.org/10.1109/TSE.2014.2349496

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank the Ministerio de Economía y Competitividad of the Spanish Government for financing the project TIN2015-67534-P (MINECO/FEDER, UE) and the Junta de Castilla y León for financing the project BU085P17 (JCyL/FEDER, UE) both cofinanced from European Union FEDER funds.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos López-Nozal.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Güemes-Peña, D., López-Nozal, C., Marticorena-Sánchez, R. et al. Emerging topics in mining software repositories. Prog Artif Intell 7, 237–247 (2018). https://doi.org/10.1007/s13748-018-0147-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-018-0147-7

Keywords

Navigation