Creating Evolving Project Data Sets in Software Engineering

Lewowski, Tomasz; Madeyski, Lech

doi:10.1007/978-3-030-26574-8_1

Part of the book series: Studies in Computational Intelligence ((SCI,volume 851))

720 Accesses
1 Citations
2 Altmetric

Abstract

While the amount of research in the area of software engineering is ever increasing, it is still a challenge to select a research data set. Quite a number of data sets have been proposed, but we still lack a systematic approach to creating ones that would evolve together with the industry. We aim to present a systematic method of selecting data sets of industry-relevant software projects for the purposes of software engineering research. We present a set of guidelines for filtering GitHub projects and implement those guidelines in a form of an R script. In particular, we select mostly projects from the biggest industrial open source contributors and remove projects in the first quartile in any of several categories from the data set. We use the latest GitHub GraphQL API to select the desired set of repositories. We evaluate the technique on Java projects. Presented technique systematizes methods for creating software development data sets and their evolution. Proposed algorithm has reasonable precision—between 0.65 and 0.80—and can be used as a baseline for further refinements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Except for specific cases like insufficient performance or for reverse-engineering, which we ignore in this discussion.
2.
This also means that we rejected all forks and only left the main repository.

References

Madeyski, L.: Test-Driven Development: An Empirical Evaluation of Agile Practice. Springer, (Heidelberg, London, New York) (2010). https://doi.org/10.1007/978-3-642-04288-1
Book Google Scholar
Rafique, Y., Misic, V.B.: The effects of test-driven development on external quality and productivity: A meta-analysis. IEEE Trans. Softw. Eng. 39(6), 835–856 (2013)
Article Google Scholar
Madeyski, L., Kawalerowicz, M.: Continuous Test-Driven Development: A Preliminary Empirical Evaluation using Agile Experimentation in Industrial Settings. In: Towards a Synergistic Combination of Research and Practice in Software Engineering, Studies in Computational Intelligence, vol. 733, pp. 105–118. Springer (2018). https://doi.org/10.1007/978-3-319-65208-5_8
Google Scholar
Arisholm, E., Gallis, H., Dybå, T., Sjøberg, D.I.K.: Evaluating Pair Programming with Respect to System Complexity and Programmer Expertise. IEEE Transactions on Software Engineering 33(2), 65–86 (2007)
Article Google Scholar
Dybå, T., Dingsøyr, T.: Empirical studies of agile software development: A systematic review. Information and Software Technology 50(9–10), 833–859 (2008)
Article Google Scholar
Tempero, E., Anslow, C., Dietrich, J., Han, T., Li, J., Lumpe, M., Melton, H., Noble, J.: The qualitas corpus: A curated collection of java code for empirical studies. In: 2010 Asia Pacific Software Engineering Conference, pp. 336–345 (2010). https://doi.org/10.1109/APSEC.2010.46
Ortu, M., Destefanis, G., Adams, B., Murgia, A., Marchesi, M., Tonelli, R.: The jira repository dataset: Understanding social aspects of software development. In: Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE ’15, pp. 1:1–1:4. ACM, New York, NY, USA (2015). https://doi.org/10.1145/2810146.2810147. http://doi.acm.org/10.1145/2810146.2810147
Lamastra, C.R.: Software innovativeness. a comparison between proprietary and free/open source solutions offered by italian smes. R&D Management 39(2), 153–169 (2009). https://doi.org/10.1111/j.1467-9310.2009.00547.x. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9310.2009.00547.x
Article Google Scholar
MacCormack, A., Rusnak, J., Baldwin, C.Y.: Exploring the structure of complex software designs: An empirical study of open source and proprietary code. Management Science 52(7), 1015–1030 (2006). 10.1287/mnsc.1060.0552. https://doi.org/10.1287/mnsc.1060.0552
Article Google Scholar
Pruett, J., Choi, N.: A comparison between select open source and proprietary integrated library systems. Library Hi Tech 31(3), 435–454 (2013). https://doi.org/10.1108/LHT-01-2013-0003
Article Google Scholar
Bird, C., Pattison, D., D’Souza, R., Filkov, V., Devanbu, P.: Latent social structure in open source projects. In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, SIGSOFT ’08/FSE-16, pp. 24–35. ACM, New York, NY, USA (2008). https://doi.org/10.1145/1453101.1453107. http://doi.acm.org/10.1145/1453101.1453107
Vasudevan, A.R., Harshini, E., Selvakumar, S.: Ssenet-2011: A network intrusion detection system dataset and its comparison with kdd cup 99 dataset. In: 2011 Second Asian Himalayas International Conference on Internet (AH-ICI), pp. 1–5 (2011). https://doi.org/10.1109/AHICI.2011.6113948
Madeyski, L.: Training data preparation method. Tech. rep., code quest (research project NCBiR POIR.01.01.01-00-0792/16) (2019)
Google Scholar
Raemaekers, S., van Deursen, A., Visser, J.: The maven repository dataset of metrics, changes, and dependencies. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 221–224 (2013). https://doi.org/10.1109/MSR.2013.6624031
Habayeb, M., Miranskyy, A., Murtaza, S.S., Buchanan, L., Bener, A.: The firefox temporal defect dataset. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp. 498–501. IEEE Press, Piscataway, NJ, USA (2015). http://dl.acm.org/citation.cfm?id=2820518.2820597
Lamkanfi, A., Prez, J., Demeyer, S.: The eclipse and mozilla defect tracking dataset: A genuine dataset for mining bug information. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 203–206 (2013). https://doi.org/10.1109/MSR.2013.6624028
Ohira, M., Kashiwa, Y., Yamatani, Y., Yoshiyuki, H., Maeda, Y., Limsettho, N., Fujino, K., Hata, H., Ihara, A., Matsumoto, K.: A dataset of high impact bugs: Manually-classified issue reports. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 518–521 (2015). https://doi.org/10.1109/MSR.2015.78
Filó, T.G., Bigonha, M.A., Ferreira, K.A.: Statistical dataset on software metrics in object-oriented systems. SIGSOFT Softw. Eng. Notes 39(5), 1–6 (2014). https://doi.org/10.1145/2659118.2659130
Article Google Scholar
Open-source version control system for machine learning projects. https://dvc.org/. Accessed: 2019-04-23
dat:// a peer-to-peer protocol. https://datproject.org/. Accessed: 2019-04-23
Gousios, G.: The ghtorent dataset and tool suite. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp. 233–236. IEEE Press, Piscataway, NJ, USA (2013). http://dl.acm.org/citation.cfm?id=2487085.2487132
Cosentino, V., Izquierdo, J.L.C., Cabot, J.: Findings from github: Methods, datasets and limitations. In: 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), pp. 137–141 (2016). https://doi.org/10.1109/MSR.2016.023
Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: The promises and perils of mining github. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 92–101. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2597073.2597074. http://doi.acm.org/10.1145/2597073.2597074
Guzman, E., Azócar, D., Li, Y.: Sentiment analysis of commit comments in github: An empirical study. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 352–355. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2597073.2597118. http://doi.acm.org/10.1145/2597073.2597118
Pletea, D., Vasilescu, B., Serebrenik, A.: Security and emotion: Sentiment analysis of security discussions on github. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 348–351. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2597073.2597117. http://doi.acm.org/10.1145/2597073.2597117
Sawant, A.A., Bacchelli, A.: A dataset for api usage. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp. 506–509. IEEE Press, Piscataway, NJ, USA (2015). http://dl.acm.org/citation.cfm?id=2820518.2820599
Badashian, A.S., Esteki, A., Gholipour, A., Hindle, A., Stroulia, E.: Involvement, contribution and influence in github and stack overflow. In: Proceedings of 24th Annual International Conference on Computer Science and Software Engineering, CASCON ’14, pp. 19–33. IBM Corp., Riverton, NJ, USA (2014). http://dl.acm.org/citation.cfm?id=2735522.2735527
Awesome empirical software engineering resources. https://github.com/dspinellis/awesome-msr. Accessed: 2019-03-31
Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: PROMISE’2010: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, pp. 9:1–9:10. ACM (2010). https://doi.org/10.1145/1868328.1868342
Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating github for engineered software projects. Empirical Software Engineering 22(6), 3219–3253 (2017)
Article Google Scholar
Smith, T.M., McCartney, R., Gokhale, S.S., Kaczmarczyk, L.C.: Selecting open source software projects to teach software engineering. In: Proceedings of the 45th ACM Technical Symposium on Computer Science Education, SIGCSE 14, pp. 397–402. ACM, New York, NY, USA (2014)
Google Scholar
Tamburri, D.A., Palomba, F., Serebrenik, A., Zaidman, A.: Discovering community patterns in open-source: a systematic approach and its evaluation. Empirical Software Engineering (2018)
Google Scholar
Falessi, D., Smith, W., Serebrenik, A.: Stress: A semi-automated, fully replicable approach for project selection. In: 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 151–156 (2017)
Google Scholar
Gebru, T., Morgenstern, J.H., Vecchione, B., Vaughan, J.W., Wallach, H.M., Daumé, H., Crawford, K.: Datasheets for datasets. CoRR abs/1803.09010 (2018)
Google Scholar
Asay, M.: Who really contributes to open source (2018). https://www.infoworld.com/article/3253948/who-really-contributes-to-open-source.html. [Online; posted 7-February-2018; Accessed 23-April-2019]
Madeyski, L., Kitchenham, B.: reproducer: Reproduce Statistical Analyses and Meta-Analyses (2019). http://madeyski.e-informatyka.pl/reproducible-research/. R package version (http://CRAN.R-project.org/package=reproducer)
Madeyski, L., Kitchenham, B.: Would wider adoption of reproducible research be beneficial for empirical software engineering research? Journal of Intelligent & Fuzzy Systems 32(2), 1509–1521 (2017). https://doi.org/10.3233/JIFS-169146
Article Google Scholar
Madeyski, L., Kitchenham, B.: Effect Sizes and their Variance for AB/BA Crossover Design Studies. Empirical Software Engineering 23(4), 1982–2017 (2018). https://doi.org/10.1007/s10664-017-9574-5
Article Google Scholar
Sharma, A., Thung, F., Kochhar, P.S., Sulistya, A., Lo, D.: Cataloging github repositories. In: Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, EASE’17, pp. 314–319. ACM, New York, NY, USA (2017)
Google Scholar
Tiobe index. https://www.tiobe.com/tiobe-index/. Accessed: 2019-04-24

Download references

Acknowledgements

This work has been conducted as a part of research and development project POIR.01.01.01-00-0792/16 supported by the National Centre for Research and Development (NCBiR). We would like to thank Tomasz Korzeniowski and Marek Skrajnowski from code quest sp. z o.o. for all of the comments and feedback from the real-world software engineering environment.

Author information

Authors and Affiliations

Faculty of Computer Science and Management, Wroclaw University of Science and Technology, Wroclaw, Poland
Tomasz Lewowski & Lech Madeyski

Authors

Tomasz Lewowski
View author publications
You can also search for this author in PubMed Google Scholar
Lech Madeyski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomasz Lewowski .

Editor information

Editors and Affiliations

Faculty of Computer Science, Bialystok University of Technology, Białystok, Poland
Stan Jarzabek
Institute of Information Technology, Lodz University of Technology, Łódź, Poland
Aneta Poniszewska-Marańda
Faculty of Computer Science and Management, Wroclaw University of Science and Technology, Wrocław, Poland
Lech Madeyski

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lewowski, T., Madeyski, L. (2020). Creating Evolving Project Data Sets in Software Engineering. In: Jarzabek, S., Poniszewska-Marańda, A., Madeyski, L. (eds) Integrating Research and Practice in Software Engineering. Studies in Computational Intelligence, vol 851. Springer, Cham. https://doi.org/10.1007/978-3-030-26574-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-26574-8_1
Published: 03 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26573-1
Online ISBN: 978-3-030-26574-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics