ForkXplorer: an approach of fork summary generation

Zhang, Zhang; Mao, Xinjun; Zhang, Chao; Lu, Yao

doi:10.1007/s11704-020-0047-4

ForkXplorer: an approach of fork summary generation

Research Article
Published: 20 October 2021

Volume 16, article number 162202, (2022)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Zhang Zhang¹,
Xinjun Mao¹,
Chao Zhang¹ &
…
Yao Lu¹

58 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Pull-based development has become an important paradigm for distributed software development. In this model, each developer independently works on a copied repository (i.e., a fork) from the central repository. It is essential for developers to maintain awareness of the state of other forks to improve collaboration efficiency. In this paper, we propose a method to automatically generate a summary of a fork. We first use the random forest method to generate the label of a fork, i.e., feature implementation or a bug fix. Based on the information of the fork-related commits, we then use the TextRank algorithm to generate detailed activity information of the fork. Finally, we apply a set of rules to integrate all related information to construct a complete fork summary. To validate the effectiveness of our method, we conduct 30 groups of manual experiment and 77 groups of case studies on Github. We propose Fea_avg to evaluate the performance of the generated fork summary, considering the content accuracy, content integrity, sentence fluency, and label extraction accuracy. The results show that the average of Fea_avg of the fork summary generated by this method is 0.672. More than 63% of project maintainers and the contributors believe that the fork summary can improve development efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Why and how developers fork what from whom in GitHub

Article 31 May 2016

An entropy-based measure of fork diversity and its correlations with open source software projects’ received contributions

Article 16 May 2025

Longitudinal Analysis of the Run-up to a Decision to Break-up (Fork) in a Community

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Gousios G, Storey M A, Bacchelli A. Work practices and challenges in pull-based development: the contributor’s perspective. In: Proceedings of IEEE/ACM International Conference on Software Engineering. 2016, 285–296
Lu Y, Mao X, Wang T, Yin G, Li Z. Improving students’ programming quality with the continuous inspection process: a social coding perspective. Frontiers of Computer Science, 2020, 14(5): 1–18
Article Google Scholar
Jiang J, Lo D, He J, Xia X, Kochhar P S, Zhang L. Why and how developers fork what from whom in GitHub. Empirical Software Engineering, 2017, 22(1): 547–578
Article Google Scholar
Bitzer J, Schröder P J H. The Economics of open source software development. 1st ed. Kidlington: Elsevier, 2006
Google Scholar
Abdullah R, Lakulu M, Ibrahim H, Selamat M H, Nor M Z M. The challenges of open source software development with collaborative environment. In: Proceedings of IEEE International Conference on Computer Technology and Development. 2009, 251–255
Padhye R, Mani S, Sinha V S. A study of external community contribution to open-source projects on GitHub. In: Proceedings of the Working Conference on Mining Software Repositories. 2014, 332–335
Ren L, Zhou S, Kästner C, Wąsowski A. Identifying redundancies in fork-based development. In: Proceedings of IEEE International Conference on Software Analysis, Evolution and Reengineering. 2019, 230–241
Stănciulescu Ş, Schulze S, Wąsowski A. Forked and integrated variants in an open-source firmware project. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution. 2015, 151–160
Ren L, Zhou S, Kästner C. Poster: Forks insight: providing an overview of GitHub forks. In: Proceedings of ACM/IEEE International Conference on Software Engineering. 2018, 179–180
Zhou S, Stanciulescu S, Leßenich O, Xiong Y, Wasowski A, Kästner C. Identifying features in forks. In: Proceedings of ACM/IEEE International Conference on Software Engineering. 2018, 105–116
Yu Y, Li Z, Yin G, Wang T, Wang H M. A dataset of duplicate pullrequests in Github. In: Proceedings of International Conference on Mining Software Repositories. 2018, 22–25
Zhu J, Zhou M, Mockus A. Effectiveness of code contribution: from patch-based to pull-request-based tools. In: Proceedings of ACM SIGSOFT International Symposium on Foundations of Software Engineering. 2016, 871–882
Li L, Ren Z, Li X, Zou W, Jiang H. How are issue units linked? Empirical study on the linking behavior in GitHub. In: Proceedings of IEEE Asia-Pacific Software Engineering Conference. 2018, 386–395
Li Z, Yin G, Yu Y, Wang T, Wang H. Detecting duplicate pull-requests in github. In: Proceedings of Asia-Pacific Symposium on Internetware. 2017, 1–6
Ruan H, Chen B, Peng X, Zhao W. DeepLink: Recovering issuecommit links based on deep learning. Journal of Systems and Software, 2019, 158: 110406
Article Google Scholar
Sun Y, Chen C, Wang Q, Boehm, B. Improving missing issue-commit link recovery using positive and unlabeled data. In: Proceedings of IEEE/ACM International Conference on Automated Software Engineering. 2017, 147–152
Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11): 613–620
Article MATH Google Scholar
Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information processing & management, 1988, 24(5): 513–523
Article Google Scholar
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. 1st ed. New York: Springer, 2013
Book MATH Google Scholar
Liu Z, Chen X, Sun M. Mining the interests of Chinese microbloggers via keyword extraction. Frontiers of Computer Science, 2012, 6(1): 76–87
MathSciNet Google Scholar
Mihalcea R, Tarau P. Textrank: Bringing order into text. In: Proceedings of Conference on Empirical Methods in Natural Language Processing. 2004, 404–411
Gambhir M, Gupta V. Recent automatic text summarization techniques: a survey. Artificial Intelligence Review, 2017, 47(1): 1–66
Article Google Scholar
Nyman L, Mikkonen T. To fork or not to fork: Fork motivations in SourceForge projects. International Journal of Open Source Software and Processes, 2011, 3(3): 1–9
Article Google Scholar
Robles G, González-Barahona J M. A comprehensive study of software forks: dates, reasons and outcomes. In: Proceedings of IFIP International Conference on Open Source Systems. 2012, 1–14
Stänciulescu Ş, Schulze S, Wąsowski A. Forked and integrated variants in an open-source firmware project. In: Proceedings of IEEE International Conference on Software Maintenance and Evolution. 2015, 151–160
Gousios G, Pinzger M, Deursen A. An exploratory study of the pullbased software development model. In: Proceedings of International Conference on Software Engineering. 2014, 345–355
Dabbish L, Stuart C, Tsay J, Herbsleb J. Social coding in GitHub: transparency and collaboration in an open software repository. In: Proceedings of ACM Conference on Computer Supported Cooperative Work. 2012, 1277–1286
Dabbish L, Stuart C, Tsay J, Herbsleb J. Leveraging transparency. IEEE Software, 2012, 30(1): 37–43
Article Google Scholar
Kuhn A, Ducasse S, Gírba T. Semantic clustering: Identifying topics in source code. Information and Software Technology, 2007, 49(3): 230–243
Article Google Scholar
Murphy G C. Lightweight structural summarization as an aid to software evolution. Seattle: University of Washington, 1996
Google Scholar
Poshyvanyk D, Marcus A. Combining formal concept analysis with information retrieval for concept location in source code. In: Proceedings of IEEE International Conference on Program Comprehension. 2007, 37–48
Storey M A, Cheng L T, Bull I, Rigby P. Shared waypoints and social tagging to support collaboration in software development. In: Proceedings of ACM Anniversary Conference on Computer Supported Cooperative Work. 2006, 195–198
Khatavkar V, Kulkarni P. Comparison of support vector machines with and without latent semantic analysis for document classification. In: Proceedings of International Conference on Data Management, Analytics & Innovation. 2019, 263–274
Nazar N, Jiang H, Gao G, Zhang T, Li X, Ren Z. Source code fragment summarization with small-scale crowdsourcing based features. Frontiers of Computer Science, 2016, 10(3): 504–517
Article Google Scholar
Cortés-Coy L F, Linares-Vásquez M, Aponte J, Poshyvanyk, D. On automatically generating commit messages via summarization of source code changes. In: Proceedings of IEEE International Working Conference on Source Code Analysis and Manipulation. 2014, 275–284
Jiang S, Armaly A, McMillan C. Automatically generating commit messages from diffs using neural machine translation. In: Proceedings of IEEE/ACM International Conference on Automated Software Engineering. 2017, 135–146
Liu Z, Xia X, Hassan A E, Lo D, Xing Z, Wang X. Neural-machinetranslation-based commit message generation: how far are we? In: Proceedings of ACM/IEEE International Conference on Automated Software Engineering. 2018, 373–384
Zaidi A. Summarizing git commits and Github pull requests using sequence to sequence neural attention models. California: Stanford University, 2017
Google Scholar
Liu Z, Xia X, Treude C, Lo D, Li S. Automatic generation of pull request descriptions. In: Proceedings of IEEE/ACM International Conference on Automated Software Engineering. 2019, 176–188

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2018YFB1004202).

Author information

Authors and Affiliations

Key Laboratory of Software Engineering for Complex Systems, College of Computer, National University of Defense Technology, Changsha, 410073, China
Zhang Zhang, Xinjun Mao, Chao Zhang & Yao Lu

Authors

Zhang Zhang
View author publications
Search author on:PubMed Google Scholar
Xinjun Mao
View author publications
Search author on:PubMed Google Scholar
Chao Zhang
View author publications
Search author on:PubMed Google Scholar
Yao Lu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Xinjun Mao.

Additional information

Zhang Zhang is a master candidate in the College of Computer, National University of Defense Technology, China. His work interests include open source software engineering, data mining, and crowdsourced learning.

Xinjun Mao is a professor in the College of Computer, National University of Defense Technology, China. He received his PhD degree in computer science from National University of Defense Technology, China in 1998. His research interests include software engineering, multi-agent system, robot system, self-adaptive system, and crowdsourcing.

Chao Zhang is a master in the College of Computer, National University of Defense Technology, China. His work interests include open source software engineering and crowdsourced learning.

Yao Lu is a lecturer in the College of Computer, National University of Defense Technology, China. He received his PhD degree in software engineering from National University of Defense Technology, China in 2019. His research interests include open source software engineering, data mining, and crowdsourced learning.

Electronic Supplementary Material