Cross-project code clones in GitHub

Gharehyazie, Mohammad; Ray, Baishakhi; Keshani, Mehdi; Zavosht, Masoumeh Soleimani; Heydarnoori, Abbas; Filkov, Vladimir

doi:10.1007/s10664-018-9648-z

Cross-project code clones in GitHub

Published: 05 September 2018

Volume 24, pages 1538–1573, (2019)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Mohammad Gharehyazie ORCID: orcid.org/0000-0002-7567-6991^1,2,
Baishakhi Ray³,
Mehdi Keshani⁴,
Masoumeh Soleimani Zavosht⁴,
Abbas Heydarnoori⁵ &
…
Vladimir Filkov¹

1346 Accesses
12 Citations
2 Altmetric
Explore all metrics

Abstract

Code reuse has well-known benefits on code quality, coding efficiency, and maintenance. Open Source Software (OSS) programmers gladly share their own code and they happily reuse others’. Social programming platforms like GitHub have normalized code foraging via their common platforms, enabling code search and reuse across different projects. Removing project borders may facilitate more efficient code foraging, and consequently faster programming. But looking for code across projects takes longer and, once found, may be more challenging to tailor to one’s needs. Learning how much code reuse goes on across projects, and identifying emerging patterns in past cross-project search behavior may help future foraging efforts. Our contribution is two fold. First, to understand cross-project code reuse, here we present an in-depth empirical study of cloning in GitHub. Using Deckard, a popular clone finding tool, we identified copies of code fragments across projects, and investigate their prevalence and characteristics using statistical and network science approaches, and with multiple case studies. By triangulating findings from different analysis methods, we find that cross-project cloning is prevalent in GitHub, ranging from cloning few lines of code to whole project repositories. Some of the projects serve as popular sources of clones, and others seem to contain more clones than their fair share. Moreover, we find that ecosystem cloning follows an onion model: most clones come from the same project, then from projects in the same application domain, and finally from projects in different domains. Second, we utilized these results to develop a novel tool named CLONE-HUNTRESS that streamlines finding and tracking code clones in GitHub. The tool is GitHub integrated, built around a user-friendly interface and runs efficiently over a modern database system. We describe the tool and make it publicly available at http://clone-det.ictic.sharif.edu/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

A mixed-methods analysis of micro-collaborative coding practices in OpenStack

Article 18 June 2022

Armstrong Foundjem, Eleni Constantinou, … Bram Adams

Reuse and maintenance practices among divergent forks in three software ecosystems

Article Open access 04 March 2022

John Businge, Moses Openja, … Thorsten Berger

World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data

Article 25 February 2021

Yuxing Ma, Tapajit Dey, … Audris Mockus

Notes

See StackExchange question: http://progr-ammers.stack-exchange.com/-questions-/-193415/best-prac-tices-for-sharing-tiny-snippets-of-code-across-projects
https://searchcode.com
Since shorter exact clones can capture, to some extent, more variability during longer code evolution.
All these operations are done using MySQL Server and SQL queries.
The project domain identification process was implemented with Python.
https://github.com/cloudera/kitten
https://github.com/mongodb/mongo-hadoop
https://github.com/Netflix/servo
https://github.com/ideaconsult/apps-ambit
https://github.com/vivo-project/Vitro
https://github.com/scottfrazer/hermes
https://github.com/mendhak/gpslogger
https://github.com/nifty-gui/nifty-gui
https://github.com/hoegertn/restdoc-java-server
We used R and the “igraph” package for all graph constructions, comparisons and analyses.
The graphs are created using Gephi.
We also investigated the correlation between clone density and project size. The results were similar to those in Fig. 2.
This number is among the set of the projects that had any clones at all. So the total sum of all domain sizes adds up to the first row numbers of Table 3.
This number is derived from the implementation of queries described in Section 4.1 and using GHtorrent’s 2018-04-01 dump of GitHub projects.

References

Al-Ekram R, Kapser C, Holt R, Godfrey M (2005) Cloning by accident: an empirical study of source code cloning across software systems. In: 2005 international symposium on Empirical software engineering. IEEE, pp 10–pp
Bajracharya S, Ngo T, Linstead E, Dou Y, Rigor P, Baldi P, Lopes C (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: Companion to the 21st ACM SIGPLAN symposium on object-oriented programming systems, languages, and applications. ACM, pp 681–682
Barr ET, Brun Y, Devanbu P, Harman M, Sarro F (2014) The plastic surgery hypothesis. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 306–317
Bogdan V, Posnett D, Ray B, Brand Mvd, Filkov AS, Premkumar D, Filkov V (2015) Gender and tenure diversity in github teams. CHI ’15 ACM
Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in github: transparency and collaboration in an open software repository. In: Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. ACM, pp 1277–1286
Duala-Ekoko E, Robillard MP (2008) Clonetracker: tool support for code clone management. In: Proceedings of the 30th international conference on Software engineering. ACM, pp 843–846
Gabel M, Su Z (2010) A study of the uniqueness of source code. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, pp 147–156
Gharehyazie M, Posnett D, Vasilescu B, Filkov V (2015) Developer initiation and social interactions in oss: a case study of the apache software foundation. Empir Softw Eng 20(5):1318–1353
Article Google Scholar
Gharehyazie M, Ray B, Filkov V (2017) Some from here, some from there: cross-project code reuse in github. In: Proceedings of the 14th International Conference on Mining Software Repositories. IEEE Press, pp 291–301
Goues CL, Nguyen T, Forrest S, Weimer W (2012) Genprog: a generic method for automatic software repair. IEEE Trans Softw Eng 38(1):54–72
Article Google Scholar
Gousios G (2013) The ghtorent dataset and tool suite. In: Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, pp 233–236
Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: scalable and accurate tree-based detection of code clones. In: Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, pp 96–105
Juergens E, Deissenboeck F, Hummel B, Wagner S (2009) Do code clones matter?. In: Proceedings of the 31st International Conference on Software Engineering, ICSE ’09. IEEE Computer Society, Washington, pp 485–495
Google Scholar
Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28 (7):654–670
Article Google Scholar
Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in oopl. In: 2004 Proceedings of the International Symposium on Empirical Software Engineering, ISESE’04. IEEE, pp 83–92
Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. In: ACM SIGSOFT Software engineering notes, vol 30. ACM, pp 187–196
Li J, Ernst MD (2012) Cbcd: cloned buggy code detector. In: Proceedings of the 34th International Conference on Software Engineering. IEEE Press, pp 310–320
Lv F, Zhang H, Lou J-G, Wang S, Zhang D, Zhao J (2015) Codehow: effective code search based on api understanding and extended boolean model (e). In: 2015 30th IEEE/ACM International Conference on Automated software engineering (ASE). IEEE, pp 260–270
Meng N, Kim M, McKinley KS (2011) Systematic editing: generating program transformations from an example. In: ACM SIGPLAN Notices, vol 46. ACM, pp 329–342
Meng N, Kim M, McKinley KS (2013) Lase: locating and applying systematic edits by learning from examples. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 502–511
Nguyen HA, Nguyen AT, Nguyen TT, Nguyen TN, Rajan H (2013) A study of repetitiveness of code changes in software evolution. In: Proceedings of the 28th International Conference on Automated Software Engineering. ASE
Ossher J, Sajnani H, Lopes C (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: 2011 27th IEEE International Conference on Software Maintenance (ICSM). IEEE, pp 283–292
Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the ide into a self-confident programming prompter. In: Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, pp 102–111
Rattan D, Bhatia R, Singh M (2013) Software clone detection: a systematic review. Inf Softw Technol 55(7):1165–1199
Article Google Scholar
Ray B, Kim M (2012) A case study of cross-system porting in forked projects. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. ACM, p 53
Ray B, Nagappan M, Bird C, Nagappan N, Zimmermann T (2014) The uniqueness of changes: characteristics and applications. Technical report, Microsoft Research Technical Report
Ray B, Posnett D, Filkov V, Devanbu P (2014) A large scale study of programming languages and code quality in github. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 155–165
Reiss SP (2009) Semantics-based code search. In: Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, pp 243–253
Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74 (7):470–495
Article MathSciNet MATH Google Scholar
Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: scaling code clone detection to big-code. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, pp 1157–1168
Scacchi W (2010) Collaboration practices and affordances in free/open source software development. In: Collaborative software engineering. Springer, pp 307–327
Sim SE, Clarke CL, Holt RC (1998) Archetypal source code searches: a survey of software developers and maintainers. In: 1998 Proceedings of the 6th international workshop on Program comprehension, IWPC’98. IEEE, pp 180–187
Su F-H, Bell J, Harvey K, Sethumadhavan S, Kaiser G, Jebara T (2016) Code relatives: detecting similarly behaving software. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 702–714
Thummalapenta S, Xie T (2007) Parseweb: a programmer assistant for reusing open source code on the web. In: Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering. ACM, pp 204–213
Vasilescu B, Blincoe K, Xuan Q, Casalnuovo C, Damian D, Devanbu P, Filkov V (2016) The sky is not the limit: multitasking on GitHub projects. In: International Conference on Software Engineering, ICSE. to appear
Xuan Q, Okano A, Devanbu P, Filkov V (2014) Focus-shifting patterns of oss developers and their congruence with call graphs. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 401–412
Zhang H, Jain A, Khandelwal G, Kaushik C, Ge S, Hu W (2016) Bing developer assistant: improving developer productivity by recommending sample code. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 956–961

Download references

Acknowledgments

We thank Prof. Prem Devanbu and members of the DECAL lab at UC Davis for valuable discussions. We also thank Mr. Seyed Mohammad Masoud Sadrnezhaad for his help in updating CLONE-HUNTRESS’s database.

Author information

Authors and Affiliations

Department of Computer Science, University of California, Davis, CA, 95616, USA
Mohammad Gharehyazie & Vladimir Filkov
AICT Innovation Center, Sharif University of Technology, Tehran, Iran
Mohammad Gharehyazie
Columbia University, 500 West 120 Street, MC0401, New York, NY, 10027, USA
Baishakhi Ray
Sharif University of Technology, Tehran, Iran
Mehdi Keshani & Masoumeh Soleimani Zavosht
Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
Abbas Heydarnoori

Authors

Mohammad Gharehyazie
View author publications
You can also search for this author in PubMed Google Scholar
Baishakhi Ray
View author publications
You can also search for this author in PubMed Google Scholar
Mehdi Keshani
View author publications
You can also search for this author in PubMed Google Scholar
Masoumeh Soleimani Zavosht
View author publications
You can also search for this author in PubMed Google Scholar
Abbas Heydarnoori
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Filkov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Gharehyazie.

Additional information

Communicated by: Abram Hindle and Lin Tan

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: CLONE-HUNTRESS tool description and use

Here we describe CLONE-HUNTRESS, our online tool for (1) identifying clones between a user selected source project and a target list of Java-based GitHub projects, and (2) tracking changes to the clones over time. Our design goal was to provide a GitHub integrated, comprehensive, and efficient tool that users can interact with transparently, without the need to experience the mechanics of the clone search process. We wanted users to be able to come back to the tool over time and be able to monitor the changes to the cloned code. The tool is available at http://clone-det.ictic.sharif.edu/.

Finding clones among the many projects that exist in GitHub is very time consuming and computationally infeasible, specially when constrained by a reasonable response time limit. Also, as per our findings in the main text of this paper, clones are often found in pairs of projects in the same domain. Hence, to speed up the search among projects, CLONE-HUNTRESS allows users to search and track clones between projects in the same domain.

We selected a list of projects consisting of the 39422 Java-based GitHub projects, as an initial preset list that will grow over time through automatic addition of users’ projects. This number is derived from the implementation of the queries described in Section 4.1 and applying them to the April^st2018 GHTorrent MySQL dump. In other words we selected Java projects that had at least 2 developers, were at least 1 year old, and had more than 10 commits. We also eliminated projects that were forked.

The front page of the tool is shown in Fig. 12.

1.1 A.1 Login, registration and settings

CLONE-HUNTRESS is GitHub integrated. To use CLONE-HUNTRESS a user must first get authenticated through GitHub. Once authenticated, CLONE-HUNTRESS automatically pulls the list of the user’s publicly available projects and adds them to their profile within the tool. Users can chose one from these projects, or add other projects manually, as described later, as the source project for clone detection.

By clicking on the user’s GitHub name, email, or avatar on the dashboard, the Profile page is shown, where users can change the tool’s tracking frequency settings. As shown in Fig. 13, there are two options that govern CLONE-HUNTRESS’s behavior. The first one is the update frequency of the tracked clones. This frequency determines how often the tool should update the changes that are taking place on the tracked clone code. The second one is the frequency at which clone detection is executed from scratch. This option exists because after a sufficiently long time, many of the tracked clones may change via commits, and thus may not be similar anymore to the original clone in the user’s project.

1.2 A.2 Detecting and tracking clones

The main functionality of CLONE-HUNTRESS i.e., tacking clones, is accessible through the ”Add project” button on the top right corner of the dashboard (Fig. 14) which redirects the user to the corresponding page (Fig. 15, top) where users can select a project from their list of GitHub projects. In addition to the list of user’s GitHub projects, any other project of interest can be selected as the source by providing its URL directly, as illustrated in Fig. 15 (top). Once a project is specified, the tool will ask for the project’s application domain, and once it is specified and ”Get projects” is pressed, it will present a list of all projects (within its current project list) in that application domain (Fig. 15, bottom).

Users can select up to 20 target projects from the given list, to detect clones between them and the source project. These limitations are imposed for two reasons: 1) Hardware resource limitations and response time limits and 2) The fact that tracking a large number of projects eventually leads to confusion rather than providing benefits. Users are also able to add any other GitHub project to the target list by specifying the project link directly, using the “Add other project” button below the list, as illustrated in Fig. 16. The target list can be reset to its original form using the “Reset project list” button at the bottom of the list.

With the source and target projects chosen, clone detection is initiated by pressing the “Detect-Clones” button at the bottom of the page. It could take the tool a few minutes to show the results of clone detection. When done, CLONE-HUNTRESS will redirect the user to the result page, which will resemble Fig. 17. If any clones are found, the results will show the clone instances from the source project and those from the target projects.

Users can choose to track any clone instance they want by selecting them and clicking on the “Save and track” button, and over time see the changes that occur on these selected instances. There is a limitation on the number of traceable clones. Users can track up to 20 clone instances due to the aforementioned reasons. After choosing some clone instances to track, users are returned to their dashboard. Every clone detection that the user has done will be displayed as a row in a table placed in the dashboard page, as shown in Fig. 14.

1.3 A.3 Tracking reports

CLONE-HUNTRESS provides View, Edit, and Delete functions in each row of the clone detection table (see the buttons in the ACTION column in Fig. 14). The View buttons report the tracking of changes made to the respective clone instances. Our tool checks at pre-specified intervals whether or not the clone instances have changed, and if so, the number of changes will be displayed as a notification on the View button. The intervals are identified by the update frequencies of tracked clones, found under the Profile page, as mentioned before. Clicking on the View button will redirect users to an “Alerts and Reports” page for that clone, similar to Fig. 18. There, clones from the user’s source project will be shown, and below each there will be the tracked clone instances, and links to the actual code. Changed clone instances are marked and users can visit the changed files. It is also possible for users to untrack any clones or clone instances from this page.

Edit directs users to a page similar to the first page of the process (Fig. 19), where users can repeat the steps of clone detection. The tool shows them all the steps they have already taken, and they can change anything they want and re-run clone detection again. Through the Delete button, the corresponding entry be deleted, and so the results of clone detection for that specific project will disappear.

1.4 A.4 Future improvements

While we have tried our best to provide a polished and useful product, there are many ways in which our tool can be improved. The first and foremost thing is to improve its hardware resource so that clone detection and checking for updates does not take as much time and users would be able to check for clones across more projects. The second area of improvement is to provide documentation and access to CLONE-HUNTRESS’s web services so other developers may integrate its functionalities within other tools and environments such as Eclipse.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gharehyazie, M., Ray, B., Keshani, M. et al. Cross-project code clones in GitHub. Empir Software Eng 24, 1538–1573 (2019). https://doi.org/10.1007/s10664-018-9648-z

Download citation

Published: 05 September 2018
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s10664-018-9648-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-project code clones in GitHub

Abstract

Access this article

Similar content being viewed by others

A mixed-methods analysis of micro-collaborative coding practices in OpenStack

Reuse and maintenance practices among divergent forks in three software ecosystems

World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data

Notes

References

Acknowledgments