Article

Constructing universal version history

Authors:
Hung-Fu Chang

University of Southern California, Los Angeles, CA

University of Southern California, Los Angeles, CA
View Profile

,
Audris Mockus

Avaya Labs Research, Basking Ridge, NJ

Avaya Labs Research, Basking Ridge, NJ
View Profile

MSR '06: Proceedings of the 2006 international workshop on Mining software repositoriesMay 2006Pages 76–79https://doi.org/10.1145/1137983.1138002

Published:22 May 2006Publication History

MSR '06: Proceedings of the 2006 international workshop on Mining software repositories

Pages 76–79

ABSTRACT

Developers often copy code for parts or entire products to start a new product or a new release. In order to understand the software change history and to determine the code authorship, we propose to construct a universal version history from multiple version control repositories. To that end we create two practical code copy detection methods at the level of the source code file: prefix-postfix algorithm and prefix algorithm. The full pathname of a file and its version history are used to construct the universal version history of a file by linking together change histories of files that had the same code at any point in the past. The assumption of both algorithms is that developers often duplicate files by copying entire directories. Once the copying is identified we propose an algorithm to link version histories from multiple repositories in order to construct universal version history. The results show that about 41.32% of source files (in the repository involving more than 6M versions of around 2M files) were duplicated among the Avaya's source code repositories for more than ten different projects. The prefix-postfix algorithm is more suitable than prefix algorithm due to the reasonable error rates after validation of the known copying behaviors.

References

Brenda Baker. On finding duplication and near duplication in large software system, IEEE Working Conference on Reverse Engineering 1995. Google ScholarDigital Library
B. Lague, D. Proulx, E. Merlo, J. Maryland, J. Hudepohl, Assessing the benefits of incorporating function clone detection in a development process, IEEE International Conference on Software Maintenance 1997. Google ScholarDigital Library
Akito Monden, Daikai Nakae, Toshihiro Kamiya, Shin-ichi Sato and Ken-ichi Matsumoto. Software quality analysis by code clones in industrial legacy software, Proceedings of the 8th International Symposium on Software Metrics 2002. Google ScholarDigital Library
Ira Baxter, Andrew Yahin, Leonardo Moura, Marcelo SantAnna and Lorraine Bier. Clone detection using abstract syntax trees. In Proceedings of the 8th International Symposium on Software Metrics 1998. Google ScholarDigital Library
S. Ducasse, M. Rieger, and S. Demeyer. A language independent approach for detecting duplicated code. International Conference on Software Maintenance 1999. Google ScholarDigital Library
T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Software Engineering, Vol. 28, No.7, 2002. Google ScholarDigital Library
Cory Kapser and Michael W. Godfrey. Improved tool support for the investigation of duplication in software. International Conference on Software Maintenance 2005. Google ScholarDigital Library

Index Terms

Constructing universal version history
1. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Software reverse engineering

Recommendations

Evaluation of source code copy detection methods on freebsd
MSR '08: Proceedings of the 2008 international working conference on Mining software repositories

Studies have shown that substantial code reuse is common in open source and in commercial projects. However, the precise extent of reuse and its impact on productivity and quality are not well investigated in the open source context. Previously, we have ...
Read More
A linear-time scheme for version reconstruction

An efficient scheme to store and reconstruct versions of sequential files is presented. The reconstruction scheme involves building a data structure representing a complete version, and then successively modifying this data structure by applying a ...
Read More
Analysis of Implementations to Secure Git for Use as an Encrypted Distributed Version Control System
HICSS '15: Proceedings of the 2015 48th Hawaii International Conference on System Sciences

This paper analyzes two existing methods for securing Git repositories, Git-encrypt and Git-crypt, by comparing their performance relative to the default Git implementation. Securing a Git repository is necessary when the repository contains sensitive ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MSR '06: Proceedings of the 2006 international workshop on Mining software repositories
May 2006
191 pages
ISBN:1595933972
DOI:10.1145/1137983
General Chairs:
Stephan Diehl
University Trier, Germany
,
Harald Gall
University of Zurich, Switzerland
,
Ahmed E. Hassan
Research in Motion RIM, Canada
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
change history
cloning
code authorship
code copying
version control
Qualifiers
- Article
Conference

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 335
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Constructing universal version history

MSR '06: Proceedings of the 2006 international workshop on Mining software repositories

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluation of source code copy detection methods on freebsd

A linear-time scheme for version reconstruction

Analysis of Implementations to Secure Git for Use as an Encrypted Distributed Version Control System