Selecting Features in Origin Analysis

Green, Pam; Lane, Peter C.R.; Rainer, Austen; Scholz, Sven-Bodo

doi:10.1007/978-0-85729-130-1_29

Selecting Features in Origin Analysis

Pam Green⁴,
Peter C.R. Lane⁴,
Austen Rainer⁴ &
…
Sven-Bodo Scholz⁴

Conference paper
First Online: 29 October 2010

676 Accesses
1 Citations

Abstract

When applying a machine-learning approach to develop classifiers in a new domain, an important question is what measurements to take and how they will be used to construct informative features. This paper develops a novel set of machine-learning classifiers for the domain of classifying files taken from software projects; the target classifications are based on origin analysis. Our approach adapts the output of four copy-analysis tools, generating a number of different measurements. By combining the measures and the files on which they operate, a large set of features is generated in a semi-automatic manner. After which, standard attribute selection and classifier training techniques yield a pool of high quality classifiers (accuracy in the range of 90%), and information on the most relevant features.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ammann, C.M.: Duplo - code clone detection tool. Sourceforge project (2005) http://sourceforge.net/projects/duplo/
Antoniol, G., Penta, M.D., Merlo, E.: An automatic approach to identify class evolution discontinuities. In: IWPSE ’04: Proceedings of the Principles of Software Evolution, 7th International Workshop, pp. 31–40. IEEE Computer Society, Washington, DC, USA (2004)
Chapter Google Scholar
Glocer, K., Eads, D., Theiler, J.: Online feature selection for pixel classification. In: ICML ’05: Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany (2005)
Google Scholar
Godfrey, M.W., Zou, L.: Using origin analysis to detect merging and splitting of source code entities. IEEE Trans. Software Eng. 31(2), 166–181 (2005)
Article Google Scholar
Green, P., Lane, P.C.R., Rainer, A., Scholz, S.B.: Building classifiers to identify split files. In: P. Perner (ed.) MLDM Posters, pp. 1–8. IBaI Publishing (2009)
Google Scholar
Green, P., Lane, P.C.R., Rainer, A., Scholz, S.B.: Analysing ferret XML reports to estimate the density of copied code. Tech. Rep. 501, Science and Technology Research Institute, University of Hertfordshire, UK (2010)
Google Scholar
Green, P., Lane, P.C.R., Rainer, A., Scholz, S.B.: Unscrambling code clones for one-to-one matching of duplicated code. Tech. Rep. 502, Science and Technology Research Institute, University of Hertfordshire, UK (2010)
Google Scholar
Hall, M.A.: Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New Zealand (1998)
Google Scholar
Harris, S.: Simian. http://www.redhillconsulting.com.au/products/simian/. Copyright (c) 2003-08 RedHill Consulting Pty. Ltd.
Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28(7), 654–670 (2002)
Article Google Scholar
Kim, M., Notkin, D.: Program element matching for multi-version program analyses. In: MSR ’06: Proceedings of the 2006 international workshop on Mining software repositories, pp. 58–64. ACM, New York, NY, USA (2006)
Chapter Google Scholar
Kim, S., Pan, K., Jr., E.J.W.: When functions change their names: Automatic detection of origin relationships. In: 12th Working Conference on Reverse Engineering (WCRE 2005), 7-11 November 2005, Pittsburgh, PA, USA, pp. 143–152. IEEE Computer Society (2005)
Google Scholar
Kramer, S., de Raedt, L.: Feature construction with version spaces for biochemical applications. In ICML ’01: Proceedings of the 18th International Conference on Machine Learning, (2001)
Google Scholar
Krawiec, K.: Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines 3, 329–343 (2002)
Article MATH Google Scholar
Rainer, A.W., Lane, P.C.R., Malcolm, J.A., Scholz, S.B.: Using n-grams to rapidly characterise the evolution of software code. In: The Fourth International ERCIM Workshop on Software Evolution and Evolvability (2008)
Google Scholar
Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program. 74(7), 470–495 (2009)
Article MATH Google Scholar
Sourceforge open source software repository : http://sourceforge.net/ (1998)
Witten, I.H., Frank, E.: Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufman, San Francisco, CA, USA (2000) http://www.cs.waikato.ac.nz/ml/weka
Yamamoto, T., Matsushita, M., Kamiya, T., Inoue, K.: Similarity of software system and its measurement tool SMMT. Systems and Computers in Japan 38(6), 91–99 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Hertfordshire, College Lane, Hatfield, Herts, AL10 9AB, UK
Pam Green, Peter C.R. Lane, Austen Rainer & Sven-Bodo Scholz

Authors

Pam Green
View author publications
You can also search for this author in PubMed Google Scholar
Peter C.R. Lane
View author publications
You can also search for this author in PubMed Google Scholar
Austen Rainer
View author publications
You can also search for this author in PubMed Google Scholar
Sven-Bodo Scholz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Pam Green , Peter C.R. Lane , Austen Rainer or Sven-Bodo Scholz .

Editor information

Editors and Affiliations

Dept. Computer Science and, Software Engineering, University of Portsmouth, Buckingham Building, Lion Terrace, Portsmouth, PO1 3HE, United Kingdom
Max Bramer
School of Computing &, Mathematical Sciences, University of Greenwich, Park Row 30, London, SE10 9LS, United Kingdom
Miltos Petridis
, Faculty of Technology, De Montford University, The Gateway, Leicester, LE1 9BH, United Kingdom
Adrian Hopgood

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Green, P., Lane, P.C., Rainer, A., Scholz, SB. (2011). Selecting Features in Origin Analysis. In: Bramer, M., Petridis, M., Hopgood, A. (eds) Research and Development in Intelligent Systems XXVII. SGAI 2010. Springer, London. https://doi.org/10.1007/978-0-85729-130-1_29

Download citation

DOI: https://doi.org/10.1007/978-0-85729-130-1_29
Published: 29 October 2010
Publisher Name: Springer, London
Print ISBN: 978-0-85729-129-5
Online ISBN: 978-0-85729-130-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics