Skip to main content

Selecting Features in Origin Analysis

  • Conference paper
  • First Online:

Abstract

When applying a machine-learning approach to develop classifiers in a new domain, an important question is what measurements to take and how they will be used to construct informative features. This paper develops a novel set of machine-learning classifiers for the domain of classifying files taken from software projects; the target classifications are based on origin analysis. Our approach adapts the output of four copy-analysis tools, generating a number of different measurements. By combining the measures and the files on which they operate, a large set of features is generated in a semi-automatic manner. After which, standard attribute selection and classifier training techniques yield a pool of high quality classifiers (accuracy in the range of 90%), and information on the most relevant features.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ammann, C.M.: Duplo - code clone detection tool. Sourceforge project (2005) http://sourceforge.net/projects/duplo/

  2. Antoniol, G., Penta, M.D., Merlo, E.: An automatic approach to identify class evolution discontinuities. In: IWPSE ’04: Proceedings of the Principles of Software Evolution, 7th International Workshop, pp. 31–40. IEEE Computer Society, Washington, DC, USA (2004)

    Chapter  Google Scholar 

  3. Glocer, K., Eads, D., Theiler, J.: Online feature selection for pixel classification. In: ICML ’05: Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany (2005)

    Google Scholar 

  4. Godfrey, M.W., Zou, L.: Using origin analysis to detect merging and splitting of source code entities. IEEE Trans. Software Eng. 31(2), 166–181 (2005)

    Article  Google Scholar 

  5. Green, P., Lane, P.C.R., Rainer, A., Scholz, S.B.: Building classifiers to identify split files. In: P. Perner (ed.) MLDM Posters, pp. 1–8. IBaI Publishing (2009)

    Google Scholar 

  6. Green, P., Lane, P.C.R., Rainer, A., Scholz, S.B.: Analysing ferret XML reports to estimate the density of copied code. Tech. Rep. 501, Science and Technology Research Institute, University of Hertfordshire, UK (2010)

    Google Scholar 

  7. Green, P., Lane, P.C.R., Rainer, A., Scholz, S.B.: Unscrambling code clones for one-to-one matching of duplicated code. Tech. Rep. 502, Science and Technology Research Institute, University of Hertfordshire, UK (2010)

    Google Scholar 

  8. Hall, M.A.: Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New Zealand (1998)

    Google Scholar 

  9. Harris, S.: Simian. http://www.redhillconsulting.com.au/products/simian/. Copyright (c) 2003-08 RedHill Consulting Pty. Ltd.

  10. Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28(7), 654–670 (2002)

    Article  Google Scholar 

  11. Kim, M., Notkin, D.: Program element matching for multi-version program analyses. In: MSR ’06: Proceedings of the 2006 international workshop on Mining software repositories, pp. 58–64. ACM, New York, NY, USA (2006)

    Chapter  Google Scholar 

  12. Kim, S., Pan, K., Jr., E.J.W.: When functions change their names: Automatic detection of origin relationships. In: 12th Working Conference on Reverse Engineering (WCRE 2005), 7-11 November 2005, Pittsburgh, PA, USA, pp. 143–152. IEEE Computer Society (2005)

    Google Scholar 

  13. Kramer, S., de Raedt, L.: Feature construction with version spaces for biochemical applications. In ICML ’01: Proceedings of the 18th International Conference on Machine Learning, (2001)

    Google Scholar 

  14. Krawiec, K.: Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines 3, 329–343 (2002)

    Article  MATH  Google Scholar 

  15. Rainer, A.W., Lane, P.C.R., Malcolm, J.A., Scholz, S.B.: Using n-grams to rapidly characterise the evolution of software code. In: The Fourth International ERCIM Workshop on Software Evolution and Evolvability (2008)

    Google Scholar 

  16. Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program. 74(7), 470–495 (2009)

    Article  MATH  Google Scholar 

  17. Sourceforge open source software repository : http://sourceforge.net/ (1998)

  18. Witten, I.H., Frank, E.: Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufman, San Francisco, CA, USA (2000) http://www.cs.waikato.ac.nz/ml/weka

  19. Yamamoto, T., Matsushita, M., Kamiya, T., Inoue, K.: Similarity of software system and its measurement tool SMMT. Systems and Computers in Japan 38(6), 91–99 (2007)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Pam Green , Peter C.R. Lane , Austen Rainer or Sven-Bodo Scholz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag London Limited

About this paper

Cite this paper

Green, P., Lane, P.C., Rainer, A., Scholz, SB. (2011). Selecting Features in Origin Analysis. In: Bramer, M., Petridis, M., Hopgood, A. (eds) Research and Development in Intelligent Systems XXVII. SGAI 2010. Springer, London. https://doi.org/10.1007/978-0-85729-130-1_29

Download citation

  • DOI: https://doi.org/10.1007/978-0-85729-130-1_29

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-0-85729-129-5

  • Online ISBN: 978-0-85729-130-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics