Skip to main content
Log in

Software Bertillonage

Determining the provenance of software development artifacts

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Deployed software systems are typically composed of many pieces, not all of which may have been created by the main development team. Often, the provenance of included components—such as external libraries or cloned source code—is not clearly stated, and this uncertainty can introduce technical and ethical concerns that make it difficult for system owners and other stakeholders to manage their software assets. In this work, we motivate the need for the recovery of the provenance of software entities by a broad set of techniques that could include signature matching, source code fact extraction, software clone detection, call flow graph matching, string matching, historical analyses, and other techniques. We liken our provenance goals to that of Bertillonage, a simple and approximate forensic analysis technique based on bio-metrics that was developed in 19th century France before the advent of fingerprints. As an example, we have developed a fast, simple, and approximate technique called anchored signature matching for identifying the source origin of binary libraries within a given Java application. This technique involves a type of structured signature matching performed against a database of candidates drawn from the Maven2 repository, a 275 GB collection of open source Java libraries. To show the approach is both valid and effective, we conducted an empirical study on 945 jars from the Debian GNU/Linux distribution, as well as an industrial case study on 81 jars from an e-commerce application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The interdependence of the Bertillonage bio-metrics was recognized by Francis Galton, and it inspired him to devise the notion of statistical correlation.

  2. The GPL Compliance Engineering Guide recommends the extraction of literal strings to determine potential licensing violations (Hemel 2010).

  3. This is analogous to a policeman asking a suspect for her/his name and expecting a correct answer.

  4. Identifying the class’s own fully qualitifed name is determinate. The indeterminism only arises when we try to resolve internal references that point to other classes.

  5. http://repo1.maven.org/maven2/

  6. Debian pushes critical security updates out to its stable releases. These usually represent the smallest possible changes necessary to patch the discovered security holes.

  7. We were unable to process beta implementations of generics sometimes found in Java 1.4 class files of a few brave bleeding edge developers from that time.

  8. Our source code contains the full list of signature canonicalizations that we apply. The source code is available to download from our replication package: http://juliusdavies.ca/2013/j.emse/bertillonage/.

  9. We suspect a file named servlet-api-2.5.jar is the true origin of this large equivalence class of perfect matches. JSP & Servlet technologies have long been an important part of Java’s popularity in servers for over 10 years, and servlet-api-2.5.jar is a critical interface library, originally published by Sun Microsystems, which all Java web and application servers must implement, including Tomcat, JBoss, Glassfish, Jetty, and many others. The 6.1.12 in this case probably comes from a version of Jetty. The Jetty project tends to rename its own critical dependencies so that they contain Jetty’s own version number alongside the original dependency’s version number.

  10. Values are rounded to nearest 10,000.

  11. We only count outter classes. Class files containing a $ (dollar-sign) character in their name are assumed to be inner classes, and are not included in these tallies. For example, only 3 of the class files listed earlier in Table 2 would count: A.class, B.class, and C.class, since these do not contain $ in their names.

  12. The chance of a birthday collision from SHA1 in our data set is less than 10 − 18.

  13. http://juliusdavies.ca/2011/icse/src/

  14. Unfortunately, we did not instrument our tools to collect unzip timings.

  15. See email from Bob Lee to dev@hc.apache.org on 18 Mar 2010 23:47:14 GMT, subject “Re: HttpClient in Android”.

References

  • Cubranic D, Murphy GC, Singer J, Booth KS (2005) Hipikat: a project memory for software development. IEEE Trans Softw Eng 31(6):446–465

    Article  Google Scholar 

  • Davies J (2011) Measuring subversions: security and legal risk in reused software artifacts. In: Taylor RN, Gall H, Medvidovic N (eds) ICSE, pp 1149–1151, ACM

  • Davies J, Germán DM, Godfrey MW, Hindle A Software bertillonage: finding the provenance of an entity. In: van Deursen A, Xie T, Zimmermann T (eds) (2011) In: Proceedings of the 8th international working conference on mining software repositories, MSR 2011 (Co-located with ICSE), Proceedings, IEEE. Waikiki, Honolulu, HI, USA, May 21–28, pp 183–192

  • Di Penta M, Germán DM, Antoniol G (2010) Identifying licensing of jar archives using a code-search approach. In: MSR’10 Proc. of the intl. working conf. on mining software repositories, pp 151–160

  • Germán DM, Di Penta M, Guéhéneuc YG, Antoniol G (2009) Code siblings: technical and legal implications of copying code between applications. In: MSR ’09: Proc. of the Working Conf. on Mining Software Repositories, pp 81–90

  • Godfrey M, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Trans Softw Eng 31(2):166–181

    Article  Google Scholar 

  • Gosling J, Joy B, Steele G, Bracha G (2005) The java language specification, 2nd edn, section 3.8: Identifiers. http://docs.oracle.com/javase/specs/jls/se5.0/html/lexical.html#3.8. Accessed 27 March 2012

  • Hemel A (2010) The GPL compliance engineering guide version 3.5. http://www.loohuis-consulting.nl/ downloads/compliance-manual.pdf. Accessed 27 March 2012

  • Hemel A, Kalleberg KT, Vermaas R, Dolstra E Finding software license violations through binary code clone detection. In: van Deursen A, Xie T, Zimmermann T (eds) Proceedings of the 8th international working conference on mining software repositories, MSR 2011 (Co-located with ICSE), Proceedings, IEEE. Waikiki, Honolulu, HI, USA, May 21–28, pp 63–72

  • Holmes R, Walker RJ (2010) Customized awareness: recommending relevant external change events. In: Kramer J, Bishop J, Devanbu PT, Uchitel S (eds) ICSE (1), ACM, pp 465–474

  • Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12):952–970

    Article  Google Scholar 

  • Houck MM, Siegel JA (2006) Fundamentals of forensic science. Academic Press

  • Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670

    Article  Google Scholar 

  • Kapser C, Godfrey MW (2008) ‘Cloning considered harmful’ considered harmful: patterns of cloning in software. Empir Software Eng 13(6):645–692

    Article  Google Scholar 

  • Kersten M, Murphy GC (2005) Mylar: a degree-of-interest model for ides. In: Mezini M, Tarr PL (eds) AOSD. ACM, pp 159–168

  • Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. ESEC/FSE 30(5):187–196

    Article  Google Scholar 

  • Krinke J (2008) Is cloned code more stable than non-cloned code? In: SCAM’08, pp 57–66

  • Livieri S, Higo Y, Matsushita M, Inoue K (2007) Very-large scale code clone analysis and visualization of open source programs using distributed ccfinder: D-ccfinder. In: ICSE, pp 106–115

  • Lozano A (2008) A methodology to assess the impact of source code flaws in changeability and its application to clones. In: ICSM 08: Proc. of the int. conf. of software maintenance, pp 424–427

  • Lozano A, Wermelinger M, Nuseibeh B (2007) Evaluating the harmfulness of cloning: a change based experiment. In: MSR ’07: proc. of the 4th int. workshop on mining soft. Repositories, p 18

  • Ossher J, Sajnani H, Lopes CV (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: ICSM, IEEE, pp 283–292

  • PCI Security Standards Council (2009) Payment card industry data security standard (PCI DSS), version 1.2.1. https://www.pcisecuritystandards.org/security_standards

  • Robillard MP, Walker RJ, Zimmermann T (2010) Recommendation systems for software engineering. IEEE Softw 27(4):80–86

    Article  Google Scholar 

  • Siegel J, Saukko P, Knupfer G (2000) Encyclopedia of forensic sciences. Academic Press

  • Thummalapenta S, Cerulo L, Aversano L, Di Penta M (2009) An empirical study on the maintenance of source code clones. Empir Software Eng 15(1):1–34

    Article  Google Scholar 

  • Western Canada Research Grid. http://www.westgrid.ca/. Accessed 27 March 2012

  • Wheeler D Counting Source Lines of Code (SLOC). http://www.dwheeler.com/sloc/. Accessed 27 March 2012

Download references

Acknowledgement

We thank Dr. Anton Chuvakin of Security Warrior Consulting (www.chuvakin.org) for his advice on PCI DSS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julius Davies.

Additional information

Editors: Tao Xie, Thomas Zimmermann and Arie van Deursen

Rights and permissions

Reprints and permissions

About this article

Cite this article

Davies, J., German, D.M., Godfrey, M.W. et al. Software Bertillonage. Empir Software Eng 18, 1195–1237 (2013). https://doi.org/10.1007/s10664-012-9199-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-012-9199-7

Keywords

Navigation