Skip to main content
Log in

Persistent code contribution: a ranking algorithm for code contribution in crowdsourced software

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Measuring code contribution in crowdsourced software is essential for ranking contributors to a project or distributing revenue. Past studies have demonstrated that there is variation between different code contribution measures and their ability for ranking users accurately. This study proposes a new code contribution ranking algorithm, Persistent Code Contribution (PCC), that aims to be language independent, quality aware and provide a ranking balance between new and senior users. PCC tracks the number of characters contributed by a user and ranks each character based on the number of subsequent revisions that each character survived for. It also tracks lines that may have been moved between revisions in the code and attributes character changes to the appropriate user that committed them to a repository. A ranking comparison between existing code contribution measures is performed to determine the similarities and differences, and, quantitative as well as qualitative evidence is presented as a means to validate the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Beck K (1999) Embracing change with extreme programming. https://doi.org/10.1109/2.796139

  • Benaglia T, Chauveau D, Hunter DR, Young DS (2009) mixtools: An R Package for Analyzing Finite Mixture Models. J Stat Softw 32 (6):1–29. https://hal.archives-ouvertes.fr/hal-00384896

    Article  Google Scholar 

  • Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’T touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ACM, New York, NY, USA, ESEC/FSE ’11. https://doi.org/10.1145/2025113.2025119, pp 4–14

  • Black P (2004) Ratcliff/Obershelp pattern recognition. http://www.nist.gov/dads/HTML/ratcliffObershelp.html

  • Canfora G, Cerulo L, Penta MD (2007) Identifying changed source code lines from version repositories. https://doi.org/10.1109/MSR.2007.14

  • Canfora G, Cerulo L, Penta MD (2009) Ldiff: an enhanced line differencing tool. https://doi.org/10.1109/ICSE.2009.5070564

  • Dixon J (2009) The Beekeeper. http://wiki.pentaho.com/display/BEEKEEPER/The+Beekeeper

  • Eick SG, Graves TL, Karr AF, Marron JS, Mockus A (2001) Does code decay? Assessing the evidence from change management data. IEEE Trans Softw Eng 27(1):1–12. https://doi.org/10.1109/32.895984

    Article  Google Scholar 

  • Eyolfson J, Tan L, Lam P (2011) Do time of day and developer experience affect commit bugginess?. In: Proceedings of the 8th Working Conference on Mining Software Repositories, ACM, New York, NY, USA, MSR ’11, pp 153–162. https://doi.org/10.1145/1985441.1985464

  • Foucault M, Falleri JR, Blanc X (2014) Code ownership in open-source software. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, ACM, New York, NY, USA, EASE ’14, pp 39:1—-39:9. https://doi.org/10.1145/2601248.2601283

  • Foucault M, Teyton C, Lo D, Blanc X, Falleri JR (2015) On the usefulness of ownership metrics in open-source software projects. Inf Softw Technol 64:102–112. https://doi.org/10.1016/j.infsof.2015.01.013. http://www.sciencedirect.com/science/article/pii/S0950584915000294

    Article  Google Scholar 

  • Frantzeskou G, Stamatatos E, Gritzalis S, Chaski CE, Howald BS (2007) Identifying authorship by byte-level N-grams: the source code author profile (SCAP) method. Int J Digital Evidence 6(1):1–18

    Google Scholar 

  • Frantzeskou G, MacDonell SG, Stamatatos E (2010) Source code authorship analysis for supporting the cybercrime investigation process. In: Handbook of Research on Computational Forensics, Digital Crime, and Investigation, IGI Global. https://doi.org/10.4018/978-1-60566-836-9.ch020, pp 470–495

  • Halfaker A, Keyes O, Kluver D, Thebault-Spieker J, Nguyen T, Shores K, Uduwage A, Warncke-Wang M (2015) User session identification based on strong regularities in inter-activity time. In: Proceedings of the 24th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, WWW ’15, pp 410–418. https://doi.org/10.1145/2736277.2741117

  • Halvorsen SM, Raaen K (2014) Games for research: a comparative study of open source game projects. In: Mey D, Alexander M, Bientinesi P, Cannataro M, Clauss C, Costan A, Kecskemeti G, Morin C, Ricci L, Sahuquillo J, Schulz M, Scarano V, Scott SL, Weidendorfer J (eds) Euro-Par 2013: Parallel Processing Workshops: BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013, Aachen, Germany, August 26-27, 2013. Revised Selected Papers. Springer, Berlin, pp 353–362. https://doi.org/10.1007/978-3-642-54420-0_35

  • Harrison W (1992) An entropy-based measure of software complexity. IEEE Trans Softw Eng 18(11):1025–1029. https://doi.org/10.1109/32.177371

    Article  Google Scholar 

  • Hirth M, Hoßfeld T, Tran-Gia P (2011), Anatomy of a Crowdsourcing Platform - Using the Example of Microworkers.com. https://doi.org/10.1109/IMIS.2011.89

  • Kilgour R, Gray A, Sallis P, MacDonell S (1998) A fuzzy logic approach to computer software source code authorship analysis. In: Proceedings of the 1997 International Conference on Neural Information Processing and Intelligent Information Systems. Springer, Berlin, pp 865–868. http://hdl.handle.net/10292/3471

  • Linares-Vasquez M, Hossen K, Dang H, Kagdi H, Gethers M, Poshyvanyk D (2012) Triaging incoming change requests: Bug or commit history, or code authorship?. In: 2012 28th IEEE International Conference on Software Maintenance (ICSM), pp 451–460. https://doi.org/10.1109/ICSM.2012.6405306

  • Maier D (1978) The complexity of some problems on subsequences and supersequences. J ACM 25(2):322–336. https://doi.org/10.1145/322063.322075

    Article  MathSciNet  MATH  Google Scholar 

  • McIntosh S, Kamei Y, Adams B, Hassan AE (2014) The impact of code review coverage and code review participation on software quality: a case study of the qt, VTK, and ITK Projects. In: Proceedings of the 11th working conference on mining software repositories, ACM, New York, NY, USA, MSR 2014, pp 192–201. https://doi.org/10.1145/2597073.2597076

  • Meng X, Miller BP, Williams WR, Bernat AR (2013) Mining software repositories for accurate authorship. In: Proceedings of the 2013 IEEE International Conference on Software Maintenance, IEEE Computer Society, Washington, DC, USA, ICSM ’13, pp 250–259. https://doi.org/10.1109/ICSM.2013.36

  • Nardi BA (1996) Context and consciousness: activity theory and human-computer interaction. MIT Press, Cambridge

    Google Scholar 

  • Olague HM, Etzkorn LH, Gholston S, Quattlebaum S (2007) Empirical Validation of Three Software Metrics Suites to Predict Fault-Proneness of Object-Oriented Classes Developed Using Highly Iterative or Agile Software Development Processes. https://doi.org/10.1109/TSE.2007.1015

  • Panciera K, Halfaker A, Terveen L (2009) Wikipedians are born, not made: a study of power editors on Wikipedia. In: Proceedings of the ACM 2009 International Conference on Supporting Group Work, Association for Computing Machinery, vol 4. ACM Press, New York, pp 51–60. https://doi.org/10.1145/1531674.1531682

  • Peng X, Babar MA, Ebert C (2014) Collaborative software development platforms for crowdsourcing. IEEE Softw 31(2):30–36. https://doi.org/10.1109/MS.2014.31

    Article  Google Scholar 

  • Posnett D, D’Souza R, Devanbu P, Filkov V (2013) Dual ecological measures of focus in software development. In: Proceedings of the 2013 International Conference on Software Engineering, IEEE Press, Piscataway, NJ, USA, ICSE ’13, pp 452–461. http://dl.acm.org/citation.cfm?id=2486788.2486848

  • Prechelt L (2000) An empirical comparison of seven programming languages. Computer 33(10):23–29. https://doi.org/10.1109/2.876288

    Article  Google Scholar 

  • Pythonorg (2016) difflib — Helpers for computing deltas. https://docs.python.org/2/library/difflib.html

  • Rahman F, Devanbu P (2011) Ownership, experience and defects: a fine-grained study of authorship. In: Proceedings of the 33rd International Conference on Software Engineering, ACM, New York, NY, USA, ICSE ’11, pp 491–500. https://doi.org/10.1145/1985793.1985860

  • Raymond E (1999) The cathedral and the bazaar. Knowl Technol Policy 12(3):23–49. https://doi.org/10.1007/s12130-999-1026-0

    Article  Google Scholar 

  • van Wendel de Joode R, De Bruijn JA, Van Eeten MJG (2003) Protecting the virtual commons: self-organizing open source communities and innovative intellectual property regimes. Asser Press International Distribution by kluwer Law International, The Hague, The Netherlands. http://hdl.handle.net/10535/25

  • Wagner R, Fischer M (1974) The string-to-string correction problem. J ACM 21(1):168–173

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michail Tsikerdekis.

Additional information

Communicated by: Maurizio Morisio

Appendix

Appendix

1.1 A Proof of Claim 1

Let \(\lfloor x \rceil \in \mathbb {Z}^{+}\) be the closest integer to x. It holds that:

$$x - \frac{1}{2} \le \lfloor x \rceil < x + \frac{1}{2} $$

For x = nm t where n > 0 and \(mt \in \mathbb {R} | 0 \le mt \le 1\) it follows that:

$$\begin{array}{@{}rcl@{}} &&n*mt - \frac{1}{2} \le \lfloor n*mt \rceil < n*mt + \frac{1}{2} \\ &&\implies \frac {n*mt - \frac{1}{2}}{n} \le \frac{\lfloor n*mt \rceil}{n} < \frac{n*mt + \frac{1}{2}}{n} \\ &&\implies \frac {n*mt}{n} - \frac{\frac{1}{2}}{n} \le \frac{\lfloor n*mt \rceil}{n} < \frac{n*mt}{n} + \frac{\frac{1}{2}}{n} \\ &&\implies mt - \frac{1}{2n} \le \frac{\lfloor n*mt \rceil}{n} < mt + \frac{1}{2n} \\ &&\implies mt - mt - \frac{1}{2n} \le \underset{\underset{\text{discrete form}}{\text{percentage over}}}{\underbrace{\frac{\lfloor n*mt \rceil}{n}}} - mt < mt - mt + \frac{1}{2n} \end{array} $$

As the middle part represents the difference (or distortion) due to applying a percentage to a discrete set, we can represent it simply as d.

$$\begin{array}{@{}rcl@{}} &&- \frac{1}{2n} \le d < \frac{1}{2n} \\ &&\implies |d| \le \frac{1}{2n} \end{array} $$

Therefore the absolute maximum of d is \(|d_{max}| = \frac {1}{2n}\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tsikerdekis, M. Persistent code contribution: a ranking algorithm for code contribution in crowdsourced software. Empir Software Eng 23, 1871–1894 (2018). https://doi.org/10.1007/s10664-017-9575-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-017-9575-4

Keywords

Navigation