ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems

Amreen, Sadika; Mockus, Audris; Zaretzki, Russell; Bogart, Christopher; Zhang, Yuxia

doi:10.1007/s10664-019-09786-7

ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems

Published: 03 January 2020

Volume 25, pages 1136–1167, (2020)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Sadika Amreen¹,
Audris Mockus¹,
Russell Zaretzki¹,
Christopher Bogart² &
…
Yuxia Zhang³

596 Accesses
16 Citations
Explore all metrics

Abstract

An accurate determination of developer identities is important for software engineering research and practice. Without it, even simple questions such as “how many developers does a project have?” cannot be answered. The commonly used version control data from Git is full of identity errors and the existing approaches to correct these errors are difficult to validate on large scale and cannot be easily improved. We, therefore, aim to develop a scalable, highly accurate, easy to use and easy to improve approach to correct software developer identity errors. We first amalgamate developer identities from version control systems in open source software repositories and investigate the nature and prevalence of these errors, design corrective algorithms, and estimate the impact of the errors on networks inferred from this data. We investigate these questions using a collection of over 1B Git commits with over 23M recorded author identities. By inspecting the author strings that occur most frequently, we group identity errors into categories. We then augment the author strings with three behavioral fingerprints: time-zone frequencies, the set of files modified, and a vector embedding of the commit messages. We create a manually validated set of identities for a subset of OpenStack developers using an active learning approach and use it to fit supervised learning models to predict the identities for the remaining author strings in OpenStack. We then compare these predictions with a competing commercially available effort and a leading research method. Finally, we compare network measures for file-induced author networks based on corrected and raw data. We find commits done from different environments, misspellings, organizational ids, default values, and anonymous IDs to be the major sources of errors. We also find supervised learning methods to reduce errors by several times in comparison to existing research and commercial methods and the active learning approach to be an effective way to create validated datasets. Results also indicate that correction of developer identity has a large impact on the inference of the social network. We believe that our proposed Active Learning Fingerprint Based Anti-Aliasing (ALFAA) approach will expedite research progress in the software engineering domain for applications that involve developer identities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Using the uniqueness of global identifiers to determine the provenance of Python software source code

Article 20 July 2023

Yiming Sun, Daniel German & Stefano Zacchiroli

Characterizing and identifying reverted commits

Article 02 March 2019

Meng Yan, Xin Xia, … Shanping Li

Discovering community patterns in open-source: a systematic approach and its evaluation

Article Open access 13 November 2018

Damian A. Tamburri, Fabio Palomba, … Andy Zaidman

Notes

https://www.openstack.org/
https://opensource.com/resources/what-is-openstack
https://radimrehurek.com/gensim/index.html
On large and diverse bodies of text, a larger vector size of 300 is recommended (Řehu̇řek and Sojka 2010)
We found that more accurate predictors can be obtained by training the learner only on the matched pairs, since the transitive closure typically results in some pairs that are extremely dissimilar, leading the learner to learn from such pairs and, subsequently, produce many more false positives
Assuming independence of observations and using binomial distribution.
https://bitergia.com/
https://github.com/bvasiles/ght_unmasking_aliases
The author got much better results than we could obtained using their published code without modifications.
https://www.bioconductor.org
https://luarocks.org
https://www.stackage.org/lts-10.5
https://www.olcf.ornl.gov/
https://github.com/ssc-oscar/ALFAA-Replication

References

Badashian AS, Esteki A, Gholipour A, Hindle A, Stroulia E (2014) Involvement, contribution and influence in github and stack overflow. In: Proceedings of 24th annual international conference on computer science and software engineering, pp 19–33. IBM Corp
Baltes S, Diehl S (2018) Usage and attribution of stack overflow code snippets in github projects. Empir Softw Eng 24:1–37
Google Scholar
Bird C, Gourley A, Devanbu P, Gertz M, Swaminathan A (2006) Mining email social networks. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06. https://doi.org/10.1145/1137983.1138016. ACM, New York, pp 137–143
Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009) The promises and perils of mining git
Bonacich P (1987) Power and centrality: A family of measures. Am J Soc 92 (5):1170–1182. https://doi.org/10.1086/228631
Article Google Scholar
Burt RS (1992) Structural holes. Harvard University Press, Harvard
Google Scholar
Cataldo M, Wagstrom PA, Herbsleb JD, Carley KM (2006) Identification of coordination requirements: implications for the design of collaboration and awareness tools. In: Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, pp 353–362. ACM
Cataldo M, Herbsleb JD, Carley KM (2008) Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software development productivity. In: Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, pp 2–11. ACM
Christen P (2006) A comparison of personal name matching: Techniques and practical issues. In: 6th IEEE international conference on data mining - workshops (ICDMW’06), pp 290–294. https://doi.org/10.1109/ICDMW.2006.2
Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string metrics for matching names and records. In: KDD Workshop on data cleaning and object consolidation
Czerwonka J, Nagappan N, Schulte W, Murphy B (2013) Codemine: Building a software development data analytics platform at microsoft. IEEE Softw 30 (4):64–71
Article Google Scholar
Edberg DT, Bowman BJ (1996) User-developed applications: An empirical study of application quality and developer productivity. J Manag Inf Syst 13(1):167–185
Article Google Scholar
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210. https://doi.org/10.1080/01621459.1969.10501049. https://www.tandfonline.com/doi/abs/10.1080/01621459.1969.10501049
Article MATH Google Scholar
Freeman LC (1978) Centrality in social networks conceptual clarification. Soc Netw 1(3):215–239. https://doi.org/10.1016/0378-8733(78)90021-7. http://www.sciencedirect.com/science/article/pii/0378873378900217
Article Google Scholar
German DM (2004) Mining cvs repositories, the softchange experience. In: 1st international workshop on mining software repositories, pp 17–21. Citeseer
German D, Mockus A (2003) Automating the measurement of open source projects. In: Proceedings of the 3rd workshop on open source software engineering, pp 63–67. University College Cork Cork Ireland
Gharehyazie M, Posnett D, Vasilescu B, Filkov V (2015) Developer initiation and social interactions in oss: A case study of the apache software foundation. Empirical Softw Eng 20(5):1318–1353. https://doi.org/10.1007/s10664-014-9332-x
Article Google Scholar
Goeminne M, Mens T (2013) A comparison of identity merge algorithms for software repositories. Sci Comput Progr 78(8):971–986. https://doi.org/10.1016/j.scico.2011.11.004. http://www.sciencedirect.com/science/article/pii/S0167642311002048
Article Google Scholar
Hallgren KA (2012) Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology 8(1):23
Article Google Scholar
Jergensen C, Sarma A, Wagstrom P (2011) The onion patch: migration in open source ecosystems. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pp 70–80. ACM
Kouters E, Vasilescu B, Serebrenik A, van den Brand MGJ (2012) Who’s who in gnome: using lsa to merge software repository identities. In: 28th IEEE international conference on software maintenance (ICSM). IEEE
Lawrence S, Giles CL, Bollacker K (1999) Digital libraries and autonomous citation indexing. Computer 32(6):67–71. 10.1109/2.769447
Article Google Scholar
Le Q, Mikolov T (2014) Distributed representation of sentences and documents. In: Proceedings of the 31 st international conference on machine learning. https://cs.stanford.edu/quocle/paragraph_vector.pdf, vol 32. JMLR, Beijing
Ma Y, Bogart C, Amreen S, Zaretzki R, Mockus A (2019) World of code: An infrastructure for mining the universe of open source vcs data. In: Proceedings of the 2019 international conference on mining software repositories
Martinez-Romo J, Robles G, Gonzalez-Barahona JM, Ortuṅo-Perez M (2008) Using social network analysis techniques to study collaboration between a floss community and a company. In: Russo B, Damiani E, Hissam S, Lundell B, Succi G (eds) Open source development, communities and quality. Springer, Boston, pp 171–186
Google Scholar
Mockus A (2009a) Amassing and indexing a large sample of version control systems: towards the census of public source code history. In: 6th IEEE working conference on mining software repositories. IEEE. papers/amassing.pdf
Mockus A (2009b) Succession: Measuring transfer of code and developer productivity. In: Proceedings of the 31st international conference on software engineering, pp 67–77. IEEE Computer Society
Mockus A (2009c) Succession: Measuring transfer of code and developer productivity. In: 2009 international conference on software engineering. papers/succession.pdf. ACM Press, Vancouver
Mockus A (2014) Engineering big data solutions. In: ICSE’14 FOSE, pp 85–99. http://dl.acm.org/authorize?N14216
Mockus A, Herbsleb JD (2002) Expertise browser: a quantitative approach to identifying expertise. In: Proceedings of the 24th international conference on software engineering, pp 503–512. ACM
Nagappan N, Murphy B, Basili V (2008) The influence of organizational structure on software quality. In: 2008 ACM/IEEE 30th international conference on software engineering, pp 521–530. IEEE
Nesbitt A, Nickolls B (2017) Libraries.io open source repository and dependency metadata. https://doi.org/10.5281/zenodo.808273
Ostrouchov G, Chen WC, Schmidt D, Patel P (2012) Programming with big data in r. URL http://r-pbd.org
Petersen K, Wohlin C (2011) Measuring the flow in lean software development. Software: Practice and experience 41(9):975–996
Google Scholar
Pinzger M, Nagappan N, Murphy B (2008) Can developer-module networks predict failures?. In: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering, SIGSOFT ’08/FSE-16. https://doi.org/10.1145/1453101.1453105. ACM, New York, pp 2–12
Řehu̇řek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Valletta, pp 45–50
Robles G, Gonzalez-Barahona JM (2005) Developer identification methods for integrated data from various sources. In: Proceedings of the 2005 international workshop on mining software repositories, MSR ’05. https://doi.org/10.1145/1082983.1083162. ACM, New York, pp 1–5
Article Google Scholar
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02. https://doi.org/10.1145/775047.775087. ACM, New York, pp 269–278
Sariyar M, Borg A (2010) The recordlinkage package: Detecting errors in data. The R J 2(1):61–67 . https://journal.r-project.org/archive/2010-2/RJournal_2010-_Sariyar+Borg.pdf
Article Google Scholar
Smalheiser NR, Torvik VI (2011) Author name disambiguation. Annual Review of Information Science and Technology 43(1):1–43. https://doi.org/10.1002/aris.2009.1440430113. https://onlinelibrary.wiley.com/doi/abs/10.1002/aris.2009.1440430113
Article Google Scholar
Spencer D, Warfel T (2004) Card sorting: A definitive guide. Boxes and Arrows, pp 2
Thung F, Bissyande TF, Lo D, Jiang L (2013) Network structure of social coding in github. In: 2013 17th European conference on software maintenance and reengineering, pp 323–326. IEEE
Vasilescu B, Serebrenik A, Filkov V (2015) A data set for social diversity studies of github teams. In: Proceedings of the 12th working conference on mining software repositories, pp 514–517. ACM. https://dl.acm.org/citation.cfm?id=2820601
Ventura SL, Nugent R, Fuchs ER (2015) Seeing the non-starts: (some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records. Elsevier
Wang DJ, Shi X, McFarland DA, Leskovec J (2012) Measurement error in network data: A re-classification. Soc Netw 34(4):396–409
Article Google Scholar
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393:440–442. https://doi.org/10.1038/30918
Article Google Scholar
Wiese IS, da Silva JT, Steinmacher I, Treude C, Gerosa MA (2016) Who is who in the mailing list? comparing six disambiguation heuristics to identify multiple addresses of a participant. In: 2016 IEEE international conference on software maintenance and evolution (ICSME), pp 345–355, DOI 10.1109/ICSME.2016.13, (to appear in print)
Winkler WE (2006) Overview of record linkage and current research directions. Tech. rep., Bureau of the Census
Google Scholar
Wolf T, Schröter A, Damian D, Panjer LD, Nguyen THD (2009) Mining task-based social networks to explore collaboration in software teams. IEEE Softw 26 (1):58–66. 10.1109/MS.2009.16
Article Google Scholar
Xiong Y, Meng Z, Shen B, Yin W (2017) Mining developer behavior across github and stackoverflow. In: The 29th international conference on software engineering and knowledge engineering, pp 578–583. https://doi.org/10.18293/SEKE2017-062
Zhou M, Mockus A, Ma X, Zhang L, Mei H (2016) Inflow and retention in oss communities with commercial involvement: A case study of three hybrid projects. ACM Transactions on Software Engineering and Methodology (TOSEM) 25(2):13
Article Google Scholar
Zhu J, Wei J (2019) An empirical study of multiple names and email addresses in oss version control repositories. In: Proceedings of 16th international conference on mining software repositories (MSR). IEEE/ACM

Download references

Acknowledgments

This research material is based on work supported by the National Science Foundation (NSF) grants IIS-1633437 and IIS-1901102. We would like to thank our collaborators at the Open Source Supply Chains and Avoidance of Risk (OSCAR) team at the University of Tennessee and from the Institute for Software Research (ISR) at the Carnegie Mellon University for their valuable feedback on this work.

Author information

Authors and Affiliations

University of Tennessee, Knoxville, TN, USA
Sadika Amreen, Audris Mockus & Russell Zaretzki
Carnegie Mellon University, Pittsburgh, PA, USA
Christopher Bogart
Peking University, Beijing, China
Yuxia Zhang

Authors

Sadika Amreen
View author publications
You can also search for this author in PubMed Google Scholar
Audris Mockus
View author publications
You can also search for this author in PubMed Google Scholar
Russell Zaretzki
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Bogart
View author publications
You can also search for this author in PubMed Google Scholar
Yuxia Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sadika Amreen.

Additional information

Communicated by:Communicated by: Daniel Méndez

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amreen, S., Mockus, A., Zaretzki, R. et al. ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems. Empir Software Eng 25, 1136–1167 (2020). https://doi.org/10.1007/s10664-019-09786-7

Download citation

Published: 03 January 2020
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10664-019-09786-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems

Abstract

Access this article

Similar content being viewed by others

Using the uniqueness of global identifiers to determine the provenance of Python software source code

Characterizing and identifying reverted commits

Discovering community patterns in open-source: a systematic approach and its evaluation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems

Abstract

Access this article

Similar content being viewed by others

Using the uniqueness of global identifiers to determine the provenance of Python software source code

Characterizing and identifying reverted commits

Discovering community patterns in open-source: a systematic approach and its evaluation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation