skip to main content
10.1145/3463274.3463343acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article

Assessing Developer Expertise from the Statistical Distribution of Programming Syntax Patterns

Published:21 June 2021Publication History

ABSTRACT

Accurate assessment of developer expertise is crucial for the assignment of an individual to perform a task or, more generally, to be involved in a project that requires an adequate level of knowledge. Potential programmers can come from a large pool. Therefore, automatic means to provide such assessment of expertise from written programs would be highly valuable in such context.

Previous works towards this goal have generally used heuristics such as Line 10 Rule or linguistic information in source files such as comments or identifiers to represent the knowledge of developers and evaluate their expertise. In this paper, we focus on syntactic patterns mastery as an evidence of knowledge in programming and propose a theoretical definition of programming knowledge based on the distribution of Syntax Patterns (SPs) in source code, namely Zipf’s law. We first validate the model and its scalability over synthetic data of “Expert” and “Novice” programmers. This provides a ground truth and allows us to explore the space of validity of the model. Then, we assess the performance of the model over real data from programmers. The results show that our proposed approach outperforms the recent state of the art approaches for the task of classifying programming experts.

References

  1. Mohammad Allahbakhsh, Boualem Benatallah, Aleksandar Ignjatovic, Hamid Reza Motahari-Nezhad, Elisa Bertino, and Schahram Dustdar. 2013. Quality control in crowdsourcing systems: Issues and directions. IEEE Internet Computing 17, 2 (2013), 76–81.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Eduardo G Altmann and Martin Gerlach. 2016. Statistical laws in linguistics. In Creativity and universality in language. Springer, 7–26.Google ScholarGoogle Scholar
  3. John Anvik and Gail C Murphy. 2007. Determining implementation expertise from bug reports. In Fourth International Workshop on Mining Software Repositories (MSR’07: ICSE Workshops 2007). IEEE, 2–2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ali Sajedi Badashian. 2016. Realistic bug triaging. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C). IEEE, 847–850.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jaume Baixeries, Brita Elvevåg, and Ramon Ferrer-i Cancho. 2013. The evolution of the exponent of Zipf’s law in language ontogeny. PloS one 8, 3 (2013), e53227.Google ScholarGoogle ScholarCross RefCross Ref
  6. Younes Boubekeur, Gunter Mussbacher, and Shane McIntosh. 2020. Automatic assessment of students’ software models using a simple heuristic and machine learning. In Proceedings of the 23rd ACM/IEEE International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings. 1–10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Martin S Bressler. 2014. Building the winning organization through high-impact hiring. Journal of Management and Marketing Research 15 (2014), 1.Google ScholarGoogle Scholar
  8. Casey Casalnuovo, Kenji Sagae, and Prem Devanbu. 2019. Studying the difference between natural and programming language corpora. Empirical Software Engineering 24, 4 (2019), 1823–1868.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Xiang Cheng, Shuguang Zhu, Gang Chen, and Sen Su. 2015. Exploiting user feedback for expert finding in community question answering. In 2015 IEEE International Conference on Data Mining Workshop (ICDMW). IEEE, 295–302.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. Power-law distributions in empirical data. SIAM review 51, 4 (2009), 661–703.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Baojiang Cui, Jiansong Li, Tao Guo, Jianxin Wang, and Ding Ma. 2010. Code comparison system based on abstract syntax tree. In 2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT). IEEE, 668–673.Google ScholarGoogle Scholar
  12. Jose Ricardo da Silva, Esteban Clua, Leonardo Murta, and Anita Sarma. 2015. Niche vs. breadth: Calculating expertise over time through a fine-grained analysis. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 409–418.Google ScholarGoogle ScholarCross RefCross Ref
  13. Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. 2013. Aggregating crowdsourced binary ratings. In Proceedings of the 22nd international conference on World Wide Web. 285–294.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Anna Deluca and Álvaro Corral. 2013. Fitting and goodness-of-fit test of non-truncated and truncated power-law distributions. Acta Geophysica 61, 6 (2013), 1351–1394.Google ScholarGoogle ScholarCross RefCross Ref
  15. Tapajit Dey, Andrey Karnauch, and Audris Mockus. 2020. Representation of Developer Expertise in Open Source Software. arXiv preprint arXiv:2005.10176(2020).Google ScholarGoogle Scholar
  16. Roderick Edwards and Laura Collins. 2011. Lexical frequency profiles and Zipf’s law. Language Learning 61, 1 (2011), 1–30.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jean-Claude Falmagne, Mathieu Koppen, Michael Villano, Jean-Paul Doignon, and Leila Johannesen. 1990. Introduction to knowledge spaces: How to build, test, and search them.Psychological Review 97, 2 (1990), 201.Google ScholarGoogle Scholar
  18. Thomas Fritz, Jingwen Ou, Gail C Murphy, and Emerson Murphy-Hill. 2010. A degree-of-knowledge model to capture source code familiarity. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 385–394.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Michel L Goldstein, Steven A Morris, and Gary G Yen. 2004. Problems with fitting to the power-law distribution. The European Physical Journal B-Condensed Matter and Complex Systems 41, 2(2004), 255–258.Google ScholarGoogle ScholarCross RefCross Ref
  20. Gillian J Greene and Bernd Fischer. 2016. Cvexplorer: Identifying candidate developers by mining and exploring their open source contributions. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. 804–809.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Weizhi Huang, Wenkai Mo, Beijun Shen, Yu Yang, and Ning Li. 2016. CPDScorer: Modeling and Evaluating Developer Programming Ability across Software Communities.. In SEKE. 87–92.Google ScholarGoogle Scholar
  22. George F Jenks. 1967. The data model concept in statistical mapping. International yearbook of cartography 7 (1967), 186–190.Google ScholarGoogle Scholar
  23. Jennifer Marlow and Laura Dabbish. 2013. Activity traces and signals in software developer recruitment and hiring. In Proceedings of the 2013 conference on Computer supported cooperative work. 145–156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Dominique Matter, Adrian Kuhn, and Oscar Nierstrasz. 2009. Assigning bug reports using a vocabulary-based expertise model of developers. In 2009 6th IEEE international working conference on mining software repositories. IEEE, 131–140.Google ScholarGoogle Scholar
  25. David W McDonald and Mark S Ackerman. 2000. Expertise recommender: a flexible recommendation system and architecture. In Proceedings of the 2000 ACM conference on Computer supported cooperative work. 231–240.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Robert McMaster. 1997. In Memoriam: George F. Jenks (1916-1996). Cartography and Geographic Information Systems 24, 1 (1997), 56–59.Google ScholarGoogle ScholarCross RefCross Ref
  27. Audris Mockus and James D Herbsleb. 2002. Expertise browser: a quantitative approach to identifying expertise. In Proceedings of the 24th International Conference on Software Engineering. ICSE 2002. IEEE, 503–512.Google ScholarGoogle ScholarCross RefCross Ref
  28. Joao Eduardo Montandon, Luciana Lourdes Silva, and Marco Tulio Valente. 2019. Identifying experts in software libraries and frameworks among GitHub users. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 276–287.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. João Eduardo Montandon, Marco Tulio Valente, and Luciana L Silva. 2021. Mining the Technical Roles of GitHub Users. Information and Software Technology 131 (2021), 106485.Google ScholarGoogle ScholarCross RefCross Ref
  30. Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Thirtieth AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Johnatan Oliveira, Markos Viggiato, and Eduardo Figueiredo. 2019. How Well Do You Know This Library? Mining Experts from Source Code Analysis. In Proceedings of the XVIII Brazilian Symposium on Software Quality. 49–58.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Regina Pustet. 2004. Zipf and his heirs. Language Sciences 26, 1 (2004), 1–25.Google ScholarGoogle ScholarCross RefCross Ref
  33. Ali Sajedi-Badashian and Eleni Stroulia. 2020. Vocabulary and time based bug-assignment: A recommender system for open-source projects. Software: Practice and Experience(2020).Google ScholarGoogle Scholar
  34. David Schuler and Thomas Zimmermann. 2008. Mining usage expertise from version archives. In Proceedings of the 2008 international working conference on Mining software repositories. 121–124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Evgeny Shulzinger, Irina Legchenkova, and Edward Bormashenko. 2018. Co-occurrence of the Benford-like and Zipf Laws Arising from the Texts Representing Human and Artificial Languages. arXiv preprint arXiv:1803.03667(2018).Google ScholarGoogle Scholar
  36. Renuka Sindhgatta. 2008. Identifying domain expertise of developers from source code. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 981–989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Leif Singer, Fernando Figueira Filho, Brendan Cleary, Christoph Treude, Margaret-Anne Storey, and Kurt Schneider. 2013. Mutual assessment in the social programmer ecosystem: An empirical investigation of developer profile aggregators. In Proceedings of the 2013 conference on Computer supported cooperative work. 103–116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. Pydriller: Python framework for mining software repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 908–911.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Cédric Teyton, Marc Palyart, Jean-Rémy Falleri, Floréal Morandat, and Xavier Blanc. 2014. Automatic extraction of developer expertise. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering. 1–10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yuan Tian, Dinusha Wijedasa, David Lo, and Claire Le Goues. 2016. Learning to rank for bug report assignee recommendation. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC). IEEE, 1–10.Google ScholarGoogle ScholarCross RefCross Ref
  41. Morteza Verdi, Ashkan Sami, Jafar Akhondali, Foutse Khomh, Gias Uddin, and Alireza Karami Motlagh. 2020. An empirical study of c++ vulnerabilities in crowd-sourced code examples. IEEE Transactions on Software Engineering(2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Xin Xia, David Lo, Xinyu Wang, and Bo Zhou. 2015. Dual analysis for recommending developers to resolve bugs. Journal of Software: Evolution and Process 27, 3 (2015), 195–220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Hongyu Zhang. 2009. Discovering power laws in computer programs. Information processing & management 45, 4 (2009), 477–483.Google ScholarGoogle Scholar
  44. George Kingsley Zipf. 1949. Human behaviour and the principle of least-effort. Cambridge MA edn. Reading: Addison-Wesley(1949).Google ScholarGoogle Scholar
  45. George Kingsley Zipf. 2016. Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    EASE '21: Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering
    June 2021
    417 pages
    ISBN:9781450390538
    DOI:10.1145/3463274

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 21 June 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate71of232submissions,31%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format