Scalable deep learning for bug detection
View/ Open
Date
31/07/2021Author
Karampatsis, Rafael-Michael
Metadata
Abstract
The application of machine learning (ML) and natural language processing (NLP) methods
for creating software engineering (SE) tools is a recent emerging trend. A crucial early
decision is how to model software’s vocabulary. Unlike in natural language, software
developers are free to create any identifiers they like, and can make them arbitrarily
complex resulting in an immense out of vocabulary problem. This fundamental fact
prohibits training of Neural models on large-scale software corpora.
This thesis aimed on addressing this problem. As an initial step we studied the most
common ways for vocabulary reduction previously considered in the software engineering
literature and found that they are not enough to obtain a vocabulary of manageable
size. Instead this goal was reached by using an adaptation of the Byte-Pair Encoding
(BPE) algorithm, which produces an open-vocabulary neural language model (NLM).
Experiments on large corpora show that the resulting NLM outperforms other LMs both
in perplexity and code completion performance for several programming languages. It
continues by showing that the improvement in language modelling transfers to downstream
SE tasks by finding that the BPE NLMs are more effective in highlighting buggy code
than previous LMs. Driven by this finding and from recent advances in NLP it also
investigates the idea of transferring language model representations to program repair
systems.
Program repair is an important but difficult software engineering problem. One way
to achieve a “sweet spot” of low false positive rates, while maintaining high enough recall
to be usable, is to focus on repairing classes of simple bugs, such as bugs with single
statement fixes, or that match a small set of bug templates. However, it is very difficult
to estimate the recall of repair techniques based on templates or based on repairing
simple bugs, as there are no datasets about how often the associated bugs occur in code.
To fill this gap, the thesis contributes a large dataset of single statement Java bug-fix
changes annotated by whether they match any of a set of 16 bug templates along with
a methodology for mining similar datasets. These specific patterns were selected with
the criteria that they appear often in open-source Java code and relate to those used by
mutation and pattern-based repair tools. They also aim at extracting bugs that compile both before and after repair as such can be quite tedious to manually spot, yet their fixes
are simple. These mined bugs are quite frequent appearing about every 2000 lines of
code and that their fixes are very often already present in the code satisfying the popular
plastic surgery hypothesis.
Furthermore, it introduces a hypothesis that contextual embeddings offer potential
modelling advantages that are specifically suited for modelling source code due to its
nature. Contextual embeddings are common in natural language processing but have
not been previously applied in software engineering. As such another contribution is
the introduction a new set of deep contextualized word representations for computer
programs based on the ELMo (embeddings from language models) framework of Peters
et al (2018). It is shown that even a low-dimensional embedding trained on a relatively
small corpus of programs can improve a state-of-the-art machine learning system for
bug detection of single statement fixes. The systems were evaluated on the DeepBugs
dataset of synthetic bugs, a new synthetic test dataset, and a small dataset of real
JavaScript bugs. Lastly, the final contribution is the first steps at answering whether
neural bug-finding is useful in practice by performing an evaluation study over a small
set of real bugs.