Scalable deep learning for bug detection

Karampatsis, Rafael-Michael

View/Open

Karampatsis2021.pdf (1.617Mb)

Date

31/07/2021

Author

Karampatsis, Rafael-Michael

Metadata

Show full item record

Abstract

The application of machine learning (ML) and natural language processing (NLP) methods for creating software engineering (SE) tools is a recent emerging trend. A crucial early decision is how to model software’s vocabulary. Unlike in natural language, software developers are free to create any identifiers they like, and can make them arbitrarily complex resulting in an immense out of vocabulary problem. This fundamental fact prohibits training of Neural models on large-scale software corpora. This thesis aimed on addressing this problem. As an initial step we studied the most common ways for vocabulary reduction previously considered in the software engineering literature and found that they are not enough to obtain a vocabulary of manageable size. Instead this goal was reached by using an adaptation of the Byte-Pair Encoding (BPE) algorithm, which produces an open-vocabulary neural language model (NLM). Experiments on large corpora show that the resulting NLM outperforms other LMs both in perplexity and code completion performance for several programming languages. It continues by showing that the improvement in language modelling transfers to downstream SE tasks by finding that the BPE NLMs are more effective in highlighting buggy code than previous LMs. Driven by this finding and from recent advances in NLP it also investigates the idea of transferring language model representations to program repair systems. Program repair is an important but difficult software engineering problem. One way to achieve a “sweet spot” of low false positive rates, while maintaining high enough recall to be usable, is to focus on repairing classes of simple bugs, such as bugs with single statement fixes, or that match a small set of bug templates. However, it is very difficult to estimate the recall of repair techniques based on templates or based on repairing simple bugs, as there are no datasets about how often the associated bugs occur in code. To fill this gap, the thesis contributes a large dataset of single statement Java bug-fix changes annotated by whether they match any of a set of 16 bug templates along with a methodology for mining similar datasets. These specific patterns were selected with the criteria that they appear often in open-source Java code and relate to those used by mutation and pattern-based repair tools. They also aim at extracting bugs that compile both before and after repair as such can be quite tedious to manually spot, yet their fixes are simple. These mined bugs are quite frequent appearing about every 2000 lines of code and that their fixes are very often already present in the code satisfying the popular plastic surgery hypothesis. Furthermore, it introduces a hypothesis that contextual embeddings offer potential modelling advantages that are specifically suited for modelling source code due to its nature. Contextual embeddings are common in natural language processing but have not been previously applied in software engineering. As such another contribution is the introduction a new set of deep contextualized word representations for computer programs based on the ELMo (embeddings from language models) framework of Peters et al (2018). It is shown that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection of single statement fixes. The systems were evaluated on the DeepBugs dataset of synthetic bugs, a new synthetic test dataset, and a small dataset of real JavaScript bugs. Lastly, the final contribution is the first steps at answering whether neural bug-finding is useful in practice by performing an evaluation study over a small set of real bugs.

URI

https://hdl.handle.net/1842/37798

http://dx.doi.org/10.7488/era/1074

Collections

Informatics thesis and dissertation collection