Abstract:
As one of the most important resources to express the semantics of source code, identifiers are usually composed of several common or domain-specific terms and abbreviati...Show MoreMetadata
Abstract:
As one of the most important resources to express the semantics of source code, identifiers are usually composed of several common or domain-specific terms and abbreviations, thus heavily hindering developers from analyzing and comprehending source code. Hence, it is very necessary to normalize identifiers, which aims to align the vocabulary found in identifiers with natural language words found in other software artifacts. Even though researchers have proposed several identifier normalization approaches in the literature, these approaches only rely on the lexical information in identifiers and related source code entities to normalize identifiers, suffering from the lack of deep semantic understanding of identifiers. In this paper, we propose an effective and efficient identifier normalization approach BEQAIN to split identifiers into their composing words and expand the enclosed abbreviations. Specifically, BEQAIN employs a deep learning model, which is mainly composed of a Bidirectional Encoder Representation from Transformers (BERT) layer and a Conditional Random Fields (CRF) layer to embed identifiers into low-level vectors and learn the identifier splitting patterns. The BERT-CRF network is also combined with a pre-processing component and a post-processing component to resolve the problems of over-splitting and under-splitting so as to improve the identifier splitting performance. Furthermore, BEQAIN also employs a Question Answering (Q&A) system to learn the abbreviation expansion mappings and leverages the current programming context to determine the exactly correct expansion when there are multiple expansions for specific abbreviations. After BEQAIN is fully trained, it can be used to normalize identifiers. We conduct extensive experiments to validate the effectiveness and efficiency of BEQAIN over two publicly available datasets with nine projects. Experimental results show that BEQAIN achieves the overall average Accuracy of 80.20% and outperforms the ex...
Published in: IEEE Transactions on Software Engineering ( Volume: 49, Issue: 4, 01 April 2023)