Supporting concept location through identifier parsing and ontology extraction

https://doi.org/10.1016/j.jss.2013.07.009Get rights and content

Abstract

Identifier names play a key role in program understanding and in particular in concept location. Programmers can easily “parse” identifiers and understand the intended meaning. This, however, is not trivial for tools that try to exploit the information in the identifiers to support program understanding. To address this problem, we resort to natural language analyzers, which parse tokenized identifier names and provide the syntactic relationships (dependencies) among the terms composing the identifiers. Such relationships are then mapped to semantic relationships.

In this study, we have evaluated the use of off-the-shelf and trained natural language analyzers to parse identifier names, extract an ontology and use it to support concept location. In the evaluation, we assessed whether the concepts taken from the ontology can be used to improve the efficiency of queries used in concept location. We have also investigated if the use of different natural language analyzers has an impact on the ontology extracted and the support it provides to concept location. Results show that using the concepts from the ontology significantly improves the efficiency of concept location queries (e.g., in some cases, an improvement of 127% is observed). The results also indicate that the efficiency of concept location queries is not affected by the differences in the ontologies produced by different analyzers.

Introduction

During program understanding, source code exploration in search for a specific concept is a typical activity. One of the key source code elements which affects this activity is identifiers. Identifiers serve as a link between the intention of a concept and its extension in the source code (Rajlich, 2009). Different approaches which take advantage of this fact have been proposed by various authors to improve code search and support program understanding (see Marcus et al., 2004, Hill et al., 2009, Gay et al., 2009, Abebe and Tonella, 2010, Ratiu et al., 2008, Shepherd et al., 2007).

The intention of a concept in an identifier is reflected in the terms chosen, their relative order and the relationships among them. For example, an identifier name constructed from the terms record and event, might convey two different meanings to the reader when the terms are used in a different sequence: recordEvent, which might mean logging an event, and eventRecord, a type of record. While this difference is easy to grasp for a human, it is not obvious for tools that are designed to support program understanding. To address this problem, we resort to natural language dependency analyzers, which can be used to extract the semantic relationships among the identifier terms.

Natural language dependency analyzers present the syntactic relationship between the identifier terms through natural language dependencies (e.g., noun-specifier or direct-object) among them. Though these relations are syntactic, very often the terms they connect are semantically related. Of course this is not always the case, and, for example, the relation between a determiner and the corresponding noun is not semantically relevant. However, the relation between a verb and its object usually has a semantic nature. Therefore, we focus on some categories of the dependencies extracted from the analyzer in order to obtain relevant semantic relations for our aims.

In our previous work (Abebe and Tonella, 2010), we have exploited such information to extract an ontology from identifiers and support program understanding. The extraction of the ontology is conducted by parsing sentences constructed from identifiers. In this work, in addition to the natural language dependency analyzer used in Abebe and Tonella (2010), we have considered analyzers which are trained on a training set that closely resembles the structure of identifiers. The training set is constructed from sentences and phrases automatically extracted from the textual documentation of the system which comes with the source code and is available online. The documentation we considered consists of source code comments, user manuals, system documentation, and FAQs describing howtos, when available. The extracted sentences and phrases are automatically converted to identifiers that resemble true code identifiers. This is achieved by applying a predefined set of rules for identifier construction from natural language sentences. For these artificially constructed identifiers we know the correct parse trees, obtained from the parsing of the original sentences or phrases. Hence, we can use them as a training set.

The natural language dependency trees generated by the analyzers are used to build ontologies, which can support program understanding. Based on the level of formality, an ontology can vary from a simple taxonomy with almost no formalization, to one which uses a rigorously formalized theory (see Uschold and Gruninger, 2004). Ontology in our context is a “lightweight ontology” which is in between these two extremes and does not include axioms supporting formal reasoning, but only considers concepts and relations connecting the concepts. A lightweight ontology which is built using only concepts and relations connecting the concepts without any formalization is sometimes referred to as “concept map”. In this paper we refer to such lightweight ontology simply as ontology.

In this study, we have assessed the benefits of using ontologies extracted from identifiers in several subject applications and we have investigated the impact of using different analyzers to generate the ontologies. The assessment was conducted in the context of a program understanding task, namely concept location, which uses queries to narrow down the search space and identify the parts of a program that implement a concept of interest. The study we conducted, in particular, answers three research questions: (RQ1) Do the extracted ontology concepts contribute to increasing the effectiveness of basic queries formulated for concept location?, (RQ2) Do the ontologies produced by different analyzers differ between each other?; and (RQ3) Does the choice of the analyzer impact the effectiveness of concept location?. The obtained results indicate that exploiting the concepts taken from the ontologies improves the quality of queries used in concept location and that this is independent of the analyzer employed, despite the fact that different analyzers generate quite different ontologies.

The main contributions of this paper as compared to our previous work (Abebe and Tonella, 2010) are:

  • 1

    An approach to train natural language analyzers for use with identifiers.

  • 2

    A through empirical assessment of the impact of ontologies on concept location using two additional systems (FileZilla, JEdit) and computing additional metrics (e.g., average delta percentage, net improvements, mean reciprocal rank, etc.).

  • 3

    An analysis of the impact of using ontologies on latent semantic indexing based concept location.

  • 4

    A comparison among the different natural language analyzers investigated in this work in terms of differences between the extracted ontologies (measuring Jaccard index and ratio of unique concepts), in addition to their support to concept location.

Section 2 presents the related works that use identifiers as their main source of information to minimize the search space and facilitate program understanding. In Section 3, we describe the different types of natural language analyzers used in this study. The mapping of natural language dependencies and ontological relations are presented in Section 4 while the steps involved in the concept location task are described in Section 5. Section 6 presents the case study, including procedure, results and discussion. In the last section (Section 7), we provide our concluding remarks.

Section snippets

Related works

Concept location/assignment problem as described by Biggerstaff et al. (1993) is a problem related to discovering human oriented concepts and assigning them to their implementation instances within a program. In the literature various approaches which exploit different information such as dynamic and textual information are proposed to address this problem. A comprehensive survey of the approaches can be found in Dit et al. (2012). In this section, we discuss approaches which exploit textual

Identifier parsing

Identifiers are one of the ways in which developers’ communicate the intention of the source code, by representing it with a carefully chosen name. To extract and exploit this information, we use NLP techniques (see Abebe and Tonella, 2010). The approach uses a natural language dependency analyzer to retrieve a set of dependencies between the different tokens composing each identifier name. These dependencies are then used to identify ontology concepts and relations among them. For the sake of

Ontology extraction

The ontology of a program is extracted by exploiting the linguistic information captured in the program identifiers. The concepts of the ontology are retrieved from the nouns or noun phrases referred in the parse trees of identifiers, or, in some cases, directly from the class or program names. The ontological relations are obtained from the natural language dependencies found in the parse trees of the identifiers and the verbs used in method names. In our study, we have defined four types of

Concept location

Concept location is an activity where a programmer searches the source code to identify a specific part that implements a given concept (Rajlich and Wilde, 2002). It involves formulating a query composed of one or more keywords which a programmer thinks are related or refer to the concept to be searched. While formulating a query, the programmer resorts to her prior knowledge, as well as any information associated with the concept to be searched.

After querying the code base with the initially

Case studies

To investigate the support programmers can get through ontology extraction, we have conducted a case study. In this case study, we address the following research questions.

  • RQ1. Query effectiveness: Do the extracted ontology concepts contribute to increasing the effectiveness of basic queries formulated for concept location?

  • RQ2. Ontology comparison: Do the ontologies produced by different analyzers differ between each other?

  • RQ3. Analyzer impact: Does the choice of the analyzer impact the

Conclusion

We have presented the use of four types of natural language analyzers to parse identifiers of a system and extract ontologies. Two of the analyzers are adapted to directly work on identifiers through training while the other two are standard English analyzers. The training of the analyzers is conducted automatically using a training set constructed from the documentation of the corresponding system. To evaluate the benefits of using ontologies constructed from parse trees of identifiers, we

Surafel Lemma Abebe is currently a post-doctoral fellow at the Software Analysis and Intelligence Lab (SAIL) in the School of Computing at Queens University. He received his B.Sc. and M.Sc. degrees in Computer Science from Addis Ababa University in 2003 and 2005, respectively, and a PhD degree from University of Trento in 2013. His current research interests are program comprehension, software evolution, and source code analysis.

References (45)

  • K. Laitinen et al.

    Enhancing maintainability of source programs through disabbreviation

    Journal of Systems and Software

    (1997)
  • S.L. Abebe et al.

    Natural language parsing of program element names for concept extraction

  • S.L. Abebe et al.

    Towards the extraction of domain concepts from the identifiers

  • S.L. Abebe et al.

    Analyzing the evolution of the source code vocabulary

  • Y. Benjamini et al.

    Controlling the false discovery rate: a practical and powerful approach to multiple testing

    Journal of the Royal Statistical Society Series B: Methodological

    (1995)
  • T. Biggerstaff et al.

    The concept assignment problem in program understanding

  • D. Binkley et al.

    Improving identifier informativeness using part of speech information

  • S. Butler et al.

    Mining java class naming conventions

  • B. Cleary et al.

    An empirical analysis of information retrieval based concept location techniques in software comprehension

    Empirical Software Engineering

    (2009)
  • M.L. Collard et al.

    Supporting document and data views of source code

  • A. Corazza et al.

    Linsen: an efficient approach to split identifiers and expand abbreviations

  • B. Dit et al.

    Feature location in source code: a taxonomy and survey

    Journal of Software: Evolution and Process

    (2012)
  • Z.P. Fry et al.

    Analysing source code: looking for useful verbdirect object pairs in all the right places

    IET Software

    (2008)
  • G. Gay et al.

    On the use of relevance feedback in IR-based concept location

  • J. Giménez et al.

    Fast and accurate part-of-speech tagging: the SVM approach revisited

  • J. Giménez et al.

    SVMTool: a general POS tagger generator based on Support Vector Machines

  • S. Grant et al.

    Automated concept location using independent component analysis

  • S. Haiduc et al.

    Automatic query performance assessment during the retrieval of software artifacts

  • S. Haiduc et al.

    Evaluating the specificity of text retrieval queries to support software engineering tasks

  • E. Hill et al.

    Amap: automatically mining abbreviation expansions in programs to enhance software maintenance tools

  • E. Hill et al.

    Automatically capturing source code context of nl-queries for software maintenance and reuse

  • Cited by (6)

    Surafel Lemma Abebe is currently a post-doctoral fellow at the Software Analysis and Intelligence Lab (SAIL) in the School of Computing at Queens University. He received his B.Sc. and M.Sc. degrees in Computer Science from Addis Ababa University in 2003 and 2005, respectively, and a PhD degree from University of Trento in 2013. His current research interests are program comprehension, software evolution, and source code analysis.

    Anita Alicante is currently a research fellow at the Department of Electrical Engineering and Information Technologies at the University of Napoli Federico II in Italy. She received a degree (cum laude) in Computer Science at the University of Napoli Federico II in 2005. She got a Ph.D. in Computer Science at the University of Napoli Federico II in 2013. The thesis title was “Barrier and Syntactic features for Information Retrieval”. Her current research work is mainly focused on the definition and on the application of Information Retrieval, Information Extraction and Machine Learning methods to different fields (Ontology Learning, Software Engineering and Natural Language Processing).

    Anna Corazza is assistant professor at the Department of Electrical Engineering and Information Technologies at the University of Naples Federico II in Italy. She obtained the Laurea Degree in Electronic Engineering and the PhD degree at the University of Padua. From 1990 to 2000, she worked as researcher at FBK in Trento and afterwards at the University of Milan. Her research interests focus on statistical approaches to natural language processing, bioinformatics, software engineering and information retrieval.

    Paolo Tonella is head of the Software Engineering Research Unit at Fondazione Bruno Kessler (FBK), in Trento, Italy. He received his PhD degree in Software Engineering from the University of Padova, in 1999, with the thesis “Code Analysis in Support to Software Maintenance”. In 2011 he was awarded the ICSE 2001 MIP (Most Influential Paper) award, for his paper: “Analysis and Testing of Web Applications”. He is the author of “Reverse Engineering of Object Oriented Code”, Springer, 2005. He participated in several industrial and EU projects on software analysis and testing. His current research interests include code analysis, web and object oriented testing, search based test case generation.

    View full text