Recovering UML class models from C++: A detailed explanation

https://doi.org/10.1016/j.infsof.2006.10.011Get rights and content

Abstract

An approach to recovering design-level UML class models from C++ source code to support program comprehension is presented. A set of mappings are given that focus on accurately identifying such elements as relationship types, multiplicities, and aggregation semantics. These mappings are based on domain knowledge of the C++ language and common programming conventions and idioms. Additionally, formal concept analysis is used to detect design-level attributes of UML classes. An application implementing these mappings is used to reverse engineer a moderately sized, open-source application and the resultant class model is compared against those produced by other UML reverse engineering tools. This comparison shows that the presented mapping rules effectively produce meaningful and semantically accurate UML models.

Introduction

The software industry has widely accepted and often uses UML (Unified Modeling Language) [12] tools in forward engineering, but these tools are used less frequently during software maintenance and evolution. This is due to a number of reasons; foremost among these is that the manual recovery and maintenance of UML models is time-consuming and costly. As such, UML models become stale while the source code continues to evolve. Although many UML modeling tools allow us to reverse engineer UML models from source code, they often perform poorly at this task. A case study of reverse engineering tools [22] finds that many of these tools, despite advances in the research literature, continue to focus on producing the core elements of UML (i.e., simple class diagrams), but often fail to adequately represent design abstractions. This is problematic when recovered software models fail to accurately represent the abstract program semantics required for high-level program comprehension. This problem can be exacerbated by the fact that end users are typically unaware of the internal processes for producing the UML models. This is to say that these tools do not disclose their mechanisms for reverse engineering, which can lead to results that do not meet the end-user’s expectations.

Although this study [22] concludes that these tools provide reliable functionality, the resulting models are anything but consistent. For example, Microsoft Visio is incapable of reverse engineering associations, Visual Paradigm creates dependencies when associations are appropriate, and Rational Rose C++ Modeler creates only aggregate associations (open diamonds in UML). The primary reason for these inconsistencies is the sizeable semantic gap between UML and C++. Although this gap is quite wide, it is by no means unbridgeable. Unfortunately, commonly used reverse engineering tools are closed source systems and provide little information about how UML models are created from C++. This leaves developers to speculate about rationale for the application’s logic. As such, there is no standard “bridge” between C++ and UML, and all reverse engineering tools tend to build their own.

We address this problem by defining a set of mappings for the reverse engineering of UML class models from C++ source code [37], [38]. These mappings employ a combination of C++ syntactic and semantic information along with domain knowledge of programming conventions, idioms, and reuse libraries to produce semantically accurate UML class models. Many of these mappings extend and integrate techniques presented in the literature on this topic. As part of these mappings, a sophisticated information analysis technique (formal concept analysis) is applied to the UML model to recover design-level attributes of classes rather than re-document member variables.

These mappings are implemented in a reverse engineering tool, pilfer, which is used to evaluate the relevance of the defined mappings by reverse engineering a moderately-sized C++ application. The model produced by pilfer is compared against those produced by other tools in order to validate the accuracy and completeness of the defined mappings. Because performance is an important aspect of reverse engineering tools, we also compare pilfer’s run time performance against these tools.

This paper is organized as follows. Section 2 provides a more detailed context of the problem being addressed. Section 3 describes rationale and tradeoffs for each mapping. Section 4 describes the implementation of pilfer. In Section 5, we present a comparison of models generated by pilfer and other reverse engineering tools. Section 6 describes work related to this topic and Section 7 presents our conclusions and future work.

Section snippets

Reverse engineering analysis

Design recovery is the process of recovering design decisions, abstractions, and rationale from a program’s source code [5]. Design recovery directly supports program comprehension through reverse engineering. Fig. 1 depicts the architecture of a technology stack used in reverse engineering to recover program designs. This technology stack is motivated in part by the Rigi reverse engineering environment [36], [45] and the DMS program analysis system [3]. It integrates the technologies used in

Mappings for reverse engineering

In this section, we define mappings for UML reverse engineering tasks with a degree of ambiguity of medium or higher in Table 1, or those that are often seen as difficult or having potential ambiguities in their mappings. The mappings defined herein are heuristics based on syntactic and semantic features rather than deterministic analyses.

Implementation

The pilfer reverse engineering tool is currently implemented in the Python programming language. This was chosen for a number of reasons. First, it allows the application to be built and modified quickly, allowing developers to modify or experiment with the given mappings. pilfer leverages two key technologies to implement its reverse engineering capabilities: srcML1 and the Open Modeling Framework (OMF).

A comparison of results

In order to evaluate the effectiveness of our mappings and analyses, we used pilfer to reverse engineer HippoDraw7 (version 1.13.1), an open-source tool for information visualization. HippoDraw is a medium-sized C++ application containing about 230 classes and consists of 88 KLOC. We compared the results produced by pilfer against those produced by Doxygen, Visual Paradigm for UML (2005), and Microsoft Visio 2003 (used as a

Related work

The prevailing method of integrating modeling and reverse engineering tools is to build reverse engineering parsers and analyses into existing UML modeling applications. Examples include Rational Rose, Together, Umbrello, Visual Paradigm, and ArgoUML. However, IDE’s are beginning to realize the importance of providing a visual medium for source code and have begun to include UML modeling functionality. Both Microsoft’s Visual Studio 2005 and Apple’s XCode2 both support the ability to model

Conclusions and future work

In this paper, we have discussed the inconsistency of reverse engineering tools due to the semantic gap between UML and C++ and the non-disclosure policy of those tools. In an effort to bridge the gap between the two languages and to provide a platform for common modeling problems, we have detailed a set of mappings from C++ to UML class models. These heuristic mappings are based primarily on easily accessible syntactic and semantic information in the program. These mappings are intended to

Acknowledgments

We thank the reviewers for their helpful and detailed comments in revising this paper. This work was supported in part by a grant from the United States National Science Foundation (C-CR 02-04175).

References (44)

  • N. Anquetil, A comparison of graphs of concept for reverse engineering, in: Proceedings of 8th International Workshop...
  • L.A. Barowski, J.H. Cross, Extraction and use of class dependency information in java, in: Proceedings of Ninth Working...
  • I.D. Baxter, C. Pidgeon, M. Mehlich, DMS: program transformations for practical scalable software evolution, in:...
  • G. Canfora, A. Cimitile, A. De Lucia, G.A. Di Lucca, A case study of applying an eclectic approach to identify objects...
  • E.J. Chikofsky et al.

    Reverse engineering and design recovery: a taxonomy

    IEEE Software

    (1990)
  • R. Cole, T. Tilley, Conceptual analysis of software structure, in: Proceedings of 15th International Conference on...
  • M.L. Collard, H.H. Kagdi, J.I. Maletic, An XML-based lightweight C++ fact extractor, in: Proceedings of 11th IEEE...
  • M.L. Collard, J.I. Maletic, A. Marcus, Supporting document and data views of source code, in: Proceedings of ACM...
  • U. Dekel, Y. Gil, Revealing class structure with concept lattices, in: Proceedings of 10th Working Conference on...
  • T. Eisenbarth, R. Koschke, D. Simon, Aiding program comprehension by static and dynamic feature analysis, in:...
  • T. Eisenbarth et al.

    Locating features in source code

    IEEE Transactions on Software Engineering

    (2003)
  • M. Fowler

    Distilled Third Edition. A Brief Guide to the Standard Object Modeling Language

    (2000)
  • R. Godin, H. Mili, G. Mineau, R. Missaoui, A. Arfi, T.-T. Chau. Building and maintaining analysis-level class...
  • R. Godin et al.

    Design of class hierarchies based on concept (Galois) lattices

    International Journal of Knowledge Engineering and Software Engineering

    (1998)
  • M. Gogolla, R. Kollman, Re-documentation of Java with UML class diagrams, in: Proceedings of 7th Reengineering Forum,...
  • Y.G. Guéhéneuc, H. Albin-Amiot, Recovering binary class relationships: putting icing on the UML cake, in: Proceedings...
  • R.C. Holt, A. Winter, A. Schürr, GXL: toward a standard exchange format, in: Proceedings of 7th Working Conference on...
  • D. Jackson, A. Waingold, Lightweight extraction of object models from Bytecode, in: Proceedings of 21st International...
  • J. Jiang, T. Systa, Exploring differences in exchange formats – tool support and case studies, in: Proceedings of...
  • M. Keschenau, Student research competition: reverse engineering of UML specifications from Java programs, in:...
  • R. Kollman, M. Gogolla, Application of UML assciations and their adornments in design recovery, in: Proceedings of...
  • R. Kollman, P. Selonen, E. Stroulia, A. Zündorf, A study in the current state of the art in tool-supported UML-based...
  • Cited by (37)

    • A two-stage framework for UML specification matching

      2011, Information and Software Technology
      Citation Excerpt :

      According to current researches and commercially available reverse-engineering tools for object-oriented systems, the precision of a constituent is highly dependent on many factors, including the coding style, the mapping rules between the language elements and the modeling constituents, and the domain knowledge. For example, the identification of UML associations is highly ambiguous [28,29] and implies that only a few recovered constituents with low ambiguities are likely to be worthy of consideration for structure mapping. We believe that a higher level of query abstraction will generate superior results because unimportant match hypotheses are less likely to occur.

    • Rule fusion in positive consistent fuzzy decision formal context based on dominance relation

      2021, Nanjing Li Gong Daxue Xuebao/Journal of Nanjing University of Science and Technology
    • Association and Aggregation Class Relationships: Is there a Difference in Terms of Implementation?

      2021, Proceedings - 2021 9th International Conference in Software Engineering Research and Innovation, CONISOFT 2021
    • Object-weighted concept lattice based on information entropy

      2020, CAAI Transactions on Intelligent Systems
    View all citing articles on Scopus
    View full text