Automated clustering to support the reflexion method

https://doi.org/10.1016/j.infsof.2006.10.015Get rights and content

Abstract

A significant aspect in applying the Reflexion Method is the mapping of components found in the source code onto the conceptual components defined in the hypothesized architecture. To date, this mapping is established manually, which requires a lot of work for large software systems. In this paper, we present a new approach, in which clustering techniques are applied to support the user in the mapping activity. The result is a semi-automated mapping technique that accommodates the automatic clustering of the source model with the user’s hypothesized knowledge about the system’s architecture.

This paper describes three case studies in which the semi-automated mapping technique, called HuGMe, has been applied successfully to extend a partial map of real-world software applications. In addition, the results of another case study from an earlier publication are summarized, which lead to comparable results. We evaluated the extended versions of two automatic software clustering techniques, namely, MQAttract and CountAttract, with oracle mappings. We closely study the influence of the degree of completeness of the existing mapping and other controlling variables of the technique to make reliable suggestions.

Both clustering techniques were able to achieve a mapping quality where more than 90% of the automatic mapping decisions turned out to be correct. Moreover, the experiments indicate that the attraction function (CountAttract′) based on local coupling and cohesion is more suitable for semi-automated mapping than the approach MQAttract′ based on a global assessment of coupling and cohesion.

Introduction

Software architecture is described by many views. The most popular view addressed in research is the module view [1]. The module view describes the modules of a system, their layering and composition into subsystems, and the provided and required interfaces of these elements. The module view is required for many purposes such as allocating working packages to teams, global change impact analysis, and evaluating the maintainability of the system.

Far too often, the module view that was initially designed does not reflect the real implementation due to changes made in the source without updating the documented module view. Murphy and colleagues [2] developed the reflexion model technique to reconstruct the mapping from the specified or hypothesized decomposition to the concrete module view. The basic idea of the reflexion model is to create a hypothesized view from existing documentation or interviews with architects. Source entities are extracted from a system (global variables, routines, types, classes, interfaces, packages, files, subdirectories, etc.) along with their respective dependencies forming the concrete module view. These elements are mapped to the hypothesized view. A tool then computes resemblances and differences between the two views. Iteratively, the hypothesized and concrete views and/or the mapping are refined based on the findings.

The technique was successfully used in several case studies. The most interesting case study – reported by Murphy and Notkin [3] – is the analysis of Microsoft Excel, which consists of about 1.2 MLOC of C code. Koschke and Simon extended the original reflexion model, so that hypothesized modules can be hierarchical, and applied it to two different compilers [4].

The most challenging part of the reflexion method is to determine the mapping of concrete source entities onto the hypothesized entities of the hypothesized model. The original reflexion method does not provide any support for this – although sometimes naming conventions may be leveraged. Unfortunately, naming conventions often do not exist or are used inconsistently.

The key point of the reflexion method is to start with an initial hypothesis on the expected module view and then to validate the hypothesis against the implementation. In contrast, software clustering techniques group source entities together – typically based on some notion of coupling and cohesion – to form hypothesized entities. The advantage of clustering techniques is that they can be completely automated. Yet, these techniques are not targeted towards the expectations of the analyst and often fail to find the components a human would find [5].

Contributions. In an earlier publication [6], we proposed combining the reflexion method with a Human-Guided Mapping Generation Method (HuGMe) to accommodate the automatic clustering of the source model with the user’s hypothesized knowledge about the system architecture. This paper extends the previous paper by evaluating variations of automatic clustering analysis techniques for turning the manual mapping activity into a semi-automated approach in more depth. Two clustering techniques are adjusted to create additional candidate mappings based on a partial mapping and the targeted hypothesized model. We compare and evaluate these variations with an oracle mapping for three case studies and also summarize the results of the case study from the earlier publication [6]. We closely study the influence of the degree of completeness of the existing mapping to make reliable suggestions.

The remainder of the paper is organized as follows. Section 2 describes the original reflexion method and other related research on automated software clustering. Section 3 describes how to integrate automated clustering techniques into the reflexion method. The experimental setup to evaluate the support of two clustering techniques is introduced in Section 4. Section 5 uses this experimental setup for three case studies to investigate various factors of influence. Section 6 states known assumptions and limitations of the method, and Section 7 provides our concluding thoughts.

Section snippets

Related research

This section describes related research. We start with a detailed description of the reflexion technique and introduce concepts used in the description of our extension. We then summarize research in the wider area of software clustering.

Integration of reflexion method and clustering techniques

This section describes how we have integrated automated clustering techniques into the traditional reflexion method which resulted in HuGMe, our new combined approach.

Evaluation scheme

In this section we provide details on the evaluation scheme used for the following case studies. Before defining the variables used, we first provide an informal description of the basic evaluation approach. To compare the effectiveness of the two attraction functions, we need to determine:

  • Is a free concrete entity mapped at all?

  • If it is mapped, is the mapping correct?

Correctness of the mapping can be defined as follows. Let a be a concrete entity that is mapped onto conceptual entity A

Case studies

As a follow-on to our previous work [6], we performed three additional case studies of varying size, implementation language, and application domains to evaluate the semi-automated mapping of HuGMe and the underlying clustering alternatives. In this section, we describe the three new case studies and then provide a brief summary of the results from the earlier study of a Java program. We then compare the results and discuss the findings from the four studies.

To be able to replicate our study,

Limitations of the method

Both attraction functions derive the attraction values from source relationships between concrete entities and hypothesized dependencies between hypothesized entities. This approach shares the same drawbacks as other clustering techniques based on source dependencies. Similar to those techniques, our clustering algorithm yields hypothesized entities featuring high cohesion and low coupling. As stated by Andritsos et al. [42], this approach is problematic when the developers of the system did

Conclusions

Our case study demonstrates the supportive aspect of clustering techniques for establishing the reflexion mapping. The clustering technique of the HuGMe method was able to achieve a mapping quality where a very high fraction of the automatic mapping decisions turned out to be correct. Moreover, the existence of conceptual components and dependencies simplifies automated clustering as it has a more focused target (the conceptual components expected by a human analyst) and can leverage existing

Acknowledgements

We thank Chris Callendar for the hypothesized view and associated mappings for the Tetris case study, Ian Bull for technical support and Jody Ryall for editing assistance. We also thank the anonymous reviewers for their helpful comments and suggestions.

References (46)

  • A. Cimitile et al.

    Software salvaging and the call dominance tree

    Journal of Systems and Software

    (1995)
  • C. Hofmeister et al.

    Applied Software Architecture Object Technology Series

    (2000)
  • G. C. Murphy, D. Notkin, K. Sullivan, Software reflexion models: bridging the gap between source and high-level models,...
  • G. C. Murphy, D. Notkin, Reengineering with reflexion models: A case study, IEEE Computer 30 (8) (1997) 29–36,...
  • R. Koschke et al.

    Hierarchical reflexion models

  • R. Koschke, Atomic architectural component recovery for program understanding and evolution, Ph.d. thesis, University...
  • A. Christl et al.

    Equipping the reflexion method with automated clustering

  • G. Canfora et al.

    A case study of applying an eclectic approach to identify objects in code

  • S. Choi et al.

    Extracting and restructuring the design of large systems

    IEEE Software

    (1990)
  • D.H. Hutchens et al.

    System structure analysis: clustering with data bindings

    IEEE TSE

    (1985)
  • S.S. Liu et al.

    Identifying objects in a conventional procedural language: an example of data design recovery

  • P. Livadas et al.

    A new approach to finding objects in programs

    Journal Software Maintenance and Evolution

    (1994)
  • R.M. Ogando et al.

    An object finder for program structure understanding in software maintenance

    Journal Software Maintenance and Evolution

    (1994)
  • S. Patel et al.

    A measure for composite module cohesion

  • H. Sahraoui et al.

    Applying concept formation methods to object identification in procedural code

  • R.R. Valasareddi et al.

    A graph-based object identification process for procedural programs

  • J. Weidl, H. Gall, Binding object models to source code: An approach to object-oriented re-architecturing, in: Proc. of...
  • A. Yeh et al.

    Recovering abstract data types and object instances from a conventional procedural language

  • J.-F. Girard et al.

    Finding components in a hierarchy of modules: a step towards architectural understanding

  • R.W. Schwanke et al.

    Using neural networks to modularize software

    Machine Learning

    (1994)
  • S. Mancoridis et al.

    Using automatic clustering to produce high-level system organizations of source code

  • N. Anquetil et al.

    Extracting concepts from file names: a new file clustering criterion

  • J.-F. Girard et al.

    A metric-based approach to detect abstract data types and state encapsulations

    Journal Automated Software Engineering

    (1999)
  • Cited by (46)

    • To automatically map source code entities to architectural modules with Naive Bayes

      2022, Journal of Systems and Software
      Citation Excerpt :

      We implement the attraction functions in Java as part of our open-source tool suite for architectural analysis10 (Olsson et al., 2021). The implementation of CountAttract is based on the description in Christl et al. (2007), and the implementations of IRAttract and LSIAttract are based on the descriptions in Bittencourt et al. (2010). Since the implementations are based on the textual descriptions and not source code, we cannot be certain that our implementations are correct, but we find similar results provided in the publications to validate the algorithms.

    • The WGB method to recover implemented architectural rules

      2018, Information and Software Technology
    • Optimized Machine Learning Input for Evolutionary Source Code to Architecture Mapping

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Investigating the Effect of Partial and Real-Time Feedback in INMAP Code-to-Architecture Mapping

      2023, Proceedings of the 18th Conference on Computer Science and Intelligence Systems, FedCSIS 2023
    • An Integrated Approach to Package and Class Code-to-Architecture Mapping Using InMap

      2023, Proceedings - IEEE 20th International Conference on Software Architecture, ICSA 2023
    View all citing articles on Scopus
    View full text