Elsevier

Journal of Systems and Software

Volume 79, Issue 9, September 2006, Pages 1261-1279
Journal of Systems and Software

Program restructuring using clustering techniques

https://doi.org/10.1016/j.jss.2006.02.037Get rights and content

Abstract

Program restructuring is a key method for improving the quality of ill-structured programs, thereby increasing the understandability and reducing the maintenance cost. It is a challenging task and a great deal of research is still ongoing. This paper presents an approach to program restructuring inside of a function based on clustering techniques with cohesion as the major concern. Clustering has been widely used to group related entities together. The approach focuses on automated support for identifying ill-structured or low-cohesive functions and providing heuristic advice in both the development and evolution phases. A new similarity measure is defined and studied intensively specifically from the function perspective. A comparative study on three different hierarchical agglomerative clustering algorithms is also conducted. The best algorithm is applied to restructuring of functions of a real industrial system. The empirical observations show that the heuristic advice provided by the approach can help software designers make better decision of why and how to restructure a program. Specific source code level software metrics are presented to demonstrate the value of the approach.

Introduction

Software evolves over time primarily due to changes in requirements and technologies. As a result, huge amount of effort is spent in maintenance and evolution. Software evolution usually accounts for more than 60% of total software costs (Sommerville, 1996). In today’s highly competitive era, software development is often driven by tight schedules. Hence, software designers often emphasize the functional aspect of a system. Even if a software product is well designed, the code is often modified over time in response to the changing needs of customers. As a consequence, its original structure gradually drifts and quality degrades. Hence, the program becomes difficult to understand. As a result, it is often costly to maintain.

Müller, et al. indicate that 50–90% of software evolution work focuses on program comprehension or understanding (Müller et al., 1995). Program understanding could be at various levels, including architecture, design, and code. At the implementation level, a large or poorly coded function usually involves multiple activities or has low functional cohesion, which makes the program difficult to understand and modify. Program restructuring (Chikofsky and Cross, 1990) or refactoring (Fowler, 1999) can transform these functions to functions that are better organized and easier to understand, without changing their behaviors. The new functions will usually be higher quality and less costly for further evolution. More importantly, a desirable restructuring should achieve high cohesion and low coupling (Briand et al., 1996, Munson, 2003, Pressman, 1997).

Cohesion, as an important measure in restructuring, is to measure how tightly related elements are in a component. The goal of clustering is to group similar or related elements together. It is possible to use clustering analysis to measure the strength of the relationship between elements in a component. Previous articles of software clustering demonstrate research potential in software clustering field (Tzerpos and Holt, 1998) and conclude that clustering methods may be a very good starting point for the remodularization of software (Wiggerts, 1997).

However, existing research on the software clustering field has mainly been concerned with software remodularization at the architecture level and has not been used in program restructuring at the source code level. Source code contains critical information regarding the behavior of a system. The understanding and manipulation of source code is a pressing issue for maintenance and evolution. This paper focuses on source code. Specifically, this paper deals with restructuring of each individual function. One challenge of restructuring at this level is how to meaningfully and effectively group related code segments together inside a large or poorly structured function to form small or cohesive functions, because it is not uncommon that unrelated fragments and functionally cohesive code segments are interleaved in real software products. In addition, the approach should be easy to understand and also effective in practice. Clustering techniques are suitable for this problem, because the objective of clustering is consistent with that of cohesion.

This paper presents an approach to program restructuring using clustering techniques at the function level. It focuses on using automated support for identifying low-cohesive functions and making restructuring decisions, instead of the automated restructuring process. The purpose is to help software designers identify ill-structured functions and provide them with heuristic advices. In detail, this paper discusses how to select entities and how to select attributes that are important to distinguish two different entities from the cohesion perspective. A new resemblance coefficient as a similarity measure is defined. Extensive experiments on the weights of different attributes are conducted. Three hierarchical agglomerative algorithms: single linkage algorithm (SLINK), complete linkage algorithm (CLINK) and weighted pair-group method using arithmetic averages (WPGMA1), are chosen and an intensive comparative study on them is conducted. These algorithms are highlighted as follows:

  • SLINK: also called the nearest neighbor method. It defines the similarity measure between two clusters as the maximum resemblance coefficient among all pair entities in the two clusters.

  • CLINK: also called the furthest neighbor method. It defines the similarity measure between two clusters as the minimum resemblance coefficient among all pair entities in the two clusters.

  • WPGMA: this defines the similarity measure between two clusters as the simple arithmetic average of resemblance coefficients between two clusters without considering the cluster size.

The algorithm that produces the best result will then be applied to program restructuring of an industrial system.

The structure of the rest of this paper is as follows. Section 2 reviews the related work in both program restructuring and software clustering areas. Section 3 proposes an approach to program restructuring using clustering techniques and discusses the issues involved in the approach. Section 4 provides an extensive study on the similarity measure by weighting attributes differently. Section 5 gives a comparative study of three clustering algorithms: SLINK, CLINK and WPGMA. Section 6 presents a case study of program restructuring using the clustering results on an industrial software system. Empirical observations are also summarized. Section 7 presents the conclusions and future work.

Section snippets

Related work

There has been extensive research on software restructuring. This section describes related research on restructuring at the function or the design level. Additionally, this section also presents related research on software clustering.

An approach to program restructuring using clustering techniques

This section presents an approach to code restructuring using clustering techniques and discusses key issues of clustering techniques.

Experiments on similarity measure

The resemblance coefficient has been defined, but how to decide the weights is still unsolved. Previous research did not give systematic study on this issue. Dhama (1995) uses a heuristic estimate to give the data parameters twice as much weight as the control parameters. Schwanke (1991) estimates the significance of a feature using Shannon information content, which gives rarely-used identifiers higher weights than frequently-used identifiers. In this paper, the weights of attributes are

Experimental comparison of WPGMA, SLINK, and CLINK

The WPGMA, SLINK, and CLINK clustering algorithms have been applied to more than 60 functions in different areas, including functions appeared in papers, student assignments and industrial programs. This section presents comparisons of these three algorithms.

Case study of program restructuring

So far, we have defined entities and attributes that are used in the similarity measure; devised a new algorithm to calculate the resemblance coefficient; and compared three clustering algorithms. In order to evaluate the effectiveness of the proposed approach, we have applied the approach to restructuring of a real industrial system in data networks.

Conclusions and future directions

This paper presented a program restructuring approach using the clustering technique, for C programs. Specifically, we have discussed the selection of entities and attributes, similarity measure, resemblance coefficient experiments, hierarchical agglomerative algorithms comparison, and the application of the approach to an industrial program. The main goal of the restructuring approach was to provide automated support to identify poorly designed or low-cohesive functions and give heuristic

Acknowledgements

The authors would like to thank Dr. M. Zaid and Dr. R. Crawhall of NCIT (National Capital Institute of Telecommunications), Ottawa and Dr. R. Munikoti and Dr. K. Kalaichelvan of EION Inc., for supporting this research. The authors also want to thank the anonymous referees for their helpful suggestions.

References (43)

  • J.M. Bieman et al.

    Measuring design-level cohesion

    IEEE Trans. Softw. Eng.

    (1998)
  • Braden, R., Zhang, L., Berson, S., Herzog, S., Jamin, S., 1997. Resource ReSerVation Protocol (RSVP), RFC...
  • L. Briand et al.

    Property-based software engineering measurement

    IEEE Trans. Softw. Eng.

    (1996)
  • E.J. Chikofsky et al.

    Reverse engineering and design recovery: a taxonomy

    IEEE Softw.

    (1990)
  • A.C. Choi et al.

    Extracting and restructuring the design of large software systems

    IEEE Softw.

    (1990)
  • Chu, W.C., Patel, S., 1992. Software restructuring by enforcing localization and information hiding. In: Proc. Conf....
  • B. Everitt

    Cluster Analysis

    (1974)
  • N.E. Fenton et al.

    Software Metrics: A Rigorous and Practical Approach

    (1997)
  • M. Fowler

    Refactoring: Improving the Design of Existing Code

    Addison-Wesley

    (1999)
  • J.-F. Girard et al.

    A metric-based approach to detect abstract data types and state encapsulations

    Autom. Softw. Eng.

    (1999)
  • D. Hutchens et al.

    System structure analysis: clustering with data bindings

    IEEE Trans. Softw. Eng.

    (1985)
  • Cited by (28)

    • Efficient software clustering technique using an adaptive and preventive dendrogram cutting approach

      2013, Information and Software Technology
      Citation Excerpt :

      The proposed similarity metric is used to measure the closeness among the system components to group them into subsystems using an optimization clustering technique. On the other hand, the work by Lung et al. [12] proposed a new similarity measure that is designed to improve the quality of a poorly structured program. The approach looks into the structure of a program in which functions that share common attributes are deemed to correlate among themselves.

    • SPAPE: A semantic-preserving amorphous procedure extraction method for near-miss clones

      2013, Journal of Systems and Software
      Citation Excerpt :

      To make this project feasible, we left handling C pointers, a complex (Mark Harman et al., 2004; Hind et al., 1999; Qian et al., 2009; Thiessen, in press) side issue, to a later time when SPAPE is prepared to be adopted by actual development teams. Many earlier research studies in this area (e.g., (Harman et al., 2003, 2004; Alkhalid et al., 2010; Nikolaos Tsantalis and Alexander Chatzigeorgiou, 2009; Yang et al., 2009; Mark Harman et al., 2004; Lung et al., 2006; Komondoor and Horwitz, 2001)) did not handle pointers either for the sake of focusing on more immediate research objectives. Internal & external validity: Since our study does not establish cause–effect relationships, internal validity is not applicable.

    • Software refactoring at the function level using new Adaptive K-Nearest Neighbor algorithm

      2010, Advances in Engineering Software
      Citation Excerpt :

      The executable statements are the statements which include assignment, operation, and iteration statements. Entities, of the executable statements, are divided into control entities and non-control entities [19]. A control entity refers to an entity that is either a predicate statement (such as If statement) or an iteration statement (such as for statement).

    View all citing articles on Scopus
    View full text