Program restructuring using clustering techniques

doi:10.1016/j.jss.2006.02.037

Journal of Systems and Software

Volume 79, Issue 9, September 2006, Pages 1261-1279

https://doi.org/10.1016/j.jss.2006.02.037 Get rights and content

Abstract

Program restructuring is a key method for improving the quality of ill-structured programs, thereby increasing the understandability and reducing the maintenance cost. It is a challenging task and a great deal of research is still ongoing. This paper presents an approach to program restructuring inside of a function based on clustering techniques with cohesion as the major concern. Clustering has been widely used to group related entities together. The approach focuses on automated support for identifying ill-structured or low-cohesive functions and providing heuristic advice in both the development and evolution phases. A new similarity measure is defined and studied intensively specifically from the function perspective. A comparative study on three different hierarchical agglomerative clustering algorithms is also conducted. The best algorithm is applied to restructuring of functions of a real industrial system. The empirical observations show that the heuristic advice provided by the approach can help software designers make better decision of why and how to restructure a program. Specific source code level software metrics are presented to demonstrate the value of the approach.

Introduction

Software evolves over time primarily due to changes in requirements and technologies. As a result, huge amount of effort is spent in maintenance and evolution. Software evolution usually accounts for more than 60% of total software costs (Sommerville, 1996). In today’s highly competitive era, software development is often driven by tight schedules. Hence, software designers often emphasize the functional aspect of a system. Even if a software product is well designed, the code is often modified over time in response to the changing needs of customers. As a consequence, its original structure gradually drifts and quality degrades. Hence, the program becomes difficult to understand. As a result, it is often costly to maintain.

Müller, et al. indicate that 50–90% of software evolution work focuses on program comprehension or understanding (Müller et al., 1995). Program understanding could be at various levels, including architecture, design, and code. At the implementation level, a large or poorly coded function usually involves multiple activities or has low functional cohesion, which makes the program difficult to understand and modify. Program restructuring (Chikofsky and Cross, 1990) or refactoring (Fowler, 1999) can transform these functions to functions that are better organized and easier to understand, without changing their behaviors. The new functions will usually be higher quality and less costly for further evolution. More importantly, a desirable restructuring should achieve high cohesion and low coupling (Briand et al., 1996, Munson, 2003, Pressman, 1997).

Cohesion, as an important measure in restructuring, is to measure how tightly related elements are in a component. The goal of clustering is to group similar or related elements together. It is possible to use clustering analysis to measure the strength of the relationship between elements in a component. Previous articles of software clustering demonstrate research potential in software clustering field (Tzerpos and Holt, 1998) and conclude that clustering methods may be a very good starting point for the remodularization of software (Wiggerts, 1997).

However, existing research on the software clustering field has mainly been concerned with software remodularization at the architecture level and has not been used in program restructuring at the source code level. Source code contains critical information regarding the behavior of a system. The understanding and manipulation of source code is a pressing issue for maintenance and evolution. This paper focuses on source code. Specifically, this paper deals with restructuring of each individual function. One challenge of restructuring at this level is how to meaningfully and effectively group related code segments together inside a large or poorly structured function to form small or cohesive functions, because it is not uncommon that unrelated fragments and functionally cohesive code segments are interleaved in real software products. In addition, the approach should be easy to understand and also effective in practice. Clustering techniques are suitable for this problem, because the objective of clustering is consistent with that of cohesion.

This paper presents an approach to program restructuring using clustering techniques at the function level. It focuses on using automated support for identifying low-cohesive functions and making restructuring decisions, instead of the automated restructuring process. The purpose is to help software designers identify ill-structured functions and provide them with heuristic advices. In detail, this paper discusses how to select entities and how to select attributes that are important to distinguish two different entities from the cohesion perspective. A new resemblance coefficient as a similarity measure is defined. Extensive experiments on the weights of different attributes are conducted. Three hierarchical agglomerative algorithms: single linkage algorithm (SLINK), complete linkage algorithm (CLINK) and weighted pair-group method using arithmetic averages (WPGMA¹), are chosen and an intensive comparative study on them is conducted. These algorithms are highlighted as follows:

•
SLINK: also called the nearest neighbor method. It defines the similarity measure between two clusters as the maximum resemblance coefficient among all pair entities in the two clusters.
•
CLINK: also called the furthest neighbor method. It defines the similarity measure between two clusters as the minimum resemblance coefficient among all pair entities in the two clusters.
•
WPGMA: this defines the similarity measure between two clusters as the simple arithmetic average of resemblance coefficients between two clusters without considering the cluster size.

The algorithm that produces the best result will then be applied to program restructuring of an industrial system.

The structure of the rest of this paper is as follows. Section 2 reviews the related work in both program restructuring and software clustering areas. Section 3 proposes an approach to program restructuring using clustering techniques and discusses the issues involved in the approach. Section 4 provides an extensive study on the similarity measure by weighting attributes differently. Section 5 gives a comparative study of three clustering algorithms: SLINK, CLINK and WPGMA. Section 6 presents a case study of program restructuring using the clustering results on an industrial software system. Empirical observations are also summarized. Section 7 presents the conclusions and future work.

Section snippets

Related work

There has been extensive research on software restructuring. This section describes related research on restructuring at the function or the design level. Additionally, this section also presents related research on software clustering.

An approach to program restructuring using clustering techniques

This section presents an approach to code restructuring using clustering techniques and discusses key issues of clustering techniques.

Experiments on similarity measure

The resemblance coefficient has been defined, but how to decide the weights is still unsolved. Previous research did not give systematic study on this issue. Dhama (1995) uses a heuristic estimate to give the data parameters twice as much weight as the control parameters. Schwanke (1991) estimates the significance of a feature using Shannon information content, which gives rarely-used identifiers higher weights than frequently-used identifiers. In this paper, the weights of attributes are

Experimental comparison of WPGMA, SLINK, and CLINK

The WPGMA, SLINK, and CLINK clustering algorithms have been applied to more than 60 functions in different areas, including functions appeared in papers, student assignments and industrial programs. This section presents comparisons of these three algorithms.

Case study of program restructuring

So far, we have defined entities and attributes that are used in the similarity measure; devised a new algorithm to calculate the resemblance coefficient; and compared three clustering algorithms. In order to evaluate the effectiveness of the proposed approach, we have applied the approach to restructuring of a real industrial system in data networks.

Conclusions and future directions

This paper presented a program restructuring approach using the clustering technique, for C programs. Specifically, we have discussed the selection of entities and attributes, similarity measure, resemblance coefficient experiments, hierarchical agglomerative algorithms comparison, and the application of the approach to an industrial program. The main goal of the restructuring approach was to provide automated support to identify poorly designed or low-cohesive functions and give heuristic

Acknowledgements

The authors would like to thank Dr. M. Zaid and Dr. R. Crawhall of NCIT (National Capital Institute of Telecommunications), Ottawa and Dr. R. Munikoti and Dr. K. Kalaichelvan of EION Inc., for supporting this research. The authors also want to thank the anonymous referees for their helpful suggestions.

References (43)

H. Dhama
Quantitative models of cohesion and coupling in software
J. Syst. Softw.
(1995)
B.-K. Kang et al.
Using design abstractions to visualize, quantify, and restructure software
J. Syst. Softw.
(1998)
A. Lakhotia
A unified framework for expressing software subsystem classification techniques
J. Syst. Softw.
(1997)
A. Lakhotia et al.
Restructuring programs by Tucking statements into functions
J. Inform. Softw. Technol.
(1998)
C.-H. Lung et al.
Applications of clustering techniques to software partitioning, recovery and restructuring
J. Syst. Softw.
(2004)
N. Anquetil et al.
Comparative study of clustering algorithms and abstract representations for software remodularisation
IEE Proc. Softw.
(2003)
Anquetil, N., Fourrier, C., Lethbridge, T., 1999. Experiments with hierarchical clustering algorithms as software...
R.S. Arnold
Software restructuring
Proc. IEEE
(1989)
Awduche, D., Berger, L., Gan, D., Li, T., Srinivasan, V., Swallow, G., 2001. RSVP-TE: Extensions to RSVP for LSP...
J.M. Bieman
Measuring functional cohesion
IEEE Trans. Softw. Eng.
(1994)

J.M. Bieman et al.

Measuring design-level cohesion

IEEE Trans. Softw. Eng.

(1998)

Braden, R., Zhang, L., Berson, S., Herzog, S., Jamin, S., 1997. Resource ReSerVation Protocol (RSVP), RFC...

L. Briand et al.

Property-based software engineering measurement

IEEE Trans. Softw. Eng.

(1996)

E.J. Chikofsky et al.

Reverse engineering and design recovery: a taxonomy

IEEE Softw.

(1990)

A.C. Choi et al.

Extracting and restructuring the design of large software systems

IEEE Softw.

(1990)

Chu, W.C., Patel, S., 1992. Software restructuring by enforcing localization and information hiding. In: Proc. Conf....

B. Everitt

Cluster Analysis

(1974)

N.E. Fenton et al.

Software Metrics: A Rigorous and Practical Approach

(1997)

M. Fowler

Refactoring: Improving the Design of Existing Code

Addison-Wesley

(1999)

J.-F. Girard et al.

A metric-based approach to detect abstract data types and state encapsulations

Autom. Softw. Eng.

(1999)

D. Hutchens et al.

System structure analysis: clustering with data bindings

IEEE Trans. Softw. Eng.

(1985)

Cited by (28)

A clustering-based model for class responsibility assignment problem in object-oriented analysis
2014, Journal of Systems and Software
Assigning responsibilities to classes is a vital task in object-oriented analysis and design, and it directly affects the maintainability and reusability of software systems. There are many methodologies to help recognize the responsibilities of a system and assign them to classes, but all of them depend greatly on human judgment and decision-making. In this paper, we propose a clustering-based model to solve the class responsibility assignment (CRA) problem. The proposed model employs a novel interactive graph-based method to find inheritance hierarchies, and two novel criteria to determine the appropriate number of classes. It reduces the dependency of CRA on human judgment and provides a decision-making support for CRA in class diagrams. To evaluate the proposed model, we apply three different hierarchical agglomerative clustering algorithms and two different types of similarity measures. By comparing the obtained results of clustering techniques with the models designed by multi-objective genetic algorithm (MOGA), it is revealed that clustering techniques yield promising results.
Efficient software clustering technique using an adaptive and preventive dendrogram cutting approach
2013, Information and Software Technology
Citation Excerpt :
The proposed similarity metric is used to measure the closeness among the system components to group them into subsystems using an optimization clustering technique. On the other hand, the work by Lung et al. [12] proposed a new similarity measure that is designed to improve the quality of a poorly structured program. The approach looks into the structure of a program in which functions that share common attributes are deemed to correlate among themselves.
Software clustering is a key technique that is used in reverse engineering to recover a high-level abstraction of the software in the case of limited resources. Very limited research has explicitly discussed the problem of finding the optimum set of clusters in the design and how to penalize for the formation of singleton clusters during clustering.
This paper attempts to enhance the existing agglomerative clustering algorithms by introducing a complementary mechanism. To solve the architecture recovery problem, the proposed approach focuses on minimizing redundant effort and penalizing for the formation of singleton clusters during clustering while maintaining the integrity of the results.
An automated solution for cutting a dendrogram that is based on least-squares regression is presented in order to find the best cut level. A dendrogram is a tree diagram that shows the taxonomic relationships of clusters of software entities. Moreover, a factor to penalize clusters that will form singletons is introduced in this paper. Simulations were performed on two open-source projects. The proposed approach was compared against the exhaustive and highest gap dendrogram cutting methods, as well as two well-known cluster validity indices, namely, Dunn’s index and the Davies-Bouldin index.
When comparing our clustering results against the original package diagram, our approach achieved an average accuracy rate of 90.07% from two simulations after the utility classes were removed. The utility classes in the source code affect the accuracy of the software clustering, owing to its omnipresent behavior. The proposed approach also successfully penalized the formation of singleton clusters during clustering.
The evaluation indicates that the proposed approach can enhance the quality of the clustering results by guiding software maintainers through the cutting point selection process. The proposed approach can be used as a complementary mechanism to improve the effectiveness of existing clustering algorithms.
SPAPE: A semantic-preserving amorphous procedure extraction method for near-miss clones
2013, Journal of Systems and Software
Citation Excerpt :
To make this project feasible, we left handling C pointers, a complex (Mark Harman et al., 2004; Hind et al., 1999; Qian et al., 2009; Thiessen, in press) side issue, to a later time when SPAPE is prepared to be adopted by actual development teams. Many earlier research studies in this area (e.g., (Harman et al., 2003, 2004; Alkhalid et al., 2010; Nikolaos Tsantalis and Alexander Chatzigeorgiou, 2009; Yang et al., 2009; Mark Harman et al., 2004; Lung et al., 2006; Komondoor and Horwitz, 2001)) did not handle pointers either for the sake of focusing on more immediate research objectives. Internal & external validity: Since our study does not establish cause–effect relationships, internal validity is not applicable.
Cloned code, also known as duplicated code, is among the bad “code smells”. Procedure extraction can be used to remove clones and to make a software system more maintainable. While the existing procedure extraction techniques can handle automatic extraction of exact clones effectively, they fail to do so for near-miss clones, which are the code fragments that are similar but not the same. To address this gap, we developed SPAPE, a novel semantic-preserving amorphous procedure extraction method to extract near-miss clones. SPAPE relaxes the constraint of having the same syntax and uses the structural semantic information. We evaluated the performance, effectiveness, and benefits of SPAPE. Our results show that SPAPE can extract more near-miss clones than the best applicable method for ten open-source-software products in an efficient and effective fashion. We conclude that SPAPE can be a useful contribution to the toolsets of software managers and developers, and it can help them improve code structure and reduce software maintenance and overall project costs.
Adjusting Fuzzy Similarity Functions for use with standard data mining tools
2011, Journal of Systems and Software
Data mining is crucial in many areas and there are ongoing efforts to improve its effectiveness in both the scientific and the business world. There is an obvious need to improve the outcomes of mining techniques such as clustering and other classifiers without abandoning the standard mining tools that are popular with researchers and practitioners alike. Currently, however, standard tools do not have the flexibility to control similarity relations between attribute values, a critical feature in improving mining-clustering results. The study presented here introduces the Similarity Adjustment Model (SAM) where adjusted Fuzzy Similarity Functions (FSF) control similarity relations between attribute values and hence ameliorate clustering results obtained with standard data mining tools such as SPSS and SAS. The SAM draws on principles of binary database representation models and employs FSF adjusted via an iterative learning process that yields improved segmentation regardless of the choice of mining-clustering algorithm. The SAM model is illustrated and evaluated on three common datasets with the standard SPSS package. The datasets were run with several clustering algorithms. Comparison of “Naïve” runs (which used original data) and “Fuzzy” runs (which used SAM) shows that the SAM improves segmentation in all cases.
Software refactoring at the function level using new Adaptive K-Nearest Neighbor algorithm
2010, Advances in Engineering Software
Citation Excerpt :
The executable statements are the statements which include assignment, operation, and iteration statements. Entities, of the executable statements, are divided into control entities and non-control entities [19]. A control entity refers to an entity that is either a predicate statement (such as If statement) or an iteration statement (such as for statement).
Improving the quality of software is a vital target of software engineering. Constantly evolving requirements force software developers to enhance, modify, or adapt software. Thus, increasing internal complexity, maintenance effort, and ultimately cost. In trying to balance between the needs to change software, maintain high quality, and keep the maintenance effort and cost low, refactoring comes up as a solution. Refactoring aims to improve a number of quality factors, among which is understandability. Enhancing understandability of ill-structured software decreases the maintenance effort and cost. To improve understandability, designers need to maximize cohesion and minimize coupling, which becomes more difficult to achieve as software evolves and internal complexity increases. In this paper, we propose a new Adaptive K-Nearest Neighbor (A-KNN) algorithm to perform clustering with different attribute weights. The technique is used to assist software developers in refactoring at the function/method level. This is achieved by identifying ill-structured software entities and providing suggestions to improve cohesion of such entities. We also compare the proposed technique with three function-level clustering techniques Single Linkage algorithm (SLINK), Complete Linkage algorithm (CLINK) and Weighted Pair-Group Method using Arithmetic averages (WPGMA). A-KNN showed competitive performance with the other three algorithms and required less computational complexity.
Clean Code in Practice: Challenges and Opportunities
2023, SSRN

View all citing articles on Scopus

View full text

Program restructuring using clustering techniques

Abstract

Introduction

Section snippets

Related work

An approach to program restructuring using clustering techniques

Experiments on similarity measure

Experimental comparison of WPGMA, SLINK, and CLINK

Case study of program restructuring

Conclusions and future directions

Acknowledgements

J. Syst. Softw.

J. Syst. Softw.

J. Syst. Softw.

J. Inform. Softw. Technol.

J. Syst. Softw.

Comparative study of clustering algorithms and abstract representations for software remodularisation

IEE Proc. Softw.

Software restructuring

Proc. IEEE

Measuring functional cohesion

IEEE Trans. Softw. Eng.

Measuring design-level cohesion

IEEE Trans. Softw. Eng.

Property-based software engineering measurement

IEEE Trans. Softw. Eng.

Reverse engineering and design recovery: a taxonomy

IEEE Softw.

Extracting and restructuring the design of large software systems

IEEE Softw.

Cluster Analysis

Software Metrics: A Rigorous and Practical Approach

Refactoring: Improving the Design of Existing Code

Addison-Wesley

A metric-based approach to detect abstract data types and state encapsulations

Autom. Softw. Eng.

System structure analysis: clustering with data bindings

IEEE Trans. Softw. Eng.