Program restructuring using clustering techniques
Introduction
Software evolves over time primarily due to changes in requirements and technologies. As a result, huge amount of effort is spent in maintenance and evolution. Software evolution usually accounts for more than 60% of total software costs (Sommerville, 1996). In today’s highly competitive era, software development is often driven by tight schedules. Hence, software designers often emphasize the functional aspect of a system. Even if a software product is well designed, the code is often modified over time in response to the changing needs of customers. As a consequence, its original structure gradually drifts and quality degrades. Hence, the program becomes difficult to understand. As a result, it is often costly to maintain.
Müller, et al. indicate that 50–90% of software evolution work focuses on program comprehension or understanding (Müller et al., 1995). Program understanding could be at various levels, including architecture, design, and code. At the implementation level, a large or poorly coded function usually involves multiple activities or has low functional cohesion, which makes the program difficult to understand and modify. Program restructuring (Chikofsky and Cross, 1990) or refactoring (Fowler, 1999) can transform these functions to functions that are better organized and easier to understand, without changing their behaviors. The new functions will usually be higher quality and less costly for further evolution. More importantly, a desirable restructuring should achieve high cohesion and low coupling (Briand et al., 1996, Munson, 2003, Pressman, 1997).
Cohesion, as an important measure in restructuring, is to measure how tightly related elements are in a component. The goal of clustering is to group similar or related elements together. It is possible to use clustering analysis to measure the strength of the relationship between elements in a component. Previous articles of software clustering demonstrate research potential in software clustering field (Tzerpos and Holt, 1998) and conclude that clustering methods may be a very good starting point for the remodularization of software (Wiggerts, 1997).
However, existing research on the software clustering field has mainly been concerned with software remodularization at the architecture level and has not been used in program restructuring at the source code level. Source code contains critical information regarding the behavior of a system. The understanding and manipulation of source code is a pressing issue for maintenance and evolution. This paper focuses on source code. Specifically, this paper deals with restructuring of each individual function. One challenge of restructuring at this level is how to meaningfully and effectively group related code segments together inside a large or poorly structured function to form small or cohesive functions, because it is not uncommon that unrelated fragments and functionally cohesive code segments are interleaved in real software products. In addition, the approach should be easy to understand and also effective in practice. Clustering techniques are suitable for this problem, because the objective of clustering is consistent with that of cohesion.
This paper presents an approach to program restructuring using clustering techniques at the function level. It focuses on using automated support for identifying low-cohesive functions and making restructuring decisions, instead of the automated restructuring process. The purpose is to help software designers identify ill-structured functions and provide them with heuristic advices. In detail, this paper discusses how to select entities and how to select attributes that are important to distinguish two different entities from the cohesion perspective. A new resemblance coefficient as a similarity measure is defined. Extensive experiments on the weights of different attributes are conducted. Three hierarchical agglomerative algorithms: single linkage algorithm (SLINK), complete linkage algorithm (CLINK) and weighted pair-group method using arithmetic averages (WPGMA1), are chosen and an intensive comparative study on them is conducted. These algorithms are highlighted as follows:
- •
SLINK: also called the nearest neighbor method. It defines the similarity measure between two clusters as the maximum resemblance coefficient among all pair entities in the two clusters.
- •
CLINK: also called the furthest neighbor method. It defines the similarity measure between two clusters as the minimum resemblance coefficient among all pair entities in the two clusters.
- •
WPGMA: this defines the similarity measure between two clusters as the simple arithmetic average of resemblance coefficients between two clusters without considering the cluster size.
The algorithm that produces the best result will then be applied to program restructuring of an industrial system.
The structure of the rest of this paper is as follows. Section 2 reviews the related work in both program restructuring and software clustering areas. Section 3 proposes an approach to program restructuring using clustering techniques and discusses the issues involved in the approach. Section 4 provides an extensive study on the similarity measure by weighting attributes differently. Section 5 gives a comparative study of three clustering algorithms: SLINK, CLINK and WPGMA. Section 6 presents a case study of program restructuring using the clustering results on an industrial software system. Empirical observations are also summarized. Section 7 presents the conclusions and future work.
Section snippets
Related work
There has been extensive research on software restructuring. This section describes related research on restructuring at the function or the design level. Additionally, this section also presents related research on software clustering.
An approach to program restructuring using clustering techniques
This section presents an approach to code restructuring using clustering techniques and discusses key issues of clustering techniques.
Experiments on similarity measure
The resemblance coefficient has been defined, but how to decide the weights is still unsolved. Previous research did not give systematic study on this issue. Dhama (1995) uses a heuristic estimate to give the data parameters twice as much weight as the control parameters. Schwanke (1991) estimates the significance of a feature using Shannon information content, which gives rarely-used identifiers higher weights than frequently-used identifiers. In this paper, the weights of attributes are
Experimental comparison of WPGMA, SLINK, and CLINK
The WPGMA, SLINK, and CLINK clustering algorithms have been applied to more than 60 functions in different areas, including functions appeared in papers, student assignments and industrial programs. This section presents comparisons of these three algorithms.
Case study of program restructuring
So far, we have defined entities and attributes that are used in the similarity measure; devised a new algorithm to calculate the resemblance coefficient; and compared three clustering algorithms. In order to evaluate the effectiveness of the proposed approach, we have applied the approach to restructuring of a real industrial system in data networks.
Conclusions and future directions
This paper presented a program restructuring approach using the clustering technique, for C programs. Specifically, we have discussed the selection of entities and attributes, similarity measure, resemblance coefficient experiments, hierarchical agglomerative algorithms comparison, and the application of the approach to an industrial program. The main goal of the restructuring approach was to provide automated support to identify poorly designed or low-cohesive functions and give heuristic
Acknowledgements
The authors would like to thank Dr. M. Zaid and Dr. R. Crawhall of NCIT (National Capital Institute of Telecommunications), Ottawa and Dr. R. Munikoti and Dr. K. Kalaichelvan of EION Inc., for supporting this research. The authors also want to thank the anonymous referees for their helpful suggestions.
References (43)
Quantitative models of cohesion and coupling in software
J. Syst. Softw.
(1995)- et al.
Using design abstractions to visualize, quantify, and restructure software
J. Syst. Softw.
(1998) A unified framework for expressing software subsystem classification techniques
J. Syst. Softw.
(1997)- et al.
Restructuring programs by Tucking statements into functions
J. Inform. Softw. Technol.
(1998) - et al.
Applications of clustering techniques to software partitioning, recovery and restructuring
J. Syst. Softw.
(2004) - et al.
Comparative study of clustering algorithms and abstract representations for software remodularisation
IEE Proc. Softw.
(2003) - Anquetil, N., Fourrier, C., Lethbridge, T., 1999. Experiments with hierarchical clustering algorithms as software...
Software restructuring
Proc. IEEE
(1989)- Awduche, D., Berger, L., Gan, D., Li, T., Srinivasan, V., Swallow, G., 2001. RSVP-TE: Extensions to RSVP for LSP...
Measuring functional cohesion
IEEE Trans. Softw. Eng.
(1994)
Measuring design-level cohesion
IEEE Trans. Softw. Eng.
Property-based software engineering measurement
IEEE Trans. Softw. Eng.
Reverse engineering and design recovery: a taxonomy
IEEE Softw.
Extracting and restructuring the design of large software systems
IEEE Softw.
Cluster Analysis
Software Metrics: A Rigorous and Practical Approach
Refactoring: Improving the Design of Existing Code
Addison-Wesley
A metric-based approach to detect abstract data types and state encapsulations
Autom. Softw. Eng.
System structure analysis: clustering with data bindings
IEEE Trans. Softw. Eng.
Cited by (28)
A clustering-based model for class responsibility assignment problem in object-oriented analysis
2014, Journal of Systems and SoftwareEfficient software clustering technique using an adaptive and preventive dendrogram cutting approach
2013, Information and Software TechnologyCitation Excerpt :The proposed similarity metric is used to measure the closeness among the system components to group them into subsystems using an optimization clustering technique. On the other hand, the work by Lung et al. [12] proposed a new similarity measure that is designed to improve the quality of a poorly structured program. The approach looks into the structure of a program in which functions that share common attributes are deemed to correlate among themselves.
SPAPE: A semantic-preserving amorphous procedure extraction method for near-miss clones
2013, Journal of Systems and SoftwareCitation Excerpt :To make this project feasible, we left handling C pointers, a complex (Mark Harman et al., 2004; Hind et al., 1999; Qian et al., 2009; Thiessen, in press) side issue, to a later time when SPAPE is prepared to be adopted by actual development teams. Many earlier research studies in this area (e.g., (Harman et al., 2003, 2004; Alkhalid et al., 2010; Nikolaos Tsantalis and Alexander Chatzigeorgiou, 2009; Yang et al., 2009; Mark Harman et al., 2004; Lung et al., 2006; Komondoor and Horwitz, 2001)) did not handle pointers either for the sake of focusing on more immediate research objectives. Internal & external validity: Since our study does not establish cause–effect relationships, internal validity is not applicable.
Adjusting Fuzzy Similarity Functions for use with standard data mining tools
2011, Journal of Systems and SoftwareSoftware refactoring at the function level using new Adaptive K-Nearest Neighbor algorithm
2010, Advances in Engineering SoftwareCitation Excerpt :The executable statements are the statements which include assignment, operation, and iteration statements. Entities, of the executable statements, are divided into control entities and non-control entities [19]. A control entity refers to an entity that is either a predicate statement (such as If statement) or an iteration statement (such as for statement).