1 Introduction

Software Remodularization: Software systems, in particular, object-oriented systems are initially designed and created in a modular way. However, over time, modularity of a system often degrades due to improper placement of software elements into the modules [1]. The ill-modular structure of software system makes it difficult to understand, maintain and evolve [2]. To improve the modularity of software system the maintainers reorganize the software components into the modules and such process is generally known as software remodularization [3]. Software remodularization problem generally involves large set of conflicting criterion, competing constraints, ambiguous and vague information [4]. Hence, solving remodularization problem is very complex and time consuming task. Consider the problem of finding a remodularization of software system that satisfies a given modularity criterion. In such a problem, there can be very large/infinite remodularization alternatives and different sets of modularization can be good solutions. Furthermore, modularization needs to satisfy various competing constraints related to the modularity. In such case, identification of an overall good modularization solution is highly desirable for the maintainer, but not easy to obtain through an automated process.

Software remodularization techniques can be broadly categorized into analytical-based and search-based remodularization. Several analytical analysis based techniques have been presented in the research literature for software remodularization [3, 57]. These approaches guarantee to find an optimal solution, but require exponential computational time. Hence, such approaches can be more suitable for small size problems. On the other hand the search-based remodularization approaches do not guarantee the best optimal solution, but are able to find near optimal solution in a reasonable time.

This paper uses the search-based remodularization approach and primarily concerns with the problem of rearranging the source code classes into a set of modules of a system whose modularization has degraded due to the maintenance. For such a system some classes may no longer be in suitable modules and thus remodularization of the system become highly useful for improving the software quality. Our aim is therefore to search the whole space of possible modularizations to see if there exists a better grouping of classes in various modules.

Search-Based Multi-objective Software Remodularization: The term “Search-Based Software Engineering” was first given by Harman and Jones [8]. Since then lot of research has been carried out in this direction. As addressed by Harman and Jones [8], the problems to be solved by the SBSE usually require more efforts, the solution is highly intricate, and the software developer is ready to wait for the output. The search-based approach provides a lower human cost solution, freeing the software developer to work on other issues that require creativity and imagination.

To solve any problem using search-based technique, problem needs to be reformulated as a search-based optimization problem. For that, it will be necessary to define the following there key ingredients (1) representation of candidate solution; (2) fitness function; and (3) manipulator operators. Proper design of these ingredients has major influence on the performance of search-based software remodularization. The main contribution of this paper is a new and efficient representation of candidate solutions, which helps in improving the human computer interaction (HCI) as software developers (humans) interact with the computer based optimization process through this representation. An efficient representation eases the process of inputting the suitable data to the computing systems ensures the transfer of correct data from humans to computers and has a major impact on performance of computing of this data at the machine end [9].

Good representation of the candidate solutions is critical to the convergence speed of the search-based technique and the quality of obtained results [10]. Most of the existing search-based remodularization approaches use the integer based representation namely GNE (Group Number Encoding) for candidate solution [4, 1014]. It is most widely used solution representation approach in both single and multi-objective search-based software remodularization. It is represented as an n-sized integer vector, where the value \( 0 < c_{i} \le n \) of the ith class indicates the cluster which the ith class is assigned. A remodularization solution with the same value for all the classes means that all classes are placed in the same cluster, while a solution with all different values (from 1 to n) means that each classes is in a separate module. The disadvantage of this representation is that it is highly redundant. Figure 1 illustrates an example of the GNE representation.

Fig. 1.
figure 1

Illustration of candidate solution using GNE representation

Above-mentioned integer-based representation approach has been successfully used in many research papers for solving the search-based software remodularization problems. However, major drawback of this representation is that it generates many redundant solutions, which increases the search space and hence execution time. For example, consider software consisting of 7 classes being grouped into 3 modules, the vector (1,1,2,1,2,3,3), (3,3,1,3,1,2,2) and (2,2,3,2,3,2,2) all represent same modularization solution, which places classes 1,2 and 4 in one module, classes 3 and 5 in another module and which places classes 6 and 7 in a third module. Hence, such similar modularization solutions increase the search space.

To minimize the redundancy, we propose a new representation technique in search-based multi-objective software remodularization framework. The new representation is an important task which facilitates an efficient human and computer interaction. Transformation of the software structure into new transformation is a manual or semi automated process usually and optimization is a computer based processing task. The proposed representation requires a clear comprehension of semantics of the structure so that a suitable human computer interaction (HCI) can be planned. The proposed approach for software remodularization is able to produce better solution with faster convergence compared to the previous approach.

The rest of the paper is organized as follows. Section 2 presents relevant research literature. Section 3 describes a framework of the proposed search-based multi-objective software remodularization. Section 4 presents the experiments, results and analysis. Section 5 concludes with future work.

2 Related Works

A large number of research works have been proposed in the literature to support the automatic software remodularization [3, 4, 7, 1419]. Most of the existing software remodularization approaches are based on analytical techniques [3, 7, 15, 17] or search-based optimization techniques [4, 14, 16, 19]. In search-based approaches, once a software system is framed as a search problem, there are many search algorithms can be applied to solve the problem.

The encoding of the candidate solution is an important activity for the performance of the search problem. In the research literature of search-based software remodularization, many representations have been used such as binary code, floating point number, grey code and integer numbers. However, for search-based software remodularization problem, integer number representation has been found to be more suitable than other representations [4]. Here, we discuss those prevailing approaches which are close to our proposed work.

Mancoridis et al. [11] were the first who formulated the software remodularization problem as search based optimization problem and applied Hill-Climbing and Genetic Algorithms to optimize it. They used an integer based representation techniques to represent the candidate solution. Thereafter, Mitchell and Mancoridis [16] used the same representation technique in Hill-Climbing and Genetic Algorithm for development of Bunch, a tool supporting automatic software remodularization. Praditwong et al. [4] also used the same representation in a new evolutionary algorithm named as two-archive multi-objective evolutionary algorithms to address the software remodularization problems. Abdeen et al. [20] also applied evolutionary algorithms with same representation to modularize the source code classes into packages by automatically minimizing the package dependencies of object-oriented software system. Later, Abdeen et al. [13] extended the same work by formulating the problem as a multi-objective optimization and performed optimization using a multi-objective evolutionary algorithm.

Apart from the above search-based software remodularization, there are some more representations in evolutionary clustering literature. In 1998 Falkenauer [21, 22] proposed a encoding for a variable-length genetic algorithm. The encoding is carried out by separating each individual in the algorithm into two parts: c = [l | g], the first part is the element section, whereas the second part is called the group section of the individual. In [23], research each vector is a sequence of real numbers representing the K cluster centers. For an N -dimensional search space, the length of a clustering solution is N*K words, where the first N positions, represent N dimensions of the first cluster centre and next N positions represent those of the second cluster centre, and so on.

3 Proposed Approach

To improve the performance of search-based software remodularization especially multi-objective search-based software remodularization, this paper proposes a new efficient representation technique for the candidate solution. The considered multi-objective search technique is based on the Non-dominated Sorting concepts [24]. The general structure of the proposed search-based multi-objective software remodularization is given in Fig. 2.

Fig. 2.
figure 2

Overview of search-based multi-objective remodularization process

The approach is divided into two main parts. In first part, the software system which is to be re-modularized, is parsed and classes, packages and dependencies between all pairs of classes is extracted. Using the extracted information, initial modularization solution is encoded with the proposed representation technique (in Sect. 3.1). In the second part, the multi-objective evolutionary algorithm (i.e., non-dominated sorting based genetic algorithm) is applied to the initial remodularization solution. The algorithm starts by generating initial population from the initial modularization solution. The child population is generated from the initial population by applying the crossover and mutation operators. Next, parent and children populations are combined and a global population is generated. The global population is evaluated with the associated quality measurement. Now the candidate solutions in the global population are categorized according to their dominance. Non-dominated solutions are given a rank of 1; the candidate solutions dominated only by non-dominated candidate solutions are given a rank of 2; candidate solutions dominated only by the previous are given a rank of 3, and so on. After ranking, the new parent population for further generation is generated from the non-dominated solution. The algorithm evolves a population from generation to generations by applying crossover, mutation, and selection on the candidate solutions.

3.1 Representation of Candidate Solution

For software engineering domain’s search-based optimization algorithms, first and critical issue is the representation of candidate solutions [9]. In case of software remodularization problem there is a need to identify each possible correct combination of re-modularizing a software system. Representation must be chosen carefully so that there is one and exactly one candidate representation per modularization.

The approach we adopted in this paper is inspired by the work presented in [25]. In our search-based software remodularization, each modularization solution is represented as a vector of integers (c = [c1, c2, …, cn]), where n is the number of classes in the software dependency graph. In this representation, each ci is an index between 1 and number of topological neighbors for classes i. When ci = 1, class i is located in the same module as its first neighbor; when ci = 2, it is located in the same module as its second topological neighbor; etc. Figure 3 illustrates the proposed neighbor based encoding scheme.

Fig. 3.
figure 3

Illustration of candidate solution representation to an 8 class software remodularization problem.

Conceptually speaking, representing the candidate solutions by means of topological neighbors as discussed above, is usually more computationally efficient than using GNE representation scheme described in Sect. 1. The main reason of improved efficiency is that the search space formed by the proposed approach contains reduced number of redundant solutions compared to the GNE representation. This new representation provides a very good interface for human computer interaction. Software developers (humans) prepare this representation by comprehending the semantics of the structure and at the same time this representation helps in feeding error-free, efficiently-designed and less-redundant data to computing framework. Based on this efficient HCI, relevant, correct & desired information gets transferred to the optimization algorithm which then produces the near optimal solutions in lesser time with better quality.

3.2 Population Initialization

The creation of initial population has a major impact on the efficiency of the search-based evolutionary algorithms. The good initialization of population solution can improve the efficiency of the algorithms. If initial population is generated in such a way that their solutions have a better ability to reach an effective optimal solution, evolutionary algorithm converges quickly [25]. The initial population for the proposed approach is generated using a combination of random initialization and modules generated from K-means algorithm [26].

3.3 Crossover and Mutation

In our search-based software remodularization, for crossover operation, parents are selected using the standard tournament selection method. The crossover process has the following four steps: (1) two parents are selected using the tournament algorithm, (2) a single random point pair is selected at which both parents split, (3) a new individual candidate solution is generated with vector head of first parent and vector tail of second parent, (4) an individual solution in the population is replaced with new individual solution. In mutation operation, a byte is randomly reset to an integer in the feasible set. Figure 4 illustrates the single point crossover and mutation operator.

Fig. 4.
figure 4

Illustration of single-point crossover and single-point mutation operators

3.4 Objective Functions

We use five objective functions to optimize our search-based multi-objective: (1) coupling (to minimize), which corresponds to the number of inter-edges (edges between classes in different modules); (2) cohesion (to maximize), which corresponds to the number of intra-edges (edges between classes in the same module); (3) Intra-modular Coupling Density (ICD) [18] (to maximize); (4) number of classes per package (to minimize); (5) number of packages in the system (to minimize).

4 Experiments, Results and Analysis

This section illustrates the results obtained through the experimentation of proposed representation techniques and existing representation techniques by incorporating it into NSGA-II, a multi-objective evolutionary algorithm [24].

4.1 Collecting Results from the Experiments

Since the NSGA-II algorithm is a stochastic optimizer, it can generate different results for the same problem instance from one cycle to another. For this reason, we collect the results for analysis by performing 30 independent simulation runs for each problem instance. As the algorithm is a multi-objective evolutionary optimizer, each running cycle produces a set of trade-off solutions instead of single solution. However, the purpose of this paper is to demonstrate the usefulness of proposed representation techniques over the software remodularization problem. So we select that single solution which has the highest ICD value in the set of trade-off solutions of each run.

4.2 Studied Software Systems

We performed an empirical study for the proposed coupling schemes over six different object-oriented software systems with different size and characteristics. The main characteristics of the systems examined are given in Table 1. For each system, the version number, number of classes, and number of are mentioned in respective rows. All these software systems are based on the java programming language and are open-source or free-software projects.

Table 1. Characteristics of software systems

For all software systems, we first analyzed and examined different types of modules and libraries and then removed Omni-present classes such as common libraries, utilities and other domain primitive modules because they do not contribute to any main service.

4.3 Results and Analysis

In this section, we present the software remodularization results obtained by NSGA-II algorithm by incorporating the proposed representation technique. The results are evaluated and compared with existing representation in terms of Intra-modular Coupling Density (ICD), Number of Classes per Packages (NCP) and Execution time. The results are statistically analyzed by two-tailed t-test with 95 % confidence level (α = 5 %).

ICD as an assessment criterion: First we present the results of the experiments that compare the ICD values obtained from the proposed approach and the existing approach. Table 2 presents the results comparing proposed and GNE representation scheme. The results clearly indicate that the proposed representation technique performs significantly better to the GNE representation approaches for all software systems. For example, if we consider the Jstl system, ICD value of the proposed representation (i.e., 0.8621) is significantly larger than ICD value of the GNE representation. Hence, results clearly indicate that the remodularization solutions obtained through the proposed technique have better coupling and cohesion compared to the existing GNE representation.

Table 2. Results obtained through proposed and GNE representation

NCP as an assessment criterion: The results of NCP metric are demonstrated in Table 2. Similar to the ICD metric, results of the proposed approach also perform better in terms of NCP metric. For example, NCP value of the proposed approach on average for all software systems is 4.5 (i.e. 4.5 classes per package), while it is 1.5 for GNE, which is very small. Hence, the proposed approach produces remodularization solutions with better class distribution among the packages compared to GNE representation.

Execution time as an assessment criterion: Execution time has a significant influence on the usability of a software remodularization algorithm, especially if the software developers follow an iterative and incremental approach for remodularization the software systems. For example, if the developer uses the slow algorithm to obtain a baseline remodularization solution, modifying the system accordingly may be a time consuming process due to frequent remodularization. Therefore, we evaluated the execution time of the proposed representation technique for software remodularization and compared it with existing GNE. We run algorithm in the environment of Microsoft Windows 7 with 32 bits on Intel Pentium 4 process at 2.4 GHz and 2 GB of RAM and the algorithms are implemented in jMetal 4.5 frameworks. Table 2 shows the running time performance of the remodularization techniques for all eight software systems. The experimental results clearly show that the proposed representation technique for search-based software remodularization is the most time efficient compared GNE representation.

ICD values vs. number of generation: The next experiment is performed to see the growth trend in ICD values with respect to the number of generations for each variant of search-based multi-objective software remodularization (i.e., proposed and GNE representation). Figure 5 demonstrates the ICD growth speed with respect to the number of generations for six problem instances. The vertical axis in these figures shows the ICD value for modularized system. The horizontal axis shows the number of generations. The minimum number of generation is considered 10*N and the maximum number of generation is considered 200*N, where N is the total number of classes in the system.

Fig. 5.
figure 5

ICD value vs. number of generation

The results demonstrated in Fig. 5 clearly indicate that the growth trend of ICD value of proposed representation scheme for all software system is better compared to the GNE representation scheme. The proposed representation scheme is able to reach the steady state in very small number of generations. However, the GNE representation scheme can only reach at steady state after very large number of generations. For example, after generation 100*N the proposed representation keeps a steady ICD value in all problem instances. However, GNE representation scheme can only reach a steady state until to generation 160*N or later.

5 Conclusion and Future Work

This paper presents a search-based multi-objective optimization approach for software remodularization problem. The approach uses a new and efficient encoding scheme for candidate solutions which ensures a good semantic-based, efficiently-designed, less-redundant and error-free human computer interaction. The approach regroups the classes of the software system by optimizing five objective functions (i.e., coupling, cohesion, Intra-modular Coupling Density (ICD), number of classes per package, and number of packages in the system). The new neighbor based candidate solution representation reduces redundant remodularization solutions from the search space. The approach is evaluated over six real-world software systems and results are compared with the existing integer based representation approach. Experiments show that the proposed approach performs better than the existing integer based representation (i.e., GNE representation) approach in terms of Intra-modular Coupling Density (ICD), number of classes per package, convergence speed as well as execution time.