1 Introduction

Virtualization [1] is a rapidly evolving technology that enables flexible allocation of resources in cloud data centers [2]. VMs are created according to the amount of required resources and then run on a PM to host application to meet requirements of customers [3]. However, the application load changes constantly in the cloud computing environment, which is likely to cause SLAs violations and affect QoS. Therefore, some VMs on the overloaded PM need to be migrated, so as to ensure the stable operation of cloud data center.

In recent years, the VM migration problem has received much attention. Many individual-based VM migration studies [4] are presented to achieve optimal migration. Shrivastava [5] took the single VM as the migration object, and realized the remapping of VM individual and the PM according to the communication cost. The authors in [6, 7] proposed a multi-objective VM migration algorithm to optimize traffic between VMs, while minimizing the frequency of migration. But they ignored the overhead of the migration itself. More importantly, these individual-based migration strategies will result in higher communication cost due to the association between VMs.

Although some studies take into account the association between VMs, such as [8], which takes the entire associated VM group as the migration object. However, such migration is likely to result in ineffective migration and increase the network burden. Sun [9] focused on the efficient online live migration of multiple correlated VMs to optimize system performance. However, the VM groups to be migrated were not obtained according to the resource states of the data center, but were given as known conditions. When VM migration is performed, the appropriate VM group should be selected as the migration object to ensure low communication cost and migration time.

An excellent migration strategy should also provide users with better service. Although the VM that is being migrated does not suspend execution during live migration, its execution may become slowed down somewhat due to the migration. Many studies [10, 11] do not take into account the operating state of the VM during the migration, so that the VM that needs to be migrated may be dealing with high-intensity tasks, which will not only result in a higher dirty page rate, but also greatly affect the response time of the PM. We use the VM heat to reflect the operating state of the VM and take it into consideration to guarantee the better service provided to users.

In this paper, a VM migration algorithm based on group selection (VMMAGS) is proposed. The association between VMs and the resource utilization of the VM are taken into account. According to the resource status of the overloaded PM and the degree of connectivity (DoC) of the remaining VMs, the algorithm selects the appropriate VM group as the migration object. The optimal migration scheme is obtained based on the integration cost of the partitions of selected VM groups.

The remainder of the paper is organized as follows. In Sect. 2, we investigate the problem of the migration of VMs and present the definition of objective functions. Section 3 describes the proposed algorithm. An empirical evaluation is presented in Sects. 4 and 5 concludes the paper.

2 Problem and Objectives Description

2.1 Problem Description

When resources of the PM are tight, migrate some VMs on the overloaded PM to ensure that the remaining VMs and the migrated VMs can both work properly. Different migration strategies will produce different migration results, and the results directly affect the performance of the data center. Figure 1 shows two different migration solutions. PM 1 is overloaded, and some VMs on it need to be migrated. In Fig. 1(a), calculate the optimal migration scheme for the single VM. First, VM 3 is migrated to PM 2 which is closer to PM 1. Next, VM 4 is selected for migration. Since PM 2 does not have enough resources to place VM 4, PM 3 is selected as its target PM. As can be seen from Fig. 1, this migration solution is likely to result in a higher communication cost between VM 3 and VM 4. In Fig. 1(b), VM 3 and VM 4 are both migrated to PM 3. This migration solution guarantees a lower communication cost. Based on the above analysis, we should take the VM group as the migration object, rather than the VM individual. The work of this paper is to select the best migration VM group for overloaded PM and find the target PM for each VM in the group, so as to reduce the migration cost, communication cost and VM heat.

Fig. 1.
figure 1

Two different migration solutions.

2.2 The VM Model

In order to cooperate with each other in handling tasks, there may be frequent communication between VMs. Therefore, we model the associated VMs as an undirected graph G(V, E), in which vertices represent VMs and the edge value represents traffic between VM pairs. The attributes of VM i include its source PM, the requirements for CPU and RAM, denoted by \( \left( {PM_{i,src} ,vc_{i} ,vm_{i} } \right) \). Without loss of generality, it is assumed that VMs on the same PM are connected and that VMs on different PMs may also be associated. So the VM model is shown in Fig. 2.

Fig. 2.
figure 2

The VM model.

2.3 Objective Functions Definition

The focus of the paper is to reduce the migration cost, communication cost and the VM heat during the migration process, so we first quantify these objective functions.

Migration Cost.

We use the pre-copy strategy to migrate a VM. In the process of live migration, the dirty pages are transferred from the source PM to the target PM through continuous iterations. The longer the migration time, the more resources it occupied, and the greater the impact on network link communication. Therefore, we use the total migration time to reflect the migration cost.

Assume the pre-copy algorithm proceeds in \( n_{i} + 1 \) rounds. The amount of data transmitted and transmission time for \( VM_{i} \) in the k-round is \( v_{i,k} \) and \( T_{i,k} \) (\( 0 \le k \le n_{i} \)), respectively. The entire memory of the VM needs to be transferred to the target PM in the 0-round, so \( v_{0} = vm_{i} \). \( v_{i,k} (k \ne 0) \) is determined by the dirty page generated by the previous round. So \( v_{i,k} = r_{i} T_{i,k - 1} \) (\( 1 \le k \le n_{i} \)), where \( r_{i} \) is page dirty rate, and \( T_{i,k} = v_{i,k} /B_{i} = (vm{}_{i}/B_{i} ) \cdot (r_{i} /B_{i} )^{k} \). Therefore, the total time T i is calculated as:

$$ T_{i} = \sum\limits_{k = 0}^{{n_{i} }} {T_{i,k} } = \frac{{vm_{i} }}{{B_{i} }} \cdot \frac{{1 - \left( {\tfrac{{r_{i} }}{{B_{i} }}} \right)^{{n_{i} + 1}} }}{{1 - \frac{{r_{i} }}{{B_{i} }}}}. $$
(1)

\( n_{i} \) is calculated according to the memory threshold \( vm_{th} \). The iteration is stopped when the threshold is reached. So we can obtain:

$$ n_{i} = \left\lceil {\log_{{\frac{{r_{i} }}{{B_{i} }}}} \left( {\frac{{vm_{th} }}{{vm_{i} }}} \right)} \right\rceil . $$
(2)

The downtime of \( VM_{i} \) in the migration process is represented as \( T_{down\_i} = T_{d\_i} + T_{resume\_i} \). \( T_{d\_i} \) is the time of transferring the remaining dirty pages, and \( T_{resume\_i} \) denotes the time spent on resuming the VM on the target PM. Therefore, the migration cost of \( VM_{i} \) is calculated as:

$$ Cost\_mig(VM_{i} ) = \frac{{vm_{i} }}{{B_{i} }} \cdot \frac{{1 - \left( {\tfrac{{r_{i} }}{{B_{i} }}} \right)^{{n_{i} + 1}} }}{{1 - \frac{{r_{i} }}{{B_{i} }}}} + T_{down\_i} . $$
(3)

Communication Cost.

The communication cost is mainly related to the distance and traffic between the migrated VM and other VMs. The communication cost of \( VM{}_{i} \) is represented by (4).

$$ Cost\_com\left( {VM_{i} } \right) = \sum\limits_{j \ne i} {D\left( {PM_{i,target} ,PM_{j,src} } \right) \cdot f\left( {VM_{i} ,VM_{j} } \right)} . $$
(4)

VM Heat.

The VM heat represents the strength of the VM to handle tasks. We use the resource utilization of the VM to reflect its heat. The resource utilization of the VM varies with the dynamic application load. When it is necessary to migrate VMs for an overloaded PM, some VMs may be dealing with high-intensity tasks, and their resource utilization is likely to be high. The calculation of the VM heat depends not only on the resource utilization at the migration moment, but also on historical data. Higher historical utilization represents that the VM is generally dealing with a lot of tasks, and it is also likely that the tasks will be intense in the future. Therefore, we use (7) to calculate the VM heat of the VM group.

$$ H\left( {VM_{i} } \right) = \left( {H\left( {CPU} \right)_{i} + H\left( {RAM} \right)_{i} } \right)/2. $$
(5)
$$ H\left( {CPU} \right)_{i} = \lambda \cdot AVG\left( {\sum\limits_{j \in T} {U\_CPU_{j} } } \right) + \left( {1 - \lambda } \right) \cdot U\_CPU_{t} ,H\left( {RAM} \right)_{i} = \lambda \cdot AVG\left( {\sum\limits_{j \in T} {U\_RAM_{j} } } \right) + \left( {1 - \lambda } \right) \cdot U\_RAM_{t} . $$
(6)
$$ H(VMgroup) = AVG\left( {\sum\limits_{{VM_{i} \in VMgroup}} {H\left( {VM_{i} } \right)} } \right). $$
(7)

In (6), U_CPU j and U_RAM j represent CPU utilization and RAM utilization at time j. t represents the migration moment. T is the total duration of historical data. We use the average of the utilization within T before t as the VM’s historical resource utilization. Historical data is obtained by sampling. A sampling was conducted at each \( \Delta t \) interval during the T. The data at t should be given greater weight, so that we can calculate the VM heat more accurately. So we set \( \lambda \) = 0.3. In the experiments, we set T to one hour, and \( \Delta t \) is set to 30 s.

3 Algorithm

3.1 VM Group Selection

In order to avoid frequent migration, it is necessary to set the resource safe range (SR). When the resource occupied after the migration is in the SR, it represents the end of the migration on this PM. First, we will select all appropriate VM groups as migration options. The selected VM groups should include all possible scenarios to prevent the loss of the best solution, and the size of each group can’t be large. So, traverse the VM associated graph on the overloaded PM, select all VM groups that make the occupied resources of the PM after the migration are in the SR and the DoC of the remaining VMs reaches a certain value. There can be a single VM or multiple VMs in the selected VM group. The 2 to 9 lines of Algorithm 1 show the selection of VM groups.

figure a

A binary string Binary_set with the same length as the number of VMs on PM k reflects the selected state of the VM. 1 indicates that the VM is selected, 0 is the opposite. checkAvailable() is used to determine whether the occupied resources of the PM after the migration are within the SR. \( \theta (0 \le \theta \le 1) \) represents the value of the DoC that needs to be reached. checkConnectNum() is used to check whether the number of connected VMs has reached \( \theta \) times the total number of remaining VMs. If both conditions are satisfied, the VM group consisting of the selected VMs is used for the next step. If no VM group meets the conditions, make each VM on PM k as the selected VM group. Optional VM groups on PM 1 in the VM model are circled in Fig. 2. The total resource of PM 1 is (16-core, 16000M). The SR is [0.5, 0.6], and \( \theta \) is 0.5.

3.2 Objective Functions Integration

It is difficult to find the best migration scheme to meet these three goals. But if we integrate the three goals, the difficulty will be significantly reduced.

We define the Cost_mig and Cost_com weighted sum as the total cost. The simple weighted summation is susceptible to the larger value, so Cost_mig and Cost_com need to be normalized to eliminate the difference in magnitude. For VM i on PM k , we use the max-min method to normalize its cost.

$$ Cost\_mig\_norm\left( {VM_{i} } \right) = \frac{{Cost\_mig\left( {VM_{i} } \right) - \hbox{min} \left( {Cost\_mig} \right)}}{{\hbox{max} (Cost\_mig) - \hbox{min} \left( {Cost\_mig} \right)}},\hbox{max} \left( {Cost\_mig} \right) = \frac{{\hbox{max} \left( {vm} \right)}}{\hbox{min} \left( B \right)} \cdot \frac{{1 - \left( {\tfrac{\hbox{max} (r)}{\hbox{min} (B)}} \right)^{n + 1} }}{{1 - \frac{\hbox{max} (r)}{\hbox{min} (B)}}}. $$
(8)
$$ Cost\_com\_norm\left( {VM_{i} } \right) = \frac{{Cost\_com\left( {VM_{i} } \right) - \hbox{min} \left( {Cost\_com} \right)}}{{\hbox{max} \left( {Cost\_com} \right) - \hbox{min} \left( {Cost\_com} \right)}},\hbox{max} \left( {Cost\_com} \right) = \hbox{max} \left( {degree} \right)\hbox{max} \left( D \right) \cdot \hbox{max} \left( f \right). $$
(9)

Cost_mig is normalized by (8) to obtain Cost_mig_norm. max(vm), max(B) and max(r) represent the maximum RAM of the VM on PM k , the maximum bandwidth of the data center and the maximum dirty page rate. Cost_com is normalized in the same way, using (9) to obtain Cost_com_norm. max(D) represents the maximum distance between PMs. \( \hbox{max} (f) \) represents the maximum traffic between VMs on PM k . max(degree) represents the maximum degree of VMs on PM k . The calculation method of min is opposite to that of max.

Cost_norm is calculated using (10), where \( \alpha + \beta = 1 \), and we will determine their values through experiments.

$$ Cost\_norm\left( {VM_{i} } \right) = \alpha \cdot Cost\_mig\_norm\left( {VM_{i} } \right) + \beta \cdot Cost\_com\_norm\left( {VM_{i} } \right). $$
(10)

Next we will integrate Cost_norm and the VM heat. The implementation of various migration schemes will result in different Cost_norm. The cost of many schemes may have only a small difference, but the heat of VM groups in these schemes may be quite different. It is unreasonable to sacrifice the service of VMs in exchange for the small Cost_norm difference. The 10 to 14 lines of Algorithm 1 show the specific steps to get the best migration scheme. \( \sigma \) represents the standard deviation of all VM groups and the integration cost Cost_integrated is calculated as:

$$ Cost\_integrated\left( {VMgroup_{j} } \right) = \gamma \cdot Cost\_norm\left( {VMgroup_{j} } \right) + \left( {1 - \gamma } \right) \cdot H\left( {VMgroup_{j} } \right). $$
(11)

\( \gamma \) controls the weight of Cost_ norm, \( \gamma \, \in \,[0,\,1] \). Calculate the minimum value of Cost_integrated, and the corresponding migration scheme is the best solution.

3.3 VM Migration Algorithm

In this section, we use the greedy strategy to determine the optimal migration scheme based on selected VM groups.

For the selected VM group, we can’t guarantee that the cost of migrating them to the same target PM is less than the cost of individual migration. Moreover, a VM group has multiple partitions. All partitions of \( VMgroup_{2} \left( {VM_{3} ,\,VM_{4} ,\,VM_{5} } \right) \) on PM 1 in Fig. 2 are as follows: \( partition_{1} = \left\{ {\left\{ {VM_{3} } \right\},\left\{ {VM_{4} } \right\},\left\{ {VM_{5} } \right\}} \right\} \), \( partition_{2} = \left\{ {\left\{ {VM_{3} ,VM_{4} } \right\},\left\{ {VM_{5} } \right\}} \right\} \) \( partition_{3} = \left\{ {\left\{ {VM_{3} ,VM_{5} } \right\},\left\{ {VM_{4} } \right\}} \right\},\,partition_{4} = \left\{ {\left\{ {VM_{4} ,VM_{5} } \right\},\left\{ {VM_{3} } \right\}} \right\},\,partition_{5} = \left\{ {\left\{ {VM_{3} ,VM_{4} ,VM_{5} } \right\}} \right\}. \)

Therefore, we should calculate all partitions of the VM group to get the best solution. Multiple VM collections will be generated in one partition. In order to guarantee a lower communication cost, it is necessary to require that the VMs in the same collection are connected, and they are migrated to the same PM. It means that partition 4 does not meet the condition. Algorithm 2 gives the specific steps to calculate the Cost_ norm value and the migration scheme of \( VMgroup_{i} \). availableResource() is used to determine whether the resource exceeds the upper limit of the SR after the PM adds the migrated VM. checkConnected() is used to determine whether the VMs in the collections are connected. It should be noted that the placement conditions of the collection need to meet the resource requirements of all VMs in the collection. Calculate the minimum value of Cost_norm for all partitions as the Cost_norm value of this VM group.

figure b

The complete VM migration algorithm based on group selection (VMMAGS) is shown in Algorithm 1. For VMgroup i that satisfies the selection conditions, CCMS is used to calculate its <Cost_norm_ \( VMgroup_{i} \), VM mig , PM dist > . After obtaining Cost_norm of all groups, calculate their Cost_integrated according to the integration method mentioned in Sect. 3.2. Finally, we get the minimum value of Cost_integrated and the best VM migration scheme.

4 Experiments and Results

4.1 Experimental Setup

We use CloudSim [12] to carry out experimental tests in this section to verity the performance of VMMAGS. The performance of VMMAGS is evaluated by comparing with the algorithm AppAware [5] and TAVMS [8] in terms of migration cost, communication cost and response time. AppAware takes the single VM as the migration object, and uses the greedy strategy to find the migration scheme with the minimum communication cost. TAVMS solves the problem of multiple VMs migration and migrates the VM group as a whole. However, we find that the objectives of them are different from ours. For achieving fair comparison, we modify these two algorithms by replacing the objectives of them with Cost_norm defined in this paper.

In Fat-tree topology [13], the parameter k defines the data center size. We use three common structures in real cloud data centers for experiments. Structure1: k = 12, there are 432 PMs and 156 switches; Structure2: k = 14, there are 686 PMs and 210 switches; Structure3: k = 16, there are 1024 PMs and 272 switches. The link capacities in Fat-tree are set ranging from 1 GBps to 10 GBps. The distance between PMs is computed as shown in [14]. In addition, we model four instances of PMs with different capacity in the simulations, as shown in Table 1. Each PM belongs to one of the four instances, with each instance having probability 1/4. Each VM has CPU requirement of 1, 2, 4 or 8 cores and memory requirement of 1 to 16 GB, which is generated randomly from discrete uniformly distributions. We use FCFS algorithm for VM placement. Each VM runs a web-application with variable workload to generate different resource utilization, thus reflecting the different heat of the VM. The traffic between VMs is set according to what is suggested in [15]. If there is flow between VMs, a Gaussian distribution is used to generate the transmission rate. The mean is 10 MBps. The standard deviation is 1 MBps, and the probability is 0.75. In our experiments, the page dirty rate is set to 100 MBps. \( vm_{th} \) is set to 100 MB, which is a reasonable compromise based on other parameters, and \( T_{resume\_i} \) is set to 20 ms.

Table 1. Configuration information of PMs.

4.2 Parameters Analysis

VMMAGS involves some parameters, and different parameter settings will directly affect results. So we first experimentally analyze the best value of different parameters.

Two important parameters that affect VM group selection are the SR [low, high] and the DoC of the remaining VMs \( \theta \). Besides, these two parameters directly affect the total migration cost and communication cost of the data center. In order to control the number of VMs that need to be migrated, we set the minimum value of low to 0.5. We compare the total migration cost of the different SRs and the communication cost corresponding to different \( \theta \) values in Structure3 with 2400 VMs to get their best values. The results are shown in Figs. 3 and 4.

Fig. 3.
figure 3

The total migration cost of different SRs in Structure3

Fig. 4.
figure 4

The total communication cost of different \( \theta \) in Structure3.

It can be seen from Fig. 3, when high becomes larger, the migration cost increases. With the expansion of the SR, that is, the gap between high and low becomes larger, the migration cost decreases. This is because with the expansion of the SR, the optional VM groups increased, so easier to get the best migration scheme. When the SR is [0.5, 0.8], the migration cost is minimal, so we set SR to [0.5, 0.8].

In Fig. 4, when \( \theta \) changes from 0 to 0.3, the communication cost is gradually reduced. This is because when the required DoC is low, it is likely to cause the selected VM group is not the best choice, producing more communication cost than migrating a single VM. When \( \theta \) is in [0.3, 0.5], the corresponding communication cost is minimal and changes little. Then as \( \theta \) becomes larger, the communication cost increases significantly. Taking into account the stability of the algorithm and the calculation time, we finally set \( \theta \) to 0.4.

The calculation of Cost_norm involves the weight parameter \( \alpha \). A better weight parameter can guarantee the stability of the algorithm, so the effect on the system performance is reduced to the minimum. For all overloaded PMs in Structure1 with 800 VMs, we experimentally compared the average fluctuation of Cost_norm under different \( \alpha \) settings. The fluctuation is the difference between the maximum and the minimum values of Cost_norm. As shown in Fig. 5, the performance of the algorithm will fluctuate with the change of \( \alpha \). When \( \alpha \) = 1, the fluctuation of Cost_norm reaches the maximum. When \( \alpha \) = 0.3, the performance of the algorithm is stable, and the value of Cost_norm floats in a small area. Therefore, the \( \alpha \) value is set to 0.3 in the following experiments with considering the migration performance of the algorithm.

Fig. 5.
figure 5

The fluctuation of Cost_norm under different \( \alpha \).

Next we determine the optimal value of \( \gamma \) in (11) to get Cost_integrated. We choose the overloaded PM that hosts the most VMs in Structure1 to carry out the experiment, denoted by PM k . There are 148 selected VM groups. Figure 6 shows Cost_norm of all groups, Cost_norm in [min(Cost_norm), min(Cost_norm) +  \( \sigma \)] and VM heat. There are four groups with Cost_norm in the range. We have experimentally proved that when the value of \( \gamma \) changes from 0.1 to 0.9, Cost_ integrated of group95 in Fig. 6 is always the minimum. Without loss of generality, we set \( \gamma \) to 0.5 in the following experiments.

Fig. 6.
figure 6

Cost_norm of all selected VM groups on PM k , Cost_norm in [min(Cost_norm), min(Cost_norm) + \( \sigma \)] and the VM heat of the group.

4.3 Results Analysis

Total Migration Cost. We compare the total migration cost of our proposed VMMAGS with that of the other two algorithms, with the variation of VMs in three structures. The results are shown in Fig. 7.

Fig. 7.
figure 7

The total migration cost of all VMs in three structures

It can be seen from Fig. 7, our VMMAGS and AppAware performance is relatively similar, and TAVMS is the worst. That is because TAVMS migrates the entire VM group, resulting in a larger memory migration. When the number of VMs is small, the migration cost of AppAware is lower. But we find a rule from the results, that is, when the number of VMs in the data center increased to a certain extent, the migration cost of AppAware exceeds VMMAGS, even more than TAVMS. This is because when there is a large amount of overloaded PMs in the data center, individual-based migration is prone to ineffective migration, resulting in more frequent migration of VMs, and the migration cost will exceed the group-based migration strategy. In these three structures, the total migration cost of VMMAGS is about 27.4% less than that of TAVMS. Besides, when the number of VMs is large, the total migration cost of VMMAGS is about 18.8% less than that of AppAware. Overall, our VMMAGS performance is more stable, and can effectively control the migration cost.

Total Communication Cost.

The communication cost is another important metric to evaluate the performance of VM migration. So we compare the total communication cost of the three algorithms with the variation of VMs in three structures. In Fig. 8, we can observe that our VMMAGS consumes less communication cost than other algorithms in all cases. With the increase of the number of VMs, the total communication cost of VMMAGS increases almost linearly, but the cost of AppAware increases significantly. That is because as VMs become more, individual-based strategy can’t get the optimal solution, resulting in the associated VMs migrated to different PMs, so that the increase of the communication cost. When there are enough VMs in the data center, the total communication cost of VMMAGS is about 14.5% less than that of TAVMS, about 36.2% less than that of AppAware.

Fig. 8.
figure 8

The total communication cost of all VMs in three structures

Response Time.

The VM heat directly affects the system response time, so we observe the changes in the response time of a PM using different algorithms in Structure1. Figure 9 depicts the results. When t = 50 s, the response time surged, indicating that the PM resources were tight. At this point, a migration occurred. As we can see from Fig. 9(a), the PM carried out two migrations, and the response time fluctuated significantly. In Fig. 9(b), the response time had been significantly reduced with TAVMS for migration. But the response time fluctuated greatly during migration. While using VMMAGS, the response time was relatively stable, and could be maintained within 300 ms. These results show that VMMAGS can effectively guarantee the system service.

Fig. 9.
figure 9

The response time of the PM in Structure1.

5 Conclusions and Future Work

In this paper, we propose a multi-object VM migration algorithm named VMMAGS, which takes into account the migration cost, communication cost and VM heat to optimize the performance of the data center. According to the SR and the DoC of the remaining VMs, the VM groups that satisfy the conditions are obtained as migration options. Get the optimal migration scheme based on the integration cost of all partitions of selected groups. We assess VMMAGS performance using simulation and compare it with AppAware and TAVMS. Experimental results show that the total migration cost of VMMAGS is about 27.4% less than that of TAVMS, and the total communication cost of VMMAGS is about 36.2% less than that of AppAware. Besides, our algorithm can better control the response time. In the future, we consider the efficient migration of VMs across data centers.