1 Introduction

With the introduction of numerous smart handheld devices, user demands for mobile broadband have seen an unprecedented rise. The drastic growth of bandwidth-hungry and media-rich applications are pushing the limits of current cellular systems. Device to Device (D2D) technology allows two user equipments (UEs) in close proximity to communicate with each other in the licensed band with very limited involvement of the evolved NodeB (eNB) [1]. As a result, D2D communication is one of the effective methods to improve the spectrum efficiency and capacity of 5G networks. Besides, it can also reduce transmission delay. However, sharing resource blocks (RBs) between D2D user equipments (DUEs) and cellular user equipments (CUEs) in the underlay mode will induce interference [2,3,4]. Power control is one of the key technologies to reduce the interference [5]. In recent years, advancement in computing capability has made machine learning popular in many applications. In this paper, we propose the use of machine learning to solve the interference problem arising from resource sharing by controlling the transmission power of D2D pairs. The performance of the proposed method is also evaluated.

1.1 Related Works

In [6], the authors proposed a random network model for an underlay D2D cellular system and developed power control algorithms. The centralized power control ensures that CUEs have sufficient coverage probability by limiting the interference created by underlay DUEs while scheduling as many D2D links as possible. In [7], a multi-player Q-learning system is employed to maximize system capacity while maintaining the level of quality of service (QoS) of CUEs. However, the problem with the learning method is that no states are associated with a D2D pair that chooses to share RBs with a particular CUE and transmits at a certain power level.

Multi-Armed Bandit (MAB) model provides a suitable way to solve the stateless problem [8]. In [9], the authors investigated the potential of the MAB model, which is a kind of stateless machine learning, to address the challenge of RB allocation problems in 5G networks. It also provided a detailed example of using the MAB model for energy-efficient small cell activation in 5G networks. In this paper, we aim at allowing some D2D pairs with controlled transmission power to reuse the RBs of CUEs to achieve higher sum rewards in spectral efficiency and capacity. Therefore, a multi-player extension of the MAB model, the Multi-Player Multi-Armed Bandit (MP-MAB), will be used.

MP-MAB has found many applications [10,11,12]. Power control is one of the less explored areas of MP-MAB. In order to model power control problem as a bandit game, a finite discrete set of power levels is treated as the set of arms, and the rewarding process is defined as some function of SINR, where interference represents the mutual impact of players on each other. After the learning process of MP-MAB, the eNB allocates as many D2D pairs as possible to reuse the RBs of CUEs based on the outcome of the learning process. In this study, the learning process of the D2D pairs is executed on the eNB for the following reasons:

  1. 1.

    No need to install the MP-MAB-based learning process on all UEs.

  2. 2.

    Inadequate computational capability for UEs to execute the learning process.

  3. 3.

    Easiness for the eNB to gather the information of all UEs.

  4. 4.

    Avoidance of the unreliable transmissions between the eNB and UEs.

1.2 Contributions

The main contributions of this study are summarized as follows:

  1. 1.

    We establish a cellular communication model wherein both CUEs and D2D pairs can declare data rate requirements.

  2. 2.

    For each D2D pair, the transmission power is controlled by a novel approach.

  3. 3.

    For the MP-MAB model to include the power control function, we propose an extended matrix Q to record the average rewards during the learning process.

  4. 4.

    We propose a new criterion for judging the convergence of the learning process.

  5. 5.

    We also show that power control can improve the ratio of unallocated D2D pairs, energy efficiency, and total throughput.

This paper is organized as follows. In Sect. 2, the D2D communication model and the relevant notations are described. The MP-MAB model and related strategies are also briefly reviewed. Section 3 describes how we extend the MP-MAB model for power control. The proposed criterion for judging the convergence of the learning phase is also depicted. In Sect. 4, the procedure for RB allocation is illustrated. We present the simulation results in Sect. 5 and conclude the paper in Sect. 6.

2 Model and Notations

In this section, the D2D communication model is introduced. To facilitate analysis, a number of notations are defined to characterize the links between different communicating entities. After that, three learning strategies for the MAB model are described. These strategies are also used with the MP-MAB model. Subsequently, the basic MP-MAB model for RB allocation without power control is presented.

2.1 D2D Communication Model

The system under consideration is an eNB serving some CUEs and D2D pairs, as illustrated in Fig. 1. Each D2D pair has a dedicated transmitter and a receiver. A certain number of RBs are allocated to each CUE based on the semi-persistent scheduling scheme. In order to improve the spectral efficiency, the eNB allocates some D2D pairs to reuse RBs with the CUEs while still meeting the QoS requirement of CUEs. Both the RB allocation and the learning process are executed on the eNB in a centralized fashion. Thus, all the computations involved in machine learning take place on the eNB, as explained in Sect. 1. Other assumptions are listed below.

Fig. 1
figure 1

D2D communication in underlay mode

  1. 1.

    Both the CUEs and D2D pairs can declare data rate requirements that are known to the eNB.

  2. 2.

    For simplicity, the same number of RBs are allocated to each CUE.

  3. 3.

    Each CUE can share the allocated RBs with multiple D2D pairs.

  4. 4.

    Each D2D pair can reuse the RBs of a single CUE.

  5. 5.

    If a D2D pair is to reuse the RBs allocated to a specific CUE, it reuses all of the RBs.

  6. 6.

    The eNB knows the locations of all the CUEs and DUEs.

  7. 7.

    All the CUEs use the same transmission power.

The symbols shown in Fig. 1 are defined as follows.

  • \({\text{D}}_{i}\): the ith D2D pair with predetermined transmitting DUE and receiving DUE.

  • \({\text{D}}_{{i,{\text{Tx}}}}\): the transmitting DUE of \({\text{D}}_{i}\).

  • \({\text{D}}_{{i,{\text{Rx}}}}\): the receiving DUE of \({\text{D}}_{i}\).

  • \({\text{C}}_{m}\): the mth CUE.

  • \(P_{{{\text{C}}_{m} }}\): the transmission power of \({\text{C}}_{m}\).

  • \(P_{{{\text{D}}_{i} }}\): the transmission power of \({\text{D}}_{{i,{\text{Tx}}}}\).

  • \(G_{{{\text{C}}_{m} ,{\text{B}}}}\): the channel gain from \({\text{C}}_{m}\) to eNB.

  • \(G_{{{\text{D}}_{i} ,{\text{B}}}}\): the channel gain from \({\text{D}}_{{i,{\text{Tx}}}}\) to eNB.

  • \(G_{{{\text{D}}_{i} ,{\text{D}}_{i} }}\): the channel gain from \({\text{D}}_{{i,{\text{Tx}}}}\) to \({\text{D}}_{{i,{\text{Rx}}}}\) of D2D pair i.

  • \(G_{{{\text{D}}_{i} ,{\text{D}}_{j} }}\): the channel gain from \({\text{D}}_{{i,{\text{Tx}}}}\) of pair i to \({\text{D}}_{{j,{\text{Rx}}}}\) of another pair j.

  • \(G_{{{\text{C}}_{m} ,{\text{D}}_{i} }}\): the channel gain from \({\text{C}}_{m}\) to \({\text{D}}_{{i,{\text{Rx}}}}\).

2.2 Evaluation of Interference, Capacity, and SINR Threshold

2.2.1 SINR

To assess the performance of the system, we have to calculate two kinds of SINR. The first kind is that from \({\text{C}}_{m}\) to the eNB and is expressed as

$$SINR_{{{\text{C}}_{m} }} = \frac{{P_{{{\text{C}}_{m} }} G_{{{\text{C}}_{m} ,{\text{B}}}} }}{{\sum\nolimits_{{i \in \varvec{U}_{m} }} {\left( {P_{{{\text{D}}_{i} }} G_{{{\text{D}}_{i} ,{\text{B}}}} } \right) + \sigma^{ 2} } }} ,$$
(1)

where \(\sigma^{ 2}\) is the noise power and \(\varvec{U}_{m}\) is the set of indices of D2D pairs that reuse the RBs allocated to \({\text{C}}_{m}\). The second kind is that from the transmitting DUE of \({\text{D}}_{i}\) to the receiving DUE of \({\text{D}}_{i}\) and is expressed as

$$SINR_{{{\text{D}}_{i} }} = \frac{{P_{{{\text{D}}_{i} }} G_{{{\text{D}}_{i} ,{\text{D}}_{i} }} }}{{P_{{{\text{C}}_{m} }} G_{{{\text{C}}_{m} ,{\text{D}}_{i} }} + \sum\nolimits_{{ \, j \in \varvec{U}_{m} ,j \ne i}} {\left( {P_{{{\text{D}}_{j} }} G_{{{\text{D}}_{j} ,{\text{D}}_{i} }} } \right) + \sigma^{ 2} } }} .$$
(2)

In this paper, we focus on power control. Therefore, we assume that the channel path loss is evaluated according to the free-space path loss formulas as given below:

$$148 + 40{ \log }_{10} (d) , { }d{\text{ in km}} ,$$
(3)
$$128.1 + 37.6{ \log }_{10} (d ) , { }d{\text{ in km}} ,$$
(4)

where d is the distance between two devices. Equation (3) is for a link between two UEs, while (4) is for a link between a UE and the eNB [4].

2.2.2 Link Capacity

For a given SINR value, the capacity of a link can be theoretically obtained using Shannon’s channel capacity formula. In order to be more practical, in this paper, the link capacity is evaluated by the procedures in 3GPP specification 36.213 [13] and SINR values in [14]. Thus, link SINR can be mapped to Channel Quality Indicator (CQI) to get Modulation and Coding Scheme (MCS), and then Transport Block Size (TBS) table can be used to find the capacity if the number of RBs is known.

2.2.3 Finding the Threshold of SINR

By following the reverse procedures described in the last paragraph, the minimum SINR requirement can be obtained if the number of RBs and the data rate requirement are given. For example, with 6 RBs, the minimum SINR must be 10.3 dB to meet the data rate requirement of 1800 bits per Transmission Time Interval (TTI) or 1.8 Mbps.

2.3 Strategies for MAB Model

This paper applies MP-MAB to solve the resource allocation problem. Before that, three strategies that will be used to select an arm/action and maintain the matrix Q, which is used to record the average rewards of trials, are briefly described.

2.3.1 Epsilon-First (EF) Strategy

In this strategy, a pure exploration phase is followed by a pure exploitation phase. For a total of K trials, the exploration phase accounts for εK trials, where \(0 < \varepsilon < 1\), and the exploitation phase accounts for (1 − ε)K trials. During the exploration phase, an arm is randomly selected with uniform probability. On the other hand, during the exploitation phase, the arm with the best average reward is always selected. The main problem with EF is to determine a suitable value of ε to find the best selection.

2.3.2 Epsilon-Greedy (EG) Strategy

In this strategy, the best arm is selected for a proportion (1 − ε) of the trials as exploitation action. An arm is selected at random for a proportion ε as exploration action. In [15], the strategy is formulated as

$${\text{Action}} = \left\{ {\begin{array}{*{20}l} {\mathop {\arg \hbox{max} }\limits_{a} \varvec{Q}\text{[}a\text{],}} \hfill & {{\text{probability}} = (1 - \varepsilon )} \hfill \\ {\text{random,}} \hfill & {{\text{probability}} = \varepsilon } \hfill \\ \end{array} } \right. ,$$
(5)

where a represents action.

2.3.3 Upper-Confidence-Bound (UCB) Strategy

To avoid inefficient exploration, one approach is to be optimistic about options with high uncertainty. This approach favors actions that have not had a confident value estimation yet. As a consequence, UCB favors exploration of actions with a strong potential to have an optimal value. This strategy selects actions as shown below:

$${\text{Action}} = \mathop {\arg \hbox{max} }\limits_{a} \left[ {\varvec{Q}[a]{ + }c\left. {\sqrt {\frac{\ln t}{{\varvec{A} [a ]}}} } \right]} \right. ,$$
(6)

where t is the number of trials so far, A[a] is the number of times that action a has been selected, and c is a positive parameter. The computation load of UCB is much higher than that of the above two strategies. However, UCB is shown to have better performance than EF and EG.

The goal of this paper is to allow more D2D pairs to reuse the RBs allocated to CUEs. As a result, we need to extend the MAB model to Multi-Player Multi-Armed Bandits (MP-MAB), where multiple players aim to maximize the sum of rewards of all players.

2.4 Basic MP-MAB Model for RB Allocation

In this paper, the D2D RB allocation problem is formulated as an MP-MAB model as illustrated in Fig. 2. The players are the D2D pairs and the arms of the bandit are the CUEs that can be chosen for the reuse of RBs. The action of a D2D pair refers to the selection of a CUE. When a player Di selects an arm Cm, it obtains a reward, which is the capacity corresponding to the selection. It is much easier for the eNB to calculate the SINR of all links within its coverage because it has the location information of all the UEs and possesses high computing power. Hence, the learning procedure of D2D pairs is executed on the eNB. Such an arrangement can circumvent the influence of limited computing capability of UEs and unreliable transmission between them and the eNB.

Fig. 2
figure 2

System model as an example with 4 D2D pairs and 3 CUEs

Before describing the application of MP-MAB on RB allocation with power control, we first describe the method without power control. For the environment without power control on D2D pairs, an N × M matrix as shown in Fig. 3a is used to record the average rewards for the N D2D pairs that are to reuse the RBs of the M CUEs. This matrix is composed of N row vectors, where the ith row vector records the average rewards of Di with the selection of different CUEs. The matrix is updated after each trial. For the ith row vector, an element of higher value means that Di has a preference for reusing the RBs of the corresponding CUE.

Fig. 3
figure 3

a The N×M matrix of MP-MAB used to record the average rewards for N D2D pairs to reuse the RBs of M CUEs. b The matrix in a is extended to have L arrays corresponding to the L power levels. (Color figure online)

3 Proposed MP-MAB Model and Convergence Criterion

In this section, we extend the MP-MAB model for RB allocation with power control. As the learning process is iterative, the criterion for the determination of convergence is elaborated and illustrated with examples.

3.1 Extended MP-MAB Model for RB Allocation

In order to perform power control, we assume L power levels \({\mathbf{P}}_{\text{D}} = \{ P_{\text{D}}^{1} ,P_{\text{D}}^{2} ,P_{\text{D}}^{3} , \ldots ,P_{\text{D}}^{L} \}\) are available for use by the D2D pairs. Based on this assumption, we propose to extend the two-dimensional matrix to a three-dimensional N × M × L matrix Q, which has N rows, M columns, and L arrays, as shown in Fig. 3b. Each array corresponds to a power level. By the way, we also define qi as the ith row plane of Q. For example, q1, q3, and qN are respectively the row planes in red, blue, and green, as shown in Fig. 3b. A row plane qi indicates how Di likes to reuse the RBs of which CUEs with which power level.

3.2 Judging Convergence

Before allocating RBs to D2D pairs according to the matrix Q, the eNB needs to finalize the learning trials. Each time matrix Q is updated after a trial, the eNB needs to check whether the process has converged, which means there is only little change in the matrix Q. It is hard to determine how little the change should be. However, inspired by the policy improvement theorem in [15], we propose a convergence criterion as described by the pseudo-code shown in Algorithm 1.

figure f

Create a matrix O of the same size as matrix Q. The elements of oi stand for the sequence of the corresponding elements of qi in descending order, as illustrated in Fig. 4. Thus, oi can be viewed as the preference order of Di for the various combinations of CUE and power level. Figure 4 gives an example of converting reward plane qi to the preference order plane oi, where the highest value 100 in q2 corresponds to the preference index 1 in o2.

Fig. 4
figure 4

Mapping a row plane q2 to o2 as an example

After every trial, eNB updates matrix Q and then matrix O. If the matrix O remains the same for a preset number of times, convergence_threshold, convergence is considered to be achieved. As shown in Algorithm 1, if the matrix O is the same as the previous trial, convergence_count is increased by 1; otherwise, it is reset to zero. The learning trial quits when convergence_count reaches the preset convergence_threshold or the number of trials reaches the preset maximum number of trials Tmax. Then, the eNB moves to the phase of resource allocation.

4 Resource Allocation Procedure

After the learning trial is completed, the eNB can allocate RBs based on the matrix Q. For the system under consideration, the CUEs and DUEs can declare the expected data rates. As mentioned in Sect. 2, of the three terms, namely SINR value, requested data rate, and the number of allocated RBs of a communication link, knowing any two of them allows the derivation of the remaining one by following the procedure specified in [13]. As a result, for a CUE, the minimum SINR requirement can be derived based on the requested data rate and the number of RBs allocated by the eNB using the semi-persistent scheduling scheme. Similarly, if the requested data rate of a D2D pair is known, and the pair is to share a known number of RBs with a specific CUE, then the minimum SINR requirement for this D2D pair can also be derived. For one or more D2D pairs to reuse the RBs allocated to a CUE, the constraint is that all of the CUEs must meet their individual SINR requirements.

In this paper, a greedy strategy is adopted for the allocation of RBs. The strategy always tries the selection with the best average reward to allocate RBs for DUEs. The pseudo-code is shown in Algorithm 2. Finding the maximum element of Q and its corresponding indices [n, m, l] is equivalent to designating Dn to reuse the RBs of Cm with transmission power level \({\text{P}}_{\text{D}}^{l}\). It is also necessary to check whether the following devices can respectively meet their SINR requirements: (a) Dn, (b) Cm, and (c) other D2D pairs that have been designated to reuse RBs with Cm. The formulas for calculating SINR are given in (1)–(4). If the answer is yes, write the power level l to \(\varvec{S}[n,m]\) to confirm the arrangement. Then, reset all the elements in qn because a D2D pair can only reuse the RBs of one CUE. If the answer is no, reset this maximum element to give up the reusing arrangement in this way. Repeat the steps in the loop until every element of Q is zero. Figure 5 illustrates an example of a 6 × 5 matrix S, where D2 and D4 are designated to reuse the RBs of C3 with power levels 1 and 3, respectively.

figure g
Fig. 5
figure 5

An example of a 6 × 5 matrix S indicating that D2 and D4 are designated to reuse the RBs of C3 with power levels 1 and 3, respectively

5 Simulation Results

In this section, the performance of the proposed power control scheme and the convergence criterion is investigated by simulation. First, the parameters and their values in the simulation are specified. Then the simulation results and analysis are presented.

5.1 Parameters of Simulation

The system for simulation includes one eNB, M CUEs, and N D2D pairs. In order to compare the effect of power control, three strategies, namely EF, EG, and UCB are employed for both cases with and without power control. The value of ε is set to 0.2 for EF and EG. For the case with power control, three power levels are available for D2D transmission, \({\mathbf{P}}_{\text{D}} =\) {200, 100, 50} mW. For the case without power control, a fixed power of 200 mW is used by the D2D transmitter. According to [16], the minimum distance between DUE and eNB is set to 35 m. The values of the other parameters are listed in Table 1.

Table 1 Parameters of simulation

5.2 Results

The simulation includes the use of the EF, EG, and UCB learning strategies for MP-MAB learning. The results are shown in Figs. 6, 7, 8 and 9. We compare the performance with and without power control, which are respectively indicated as W and W/O in the legends. For the sake of clarity, they are respectively shown as solid lines and dashed lines.

Fig. 6
figure 6

Average transmission power of a D2D pair

Fig. 7
figure 7

The ratio of D2D pairs that are not allocated to reuse RBs with any CUE

Fig. 8
figure 8

Energy efficiency of D2D pairs

Fig. 9
figure 9

Total throughput of D2D pairs

Figure 6 shows the average transmission power selected by each D2D transmitter. Without power control, the average power corresponding to the three strategies is fixed at 200 mW as indicated by the dashed line. Power control with the three strategies can significantly reduce power consumption. The strategy achieving the most power reduction is UCB, which can reduce the power to only 34% of that without power control because it prefers the selection of actions with the potential to get an optimal reward. The next is EG. EF has the lowest performance among the three because it may not find the selection with the best reward during the exploration phase.

Figure 7 shows the ratio of D2D pairs that are not allowed to reuse RBs with any CUE. Power control can reduce the ratio. For example, when there are 20 D2D pairs, power control respectively reduces the ratio to 77.4%, 75.3%, and 75.5% of that of EF, EG, and UCB without power control.

Figure 8 shows the energy efficiency, i.e., the average throughput per unit power. The trend reveals that the percentage of reduced power with EG and UCB is positively correlated to the energy efficiency. When there are more D2D pairs, EG and UCB can accommodate more pairs by effectively reducing the transmission power. Thus, when there are more D2D pairs, EG and UCB have much higher energy efficiency than that of EF. These observations coincide with that in Fig. 6 for the simulated parameter values.

Figure 9 presents the total throughput of the D2D pairs. The strategies with power control invariably get a much higher throughput than that of the corresponding strategies without power control.

6 Conclusions and Future Works

D2D resource allocation can be formulated as an MP-MAB problem. In order to further improve the efficiency, this paper proposes controlling the power of D2D pairs. To do that, the matrix Q corresponding to the MP-MAB model is extended to account for the different power levels available to a D2D pair when it selects to reuse RBs with a certain CUE. Three learning strategies are employed to evaluate the effect of power control. Simulation results reveal that power control does improve the performance in terms of average transmission power, the ratio of unallocated D2D pairs, energy efficiency, and total throughput.

The parameters for the three strategies, such as the ε for EF and EG and the c for UCB, should have an impact on the learning performance. Different parameter values can be used to compare the performance. After the learning process, the resource is allocated to D2D pairs based on a greedy strategy in this paper. Although the performance improvement is evident, other strategies may be able to achieve even better performance in different aspects. This will be studied in the future. Moreover, as a pilot investigation, only three power levels are considered in the simulation. In the future, more power levels can be considered. Nevertheless, there is always a tradeoff between accuracy and computational complexity of the learning process, which has to be taken into account.