Full length articleDelay-aware packet scheduling for massive MIMO beamforming transmission using large-scale reinforcement learning☆
Introduction
The massive multiple input multiple output (MIMO) technology has recently been widely acknowledged to be the foundation of the physical layer technologies in future wireless communication system since it can achieve high spectrum efficiency and energy efficiency [1]. The beamforming technology in massive MIMO scenario is of particular research interest since it is one of the most important form of utilizing the advantage of the large number of antennas to achieve high throughput [2], [3]. In traditional massive MIMO beamforming researches, the commonly seen method is to derive an optimization problem to obtain the beamformer and scheduling policy. Substantial works have been focusing on these problems by using convex optimization theory and model-based MDP theory [4], [5], [6], [7], [8], [9]. However, these methods suffer from the well-known pilot contamination in massive MIMO multi-cell system which results in poor performance of channel estimation and imperfect channel state information (CSI). Since CSI is the input parameter of convex optimization based algorithms, the inaccurate CSI introduced by pilot contamination will render these methods ineffective.
Motivated by these challenges and the recent advances in machine learning [10], we try to adopt the large-scale Markov decision process (MDP) theory [11], [12] and the reinforcement learning (RL) theory [13] to model the beamforming scheduling problem. The motivations are as follows. First, the model-free RL theory grants that it is possible to solve the optimal beamformer without knowing the accurate transition dynamics of the environment or perfectly acquiring fully observed information of the system, i.e., the CSI as well as other system states concerned, by observing the practical performance of the current beamforming strategy and improving it gradually over time, which means it will not be directly affected by the pilot contamination in massive MIMO system. Second, since the channel estimation is also performed in non-continuous time slots gradually, the gradually-optimizing RL paradigm is intuitively in accordance with the process of channel estimation and does not introduce significant change of the system architecture. Also, most of the convex optimization based theories are transient, i.e., only considers the current time slot and does not consider the dynamic feature of the wireless system, while the RL methods can utilize the dynamic feature of the environment to render the optimal policy.
To the best of our knowledge, many existing MDP and RL works that address the MIMO transmission problem have the following drawbacks. One is the naive assumption of system dynamics by modeling the environment’s behavior, as known as the model-based method in RL’s prospective. Another is using the dynamic programming (DP) based methods which is computational expensive and impractical considering the dimensionality issue and the continuous action case. Furthermore, in massive MIMO scenario it is possible to decompose the large-scale RL problem into paralleled ones to overcome dimensionality, since the channels are asymptotically orthogonal and become independent.
Our main contributions are as follows. First, we propose a general RL architecture of packet scheduling for massive MIMO beamforming system with different types of formulation. Based on that, we analyze the asymptotic behavior in massive MIMO scenario and decompose the high-dimensional scheduling problem into sub-problems. The model-free packet scheduling algorithms with low computational complexity is proposed. By utilizing large-scale RL algorithms, the proposed method can dynamically optimize the system delay without using CSI, hence it can be used as a complimentary way for the traditional convex optimization based methods when CSI is unavailable. The rest of the paper is organized as follows. Section 2 presents the massive MIMO beamforming signal model. Section 3 presents the RL based scheduling architecture for the massive MIMO beamforming system. Section 4 presents the detailed analysis with asymptotic results and low-complexity algorithm for solving the proposed RL problem. Section 5 presents the numerical results, and Section 6 concludes the paper.
Notation: Boldface uppercase and lowercase letters denote matrices and vectors respectively. and denote the set of positive integer and complex number respectively. denotes the stochastic expectation, and denotes the Hermitian transpose. and denote the identity matrix and zero matrix. denotes that is a circular symmetrical complex Gaussian random vector with mean vector being and covariance vector being .
Section snippets
Massive MIMO multicast beamforming transmission model
Consider a single base station equipped with massive MIMO antenna that utilizes wireless multicast beamforming to transmit information simultaneously to users within its coverage area. The multicast information belongs to groups, which are stored and scheduled in paralleled caches. The arrival rate of multicast packet in each cache is denoted by , as is shown in Fig. 1. Each group utilizes a specific beamformer to perform multicast beamforming. The beamformed vector information of each
Large-scale RL based scheduling structure
To address the proposed problem and derive the optimal multicast beamforming strategy, we utilize the RL theory by modeling the beamformer and power ratio using RL policies. For the convenience of analysis, we consider the average cost model assumption for finite-state MDP only, where the optimal policy is stationary for all initial states [12]. To build the problem, we formulate the following RL components.
Low-complexity method for model-free RL based scheduling problem
We consider two steps to simplify the problem above. First we use the asymptotic characteristic of massive MIMO channel to derive the decomposed RL problems, then solve it by utilizing the theory for large-scale RL.
Numerical experiments
We use the Asynchronous Advantage Actor Critic (A3C) algorithm [14] with TensorFlow [15] and OpenAI Gym [16] to provide numerical results for the proposed model. For the observation environment part, we set 4 users in 2 multicast groups with distance from the base station being in kilometer. The total transmission power of the base station is set to be 10 dBW. The wireless channel fast fading vector is assumed to be Rayleigh distributed with unit variance. The
Conclusion
In this paper we propose the delay-aware packet scheduling structure in massive MIMO system using large-scale RL. We present the modeling process of a series of RL problems to solve the multicast beamforming scheduling problem and provide theoretical analysis for these models. We show that by using model-free large-scale RL algorithms, the proposed model can dynamically solve the beamforming problem without using CSI, which can overcome the imperfect CSI issue in traditional signal process
Acknowledgments
This work is supported by National Science Foundation of China Project (Grant No. 61471066) and Open Project Fund (Grant No. 201600017) of the National Key Laboratory of Electromagnetic Environment, China .
Xinran Zhang received B.S. and M.S. degrees from Beijing University of Posts and Telecommunications (BUPT) in 2010 and 2013, respectively. He has been an assistant professor at the Beijing Electronic Science and Technology Institute, Beijing, China, since 2013. He has been working toward his Ph.D. degree at BUPT since 2015. His research interests include large-scale reinforcement learning theory and wireless signal processing theory.
References (16)
- et al.
Massive mimo channel models: A survey
Int. J. Antennas Propag.
(2014) - et al.
Massive mimo multicasting in noncooperative cellular networks
IEEE J. Sel. Areas Commun.
(2014) - et al.
Joint multicast beamforming and user grouping in massive mimo systems
- et al.
A fast algorithm for multi-group multicast beamforming in large-scale wireless systems
- et al.
A cmdp-based approach for energy efficient power allocation in massive mimo systems
- et al.
-learning algorithms for constrained markov decision processes with randomized monotone policies: Application to mimo transmission control
IEEE Trans. Signal Process.
(2007) - et al.
On stochastic feedback control for multi-antenna beamforming: Formulation and low-complexity algorithms
IEEE Trans. Wireless Commun.
(2014) - et al.
Delay-aware cooperative multipoint transmission with backhaul limitation in cloud-ran
Cited by (0)
Xinran Zhang received B.S. and M.S. degrees from Beijing University of Posts and Telecommunications (BUPT) in 2010 and 2013, respectively. He has been an assistant professor at the Beijing Electronic Science and Technology Institute, Beijing, China, since 2013. He has been working toward his Ph.D. degree at BUPT since 2015. His research interests include large-scale reinforcement learning theory and wireless signal processing theory.
Songlin Sun received B.S. and M.S. degrees from Shandong University of Technology in 1997 and 2000, respectively, and his Ph.D. degree from Beijing University of Posts and Telecommunications in 2003. He is currently a professor at Beijing University of Posts and Telecommunications. He has long been engaged in the research and development and teaching work of wireless communications, multimedia communication, the Internet of Things, and embedded systems.