ABSTRACT
We study adaptive video streaming for multiple users in wireless access edge networks with unreliable channels. The key challenge is to jointly optimize the video bitrate adaptation and resource allocation such that the users' cumulative quality of experience is maximized. This problem is a finite-horizon restless multi-armed multi-action bandit problem and is provably hard to solve. To overcome this challenge, we propose a computationally appealing index policy entitled Quality Index Policy, which is well-defined without the Whittle indexability condition and is provably asymptotically optimal without the global attractor condition. These two conditions are widely needed in the design of most existing index policies, which are difficult to establish in general. Since the wireless access edge network environment is highly dynamic with system parameters unknown and time-varying, we further develop an index-aware reinforcement learning (RL) algorithm dubbed QA-UCB. We show that QA-UCB achieves a sub-linear regret with a low-complexity since it fully exploits the structure of the Quality Index Policy for making decisions. Extensive simulations using real-world traces demonstrate significant gains of proposed policies over conventional approaches. We note that the proposed framework for designing index policy and index-aware RL algorithm is of independent interest and could be useful for other large-scale multi-user problems.
- Zahaib Akhtar, Yun Seong Nam, Ramesh Govindan, Sanjay Rao, Jessica Chen, Ethan Katz-Bassett, Bruno Ribeiro, Jibin Zhan, and Hui Zhang. 2018. Oboe: Auto-Tuning Video ABR Algorithms to Network Conditions. In Proc. of ACM SIGCOMM.Google ScholarDigital Library
- Eitan Altman. 1999. Constrained Markov Decision Processes. Vol. 7. CRC Press.Google Scholar
- Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning 47, 2 (2002), 235--256.Google ScholarDigital Library
- Konstantin Avrachenkov and Vivek S Borkar. 2020. Whittle Index Based Q-learning for Restless Bandits with Average Reward. arXiv preprint arXiv:2004.14427 (2020).Google Scholar
- Dilip Bethanabhotla, Giuseppe Caire, and Michael J Neely. 2016. WiFlix: Adaptive Video Streaming in Massive MU-MIMO Wireless Networks. IEEE Transactions on Wireless Communications 15, 6 (2016), 4088--4103.Google ScholarCross Ref
- Rajarshi Bhattacharyya, Archana Bura, Desik Rengarajan, Mason Rumuly, Srinivas Shakkottai, Dileep Kalathil, Ricky KP Mok, and Amogh Dhamdhere. 2019. Qflow: A Reinforcement Learning Approach to High QoE Video Streaming over Wireless Networks. In Proc. of ACM MobiHoc.Google ScholarDigital Library
- Chao Chen, Robert W Heath, Alan C Bovik, and Gustavo de Veciana. 2013. A Markov Decision Model for Adaptive Scheduling of Stored Scalable Videos. IEEE Transactions on Circuits and Systems for Video Technology 23, 6 (2013), 1081--1095.Google ScholarDigital Library
- Yonathan Efroni, Shie Mannor, and Matteo Pirotta. 2020. Exploration-Exploitation in Constrained MDPs. arXiv preprint arXiv:2003.02189 (2020).Google Scholar
- Jing Fu, Yoni Nazarathy, Sarat Moka, and Peter G Taylor. 2019. Towards Q-Learning the Whittle Index for Restless Bandits. In 2019 Australian & New Zealand Control Conference (ANZCC). IEEE, 249--254.Google Scholar
- Chen Gong and Xiaodong Wang. 2013. Adaptive Transmission for Delay-Constrained Wireless Video. IEEE Transactions on Wireless Communications 13, 1 (2013), 49--61.Google ScholarCross Ref
- Aditya Gopalan and Shie Mannor. 2015. Thompson Sampling for Learning Parameterized Markov Decision Processes. In Proc. of COLT.Google Scholar
- Yashuang Guo, Qinghai Yang, F Richard Yu, and Victor CM Leung. 2017. Dynamic Quality Adaptation and Bandwidth Allocation for Adaptive Streaming Over Time-Varying Wireless Networks. IEEE Transactions on Wireless Communications 16, 12 (2017), 8077--8091.Google ScholarDigital Library
- David J Hodge and Kevin D Glazebrook. 2015. On the Asymptotic Optimality of Greedy Index Heuristics for Multi-Action Restless Bandits. Advances in Applied Probability 47, 3 (2015), 652--667.Google ScholarCross Ref
- Weici Hu and Peter Frazier. 2017. An Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits. arXiv preprint arXiv:1707.00205 (2017).Google Scholar
- Thomas Jaksch, Ronald Ortner, and Peter Auer. 2010. Near-Optimal Regret Bounds for Reinforcement Learning. Journal of Machine Learning Research 11, 4 (2010).Google Scholar
- Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. 2019. Learning Adversarial MDPs with Bandit Feedback and Unknown Transition. arXiv preprint arXiv:1912.01192 (2019).Google Scholar
- Krishna C Kalagarla, Rahul Jain, and Pierluigi Nuzzo. 2021. A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints. In Proc. of AAAI.Google ScholarCross Ref
- Jonathan Kua, Grenville Armitage, and Philip Branch. 2017. A Survey of Rate Adaptation Techniques for Dynamic Adaptive Streaming Over HTTP. IEEE Communications Surveys & Tutorials 19, 3 (2017), 1842--1866.Google ScholarDigital Library
- Qiao Lan, Bojie Lv, Rui Wang, Kaibin Huang, and Yi Gong. 2020. Adaptive Video Streaming for Massive MIMO Networks via Approximate MDP and Reinforcement Learning. IEEE Transactions on Wireless Communications 19, 9 (2020), 5716--5731.Google ScholarCross Ref
- Stefan Lederer, Christopher Müller, and Christian Timmerer. 2012. Dynamic Adaptive Streaming over HTTP Dataset. In Proc. of ACM MMSys.Google ScholarDigital Library
- Zhi Li, Xiaoqing Zhu, Joshua Gahm, Rong Pan, Hao Hu, Ali C Begen, and David Oran. 2014. Probe and Adapt: Rate Adaptation for HTTP Video Streaming at Scale. IEEE Journal on Selected Areas in Communications 32, 4 (2014), 719--733.Google ScholarCross Ref
- Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. 2017. Neural Adaptive Video Streaming with Pensieve. In Proc. of ACM SIGCOMM.Google ScholarDigital Library
- Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein Bounds and Sample Variance Penalization. arXiv preprint arXiv:0907.3740 (2009).Google Scholar
- José Niño-Mora. 2007. Dynamic Priority Allocation via Restless Bandit Marginal Productivity Indices. Top 15, 2 (2007), 161--198.Google ScholarCross Ref
- Ronald Ortner, Daniil Ryabko, Peter Auer, and Rémi Munos. 2012. Regret Bounds for Restless Markov Bandits. In Proc. of Algorithmic Learning Theory.Google ScholarDigital Library
- Christos H Papadimitriou and John N Tsitsiklis. 1994. The Complexity of Optimal Queueing Network Control. In Proc. of IEEE Conference on Structure in Complexity Theory.Google ScholarCross Ref
- Martin L Puterman. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.Google Scholar
- Aviv Rosenberg and Yishay Mansour. 2019. Online Convex Optimization in Adversarial Markov Decision Processes. In Proc. of ICML.Google Scholar
- Iraj Sodagar. 2011. The MPEG-DASH Standard for Multimedia Streaming Over the Internet. IEEE Multimedia 18, 4 (2011), 62--67.Google ScholarDigital Library
- Kevin Spiteri, Rahul Urgaonkar, and Ramesh K Sitaraman. 2020. BOLA: Near-Optimal Bitrate Adaptation for Online Videos. IEEE/ACM Transactions on Networking 28, 4 (2020), 1698--1711.Google ScholarDigital Library
- Thomas Stockhammer. 2011. Dynamic Adaptive Streaming Over HTTP-Standards and Design Principles. In Proc. of ACM MMSys.Google Scholar
- Cisco Systems. 2019. Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2017--2022 White Paper. [Online.] Available: https://s3.amazonaws.com/media.mediapost.com/uploads/CiscoForecast.pdf (2019).Google Scholar
- Kexin Tang, Nuowen Kan, Junni Zou, Chenglin Li, Xiao Fu, Mingyi Hong, and Hongkai Xiong. 2021. Multi-user Adaptive Video Delivery over Wireless Networks: A Physical Layer Resource-Aware Deep Reinforcement Learning Approach. IEEE Transactions on Circuits and Systems for Video Technology 31, 2 (2021), 798--815.Google ScholarCross Ref
- J. van der Hooft, S. Petrangeli, T. Wauters, R. Huysegems, P. R. Alface, T. Bostoen, and F. De Turck. 2016. HTTP/2-Based Adaptive Streaming of HEVC Video Over 4G/LTE Networks. IEEE Communications Letters 20, 11 (2016), 2177--2180.Google ScholarCross Ref
- Ina Maria Verloop. 2016. Asymptotically Optimal Priority Policies for Indexable and Nonindexable Restless Bandits. The Annals of Applied Probability 26, 4 (2016), 1947--1995.Google ScholarCross Ref
- Siwei Wang, Longbo Huang, and John Lui. 2020. Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits. In Proc. of NeurIPS.Google Scholar
- Richard R Weber and Gideon Weiss. 1990. On An Index Policy for Restless Bandits. Journal of Applied Probability (1990), 637--648.Google Scholar
- Peter Whittle. 1988. Restless Bandits: Activity Allocation in A Changing World. Journal of Applied Probability (1988), 287--298.Google Scholar
- Guojun Xiong, Jian Li, and Rahul Singh. 2022. Reinforcement Learning Augmented Asymptotically Optimal Index Policies for Finite-Horizon Restless Bandits. In Proc. of AAAI 2022.Google ScholarCross Ref
- Guojun Xiong, Shufan Wang, Jian Li, and Rahul Singh. 2022. Model-free Reinforcement Learning for Content Caching at the Wireless Edge via Restless Bandits. arXiv preprint arXiv:2202.13187 (2022).Google Scholar
- Guojun Xiong, Shufan Wang, Gang Yan, and Jian Li. 2022. Reinforcement Learning for Dynamic Dimensioning of Cloud Caches: A Restless Bandit Approach. In Proc. of IEEE INFOCOM.Google ScholarDigital Library
- Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. 2015. A Control-Theoretic Approach for Dynamic Adaptive Video Streaming Over HTTP. In Proc. of ACM SIGCOMM.Google ScholarDigital Library
- Gabriel Zayas-Cabán, Stefanus Jasin, and Guihua Wang. 2019. An Asymptotically Optimal Heuristic for General Nonstationary Finite-Horizon Restless Multi-Armed, Multi-Action Bandits. Advances in Applied Probability 51, 3 (2019), 745--772.Google ScholarCross Ref
- Chao Zhou, Chia-Wen Lin, and Zongming Guo. 2016. mDASH: A Markov Decision-based Rate Adaptation Approach for Dynamic HTTP Streaming. IEEE Transactions on Multimedia 18, 4 (2016), 738--751.Google ScholarDigital Library
- Yihan Zou, Kwang Taik Kim, Xiaojun Lin, and Mung Chiang. 2021. Minimizing Age-of-Information in Heterogeneous Multi-Channel Systems: A New Partial-Index Approach. In Proc. of ACM MobiHoc.Google ScholarDigital Library
Index Terms
- Index-aware reinforcement learning for adaptive video streaming at the wireless edge
Recommendations
Whittle index based Q-learning for restless bandits with average reward
AbstractA novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to ...
Reinforcement learning-based rate adaptation in dynamic video streaming
AbstractVideo streaming stands out as the most significant traffic type consumed by mobile devices. This increased demand has been a major driver for research on bitrate adaptation algorithms. Bitrate adaptation ensures high user-perceived quality, which, ...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent SystemsRecent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Comments