ABSTRACT
Deep learning has gained significant success in video applications such as classification, analytics, and self-supervised learning. However, when scaling out to a large volume of videos, existing approaches suffer from a fundamental limitation; they cannot efficiently utilize GPUs for training deep neural networks (DNNs). This is because video decoding in data preparation incurs a prohibitive amount of computing overhead, making GPU idle for the majority of training time. Otherwise, caching raw videos in memory or storage to bypass decoding is not scalable as they account for from tens to hundreds of terabytes.
This paper proposes SAND, a system that enables deep learning frameworks to directly access training data by a storage abstraction. This abstraction effectively hides the data preprocessing delay, enabling GPUs to be fully utilized for DNN training. To accomplish this, SAND operates an in-storage cache and manages the cache by ahead-of-time scheduling to guarantee that requested training data can be always retrieved immediately from the cache. This scheduling considers the future data accesses of deep learning frameworks for cache replacement. Compared to the existing approach, our evaluation using emulated environments shows that SAND improves the GPU utilization by 6.0X and reduces the training time by 75.9% on average.
- AWS P3 Instance Official Webpage. https://aws.amazon.com/ec2/instance-types/p3/?nc1=h_ls.Google Scholar
- Google Cloud Compute Pricing. https://cloud.google.com/compute/all-pricing. Online; accessed: June 8, 2023.Google Scholar
- Jetson AGX Xavier Official Webpage. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-xavier/.Google Scholar
- Linux Virtual Filesystem Overview. https://www.kernel.org/doc/html/latest/filesystems/vfs.html.Google Scholar
- NVIDIA 3060 GPU Official Webpage. https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/rtx-3060-3060ti/.Google Scholar
- NVIDIA GTX 3090 GPU Official Webpage. https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/rtx-3090-3090ti/.Google Scholar
- Samsung Datacenter SSD, pm9a3. https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/.Google Scholar
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-Flow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) (OSDI'16). USENIX Association, USA, 265--283.Google Scholar
- Younes Akbari, Somaya Al-Maadeed, Noor Al-Maadeed, Al Anood Najeeb, Afnan Al-Ali, Fouad Khelifi, and Ashref Lawgaly. A New Forensic Video Database for Source Smartphone Identification: Description and Analysis. IEEE Access 10 (2022), 20080--20091. Google ScholarCross Ref
- Amazon Web Services. P4 Instances. https://aws.amazon.com/ec2/instance-types/p4/?nc1=h_ls. Accessed: March 28, 2023.Google Scholar
- Brian Beach, Steven Armentrout, Rodney Bozo, and Emmanuel Tsouris. Elastic Block Storage. Apress, Berkeley, CA, 59--84. Google ScholarCross Ref
- Cisco. Cisco Visual Networking Index: Global Device Growth and Traffic Profiles. https://www.cisco.com/c/dam/m/en_us/solutions/service-provider/vni-forecast-highlights/pdf/Global_Device_Growth_Traffic_Profiles.pdf. Accessed: March 28, 2023.Google Scholar
- Srijan Das, Rui Dai, Michal Koperski, Luca Minciullo, Lorenzo Garattoni, Francois Bremond, and Gianpiero Francesca. Toyota Smarthome: Real-World Activities of Daily Living. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
- Srijan Das, Saurav Sharma, Rui Dai, François Brémond, and Monique Thonnat. VPN: Learning Video-Pose Embedding for Activities of Daily Living. Lecture Notes in Computer Science (2020), 72--90. Google ScholarDigital Library
- Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. Revisiting Skeleton-based Action Recognition. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2022). Google ScholarCross Ref
- Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. PySlowFast. https://github.com/facebookresearch/slowfast.Google Scholar
- Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6824--6835.Google Scholar
- Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, and Kaiming He. A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3299--3309.Google Scholar
- Longlong Jing, Xiaodong Yang, and Yingli Tian. Video you only look once: Overall temporal convolutions for action recognition. Journal of Visual Communication and Image Representation 52 (2018), 58--65. Google ScholarCross Ref
- Luyi Kang, Yuqi Xue, Weiwei Jia, Xiaohao Wang, Jongryool Kim, Changhwan Youn, Myeong Joon Kang, Hyung Jin Lim, Bruce Jacob, and Jian Huang. IceClave: A Trusted Execution Environment for In-Storage Computing. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO '21). Association for Computing Machinery, New York, NY, USA, 199--211. Google ScholarDigital Library
- Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The Kinetics Human Action Video Dataset. CoRR abs/1705.06950 (2017). arXiv:1705.06950 http://arxiv.org/abs/1705.06950Google Scholar
- Jinhyung Kim, Taeoh Kim, Minho Shim, Dongyoon Han, Dongyoon Wee, and Junmo Kim. Frequency Selective Augmentation for Video Representation Learning. arXiv:cs.CV/2204.03865Google Scholar
- Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. MotionSqueeze: Neural Motion Feature Learning for Video Understanding. Lecture Notes in Computer Science (2020), 345--362. Google ScholarDigital Library
- Joo Hwan Lee, Hui Zhang, Veronica Lagrange, Praveen Krishnamoorthy, Xiaodong Zhao, and Yang Seok Ki. SmartSSD: FPGA Accelerated Near-Storage Data Analytics on SSD. IEEE Computer Architecture Letters 19, 2 (2020), 110--113. Google ScholarDigital Library
- Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv:cs.LG/1511.05440 4th International Conference on Learning Representations, ICLR 2016; Conference date: 02-05-2016 Through 04-05-2016.Google Scholar
- Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuffle and Learn: Unsupervised Learning using Temporal Order Verification. arXiv:cs.CV/1603.08561Google Scholar
- NVIDIA Corporation. NVIDIA Video Codec SDK. https://developer.nvidia.com/video-codec-sdk.Google Scholar
- Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, J. K. Aggarwal, Hyungtae Lee, Larry Davis, Eran Swears, Xioyang Wang, Qiang Ji, Kishore Reddy, Mubarak Shah, Carl Vondrick, Hamed Pirsiavash, Deva Ramanan, Jenny Yuen, Antonio Torralba, Bi Song, Anesco Fong, Amit Roy-Chowdhury, and Mita Desai. A large-scale benchmark dataset for event recognition in surveillance video. In CVPR 2011. 3153--3160. Google ScholarDigital Library
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Curran Associates Inc., Red Hook, NY, USA.Google ScholarDigital Library
- Rui Qian, Yeqing Li, Liangzhe Yuan, Boqing Gong, Ting Liu, Matthew Brown, Serge J. Belongie, Ming-Hsuan Yang, Hartwig Adam, and Yin Cui. Exploring Temporal Granularity in Self-Supervised Video Representation Learning. ArXiv abs/2112.04480 (2021).Google Scholar
- Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. Spatiotemporal Contrastive Video Representation Learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2021). Google ScholarCross Ref
- Xukan Ran, Haolianz Chen, Xiaodan Zhu, Zhenming Liu, and Jiasi Chen. DeepDecision: A Mobile Deep Learning Framework for Edge Video Analytics. In IEEE INFOCOM 2018 - IEEE Conference on Computer Communications. 1421--1429. Google ScholarDigital Library
- Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altch'e, Michael Valko, Jean-Bastien Grill, Aäron van den Oord, and Andrew Zisserman. Broaden Your Views for Self-Supervised Video Learning. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 1235--1245.Google Scholar
- Amazon Web Services. Amazon Elastic Compute Cloud User Guide. https://docs.aws.amazon.com/pdfs/AWSEC2/latest/UserGuide/ec2-ug.pdf [Online; accessed 28-March-2023].Google Scholar
- Gilad Sharir, Asaf Noy, and Lihi Zelnik-Manor. An image is worth 16×16 words, what is a video worth? arXiv preprint arXiv:2103.13915 (2021).Google Scholar
- Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data 6, 1 (2019), 1--48.Google Scholar
- Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the kinetics-700-2020 human action dataset. arXiv preprint arXiv:2010.10864 (2020).Google Scholar
- Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602 (2022).Google Scholar
- Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5552--5561.Google Scholar
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv preprint arXiv:2212.03191 (2022).Google Scholar
- Hyunho Yeo, Chan Ju Chong, Youngmok Jung, Juncheol Ye, and Dongsu Han. NEMO: Enabling Neural-Enhanced Video Streaming on Commodity Mobile Devices. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (London, United Kingdom) (MobiCom '20). Article 28, 14 pages. Google ScholarDigital Library
- Hyunho Yeo, Hwijoon Lim, Jaehong Kim, Youngmok Jung, Juncheol Ye, and Dongsu Han. NeuroScaler: neural video enhancement at scale. In Proceedings of the ACM SIGCOMM 2022 Conference. 795--811.Google Scholar
- Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (ECCV). 695--712.Google Scholar
Index Terms
- SAND: A Storage Abstraction for Video-based Deep Learning
Recommendations
NeSSA: Near-Storage Data Selection for Accelerated Machine Learning Training
HotStorage '23: Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File SystemsLarge-scale machine learning (ML) models rely on extremely large datasets to learn their exponentially growing number of parameters. While these models achieve unprecedented success, the increase in training time and hardware resources required is ...
Computational Storage for an Energy-Efficient Deep Neural Network Training System
Euro-Par 2023: Parallel ProcessingAbstractNear-storage data processing and computational storage have recently received considerable attention from the industry as energy- and cost-efficient ways to improve system performance. This paper introduces a computational-storage solution to ...
Comments