ABSTRACT
Video frame synthesis is an important task in computer vision and has drawn great interests in wide applications. However, existing neural network methods do not explicitly impose tensor low-rankness of videos to capture the spatiotemporal correlations in a high-dimensional space, while existing iterative algorithms require hand-crafted parameters and take relatively long running time. In this paper, we propose a novel multi-phase deep neural network Transform-Based Tensor-Net that exploits the low-rank structure of video data in a learned transform domain, which unfolds an Iterative Shrinkage-Thresholding Algorithm (ISTA) for tensor signal recovery. Our design is based on two observations: (i) both linear and nonlinear transforms can be implemented by a neural network layer, and (ii) the soft-thresholding operator corresponds to an activation function. Further, such an unfolding design is able to achieve nearly real-time at the cost of training time and enjoys an interpretable nature as a byproduct. Experimental results on the KTH and UCF-101 datasets show that compared with the state-of-the-art methods, i.e., DVF and Super SloMo, the proposed scheme improves Peak Signal-to-Noise Ratio (PSNR) of video interpolation and prediction by 4.13 dB and 4.26 dB, respectively.
Supplemental Material
- Vlachos Alex. 2018. Introducing SteamVR Motion smoothing beta. https://steamcommunity.com/games/250820/announcements/detail/ 1696061565016280495.Google Scholar
- Simon Baker, Daniel Scharstein, JP Lewis, Stefan Roth, Michael J Black, and Richard Szeliski. 2011. A database and evaluation methodology for optical flow. International Journal of Computer Vision 92, 1 (2011), 1--31.Google ScholarDigital Library
- Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming- Hsuan Yang. 2019. Depth-aware video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3703--3712.Google ScholarCross Ref
- Amir Beck and Marc Teboulle. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2, 1 (2009), 183--202.Google ScholarDigital Library
- Judith Bütepage, Michael J Black, Danica Kragic, and Hedvig Kjellström. 2017. Deep representation learning for human motion prediction and classification. In IEEE Conference on Computer Vision and Pattern Recognition. pp. 6158--6166.Google ScholarCross Ref
- Kanglin Chen and Dirk A Lorenz. 2011. Image sequence interpolation using optimal control. Journal of Mathematical Imaging and Vision 41, 3 (2011).Google ScholarDigital Library
- Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. 2020. Channel attention is all you need for video frame interpolation. AAAI.Google Scholar
- David L. Donoho. 2006. Compressed sensing. IEEE Transactions on Information Theory 52, 4 (2006), 1289--1306.Google ScholarDigital Library
- Lijie Fan, Wenbing Huang, Chuang Gan, Junzhou Huang, and Boqing Gong.. Controllable image-to-video translation: A case study on facial expression generation. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
- Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems. pp. 64--72.Google Scholar
- Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenenbaum, and Antonio Torralba. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- Xiaochen Han, Bo Wu, Zheng Shou, Xiao-Yang Liu, Yimeng Zhang, and Linghe Kong. 2020. Tensor FISTA-Net for real-time snapshot compressive imaging.. In AAAI. 10933--10940.Google Scholar
- John R Hershey, Jonathan Le Roux, and Felix Weninger. 2014. Deep unfolding: Model-based inspiration of novel deep architectures. arXiv preprint arXiv:1409.2574 (2014).Google Scholar
- Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 5 (1989).Google Scholar
- Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. 2018. Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems. 517--526.Google Scholar
- Chao Jia and Brian L Evans. 2013. 3D rotational video stabilization using manifold optimization. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2493--2497.Google ScholarCross Ref
- Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik G. Learned- Miller, and Jan Kautz. 2018. Super SloMo: High quality estimation of multiple intermediate frames for Video Interpolation. IEEE Conference on Computer Vision and Pattern Recognition (2018).Google ScholarCross Ref
- Xiaojie Jin, Huaxin Xiao, Xiaohui Shen, Jimei Yang, Zhe Lin, Yunpeng Chen, Zequn Jie, Jiashi Feng, and Shuicheng Yan. 2017. Predicting scene parsing and motion dynamics in the future. In Advances in Neural Information Processing Systems. pp. 6915--6924.Google Scholar
- Eric Kernfeld, Misha Kilmer, and Shuchin Aeron. 2015. Tensor-tensor products with invertible linear transforms. Linear Algebra Appl. 485 (2015), 545--570.Google ScholarCross Ref
- Tmamara G. Kolda and B.W. Bader. 2009. Tensor decompositions and applications. SIAM Rev. 51 (2009), 455--500. Issue 3.Google ScholarDigital Library
- Akshay Krishnamurthy and Aarti Singh. 2013. Low-rank matrix and tensor completion via adaptive sampling. Neural Information Processing Systems (2013).Google Scholar
- Shohei Kubota, Ryoichiro Yoshida, and Yoshimitsu Kuroki. 2018. L0 norm restricted LIC with ADMM. In International Workshop on Advanced Image Technology (IWAIT). IEEE, 1--4.Google ScholarCross Ref
- Chao Li, Qibin Zhao, Junhua Li, Andrzej Cichocki, and Lili Guo. 2015. Multitensor completion with common structures. In AAAI Conference on Artificial Intelligence.Google Scholar
- Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2018. Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision.Google ScholarCross Ref
- Chia-Kai Liang and Fuhao Shi. 2017. Fused video stabilization on the Pixel 2 and Pixel 2 XL. https://ai.googleblog.com/2017/11/fused-video-stabilization-onpixel-2.html.Google Scholar
- Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. 2013. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 208--220.Google ScholarDigital Library
- Xiao-Yang Liu and Xiaodong Wang. 2017. Fourth-order tensors with multidimensional discrete transforms. arXiv preprint arXiv:1705.01576 (2017).Google Scholar
- Ziwei Liu, Raymond Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. 2017. Video frame synthesis using Deep Voxel Flow. In Proceedings of International Conference on Computer Vision. pp. 4463--4471.Google ScholarCross Ref
- Jiawei Ma, Xiao-Yang Liu, Zheng Shou, and Xin Yuan. 2019. Deep tensor ADMMNet for snapshot compressive imaging. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
- Yasuyuki Matsushita, Eyal Ofek, Weina Ge, Xiaoou Tang, and Heung-Yeung Shum. 2006. Full-frame video stabilization with motion inpainting. IEEE Transactions on Pattern Analysis & Machine Intelligence 7 (2006), 1150--1163.Google ScholarDigital Library
- Simon Niklaus, Long Mai, and Feng Liu. 2017. Video frame interpolation via adaptive convolution. In IEEE Conference on Computer Vision and Pattern Recognition. pp. 670--679.Google Scholar
- Minho Park, Hak Gu Kim, Sangmin Lee, and Yong Man Ro. 2020. Robust video frame interpolation with exceptional motion map. IEEE Transactions on Circuits and Systems for Video Technology (2020), 1--1.Google Scholar
- Tomer Peleg, Pablo Szekely, Doron Sabo, and Omry Sendik. 2019. Im-net for high resolution video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- Christian Schuldt, Ivan Laptev, and Barbara Caputo. 2004. Recognizing human actions: a local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, Vol. 3.Google ScholarDigital Library
- Guangyao Shen, Wenbing Huang, Chuang Gan, Mingkui Tan, Junzhou Huang, Wenwu Zhu, and Boqing Gong. 2019. Facial image-to-video translation by a hidden affine transformation. In Proceedings of the 27th ACM international conference on Multimedia. 2505--2513.Google ScholarDigital Library
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the Wild. CoRR (2012). arXiv:1212.0402 http://arxiv.org/abs/1212.0402Google Scholar
- Jian Sun, Huibin Li, Zongben Xu, et al. 2016. Deep ADMM-Net for compressive sensing MRI. In Advances in Neural Information Processing Systems.Google Scholar
- Robert Tibshirani. 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1996).Google Scholar
- Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. 2017. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017).Google Scholar
- Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems.Google Scholar
- Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612.Google Scholar
- Manuel Werlberger, Thomas Pock, Markus Unger, and Horst Bischof. 2011. Optical flow guided TV-L 1 video interpolation and restoration. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition. Springer, 273--286.Google ScholarCross Ref
- John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, and Yi Ma. 2009. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Advances in Neural Information Processing Systems.Google Scholar
- HuayingWu, Xiao-Yang Liu, Luoyi Fu, and XinbingWang. 2018. Energy-efficient and robust tensor-encoder for wireless camera networks in Internet of Things. IEEE Transactions on Network Science and Engineering 6, 4 (2018), 646--656.Google Scholar
- Yangyang Xu and Wotao Yin. 2013. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences 6, 3 (2013).Google ScholarCross Ref
- Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. 2016. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems.Google Scholar
- Chenggang Yan, Zhisheng Li, Yongbing Zhang, Yutao Liu, Xiangyang Ji, and Zhang Yongdong. 2020. Depth image denoising using nuclear norm and learning graph model. ACM Transactions on Multimedia Computing Communications and Applications (2020).Google Scholar
- Chenggang Yan, Yunbin Tu, Xingzheng Wang, Yongbing Zhang, Xinhong Hao, Yongdong Zhang, and Qionghai Dai. 2019. Stat: spatial-temporal attention mechanism for video captioning. IEEE transactions on multimedia 22, 1 (2019).Google Scholar
- Ming Yuan and Cun-Hui Zhang. 2016. On tensor completion via nuclear norm minimization. Foundations of Computational Mathematics 16, 4 (2016).Google Scholar
- Jian Zhang and Bernard Ghanem. 2018. ISTA-Net: Interpretable optimizationinspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1828--1837.Google ScholarCross Ref
- Zemin Zhang and Shuchin Aeron. 2017. Exact tensor completion using t-SVD. IEEE Transactions on Signal Processing 65, 6 (2017), 1511--1526.Google ScholarDigital Library
- Qibin Zhao, Liqing Zhang, and Andrzej Cichocki. 2015. Bayesian CP factorization of incomplete tensors with automatic rank determination. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 9 (2015), 1751--1763.Google ScholarDigital Library
- Yipin Zhou and Tamara L Berg. 2016. Learning temporal transformations from time-lapse videos. In European Conference on Computer Vision. pp. 262--277.Google ScholarCross Ref
Index Terms
- Video Synthesis via Transform-Based Tensor Neural Network
Recommendations
Tensor compressed video sensing reconstruction by combination of fractional-order total variation and sparsifying transform
High reconstructed performance compressed video sensing (CVS) with low computational complexity and memory requirement is very challenging. In order to reconstruct the high quality video frames with low computational complexity, this paper proposes a ...
Compressive sensing via nonlocal low-rank tensor regularization
The aim of Compressing sensing (CS) is to acquire an original signal, when it is sampled at a lower rate than Nyquist rate previously. In the framework of CS, the original signal is often assumed to be sparse and correlated in some domain. Recently, ...
Nonlocal image denoising via adaptive tensor nuclear norm minimization
Nonlocal self-similarity shows great potential in image denoising. Therefore, the denoising performance can be attained by accurately exploiting the nonlocal prior. In this paper, we model nonlocal similar patches through the multi-linear approach and ...
Comments