research-article

Video Synthesis via Transform-Based Tensor Neural Network

Authors:
Yimeng Zhang

Tensor&Deep Learning Lab & Columbia University, New York, NY, USA

Tensor&Deep Learning Lab & Columbia University, New York, NY, USA
View Profile

,
Xiao-Yang Liu

Tensor&Deep Learning Lab & Columbia University, New York, NY, USA

Tensor&Deep Learning Lab & Columbia University, New York, NY, USA
View Profile

,
Bo Wu

MIT-IBM Watson AI Lab, Cambridge, MA, USA

MIT-IBM Watson AI Lab, Cambridge, MA, USA
View Profile

,
Anwar Walid

Nokia-Bell Labs, Murray Hill, NJ, USA

Nokia-Bell Labs, Murray Hill, NJ, USA
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 2454–2462https://doi.org/10.1145/3394171.3413527

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 2454–2462

ABSTRACT

Video frame synthesis is an important task in computer vision and has drawn great interests in wide applications. However, existing neural network methods do not explicitly impose tensor low-rankness of videos to capture the spatiotemporal correlations in a high-dimensional space, while existing iterative algorithms require hand-crafted parameters and take relatively long running time. In this paper, we propose a novel multi-phase deep neural network Transform-Based Tensor-Net that exploits the low-rank structure of video data in a learned transform domain, which unfolds an Iterative Shrinkage-Thresholding Algorithm (ISTA) for tensor signal recovery. Our design is based on two observations: (i) both linear and nonlinear transforms can be implemented by a neural network layer, and (ii) the soft-thresholding operator corresponds to an activation function. Further, such an unfolding design is able to achieve nearly real-time at the cost of training time and enjoys an interpretable nature as a byproduct. Experimental results on the KTH and UCF-101 datasets show that compared with the state-of-the-art methods, i.e., DVF and Super SloMo, the proposed scheme improves Peak Signal-to-Noise Ratio (PSNR) of video interpolation and prediction by 4.13 dB and 4.26 dB, respectively.

Supplemental Material

3394171.3413527.mp4

mp4

10.5 MB

Download

References

Vlachos Alex. 2018. Introducing SteamVR Motion smoothing beta. https://steamcommunity.com/games/250820/announcements/detail/ 1696061565016280495.Google Scholar
Simon Baker, Daniel Scharstein, JP Lewis, Stefan Roth, Michael J Black, and Richard Szeliski. 2011. A database and evaluation methodology for optical flow. International Journal of Computer Vision 92, 1 (2011), 1--31.Google ScholarDigital Library
Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming- Hsuan Yang. 2019. Depth-aware video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3703--3712.Google ScholarCross Ref
Amir Beck and Marc Teboulle. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2, 1 (2009), 183--202.Google ScholarDigital Library
Judith Bütepage, Michael J Black, Danica Kragic, and Hedvig Kjellström. 2017. Deep representation learning for human motion prediction and classification. In IEEE Conference on Computer Vision and Pattern Recognition. pp. 6158--6166.Google ScholarCross Ref
Kanglin Chen and Dirk A Lorenz. 2011. Image sequence interpolation using optimal control. Journal of Mathematical Imaging and Vision 41, 3 (2011).Google ScholarDigital Library
Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. 2020. Channel attention is all you need for video frame interpolation. AAAI.Google Scholar
David L. Donoho. 2006. Compressed sensing. IEEE Transactions on Information Theory 52, 4 (2006), 1289--1306.Google ScholarDigital Library
Lijie Fan, Wenbing Huang, Chuang Gan, Junzhou Huang, and Boqing Gong.. Controllable image-to-video translation: A case study on facial expression generation. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems. pp. 64--72.Google Scholar
Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenenbaum, and Antonio Torralba. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Xiaochen Han, Bo Wu, Zheng Shou, Xiao-Yang Liu, Yimeng Zhang, and Linghe Kong. 2020. Tensor FISTA-Net for real-time snapshot compressive imaging.. In AAAI. 10933--10940.Google Scholar
John R Hershey, Jonathan Le Roux, and Felix Weninger. 2014. Deep unfolding: Model-based inspiration of novel deep architectures. arXiv preprint arXiv:1409.2574 (2014).Google Scholar
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 5 (1989).Google Scholar
Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. 2018. Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems. 517--526.Google Scholar
Chao Jia and Brian L Evans. 2013. 3D rotational video stabilization using manifold optimization. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2493--2497.Google ScholarCross Ref
Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik G. Learned- Miller, and Jan Kautz. 2018. Super SloMo: High quality estimation of multiple intermediate frames for Video Interpolation. IEEE Conference on Computer Vision and Pattern Recognition (2018).Google ScholarCross Ref
Xiaojie Jin, Huaxin Xiao, Xiaohui Shen, Jimei Yang, Zhe Lin, Yunpeng Chen, Zequn Jie, Jiashi Feng, and Shuicheng Yan. 2017. Predicting scene parsing and motion dynamics in the future. In Advances in Neural Information Processing Systems. pp. 6915--6924.Google Scholar
Eric Kernfeld, Misha Kilmer, and Shuchin Aeron. 2015. Tensor-tensor products with invertible linear transforms. Linear Algebra Appl. 485 (2015), 545--570.Google ScholarCross Ref
Tmamara G. Kolda and B.W. Bader. 2009. Tensor decompositions and applications. SIAM Rev. 51 (2009), 455--500. Issue 3.Google ScholarDigital Library
Akshay Krishnamurthy and Aarti Singh. 2013. Low-rank matrix and tensor completion via adaptive sampling. Neural Information Processing Systems (2013).Google Scholar
Shohei Kubota, Ryoichiro Yoshida, and Yoshimitsu Kuroki. 2018. L0 norm restricted LIC with ADMM. In International Workshop on Advanced Image Technology (IWAIT). IEEE, 1--4.Google ScholarCross Ref
Chao Li, Qibin Zhao, Junhua Li, Andrzej Cichocki, and Lili Guo. 2015. Multitensor completion with common structures. In AAAI Conference on Artificial Intelligence.Google Scholar
Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2018. Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision.Google ScholarCross Ref
Chia-Kai Liang and Fuhao Shi. 2017. Fused video stabilization on the Pixel 2 and Pixel 2 XL. https://ai.googleblog.com/2017/11/fused-video-stabilization-onpixel-2.html.Google Scholar
Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. 2013. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 208--220.Google ScholarDigital Library
Xiao-Yang Liu and Xiaodong Wang. 2017. Fourth-order tensors with multidimensional discrete transforms. arXiv preprint arXiv:1705.01576 (2017).Google Scholar
Ziwei Liu, Raymond Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. 2017. Video frame synthesis using Deep Voxel Flow. In Proceedings of International Conference on Computer Vision. pp. 4463--4471.Google ScholarCross Ref
Jiawei Ma, Xiao-Yang Liu, Zheng Shou, and Xin Yuan. 2019. Deep tensor ADMMNet for snapshot compressive imaging. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Yasuyuki Matsushita, Eyal Ofek, Weina Ge, Xiaoou Tang, and Heung-Yeung Shum. 2006. Full-frame video stabilization with motion inpainting. IEEE Transactions on Pattern Analysis & Machine Intelligence 7 (2006), 1150--1163.Google ScholarDigital Library
Simon Niklaus, Long Mai, and Feng Liu. 2017. Video frame interpolation via adaptive convolution. In IEEE Conference on Computer Vision and Pattern Recognition. pp. 670--679.Google Scholar
Minho Park, Hak Gu Kim, Sangmin Lee, and Yong Man Ro. 2020. Robust video frame interpolation with exceptional motion map. IEEE Transactions on Circuits and Systems for Video Technology (2020), 1--1.Google Scholar
Tomer Peleg, Pablo Szekely, Doron Sabo, and Omry Sendik. 2019. Im-net for high resolution video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Christian Schuldt, Ivan Laptev, and Barbara Caputo. 2004. Recognizing human actions: a local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, Vol. 3.Google ScholarDigital Library
Guangyao Shen, Wenbing Huang, Chuang Gan, Mingkui Tan, Junzhou Huang, Wenwu Zhu, and Boqing Gong. 2019. Facial image-to-video translation by a hidden affine transformation. In Proceedings of the 27th ACM international conference on Multimedia. 2505--2513.Google ScholarDigital Library
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the Wild. CoRR (2012). arXiv:1212.0402 http://arxiv.org/abs/1212.0402Google Scholar
Jian Sun, Huibin Li, Zongben Xu, et al. 2016. Deep ADMM-Net for compressive sensing MRI. In Advances in Neural Information Processing Systems.Google Scholar
Robert Tibshirani. 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1996).Google Scholar
Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. 2017. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017).Google Scholar
Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems.Google Scholar
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612.Google Scholar
Manuel Werlberger, Thomas Pock, Markus Unger, and Horst Bischof. 2011. Optical flow guided TV-L 1 video interpolation and restoration. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition. Springer, 273--286.Google ScholarCross Ref
John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, and Yi Ma. 2009. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Advances in Neural Information Processing Systems.Google Scholar
HuayingWu, Xiao-Yang Liu, Luoyi Fu, and XinbingWang. 2018. Energy-efficient and robust tensor-encoder for wireless camera networks in Internet of Things. IEEE Transactions on Network Science and Engineering 6, 4 (2018), 646--656.Google Scholar
Yangyang Xu and Wotao Yin. 2013. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences 6, 3 (2013).Google ScholarCross Ref
Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. 2016. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems.Google Scholar
Chenggang Yan, Zhisheng Li, Yongbing Zhang, Yutao Liu, Xiangyang Ji, and Zhang Yongdong. 2020. Depth image denoising using nuclear norm and learning graph model. ACM Transactions on Multimedia Computing Communications and Applications (2020).Google Scholar
Chenggang Yan, Yunbin Tu, Xingzheng Wang, Yongbing Zhang, Xinhong Hao, Yongdong Zhang, and Qionghai Dai. 2019. Stat: spatial-temporal attention mechanism for video captioning. IEEE transactions on multimedia 22, 1 (2019).Google Scholar
Ming Yuan and Cun-Hui Zhang. 2016. On tensor completion via nuclear norm minimization. Foundations of Computational Mathematics 16, 4 (2016).Google Scholar
Jian Zhang and Bernard Ghanem. 2018. ISTA-Net: Interpretable optimizationinspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1828--1837.Google ScholarCross Ref
Zemin Zhang and Shuchin Aeron. 2017. Exact tensor completion using t-SVD. IEEE Transactions on Signal Processing 65, 6 (2017), 1511--1526.Google ScholarDigital Library
Qibin Zhao, Liqing Zhang, and Andrzej Cichocki. 2015. Bayesian CP factorization of incomplete tensors with automatic rank determination. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 9 (2015), 1751--1763.Google ScholarDigital Library
Yipin Zhou and Tamara L Berg. 2016. Learning temporal transformations from time-lapse videos. In European Conference on Computer Vision. pp. 262--277.Google ScholarCross Ref

Index Terms

Video Synthesis via Transform-Based Tensor Neural Network
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems

Recommendations

Tensor compressed video sensing reconstruction by combination of fractional-order total variation and sparsifying transform

High reconstructed performance compressed video sensing (CVS) with low computational complexity and memory requirement is very challenging. In order to reconstruct the high quality video frames with low computational complexity, this paper proposes a ...
Read More
Compressive sensing via nonlocal low-rank tensor regularization

The aim of Compressing sensing (CS) is to acquire an original signal, when it is sampled at a lower rate than Nyquist rate previously. In the framework of CS, the original signal is often assumed to be sparse and correlated in some domain. Recently, ...
Read More
Nonlocal image denoising via adaptive tensor nuclear norm minimization

Nonlocal self-similarity shows great potential in image denoising. Therefore, the denoising performance can be attained by accurately exploiting the nonlocal prior. In this paper, we model nonlocal similar patches through the multi-linear approach and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep unfolding
interpolation and prediction
tensor neural network
transform-based tensor
video synthesis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 321
  Total Downloads
- Downloads (Last 12 months)48
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Video Synthesis via Transform-Based Tensor Neural Network

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Tensor compressed video sensing reconstruction by combination of fractional-order total variation and sparsifying transform

Compressive sensing via nonlocal low-rank tensor regularization

Nonlocal image denoising via adaptive tensor nuclear norm minimization