ABSTRACT
Despite the remarkable progress in instance segmentation, the problem of predicting future instance segmentation remains challenging due to the unobservability of future data. Existing methods mainly address this challenge by forecasting pyramid features to represent unobserved future frames. However, they mainly predict features for each pyramid level independently, and ignore the underlying structural relationship between features of different levels.
In this paper, we propose a novel framework called Contextual Pyramid ConvLSTMs, which contains a set of ConvLSTMs to exploit intra-level spatio-temporal contexts for predicting features of each individual level. Moreover, we also add pathway connections among the ConvLSTMs to transmit information across different ConvLSTMs, which allows our system to capture more inter-level spatio-temporal contextual information. We experimentally show that the proposed method can achieve state-of-the-art performance on two video instance segmentation benchmarks for future instance segmentation prediction.
- Min Bai and Raquel Urtasun. 2017. Deep watershed transform for instance segmentation. In Conference on Computer Vision and Pattern Recognition. 5221--5229.Google ScholarCross Ref
- Apratim Bhattacharyya, Mario Fritz, and Bernt Schiele. 2018. Long-term on-board prediction of people in traffic scenes under uncertainty. In Conference on Computer Vision and Pattern Recognition. 4194--4202.Google ScholarCross Ref
- Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. 2018. Masklab: Instance segmentation by refining object detection with semantic and direction features. In Conference on Computer Vision and Pattern Recognition. 4013--4022.Google ScholarCross Ref
- Xiongtao Chen, Wenmin Wang, Jinzhuo Wang, and Weimian Li. 2017. Learning Object-Centric Transformation for Video Prediction. In ACM International Conference on Multimedia. 1503--1512.Google Scholar
- Jingchun Cheng, Sifei Liu, Yi-Hsuan Tsai, Wei-Chih Hung, Shalini De Mello, Jinwei Gu, Jan Kautz, Shengjin Wang, and Ming-Hsuan Yang. 2017. Learning to segment instances in videos with spatial propagation network. (2017).Google Scholar
- Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Conference on Computer Vision and Pattern Recognition. 3213--3223.Google ScholarCross Ref
- Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In Conference on Computer Vision and Pattern Recognition. 3150--3158.Google ScholarCross Ref
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In International Conference on Computer Vision . 2961--2969.Google ScholarCross Ref
- Hexiang Hu, Shiyi Lan, Yuning Jiang, Zhimin Cao, and Fei Sha. 2017. Fastmask: Segment multi-scale object candidates in one shot. In Conference on Computer Vision and Pattern Recognition. 991--999.Google ScholarCross Ref
- Jian-Fang Hu, Wei-Shi Zheng, Lianyang Ma, Gang Wang, Jian-Huang Lai, and Jianguo Zhang. 2018. Early action prediction by soft regression. IEEE transactions on pattern analysis and machine intelligence (2018).Google Scholar
- Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. 2017. Instancecut: from edges to instances with multicut. In Conference on Computer Vision and Pattern Recognition. 5008--5017.Google ScholarCross Ref
- Trung-Nghia Le and Akihiro Sugimoto. 2019. Semantic Instance Meets Salient Object: Study on Video Semantic Salient Instance Segmentation. In Winter Conference on Applications of Computer Vision. 1779--1788.Google Scholar
- Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully convolutional instance-aware semantic segmentation. In Conference on Computer Vision and Pattern Recognition. 2359--2367.Google ScholarCross Ref
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision . 740--755.Google ScholarCross Ref
- Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 2017. Sgn: Sequential grouping networks for instance segmentation. In International Conference on Computer Vision. 3496--3504.Google ScholarCross Ref
- Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. Path aggregation network for instance segmentation. In Conference on Computer Vision and Pattern Recognition. 8759--8768.Google ScholarCross Ref
- Shu Liu, Xiaojuan Qi, Jianping Shi, Hong Zhang, and Jiaya Jia. 2016. Multi-scale patch aggregation (mpa) for simultaneous detection and segmentation. In Conference on Computer Vision and Pattern Recognition. 3141--3149.Google ScholarCross Ref
- Pauline Luc, Camille Couprie, Yann Lecun, and Jakob Verbeek. 2018. Predicting Future Instance Segmentation by Forecasting Convolutional Features. In European Conference on Computer Vision. 584--599.Google ScholarCross Ref
- Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, and Yann LeCun. 2017. Predicting deeper into the future of semantic segmentation. In International Conference on Computer Vision. 648--657.Google ScholarCross Ref
- Michael Mathieu, Camille Couprie, and Yann LeCun. 2016. Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations (2016).Google Scholar
- Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. 2015. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems. 2863--2871.Google Scholar
- Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. 2015. Learning to segment object candidates. In Advances in Neural Information Processing Systems. 1990--1998.Google Scholar
- Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. 2016. Learning to refine object segments. In European Conference on Computer Vision . 75--91.Google ScholarCross Ref
- MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. 2014. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014).Google Scholar
- Mikel D. Rodriguez and Mubarak Shah. 2007. Detecting and Segmenting Humans in Crowded Scenes. In ACM International Conference on Multimedia . 353--356.Google Scholar
- Guillaume Seguin, Piotr Bojanowski, Rémi Lajugie, and Ivan Laptev. 2016a. Instance-level video segmentation from object tracks. In Conference on Computer Vision and Pattern Recognition .Google ScholarCross Ref
- Guillaume Seguin, Piotr Bojanowski, Rémi Lajugie, and Ivan Laptev. 2016b. Instance-level video segmentation from object tracks. In Conference on Computer Vision and Pattern Recognition. 3678--3687.Google ScholarCross Ref
- Yuge Shi, Basura Fernando, and Richard Hartley. 2018. Action Anticipation with RBF Kernelized Feature Mapping RNN. In European Conference on Computer Vision. 301--317.Google Scholar
- Jingkuan Song, Lianli Gao, Mihai Marian Puscas, Feiping Nie, Fumin Shen, and Nicu Sebe. 2016. Joint graph learning and video segmentation via multiple cues and topology calibration. In ACM International Conference on Multimedia. 831--840.Google ScholarDigital Library
- Vibhav Vineet, Jonathan Warrell, Lubor Ladicky, and Philip HS Torr. 2011. Human Instance Segmentation from Video using Detector-based Conditional Random Fields.. In British Machine Vision Conference . 12--15.Google ScholarCross Ref
- Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2015. Anticipating the future by watching unlabeled video. (2015).Google Scholar
- Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and S Yu Philip. 2017. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In Advances in Neural Information Processing Systems. 879--888.Google Scholar
- SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems. 802--810.Google Scholar
Index Terms
- Predicting Future Instance Segmentation with Contextual Pyramid ConvLSTMs
Recommendations
Predicting Future Instance Segmentation by Forecasting Convolutional Features
Computer Vision – ECCV 2018AbstractAnticipating future events is an important prerequisite towards intelligent behavior. Video forecasting has been studied as a proxy task towards this goal. Recent work has shown that to predict semantic segmentation of future frames, forecasting ...
ChaInNet: Deep Chain Instance Segmentation Network for Panoptic Segmentation
AbstractWe consider the competition between instance and semantic segmentation in panoptic segmentation to develop the deep chain instance segmentation network (ChaInNet) to mitigate this problem. Segmentation competition is caused by the usual ...
Instance search via instance level segmentation and feature representation
AbstractInstance search is an interesting task as well as a challenging issue due to the lack of effective feature representation. In this paper, an instance level feature representation built upon fully convolutional instance-aware ...
Comments