skip to main content
10.1145/3343031.3350949acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Predicting Future Instance Segmentation with Contextual Pyramid ConvLSTMs

Published:15 October 2019Publication History

ABSTRACT

Despite the remarkable progress in instance segmentation, the problem of predicting future instance segmentation remains challenging due to the unobservability of future data. Existing methods mainly address this challenge by forecasting pyramid features to represent unobserved future frames. However, they mainly predict features for each pyramid level independently, and ignore the underlying structural relationship between features of different levels.

In this paper, we propose a novel framework called Contextual Pyramid ConvLSTMs, which contains a set of ConvLSTMs to exploit intra-level spatio-temporal contexts for predicting features of each individual level. Moreover, we also add pathway connections among the ConvLSTMs to transmit information across different ConvLSTMs, which allows our system to capture more inter-level spatio-temporal contextual information. We experimentally show that the proposed method can achieve state-of-the-art performance on two video instance segmentation benchmarks for future instance segmentation prediction.

References

  1. Min Bai and Raquel Urtasun. 2017. Deep watershed transform for instance segmentation. In Conference on Computer Vision and Pattern Recognition. 5221--5229.Google ScholarGoogle ScholarCross RefCross Ref
  2. Apratim Bhattacharyya, Mario Fritz, and Bernt Schiele. 2018. Long-term on-board prediction of people in traffic scenes under uncertainty. In Conference on Computer Vision and Pattern Recognition. 4194--4202.Google ScholarGoogle ScholarCross RefCross Ref
  3. Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. 2018. Masklab: Instance segmentation by refining object detection with semantic and direction features. In Conference on Computer Vision and Pattern Recognition. 4013--4022.Google ScholarGoogle ScholarCross RefCross Ref
  4. Xiongtao Chen, Wenmin Wang, Jinzhuo Wang, and Weimian Li. 2017. Learning Object-Centric Transformation for Video Prediction. In ACM International Conference on Multimedia. 1503--1512.Google ScholarGoogle Scholar
  5. Jingchun Cheng, Sifei Liu, Yi-Hsuan Tsai, Wei-Chih Hung, Shalini De Mello, Jinwei Gu, Jan Kautz, Shengjin Wang, and Ming-Hsuan Yang. 2017. Learning to segment instances in videos with spatial propagation network. (2017).Google ScholarGoogle Scholar
  6. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Conference on Computer Vision and Pattern Recognition. 3213--3223.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In Conference on Computer Vision and Pattern Recognition. 3150--3158.Google ScholarGoogle ScholarCross RefCross Ref
  8. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In International Conference on Computer Vision . 2961--2969.Google ScholarGoogle ScholarCross RefCross Ref
  9. Hexiang Hu, Shiyi Lan, Yuning Jiang, Zhimin Cao, and Fei Sha. 2017. Fastmask: Segment multi-scale object candidates in one shot. In Conference on Computer Vision and Pattern Recognition. 991--999.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jian-Fang Hu, Wei-Shi Zheng, Lianyang Ma, Gang Wang, Jian-Huang Lai, and Jianguo Zhang. 2018. Early action prediction by soft regression. IEEE transactions on pattern analysis and machine intelligence (2018).Google ScholarGoogle Scholar
  11. Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. 2017. Instancecut: from edges to instances with multicut. In Conference on Computer Vision and Pattern Recognition. 5008--5017.Google ScholarGoogle ScholarCross RefCross Ref
  12. Trung-Nghia Le and Akihiro Sugimoto. 2019. Semantic Instance Meets Salient Object: Study on Video Semantic Salient Instance Segmentation. In Winter Conference on Applications of Computer Vision. 1779--1788.Google ScholarGoogle Scholar
  13. Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully convolutional instance-aware semantic segmentation. In Conference on Computer Vision and Pattern Recognition. 2359--2367.Google ScholarGoogle ScholarCross RefCross Ref
  14. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision . 740--755.Google ScholarGoogle ScholarCross RefCross Ref
  15. Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 2017. Sgn: Sequential grouping networks for instance segmentation. In International Conference on Computer Vision. 3496--3504.Google ScholarGoogle ScholarCross RefCross Ref
  16. Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. Path aggregation network for instance segmentation. In Conference on Computer Vision and Pattern Recognition. 8759--8768.Google ScholarGoogle ScholarCross RefCross Ref
  17. Shu Liu, Xiaojuan Qi, Jianping Shi, Hong Zhang, and Jiaya Jia. 2016. Multi-scale patch aggregation (mpa) for simultaneous detection and segmentation. In Conference on Computer Vision and Pattern Recognition. 3141--3149.Google ScholarGoogle ScholarCross RefCross Ref
  18. Pauline Luc, Camille Couprie, Yann Lecun, and Jakob Verbeek. 2018. Predicting Future Instance Segmentation by Forecasting Convolutional Features. In European Conference on Computer Vision. 584--599.Google ScholarGoogle ScholarCross RefCross Ref
  19. Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, and Yann LeCun. 2017. Predicting deeper into the future of semantic segmentation. In International Conference on Computer Vision. 648--657.Google ScholarGoogle ScholarCross RefCross Ref
  20. Michael Mathieu, Camille Couprie, and Yann LeCun. 2016. Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations (2016).Google ScholarGoogle Scholar
  21. Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. 2015. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems. 2863--2871.Google ScholarGoogle Scholar
  22. Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. 2015. Learning to segment object candidates. In Advances in Neural Information Processing Systems. 1990--1998.Google ScholarGoogle Scholar
  23. Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. 2016. Learning to refine object segments. In European Conference on Computer Vision . 75--91.Google ScholarGoogle ScholarCross RefCross Ref
  24. MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. 2014. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014).Google ScholarGoogle Scholar
  25. Mikel D. Rodriguez and Mubarak Shah. 2007. Detecting and Segmenting Humans in Crowded Scenes. In ACM International Conference on Multimedia . 353--356.Google ScholarGoogle Scholar
  26. Guillaume Seguin, Piotr Bojanowski, Rémi Lajugie, and Ivan Laptev. 2016a. Instance-level video segmentation from object tracks. In Conference on Computer Vision and Pattern Recognition .Google ScholarGoogle ScholarCross RefCross Ref
  27. Guillaume Seguin, Piotr Bojanowski, Rémi Lajugie, and Ivan Laptev. 2016b. Instance-level video segmentation from object tracks. In Conference on Computer Vision and Pattern Recognition. 3678--3687.Google ScholarGoogle ScholarCross RefCross Ref
  28. Yuge Shi, Basura Fernando, and Richard Hartley. 2018. Action Anticipation with RBF Kernelized Feature Mapping RNN. In European Conference on Computer Vision. 301--317.Google ScholarGoogle Scholar
  29. Jingkuan Song, Lianli Gao, Mihai Marian Puscas, Feiping Nie, Fumin Shen, and Nicu Sebe. 2016. Joint graph learning and video segmentation via multiple cues and topology calibration. In ACM International Conference on Multimedia. 831--840.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Vibhav Vineet, Jonathan Warrell, Lubor Ladicky, and Philip HS Torr. 2011. Human Instance Segmentation from Video using Detector-based Conditional Random Fields.. In British Machine Vision Conference . 12--15.Google ScholarGoogle ScholarCross RefCross Ref
  31. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2015. Anticipating the future by watching unlabeled video. (2015).Google ScholarGoogle Scholar
  32. Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and S Yu Philip. 2017. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In Advances in Neural Information Processing Systems. 879--888.Google ScholarGoogle Scholar
  33. SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems. 802--810.Google ScholarGoogle Scholar

Index Terms

  1. Predicting Future Instance Segmentation with Contextual Pyramid ConvLSTMs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '19: Proceedings of the 27th ACM International Conference on Multimedia
      October 2019
      2794 pages
      ISBN:9781450368896
      DOI:10.1145/3343031

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 October 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader