ABSTRACT
Recent progress in few-shot segmentation usually aims at performing novel object segmentation using a few annotated examples as guidance. In this work, we advance this few-shot segmentation paradigm towards a more challenging yet general scenario, i.e., Generalized Few-shot Scene Parsing (GFSP). In this task, we take a fully annotated image as guidance to segment all pixels in a query image. Our mission is to study a generalizable and robust segmentation network from the meta-learning perspective so that both seen and unseen categories can be correctly recognized. Different from previous practices, this task performs segmentation on a joint label space consisting of both previously seen and novel categories. Moreover, pixels from these multiple categories need to be simultaneously taken into account, which is actually not well explored before. Accordingly, we present Meta Parsing Networks (MPNet) to better exploit the guidance information in the support set. Our MPNet contains two basic modules, i.e., the Adaptive Deep Metric Learning (ADML) module and the Contrastive Inter-class Distraction (CID) module. Specially, the ADML takes the annotated pixels from the support image as the guidance and adaptively produces high-quality prototypes for learning a deep comparison metric. In addition, MPNet further introduces the CID module learning to enlarge the feature discrepancy of different categories in the embedding space, leading the MPNet to generate more discriminative feature embeddings. We conduct experiments on two newly constructed benchmarks, i.e., GFSP-Cityscapes and GFSP-Pascal-Context. Extensive ablation studies well demonstrate the effectiveness and generalization ability of our MPNet.
Supplemental Material
- Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).Google Scholar
- Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV). 801--818.Google ScholarCross Ref
- Bowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, Jinjun Xiong, Thomas S Huang, Wen-Mei Hwu, and Honghui Shi. 2019. SPGNet: Semantic Prediction Guidance for Scene Parsing. In IEEE International Conference on Computer Vision (ICCV). 5218--5228.Google Scholar
- Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3213--3223.Google ScholarCross Ref
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 248--255.Google ScholarCross Ref
- Nanqing Dong and Eric Xing. 2018. Few-Shot Semantic Segmentation with Prototype Learning. In The British Machine Vision Conference (BMVC), Vol. 3.Google Scholar
- Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, Vol. 88, 2 (2010), 303--338.Google ScholarDigital Library
- Qianyu Feng, Guoliang Kang, Hehe Fan, and Yi Yang. 2019 a. Attract or Distract: Exploit the Margin of Open Set. In Proceedings of the IEEE International Conference on Computer Vision. 7990--7999.Google ScholarCross Ref
- Qianyu Feng, Yu Wu, Hehe Fan, Chenggang Yan, Mingliang Xu, and Yi Yang. 2020. Cascaded revision network for novel object captioning. IEEE Transactions on Circuits and Systems for Video Technology (2020).Google Scholar
- Qianyu Feng, Zongxin Yang, Peike Li, Yunchao Wei, and Yi Yang. 2019 b. Dual embedding learning for video instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops.Google ScholarCross Ref
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML). 1126--1135.Google Scholar
- Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. 2007. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems (NeurIPS). 513--520.Google Scholar
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In IEEE International Conference on Computer Vision (ICCV). 2961--2969.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google ScholarCross Ref
- Tao Hu, Pengwan Yang, Chiliang Zhang, Gang Yu, Yadong Mu, and Cees GM Snoek. 2019. Attention-based multi-context guiding for few-shot semantic segmentation. In AAAI Conference on Artificial Intelligence (AAAI), Vol. 33. 8441--8448.Google ScholarCross Ref
- Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE International Conference on Computer Vision (ICCV). 1501--1510.Google ScholarCross Ref
- Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In IEEE International Conference on Computer Vision (ICCV). 603--612.Google ScholarCross Ref
- Jianbo Jiao, Yunchao Wei, Zequn Jie, Honghui Shi, Rynson WH Lau, and Thomas S Huang. 2019. Geometry-aware distillation for indoor semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2869--2878.Google ScholarCross Ref
- Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Lille.Google Scholar
- Peike Li, Xuanyi Dong, Xin Yu, and Yi Yang. 2020 a. When Humans Meet Machines: Towards Efficient Segmentation Networks. Proceedings of the British Machine Vision Conference (BMVC) (2020).Google Scholar
- Peike Li, Pingbo Pan, Ping Liu, Mingliang Xu, and Yi Yang. 2020 b. Hierarchical Temporal Modeling with Mutual Distance Matching for Video Based Person Re-Identification. IEEE Transactions on Circuits and Systems for Video Technology (2020).Google Scholar
- Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. 2019. Self-Correction for Human Parsing. arXiv preprint arXiv:1910.09777 (2019).Google Scholar
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV). 740--755.Google ScholarCross Ref
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3431--3440.Google ScholarCross Ref
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, Nov (2008), 2579--2605.Google Scholar
- Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The Role of Context for Object Detection and Semantic Segmentation in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Khoi Nguyen and Sinisa Todorovic. 2019. Feature weighting and boosting for few-shot segmentation. In IEEE International Conference on Computer Vision (ICCV). 622--631.Google ScholarCross Ref
- Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alyosha Efros, and Sergey Levine. 2018. Conditional networks for few-shot semantic segmentation. ICLR Workshop.Google Scholar
- Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and Byron Boots. 2017. One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410 (2017).Google Scholar
- Mennatullah Siam, Boris N Oreshkin, and Martin Jagersand. 2019. AMP: Adaptive masked proxies for few-shot segmentation. In IEEE International Conference on Computer Vision (ICCV). 5249--5258.Google ScholarCross Ref
- Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV). 746--760.Google ScholarDigital Library
- Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS). 4077--4087.Google Scholar
- Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 567--576.Google ScholarCross Ref
- Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1199--1208.Google ScholarCross Ref
- Pinzhuo Tian, Zhangkai Wu, Lei Qi, Lei Wang, Yinghuan Shi, and Yang Gao. 2020. Differentiable Meta-learning Model for Few-shot Semantic Segmentation. In AAAI Conference on Artificial Intelligence (AAAI).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS). 5998--6008.Google Scholar
- Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems (NeurIPS). 3630--3638.Google Scholar
- Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. 2019. Panet: Few-shot image semantic segmentation with prototype alignment. In IEEE International Conference on Computer Vision (ICCV). 9197--9206.Google ScholarCross Ref
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7794--7803.Google ScholarCross Ref
- Zongxin Yang, Peike Li, Qianyu Feng, Yunchao Wei, and Yi Yang. 2019. Going deeper into embedding learning for video object segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 0--0.Google ScholarCross Ref
- Yuhui Yuan and Jingdong Wang. 2018. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018).Google Scholar
- Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo, Qingyao Wu, and Rui Yao. 2019 a. Pyramid Graph Networks with Connection Attentions for Region-Based One-Shot Semantic Segmentation. In IEEE International Conference on Computer Vision (ICCV). 9587--9595.Google Scholar
- Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua Shen. 2019 b. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5217--5226.Google ScholarCross Ref
- Xiaolin Zhang, Yunchao Wei, Yi Yang, and Thomas Huang. 2018. Sg-one: Similarity guidance network for one-shot semantic segmentation. arXiv preprint arXiv:1810.09091 (2018).Google Scholar
- Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2881--2890.Google ScholarCross Ref
- Zhedong Zheng, Yunchao Wei, and Yi Yang. 2020. University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization. In Proceedings of the 28th ACM international conference on Multimedia.Google ScholarDigital Library
- Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. Scene parsing through ade20k dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 633--641.Google ScholarCross Ref
Index Terms
- Meta Parsing Networks: Towards Generalized Few-shot Scene Parsing with Adaptive Metric Learning
Recommendations
Weakly-supervised scene parsing with multiple contextual cues
Scene parsing, fully labeling an image with each region corresponding to a label, is one of the core problems of computer vision. Previous methods to this problem usually rely on patch-level models trained from well labeled data. In this paper, we ...
Reference-Limited Compositional Zero-Shot Learning
ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia RetrievalCompositional zero-shot learning (CZSL) refers to recognizing unseen compositions of known visual primitives, which is an essential ability for artificial intelligence systems to learn and understand the world. While considerable progress has been made ...
Learning Adaptive Classifiers Synthesis for Generalized Few-Shot Learning
AbstractObject recognition in the real-world requires handling long-tailed or even open-ended data. An ideal visual system needs to recognize the populated head visual concepts reliably and meanwhile efficiently learn about emerging new tail categories ...
Comments