research-article

Meta Parsing Networks: Towards Generalized Few-shot Scene Parsing with Adaptive Metric Learning

Authors:
Peike Li

University of Technology Sydney, Sydney, Australia

University of Technology Sydney, Sydney, Australia
View Profile

,
Yunchao Wei

University of Technology Sydney, Sydney, Australia

University of Technology Sydney, Sydney, Australia
View Profile

,
Yi Yang

University of Technology Sydney, Sydney, Australia

University of Technology Sydney, Sydney, Australia
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 64–72https://doi.org/10.1145/3394171.3413944

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 64–72

ABSTRACT

Recent progress in few-shot segmentation usually aims at performing novel object segmentation using a few annotated examples as guidance. In this work, we advance this few-shot segmentation paradigm towards a more challenging yet general scenario, i.e., Generalized Few-shot Scene Parsing (GFSP). In this task, we take a fully annotated image as guidance to segment all pixels in a query image. Our mission is to study a generalizable and robust segmentation network from the meta-learning perspective so that both seen and unseen categories can be correctly recognized. Different from previous practices, this task performs segmentation on a joint label space consisting of both previously seen and novel categories. Moreover, pixels from these multiple categories need to be simultaneously taken into account, which is actually not well explored before. Accordingly, we present Meta Parsing Networks (MPNet) to better exploit the guidance information in the support set. Our MPNet contains two basic modules, i.e., the Adaptive Deep Metric Learning (ADML) module and the Contrastive Inter-class Distraction (CID) module. Specially, the ADML takes the annotated pixels from the support image as the guidance and adaptively produces high-quality prototypes for learning a deep comparison metric. In addition, MPNet further introduces the CID module learning to enlarge the feature discrepancy of different categories in the embedding space, leading the MPNet to generate more discriminative feature embeddings. We conduct experiments on two newly constructed benchmarks, i.e., GFSP-Cityscapes and GFSP-Pascal-Context. Extensive ablation studies well demonstrate the effectiveness and generalization ability of our MPNet.

Supplemental Material

3394171.3413944.mp4

mp4

58.3 MB

Download

References

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).Google Scholar
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV). 801--818.Google ScholarCross Ref
Bowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, Jinjun Xiong, Thomas S Huang, Wen-Mei Hwu, and Honghui Shi. 2019. SPGNet: Semantic Prediction Guidance for Scene Parsing. In IEEE International Conference on Computer Vision (ICCV). 5218--5228.Google Scholar
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3213--3223.Google ScholarCross Ref
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 248--255.Google ScholarCross Ref
Nanqing Dong and Eric Xing. 2018. Few-Shot Semantic Segmentation with Prototype Learning. In The British Machine Vision Conference (BMVC), Vol. 3.Google Scholar
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, Vol. 88, 2 (2010), 303--338.Google ScholarDigital Library
Qianyu Feng, Guoliang Kang, Hehe Fan, and Yi Yang. 2019 a. Attract or Distract: Exploit the Margin of Open Set. In Proceedings of the IEEE International Conference on Computer Vision. 7990--7999.Google ScholarCross Ref
Qianyu Feng, Yu Wu, Hehe Fan, Chenggang Yan, Mingliang Xu, and Yi Yang. 2020. Cascaded revision network for novel object captioning. IEEE Transactions on Circuits and Systems for Video Technology (2020).Google Scholar
Qianyu Feng, Zongxin Yang, Peike Li, Yunchao Wei, and Yi Yang. 2019 b. Dual embedding learning for video instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops.Google ScholarCross Ref
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML). 1126--1135.Google Scholar
Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. 2007. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems (NeurIPS). 513--520.Google Scholar
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In IEEE International Conference on Computer Vision (ICCV). 2961--2969.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google ScholarCross Ref
Tao Hu, Pengwan Yang, Chiliang Zhang, Gang Yu, Yadong Mu, and Cees GM Snoek. 2019. Attention-based multi-context guiding for few-shot semantic segmentation. In AAAI Conference on Artificial Intelligence (AAAI), Vol. 33. 8441--8448.Google ScholarCross Ref
Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE International Conference on Computer Vision (ICCV). 1501--1510.Google ScholarCross Ref
Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In IEEE International Conference on Computer Vision (ICCV). 603--612.Google ScholarCross Ref
Jianbo Jiao, Yunchao Wei, Zequn Jie, Honghui Shi, Rynson WH Lau, and Thomas S Huang. 2019. Geometry-aware distillation for indoor semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2869--2878.Google ScholarCross Ref
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Lille.Google Scholar
Peike Li, Xuanyi Dong, Xin Yu, and Yi Yang. 2020 a. When Humans Meet Machines: Towards Efficient Segmentation Networks. Proceedings of the British Machine Vision Conference (BMVC) (2020).Google Scholar
Peike Li, Pingbo Pan, Ping Liu, Mingliang Xu, and Yi Yang. 2020 b. Hierarchical Temporal Modeling with Mutual Distance Matching for Video Based Person Re-Identification. IEEE Transactions on Circuits and Systems for Video Technology (2020).Google Scholar
Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. 2019. Self-Correction for Human Parsing. arXiv preprint arXiv:1910.09777 (2019).Google Scholar
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV). 740--755.Google ScholarCross Ref
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3431--3440.Google ScholarCross Ref
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, Nov (2008), 2579--2605.Google Scholar
Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The Role of Context for Object Detection and Semantic Segmentation in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Khoi Nguyen and Sinisa Todorovic. 2019. Feature weighting and boosting for few-shot segmentation. In IEEE International Conference on Computer Vision (ICCV). 622--631.Google ScholarCross Ref
Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alyosha Efros, and Sergey Levine. 2018. Conditional networks for few-shot semantic segmentation. ICLR Workshop.Google Scholar
Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and Byron Boots. 2017. One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410 (2017).Google Scholar
Mennatullah Siam, Boris N Oreshkin, and Martin Jagersand. 2019. AMP: Adaptive masked proxies for few-shot segmentation. In IEEE International Conference on Computer Vision (ICCV). 5249--5258.Google ScholarCross Ref
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV). 746--760.Google ScholarDigital Library
Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS). 4077--4087.Google Scholar
Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 567--576.Google ScholarCross Ref
Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1199--1208.Google ScholarCross Ref
Pinzhuo Tian, Zhangkai Wu, Lei Qi, Lei Wang, Yinghuan Shi, and Yang Gao. 2020. Differentiable Meta-learning Model for Few-shot Semantic Segmentation. In AAAI Conference on Artificial Intelligence (AAAI).Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS). 5998--6008.Google Scholar
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems (NeurIPS). 3630--3638.Google Scholar
Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. 2019. Panet: Few-shot image semantic segmentation with prototype alignment. In IEEE International Conference on Computer Vision (ICCV). 9197--9206.Google ScholarCross Ref
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7794--7803.Google ScholarCross Ref
Zongxin Yang, Peike Li, Qianyu Feng, Yunchao Wei, and Yi Yang. 2019. Going deeper into embedding learning for video object segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 0--0.Google ScholarCross Ref
Yuhui Yuan and Jingdong Wang. 2018. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018).Google Scholar
Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo, Qingyao Wu, and Rui Yao. 2019 a. Pyramid Graph Networks with Connection Attentions for Region-Based One-Shot Semantic Segmentation. In IEEE International Conference on Computer Vision (ICCV). 9587--9595.Google Scholar
Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua Shen. 2019 b. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5217--5226.Google ScholarCross Ref
Xiaolin Zhang, Yunchao Wei, Yi Yang, and Thomas Huang. 2018. Sg-one: Similarity guidance network for one-shot semantic segmentation. arXiv preprint arXiv:1810.09091 (2018).Google Scholar
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2881--2890.Google ScholarCross Ref
Zhedong Zheng, Yunchao Wei, and Yi Yang. 2020. University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization. In Proceedings of the 28th ACM international conference on Multimedia.Google ScholarDigital Library
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. Scene parsing through ade20k dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 633--641.Google ScholarCross Ref

Index Terms

Meta Parsing Networks: Towards Generalized Few-shot Scene Parsing with Adaptive Metric Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Weakly-supervised scene parsing with multiple contextual cues

Scene parsing, fully labeling an image with each region corresponding to a label, is one of the core problems of computer vision. Previous methods to this problem usually rely on patch-level models trained from well labeled data. In this paper, we ...
Read More
Reference-Limited Compositional Zero-Shot Learning
ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Compositional zero-shot learning (CZSL) refers to recognizing unseen compositions of known visual primitives, which is an essential ability for artificial intelligence systems to learn and understand the world. While considerable progress has been made ...
Read More
Learning Adaptive Classifiers Synthesis for Generalized Few-Shot Learning
Abstract
Object recognition in the real-world requires handling long-tailed or even open-ended data. An ideal visual system needs to recognize the populated head visual concepts reliably and meanwhile efficiently learn about emerging new tail categories ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
few-shot learning
meta learning
scene parsing
semantic segmentation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 496
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Meta Parsing Networks: Towards Generalized Few-shot Scene Parsing with Adaptive Metric Learning

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Weakly-supervised scene parsing with multiple contextual cues

Reference-Limited Compositional Zero-Shot Learning

Learning Adaptive Classifiers Synthesis for Generalized Few-Shot Learning