research-article

Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation

Authors:
Guiyu Tian

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Shuai Wang

BOE Technology Group Co., Ltd., Beijing, China

BOE Technology Group Co., Ltd., Beijing, China
View Profile

,
Jie Feng

BOE Technology Group Co., Ltd., Beijing, China

BOE Technology Group Co., Ltd., Beijing, China
View Profile

,
Li Zhou

BOE Technology Group Co., Ltd., Beijing, China

BOE Technology Group Co., Ltd., Beijing, China
View Profile

,
Yadong Mu

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 4125–4134https://doi.org/10.1145/3394171.3413990

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 4125–4134

ABSTRACT

Zero-shot image segmentation refers to the task of segmenting pixels from specific unseen semantic class. Previous methods mainly rely on historic segmentation tasks, such as using semantic embedding or word embedding of class names to infer a new segmentation model. In this work we describe Cap2Seg, a novel solution of zero-shot image segmentation that harnesses accompanying image captions for intelligently inferring spatial and semantic context for the zero-shot image segmentation task. As our main insight, image captions often implicitly entail the occurrence of a new class in an image and its most-confident spatial distribution. We define a contextual entailment question (CEQ) that tailors BERT-like text models. In specific, the proposed networks for inferring unseen classes consists of three branches (global / local / semi-global), which infer labels of unseen class from image level, image-stripe level or pixel level respectively. Comprehensive experiments and ablation studies are conducted on two image benchmarks, COCO-stuff and Pascal VOC. All clearly demonstrate the effectiveness of the proposed Cap2Seg, including a set of hardest unseen classes (i.e., image captions do not literally contain the class names and direct matching for inference fails).

Supplemental Material

3394171.3413990.mp4

mp4

5.9 MB

Download

References

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2016. Label-Embedding for Image Classification. TPAMI, Vol. 38, 7 (2016), 1425--1438.Google ScholarCross Ref
Zeynep Akata, Scott E. Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. In CVPR. 2927--2936.Google Scholar
Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. TPAMI, Vol. 39, 12 (2017), 2481--2495.Google Scholar
Amy L. Bearman, Olga Russakovsky, Vittorio Ferrari, and Fei-Fei Li. 2016. What's the Point: Semantic Segmentation with Point Supervision. In ECCV. 549--565.Google Scholar
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML. 41--48.Google Scholar
Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick Pérez. 2019. Zero-Shot Semantic Segmentation. In NIPS .Google Scholar
Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. 2018. COCO-Stuff: Thing and Stuff Classes in Context. In CVPR. 1209--1218.Google Scholar
Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized Classifiers for Zero-Shot Learning. In CVPR. 5327--5336.Google Scholar
Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. 2016. An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild. In ECCV. 52--68.Google Scholar
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2018a. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI, Vol. 40, 4 (2018), 834--848.Google ScholarCross Ref
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2018b. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI, Vol. 40, 4 (2018), 834--848.Google ScholarCross Ref
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking Atrous Convolution for Semantic Image Segmentation. CoRR, Vol. abs/1706.05587 (2017).Google Scholar
Ido Dagan and Oren Glickman. 2004. PROBABILISTIC TEXTUAL ENTAILMENT: GENERIC APPLIED MODELING OF LANGUAGE VARIABILITY.Google Scholar
Jifeng Dai, Kaiming He, and Jian Sun. 2015. BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation. In ICCV. 1635--1643.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). 4171--4186.Google Scholar
Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. 2015. The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, Vol. 111, 1 (2015), 98--136.Google ScholarDigital Library
Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In NIPS. 2121--2129.Google Scholar
Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xiangyang Xue, Leonid Sigal, and Shaogang Gong. 2018. Recent Advances in Zero-Shot Recognition: Toward Data-Efficient Understanding of Visual Content. IEEE Signal Process. Mag., Vol. 35, 1 (2018), 112--125.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.Google Scholar
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jé gou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. CoRR, Vol. abs/1612.03651 (2016).Google Scholar
Naoki Kato, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2019. Zero-Shot Semantic Segmentation via Variational Mapping. In ICCV Workshops .Google ScholarCross Ref
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR .Google Scholar
Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017. Semantic Autoencoder for Zero-Shot Learning. In CVPR. 4447--4456.Google Scholar
Alexander Kolesnikov and Christoph H. Lampert. 2016. Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation. In ECCV. 695--711.Google Scholar
Suha Kwak, Seunghoon Hong, and Bohyung Han. 2017. Weakly Supervised Semantic Segmentation Using Superpixel Pooling Network. In AAAI. 4111--4117.Google Scholar
Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-Based Classification for Zero-Shot Visual Object Categorization. TPAMI, Vol. 36, 3 (2014), 453--465.Google ScholarDigital Library
Yanan Li, Donghui Wang, Huanhang Hu, Yuetan Lin, and Yueting Zhuang. 2017. Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths. In CVPR. 5207--5215.Google Scholar
Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. 2016. ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation. In CVPR. 3159--3167.Google Scholar
Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature Pyramid Networks for Object Detection. In CVPR. 936--944.Google Scholar
Shichen Liu, Mingsheng Long, Jianmin Wang, and Michael I. Jordan. 2018. Generalized Zero-Shot Learning with Deep Calibration Network. In NIPS. 2009--2019.Google Scholar
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR. 3431--3440.Google Scholar
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. 3111--3119.Google Scholar
Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffrey Dean. 2014. Zero-Shot Learning by Convex Combination of Semantic Embeddings. In ICLR .Google Scholar
George Papandreou, Liang-Chieh Chen, Kevin Murphy, and Alan L. Yuille. 2015. Weakly- and Semi-Supervised Learning of a DCNN for Semantic Image Segmentation. CoRR, Vol. abs/1502.02734 (2015).Google Scholar
Pedro H. O. Pinheiro and Ronan Collobert. 2015. From image-level to pixel-level labeling with Convolutional Networks. In CVPR. 1713--1721.Google Scholar
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting Image Annotations Using Amazon's Mechanical Turk. In Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, Los Angeles, USA, June 6, 2010. 139--147.Google ScholarDigital Library
Bernardino Romera-Paredes and Philip H. S. Torr. 2015. An embarrassingly simple approach to zero-shot learning. In ICML. 2152--2161.Google Scholar
Anirban Roy and Sinisa Todorovic. 2017. Combining Bottom-Up, Top-Down, and Smoothness Cues for Weakly Supervised Image Segmentation. In CVPR. 7282--7291.Google Scholar
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. [n.d.]. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision ( [n.,d.]).Google Scholar
Johann Sawatzky, Debayan Banerjee, and Juergen Gall. 2019. Harvesting Information from Captions for Weakly Supervised Semantic Segmentation. CoRR, Vol. abs/1905.06784 (2019).Google Scholar
Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Y. Ng. 2013. Zero-Shot Learning Through Cross-Modal Transfer. In NIPS. 935--943.Google Scholar
Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. 2018. Generalized Zero-Shot Learning via Synthesized Examples. In CVPR. 4281--4289.Google Scholar
Vinay Kumar Verma and Piyush Rai. 2017. A Simple Exponential Family Framework for Zero-Shot Learning. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18--22, 2017, Proceedings, Part II. 792--808.Google ScholarCross Ref
Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché -Buc, Emily B. Fox, and Roman Garnett (Eds.). 2019. NeurIPS .Google Scholar
Wei Wang, Vincent W. Zheng, Han Yu, and Chunyan Miao. 2019. A Survey of Zero-Shot Learning: Settings, Methods, and Applications. ACM TIST, Vol. 10, 2 (2019), 13:1--13:37.Google ScholarDigital Library
Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi Feng, and Thomas S. Huang. 2018. Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi-Supervised Semantic Segmentation. In CVPR. 7268--7277.Google Scholar
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. ArXiv, Vol. abs/1910.03771 (2019).Google Scholar
Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh N. Nguyen, Matthias Hein, and Bernt Schiele. 2016. Latent Embeddings for Zero-Shot Classification. In CVPR. 69--77.Google Scholar
Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. 2019 a. Semantic Projection Network for Zero- and Few-Label Semantic Segmentation. In CVPR. 8256--8265.Google Scholar
Yongqin Xian, Christoph H. Lampert, Bernt Schiele, and Zeynep Akata. 2019 b. Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly. TPAMI, Vol. 41, 9 (2019), 2251--2265.Google ScholarCross Ref
Jia Xu, Alexander G. Schwing, and Raquel Urtasun. 2015. Learning to segment under various forms of weak supervision. In CVPR. 3781--3790.Google Scholar
Keren Ye, Mingda Zhang, Adriana Kovashka, Wei Li, Danfeng Qin, and Jesse Berent. 2019. Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection. In ICCV .Google Scholar
Meng Ye and Yuhong Guo. 2017. Zero-Shot Classification with Discriminative Semantic Representation Learning. In CVPR. 5103--5111.Google Scholar
Li Zhang, Tao Xiang, and Shaogang Gong. 2017. Learning a Deep Embedding Model for Zero-Shot Learning. In CVPR. 3010--3019.Google Scholar
Ziming Zhang and Venkatesh Saligrama. 2015. Zero-Shot Learning via Semantic Similarity Embedding. In ICCV. 4166--4174.Google Scholar
Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning Deep Features for Discriminative Localization. In CVPR. 2921--2929.Google Scholar
Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. 2018. Weakly Supervised Instance Segmentation Using Class Peak Response. In CVPR. 3791--3800.Google Scholar

Index Terms

Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation

Recommendations

Zero-shot semantic segmentation via spatial and multi-scale aware visual class embedding
Highlights
- We proposed new Spatial and Multi-scale Visual Class Embedding NETwork (SMVCENet) for zero-shot semantic segmentation.
Abstract
Fully supervised semantic segmentation technologies bring a paradigm shift in scene understanding. However, the burden of expensive labeling cost remains as a challenge. To solve the cost problem, recent studies proposed language model ...
Read More
Transductive Visual-Semantic Embedding for Zero-shot Learning
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

Zero-shot learning (ZSL) aims to bridge the knowledge transfer via available semantic representations (e.g., attributes) between labeled source instances of seen classes and unlabelled target instances of unseen classes. Most existing ZSL approaches ...
Read More
Zero-shot Image Categorization by Image Correlation Exploration
ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval

The problem of image categorization from zero or only a few training examples, called zero-shot learning, occurs frequently, but it has hardly been studied in computer vision research. To tackle this problem, mid-level semantic attributes are introduced ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
image segmentation
neural networks
zero-shot learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 246
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Zero-shot semantic segmentation via spatial and multi-scale aware visual class embedding

Transductive Visual-Semantic Embedding for Zero-shot Learning

Zero-shot Image Categorization by Image Correlation Exploration

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media