skip to main content
10.1145/3394171.3413990acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation

Authors Info & Claims
Published:12 October 2020Publication History

ABSTRACT

Zero-shot image segmentation refers to the task of segmenting pixels from specific unseen semantic class. Previous methods mainly rely on historic segmentation tasks, such as using semantic embedding or word embedding of class names to infer a new segmentation model. In this work we describe Cap2Seg, a novel solution of zero-shot image segmentation that harnesses accompanying image captions for intelligently inferring spatial and semantic context for the zero-shot image segmentation task. As our main insight, image captions often implicitly entail the occurrence of a new class in an image and its most-confident spatial distribution. We define a contextual entailment question (CEQ) that tailors BERT-like text models. In specific, the proposed networks for inferring unseen classes consists of three branches (global / local / semi-global), which infer labels of unseen class from image level, image-stripe level or pixel level respectively. Comprehensive experiments and ablation studies are conducted on two image benchmarks, COCO-stuff and Pascal VOC. All clearly demonstrate the effectiveness of the proposed Cap2Seg, including a set of hardest unseen classes (i.e., image captions do not literally contain the class names and direct matching for inference fails).

Skip Supplemental Material Section

Supplemental Material

3394171.3413990.mp4

mp4

5.9 MB

References

  1. Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2016. Label-Embedding for Image Classification. TPAMI, Vol. 38, 7 (2016), 1425--1438.Google ScholarGoogle ScholarCross RefCross Ref
  2. Zeynep Akata, Scott E. Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. In CVPR. 2927--2936.Google ScholarGoogle Scholar
  3. Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. TPAMI, Vol. 39, 12 (2017), 2481--2495.Google ScholarGoogle Scholar
  4. Amy L. Bearman, Olga Russakovsky, Vittorio Ferrari, and Fei-Fei Li. 2016. What's the Point: Semantic Segmentation with Point Supervision. In ECCV. 549--565.Google ScholarGoogle Scholar
  5. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML. 41--48.Google ScholarGoogle Scholar
  6. Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick Pérez. 2019. Zero-Shot Semantic Segmentation. In NIPS .Google ScholarGoogle Scholar
  7. Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. 2018. COCO-Stuff: Thing and Stuff Classes in Context. In CVPR. 1209--1218.Google ScholarGoogle Scholar
  8. Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized Classifiers for Zero-Shot Learning. In CVPR. 5327--5336.Google ScholarGoogle Scholar
  9. Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. 2016. An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild. In ECCV. 52--68.Google ScholarGoogle Scholar
  10. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2018a. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI, Vol. 40, 4 (2018), 834--848.Google ScholarGoogle ScholarCross RefCross Ref
  11. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2018b. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI, Vol. 40, 4 (2018), 834--848.Google ScholarGoogle ScholarCross RefCross Ref
  12. Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking Atrous Convolution for Semantic Image Segmentation. CoRR, Vol. abs/1706.05587 (2017).Google ScholarGoogle Scholar
  13. Ido Dagan and Oren Glickman. 2004. PROBABILISTIC TEXTUAL ENTAILMENT: GENERIC APPLIED MODELING OF LANGUAGE VARIABILITY.Google ScholarGoogle Scholar
  14. Jifeng Dai, Kaiming He, and Jian Sun. 2015. BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation. In ICCV. 1635--1643.Google ScholarGoogle Scholar
  15. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). 4171--4186.Google ScholarGoogle Scholar
  16. Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. 2015. The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, Vol. 111, 1 (2015), 98--136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In NIPS. 2121--2129.Google ScholarGoogle Scholar
  18. Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xiangyang Xue, Leonid Sigal, and Shaogang Gong. 2018. Recent Advances in Zero-Shot Recognition: Toward Data-Efficient Understanding of Visual Content. IEEE Signal Process. Mag., Vol. 35, 1 (2018), 112--125.Google ScholarGoogle ScholarCross RefCross Ref
  19. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.Google ScholarGoogle Scholar
  20. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jé gou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. CoRR, Vol. abs/1612.03651 (2016).Google ScholarGoogle Scholar
  21. Naoki Kato, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2019. Zero-Shot Semantic Segmentation via Variational Mapping. In ICCV Workshops .Google ScholarGoogle ScholarCross RefCross Ref
  22. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR .Google ScholarGoogle Scholar
  23. Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017. Semantic Autoencoder for Zero-Shot Learning. In CVPR. 4447--4456.Google ScholarGoogle Scholar
  24. Alexander Kolesnikov and Christoph H. Lampert. 2016. Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation. In ECCV. 695--711.Google ScholarGoogle Scholar
  25. Suha Kwak, Seunghoon Hong, and Bohyung Han. 2017. Weakly Supervised Semantic Segmentation Using Superpixel Pooling Network. In AAAI. 4111--4117.Google ScholarGoogle Scholar
  26. Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-Based Classification for Zero-Shot Visual Object Categorization. TPAMI, Vol. 36, 3 (2014), 453--465.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yanan Li, Donghui Wang, Huanhang Hu, Yuetan Lin, and Yueting Zhuang. 2017. Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths. In CVPR. 5207--5215.Google ScholarGoogle Scholar
  28. Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. 2016. ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation. In CVPR. 3159--3167.Google ScholarGoogle Scholar
  29. Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature Pyramid Networks for Object Detection. In CVPR. 936--944.Google ScholarGoogle Scholar
  30. Shichen Liu, Mingsheng Long, Jianmin Wang, and Michael I. Jordan. 2018. Generalized Zero-Shot Learning with Deep Calibration Network. In NIPS. 2009--2019.Google ScholarGoogle Scholar
  31. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR. 3431--3440.Google ScholarGoogle Scholar
  32. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. 3111--3119.Google ScholarGoogle Scholar
  33. Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffrey Dean. 2014. Zero-Shot Learning by Convex Combination of Semantic Embeddings. In ICLR .Google ScholarGoogle Scholar
  34. George Papandreou, Liang-Chieh Chen, Kevin Murphy, and Alan L. Yuille. 2015. Weakly- and Semi-Supervised Learning of a DCNN for Semantic Image Segmentation. CoRR, Vol. abs/1502.02734 (2015).Google ScholarGoogle Scholar
  35. Pedro H. O. Pinheiro and Ronan Collobert. 2015. From image-level to pixel-level labeling with Convolutional Networks. In CVPR. 1713--1721.Google ScholarGoogle Scholar
  36. Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting Image Annotations Using Amazon's Mechanical Turk. In Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, Los Angeles, USA, June 6, 2010. 139--147.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Bernardino Romera-Paredes and Philip H. S. Torr. 2015. An embarrassingly simple approach to zero-shot learning. In ICML. 2152--2161.Google ScholarGoogle Scholar
  38. Anirban Roy and Sinisa Todorovic. 2017. Combining Bottom-Up, Top-Down, and Smoothness Cues for Weakly Supervised Image Segmentation. In CVPR. 7282--7291.Google ScholarGoogle Scholar
  39. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. [n.d.]. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision ( [n.,d.]).Google ScholarGoogle Scholar
  40. Johann Sawatzky, Debayan Banerjee, and Juergen Gall. 2019. Harvesting Information from Captions for Weakly Supervised Semantic Segmentation. CoRR, Vol. abs/1905.06784 (2019).Google ScholarGoogle Scholar
  41. Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Y. Ng. 2013. Zero-Shot Learning Through Cross-Modal Transfer. In NIPS. 935--943.Google ScholarGoogle Scholar
  42. Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. 2018. Generalized Zero-Shot Learning via Synthesized Examples. In CVPR. 4281--4289.Google ScholarGoogle Scholar
  43. Vinay Kumar Verma and Piyush Rai. 2017. A Simple Exponential Family Framework for Zero-Shot Learning. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18--22, 2017, Proceedings, Part II. 792--808.Google ScholarGoogle ScholarCross RefCross Ref
  44. Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché -Buc, Emily B. Fox, and Roman Garnett (Eds.). 2019. NeurIPS .Google ScholarGoogle Scholar
  45. Wei Wang, Vincent W. Zheng, Han Yu, and Chunyan Miao. 2019. A Survey of Zero-Shot Learning: Settings, Methods, and Applications. ACM TIST, Vol. 10, 2 (2019), 13:1--13:37.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi Feng, and Thomas S. Huang. 2018. Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi-Supervised Semantic Segmentation. In CVPR. 7268--7277.Google ScholarGoogle Scholar
  47. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. ArXiv, Vol. abs/1910.03771 (2019).Google ScholarGoogle Scholar
  48. Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh N. Nguyen, Matthias Hein, and Bernt Schiele. 2016. Latent Embeddings for Zero-Shot Classification. In CVPR. 69--77.Google ScholarGoogle Scholar
  49. Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. 2019 a. Semantic Projection Network for Zero- and Few-Label Semantic Segmentation. In CVPR. 8256--8265.Google ScholarGoogle Scholar
  50. Yongqin Xian, Christoph H. Lampert, Bernt Schiele, and Zeynep Akata. 2019 b. Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly. TPAMI, Vol. 41, 9 (2019), 2251--2265.Google ScholarGoogle ScholarCross RefCross Ref
  51. Jia Xu, Alexander G. Schwing, and Raquel Urtasun. 2015. Learning to segment under various forms of weak supervision. In CVPR. 3781--3790.Google ScholarGoogle Scholar
  52. Keren Ye, Mingda Zhang, Adriana Kovashka, Wei Li, Danfeng Qin, and Jesse Berent. 2019. Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection. In ICCV .Google ScholarGoogle Scholar
  53. Meng Ye and Yuhong Guo. 2017. Zero-Shot Classification with Discriminative Semantic Representation Learning. In CVPR. 5103--5111.Google ScholarGoogle Scholar
  54. Li Zhang, Tao Xiang, and Shaogang Gong. 2017. Learning a Deep Embedding Model for Zero-Shot Learning. In CVPR. 3010--3019.Google ScholarGoogle Scholar
  55. Ziming Zhang and Venkatesh Saligrama. 2015. Zero-Shot Learning via Semantic Similarity Embedding. In ICCV. 4166--4174.Google ScholarGoogle Scholar
  56. Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning Deep Features for Discriminative Localization. In CVPR. 2921--2929.Google ScholarGoogle Scholar
  57. Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. 2018. Weakly Supervised Instance Segmentation Using Class Peak Response. In CVPR. 3791--3800.Google ScholarGoogle Scholar

Index Terms

  1. Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '20: Proceedings of the 28th ACM International Conference on Multimedia
      October 2020
      4889 pages
      ISBN:9781450379885
      DOI:10.1145/3394171

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 October 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader