Skip to main content

Webly Supervised Concept Expansion for General Purpose Vision Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13696))

Included in the following conference series:

  • 2415 Accesses

Abstract

General Purpose Vision (GPV) systems are models that are designed to solve a wide array of visual tasks without requiring architectural changes. Today, GPVs primarily learn both skills and concepts from large fully supervised datasets. Scaling GPVs to tens of thousands of concepts by acquiring data to learn each concept for every skill quickly becomes prohibitive. This work presents an effective and inexpensive alternative: learn skills from supervised datasets, learn concepts from web image search, and leverage a key characteristic of GPVs: the ability to transfer visual knowledge across skills. We use a dataset of 1M+ images spanning 10k+ visual concepts to demonstrate webly-supervised concept expansion for two existing GPVs (GPV-1  and VL-T5) on 3 benchmarks: 5 Coco-based datasets (80 primary concepts), a newly curated series of 5 datasets based on the OpenImages and VisualGenome repositories (\(\sim \)500 concepts), and the Web-derived dataset (10k+ concepts). We also propose a new architecture, GPV-2 that supports a variety of tasks — from vision tasks like classification and localization to vision+language tasks like QA and captioning, to more niche ones like human-object interaction detection. GPV-2 benefits hugely from web data and outperforms GPV-1 and VL-T5 across these benchmarks. Our data, code, and web demo are available at https://prior.allenai.org/projects/gpv2.

A. Kamath, C. Clark and T. Gupta—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Since [24] takes a long time to train when using the web data (over 3 weeks), results for GPV-1 with and without web data are reported after 20 epochs training.

References

  1. Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: International Conference on Computer Vision, pp. 8947–8956 (2019)

    Google Scholar 

  2. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  3. Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)

    Google Scholar 

  4. Brown, T., et al.: Language models are few-shot learners. ArXiv arXiv:2005.14165 (2020)

  5. Brysbaert, M., Warriner, A., Kuperman, V.: Concreteness ratings for 40 thousand generally known English word lemmas. Behav. Res. Methods 46(3), 904–911 (2013). https://doi.org/10.3758/s13428-013-0403-5

    Article  Google Scholar 

  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. ECCV arXiv:2005.12872 (2020)

  7. Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (2018)

    Google Scholar 

  8. Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE International Conference on Computer Vision (2015)

    Google Scholar 

  9. Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic attributes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 609–623. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_44

    Chapter  Google Scholar 

  10. Chen, X., Gupta, A.: Webly supervised learning of convolutional networks. In: ICCV (2015)

    Google Scholar 

  11. Chen, X., Shrivastava, A., Gupta, A.: NEIL: extracting visual knowledge from web data. In: ICCV (2013)

    Google Scholar 

  12. Chen, Y.C., et al.: Uniter: learning universal image-text representations. ArXiv arXiv:1909.11740 (2019)

  13. Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. arXiv preprint arXiv:2102.02779 (2021)

  14. Divvala, S., Farhadi, A., Guestrin, C.: Learning everything about anything: webly-supervised visual concept learning. In: CVPR (2014)

    Google Scholar 

  15. Dong, W., Socher, R., Li-Jia, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  16. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. ICLR arXiv:2010.11929 (2021)

  17. Fang, H., Xie, Y., Shao, D., Lu, C.: Dirv: dense interaction region voting for end-to-end human-object interaction detection. In: AAAI (2021)

    Google Scholar 

  18. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.A.: Describing objects by their attributes. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1785 (2009)

    Google Scholar 

  19. Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from Google’s image search. In: Tenth IEEE International Conference on Computer Vision (ICCV 2005) Volume 1 2, vol. 2, pp. 1816–1823 (2005)

    Google Scholar 

  20. Golge, E., Duygulu, P.: ConceptMap: mining noisy web data for concept learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 439–455. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_29

    Chapter  Google Scholar 

  21. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In: CVPR (2017)

    Google Scholar 

  22. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation (2021)

    Google Scholar 

  23. Guo, S., et al.: Curriculumnet: weakly supervised learning from large-scale web images. ArXiv arXiv:1808.01097 (2018)

  24. Gupta, T., Kamath, A., Kembhavi, A., Hoiem, D.: Towards general purpose vision systems: an end-to-end task-agnostic vision-language architecture. In: CVPR (2022)

    Google Scholar 

  25. Gupta, T., Marten, R., Kembhavi, A., Hoiem, D.: Grit: general robust image task benchmark. arXiv preprint arXiv:2204.13653 (2022)

  26. Gupta, T., Schwing, A.G., Hoiem, D.: No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9676–9684 (2019)

    Google Scholar 

  27. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  28. Hoffman, J., et al.: LSDA: large scale detection through adaptation. In: NIPS (2014)

    Google Scholar 

  29. Jaegle, A., et al.: Perceiver IO: a general architecture for structured inputs & outputs. ArXiv arXiv:2107.14795 (2021)

  30. Jaegle, A., Gimeno, F., Brock, A., Zisserman, A., Vinyals, O., Carreira, J.: Perceiver: general perception with iterative attention. In: ICML (2021)

    Google Scholar 

  31. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)

    Google Scholar 

  32. Jin, B., Segovia, M.V.O., Süsstrunk, S.: Webly supervised semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1705–1714 (2017)

    Google Scholar 

  33. Kim, B., Lee, J., Kang, J., Kim, E.S., Kim, H.J.: HOTR: end-to-end human-object interaction detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  34. Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. ArXiv arXiv:2102.03334 (2021)

  35. Krause, J., et al.: The unreasonable effectiveness of noisy data for fine-grained recognition. ArXiv arXiv:1511.06789 (2016)

  36. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)

    Article  MathSciNet  Google Scholar 

  37. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile classifiers for face verification. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 365–372 (2009)

    Google Scholar 

  38. Kuznetsova, A., et al.: The Open Images Dataset V4: unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982 (2018)

  39. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958 (2009)

    Google Scholar 

  40. Li, L.J., Fei-Fei, L.: Optimol: automatic online picture collection via incremental model learning. Int. J. Comput. Vision 88, 147–168 (2007)

    Article  Google Scholar 

  41. Li, L.H., Yatskar, M., Yin, D., Hsieh, C., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. ArXiv arXiv:1908.03557 (2019)

  42. Li, Q., Wu, J., Tu, Z.: Harvesting mid-level visual concepts from large-scale internet images. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 851–858 (2013)

    Google Scholar 

  43. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  44. Liang, K., Guo, Y., Chang, H., Chen, X.: Visual relationship detection with deep structural ranking. In: AAAI (2018)

    Google Scholar 

  45. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. ICCV arXiv:2103.14030 (2021)

  46. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)

    Google Scholar 

  47. Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10434–10443 (2020)

    Google Scholar 

  48. Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: CVPR (2020)

    Google Scholar 

  49. Luo, A., Li, X., Yang, F., Jiao, Z., Cheng, H.: Webly-supervised learning for salient object detection. Pattern Recognit. 103, 107308 (2020)

    Article  Google Scholar 

  50. van der Maaten, L., Hinton, G.E.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  51. McCann, B., Keskar, N., Xiong, C., Socher, R.: The natural language decathlon: multitask learning as question answering. ArXiv arXiv:1806.08730 (2018)

  52. Niu, L., Tang, Q., Veeraraghavan, A., Sabharwal, A.: Learning from noisy web data with category-level supervision. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7689–7698 (2018)

    Google Scholar 

  53. Niu, L., Veeraraghavan, A., Sabharwal, A.: Webly supervised learning meets zero-shot learning: a hybrid approach for fine-grained classification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7171–7180 (2018)

    Google Scholar 

  54. Parikh, D., Grauman, K.: Relative attributes. In: 2011 International Conference on Computer Vision, pp. 503–510 (2011)

    Google Scholar 

  55. Patterson, G., Hays, J.: Sun attribute database: discovering, annotating, and recognizing scene attributes. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2751–2758 (2012)

    Google Scholar 

  56. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  57. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  58. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1-140:67 (2020)

    MathSciNet  MATH  Google Scholar 

  59. Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. arXiv preprint arXiv:1810.03649 (2018)

  60. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525 (2017)

    Google Scholar 

  61. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)

    Google Scholar 

  62. Shen, S., et al.: How much can clip benefit vision-and-language tasks? ArXiv arXiv:2107.06383 (2021)

  63. Shen, T., Lin, G., Shen, C., Reid, I.D.: Bootstrapping the performance of webly supervised semantic segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1363–1371 (2018)

    Google Scholar 

  64. Shen, Y., et al.: Noise-aware fully webly supervised object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11323–11332 (2020)

    Google Scholar 

  65. Sun, G., Wang, W., Dai, J., Van Gool, L.: Mining cross-image semantics for weakly supervised semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 347–365. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_21

    Chapter  Google Scholar 

  66. Tan, H.H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. In: EMNLP/IJCNLP (2019)

    Google Scholar 

  67. Uijlings, J.R.R., Popov, S., Ferrari, V.: Revisiting knowledge transfer for training object class detectors. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1101–1110 (2018)

    Google Scholar 

  68. Ulutan, O., Iftekhar, A.S.M., Manjunath, B.S.: Vsgnet: spatial attention network for detecting human object interactions using graph convolutions. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13614–13623 (2020)

    Google Scholar 

  69. Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 4566–4575 (2015)

    Google Scholar 

  70. Vijayanarasimhan, S., Grauman, K.: Keywords to visual categories: multiple-instance learning forweakly supervised object categorization. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)

    Google Scholar 

  71. Wang, S., Thompson, L., Iyyer, M.: Phrase-Bert: improved phrase embeddings from Bert with an application to corpus exploration. In: EMNLP (2021)

    Google Scholar 

  72. Wang, S., Joo, J., Wang, Y., Zhu, S.C.: Weakly supervised learning for attribute localization in outdoor scenes. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3111–3118 (2013)

    Google Scholar 

  73. Wang, X.J., Zhang, L., Li, X., Ma, W.Y.: Annotating images by mining image search results. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1919–1932 (2008)

    Article  Google Scholar 

  74. Whitehead, S., Wu, H., Ji, H., Feris, R.S., Saenko, K., MIT-IBM, U.: Separating skills and concepts for novel visual question answering. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5628–5637 (2021)

    Google Scholar 

  75. Wu, Z., Tao, Q., Lin, G., Cai, J.: Exploring bottom-up and top-down cues with attentive learning for webly supervised object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12933–12942 (2020)

    Google Scholar 

  76. Xu, H., Yan, M., Li, C., Bi, B., Huang, S., Xiao, W., Huang, F.: E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning (2021)

    Google Scholar 

  77. YANG, J., et al.: Webly supervised image classification with self-contained confidence. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 779–795. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_46

    Chapter  Google Scholar 

  78. Yatskar, M., Zettlemoyer, L., Farhadi, A.: Situation recognition: visual semantic role labeling for image understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5534–5542 (2016)

    Google Scholar 

  79. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14388–14397 (2021)

    Google Scholar 

  80. Zhang, A., et al.: Mining the benefits of two-stage and one-stage hoi detection. arXiv preprint arXiv:2108.05077 (2021)

  81. Zhang, P., et al.: Vinvl: making visual representations matter in vision-language models. ArXiv arXiv:2101.00529 (2021)

  82. Zheng, W., Yan, L., Gou, C., Wang, F.: Webly supervised knowledge embedding model for visual reasoning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 12442–12451 (2020)

    Google Scholar 

  83. Zhong, X., Ding, C., Qu, X., Tao, D.: Polysemy deciphering network for human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 69–85. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_5

    Chapter  Google Scholar 

  84. Zou, C., et al.: End-to-end human object interaction detection with hoi transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

Download references

Acknowledgements

This work is partially supported by ONR award N00014-21-1-2705.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amita Kamath .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3007 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kamath, A., Clark, C., Gupta, T., Kolve, E., Hoiem, D., Kembhavi, A. (2022). Webly Supervised Concept Expansion for General Purpose Vision Models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20059-5_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20058-8

  • Online ISBN: 978-3-031-20059-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics