short-paper

Zero-Shot Learning for Computer Vision Applications

Author:

Sandipan SarmaAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9360 - 9364

https://doi.org/10.1145/3581783.3613435

Published: 27 October 2023 Publication History

Abstract

Human beings possess the remarkable ability to recognize unseen concepts by integrating their visual perception of known concepts with some high-level descriptions. However, the best-performing deep learning frameworks today are supervised learners that struggle to recognize concepts without training on their labeled visual samples. Zero-shot learning (ZSL) has recently emerged as a solution that mimics humans and leverages multimodal information to transfer knowledge from seen to unseen concepts. This study aims to emphasize the practicality of ZSL, unlocking its potential across four different applications in computer vision, namely -- object recognition, object detection, action recognition, and human-object interaction detection. Several task-specific challenges are identified and addressed in the presented research hypotheses. Zero-shot frameworks are proposed to attain state-of-the-art performance, elucidating some future research directions as well.

Supplemental Material

MP4 File

This video presents a doctoral study that explores zero-shot learning for four different computer vision applications, ranging from simple tasks like object recognition to complex ones like human-object interaction detection. With a brief introduction about the emerging field of zero-shot learning, several challenges are identified in each of the four vision applications, and then research questions are asked in each of them. A few outcomes of the study (both published and unpublished) have been presented, and their impact on multimedia as a domain is discussed.

Download
107.90 MB

References

[1]

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2015. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 7 (2015), 1425--1438.

[2]

Ioannis Alexiou, Tao Xiang, and Shaogang Gong. 2016. Exploring synonyms as context in zero-shot action recognition. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 4190--4194.

[3]

Ankan Bansal, Sai Saketh Rambhatla, Abhinav Shrivastava, and Rama Chellappa. 2020. Detecting human-object interactions via functional generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10460--10469.

[4]

Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. 2018. Zero-shot object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 384--400.

Digital Library

[5]

Thomas Barnett, Shruti Jain, Usha Andra, and Taru Khurana. 2018. Cisco visual networking index (vni) complete forecast update, 2017--2022. Americas/EMEAR Cisco Knowledge Network (CKN) Presentation (2018), 1--30.

[6]

Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5327--5336.

[7]

Shiming Chen, Ziming Hong, Yang Liu, Guo-Sen Xie, Baigui Sun, Hao Li, Qinmu Peng, Ke Lu, and Xinge You. 2022a. Transzero: Attribute-guided transformer for zero-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 330--338.

[8]

Shiming Chen, Ziming Hong, Guo-Sen Xie, Wenhan Yang, Qinmu Peng, Kai Wang, Jian Zhao, and Xinge You. 2022b. Msdn: Mutually semantic distillation network for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7612--7621.

[9]

Shizhe Chen and Dong Huang. 2021. Elaborative rehearsal for zero-shot action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13638--13647.

[10]

Berkan Demirel, Ramazan Gokberk Cinbis, and Nazli Ikizler-Cinbis. 2018. Zero-Shot Object Detection by Hybrid Region Embedding. In BMVC.

[11]

Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8303--8311.

Digital Library

[12]

Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and recognizing human-object interactions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8359--8367.

[13]

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014).

[14]

Pranay Gupta, Divyanshu Sharma, and Ravi Kiran Sarvadevabhatla. 2021. Syntactically guided generative embeddings for zero-shot skeleton action recognition. In 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 439--443.

[15]

Nasir Hayat, Munawar Hayat, Shafin Rahman, Salman Khan, Syed Waqas Zamir, and Fahad Shahbaz Khan. 2020. Synthesizing the Unseen for Zero-shot Object Detection. In Proceedings of the Asian Conference on Computer Vision.

[16]

Mingyao Hong, Guorong Li, Xinfeng Zhang, and Qingming Huang. 2020. Generalized zero-shot video classification via generative adversarial networks. In Proceedings of the 28th ACM international conference on multimedia. 2419--2426.

Digital Library

[17]

Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, and Dacheng Tao. 2021. Detecting human-object interaction via fabricated compositional learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14646--14655.

[18]

Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. 2022. Prompting Visual-Language Models for Efficient Video Understanding. In European Conference on Computer Vision (ECCV).

[19]

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 951--958.

[20]

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2013. Attribute-based classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 3 (2013), 453--465.

[21]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[22]

Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In CVPR 2011. IEEE, 3337--3344.

Digital Library

[23]

Ye Liu, Junsong Yuan, and Chang Wen Chen. 2020. Consnet: Learning consistency graph for zero-shot human-object interaction detection. In Proceedings of the 28th ACM International Conference on Multimedia. 4235--4243.

Digital Library

[24]

Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. 2019. Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1429--1437.

[25]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, Vol. 26 (2013).

Digital Library

[26]

George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM, Vol. 38, 11 (1995), 39--41.

Digital Library

[27]

Sanath Narayan, Akshita Gupta, Fahad Shahbaz Khan, Cees G. M. Snoek, and Ling Shao. 2020. Latent embedding feedback and discriminative features for zero-shot classification. In ECCV. 479--495.

[28]

Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. 2022. Expanding language-image pretrained models for general video recognition. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part IV. Springer, 1--18.

[29]

Genevieve Patterson, Chen Xu, Hang Su, and James Hays. 2014. The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding. IJCV, Vol. 108, 1--2 (2014), 59--81.

[30]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[31]

Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, Vol. 11, sept (2010), 2487--2531.

[32]

Shafin Rahman, Salman Khan, and Nick Barnes. 2020. Improved visual-semantic alignment for zero-shot object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11932--11939.

[33]

Shafin Rahman, Salman Khan, and Fatih Porikli. 2018. Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts. In Asian Conference on Computer Vision. Springer, 547--563.

[34]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 6 (2016), 1137--1149.

[35]

Sandipan Sarma, SUSHIL KUMAR, and Arijit Sur. 2022. Resolving Semantic Confusions for Improved Zero-Shot Detection. In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press. https://bmvc2022.mpi-inf.mpg.de/0347.pdf

[36]

Sandipan Sarma and Arijit Sur. 2023. DiRaC-I: Identifying Diverse and Rare Training Classes for Zero-Shot Learning. ACM Trans. Multimedia Comput. Commun. Appl. (may 2023). https://doi.org/10.1145/3603147 Just Accepted.

Digital Library

[37]

Liyue Shen, Serena Yeung, Judy Hoffman, Greg Mori, and Li Fei-Fei. 2018. Scaling human-object interaction recognition through zero-shot learning. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1568--1576.

[38]

C. Wah., S. Branson, P. Welinder, P. Perona, and S. Belongie. 2011. The Caltech-UCSD Birds-200--2011 Dataset. Technical Report CNS-TR-2011-001. California Institute of Technology.

[39]

Qian Wang and Ke Chen. 2017. Alternative semantic representations for zero-shot human action recognition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18-22, 2017, Proceedings, Part I 10. Springer, 87--102.

[40]

Qunbo Wang, Wenjun Wu, Yongchi Zhao, and Yuzhang Zhuang. 2021. Graph active learning for GCN-based zero-shot classification. Neurocomputing, Vol. 435 (2021), 15--25.

[41]

Suchen Wang, Yueqi Duan, Henghui Ding, Yap-Peng Tan, Kim-Hui Yap, and Junsong Yuan. 2022. Learning transferable human-object interaction detector with natural language supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 939--948.

[42]

Suchen Wang, Kim-Hui Yap, Junsong Yuan, and Yap-Peng Tan. 2020. Discovering human interactions with novel objects via zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11652--11661.

[43]

Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6857--6866.

[44]

Mingrui Wu, Jiaxin Gu, Yunhang Shen, Mingbao Lin, Chao Chen, Xiaoshuai Sun, and Rongrong Ji. 2022. End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation. arXiv preprint arXiv:2204.03541 (2022).

[45]

Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. 2018. Zero-shot learning-A comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 9 (2018), 2251--2265.

[46]

Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, and Mohan S Kankanhalli. 2019. Learning to detect human-object interactions with knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]

Kai Yi, Xiaoqian Shen, Yunhao Gou, and Mohamed Elhoseiny. 2022. Exploring hierarchical graph representation for large-scale zero-shot image classification. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XX. Springer, 116--132.

Index Terms

Zero-Shot Learning for Computer Vision Applications
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
        Object recognition
      2. Computer vision tasks
        Activity recognition and understanding
        Scene understanding
  2. Machine learning
    1. Learning paradigms
      1. Multi-task learning
        Transfer learning

Recommendations

ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of <human, action, object> in images. Most existing works treat HOIs as individual interaction categories, thus can not ...
Transductive Zero-Shot Action Recognition by Word-Vector Embedding

The number of categories for action recognition is growing rapidly and it has become increasingly hard to label sufficient training data for learning conventional models for all categories. Instead of collecting ever more data and labelling them ...
Zero-Shot Visual Recognition via Bidirectional Latent Embedding

Zero-shot learning for visual recognition, e.g., object and action recognition, has recently attracted a lot of attention. However, it still remains challenging in bridging the semantic gap between visual features and their underlying semantics and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
153
Total Downloads

Downloads (Last 12 months)64
Downloads (Last 6 weeks)5

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten