skip to main content
10.1145/3581783.3613435acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Zero-Shot Learning for Computer Vision Applications

Published: 27 October 2023 Publication History

Abstract

Human beings possess the remarkable ability to recognize unseen concepts by integrating their visual perception of known concepts with some high-level descriptions. However, the best-performing deep learning frameworks today are supervised learners that struggle to recognize concepts without training on their labeled visual samples. Zero-shot learning (ZSL) has recently emerged as a solution that mimics humans and leverages multimodal information to transfer knowledge from seen to unseen concepts. This study aims to emphasize the practicality of ZSL, unlocking its potential across four different applications in computer vision, namely -- object recognition, object detection, action recognition, and human-object interaction detection. Several task-specific challenges are identified and addressed in the presented research hypotheses. Zero-shot frameworks are proposed to attain state-of-the-art performance, elucidating some future research directions as well.

Supplemental Material

MP4 File
This video presents a doctoral study that explores zero-shot learning for four different computer vision applications, ranging from simple tasks like object recognition to complex ones like human-object interaction detection. With a brief introduction about the emerging field of zero-shot learning, several challenges are identified in each of the four vision applications, and then research questions are asked in each of them. A few outcomes of the study (both published and unpublished) have been presented, and their impact on multimedia as a domain is discussed.

References

[1]
Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2015. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 7 (2015), 1425--1438.
[2]
Ioannis Alexiou, Tao Xiang, and Shaogang Gong. 2016. Exploring synonyms as context in zero-shot action recognition. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 4190--4194.
[3]
Ankan Bansal, Sai Saketh Rambhatla, Abhinav Shrivastava, and Rama Chellappa. 2020. Detecting human-object interactions via functional generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10460--10469.
[4]
Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. 2018. Zero-shot object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 384--400.
[5]
Thomas Barnett, Shruti Jain, Usha Andra, and Taru Khurana. 2018. Cisco visual networking index (vni) complete forecast update, 2017--2022. Americas/EMEAR Cisco Knowledge Network (CKN) Presentation (2018), 1--30.
[6]
Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5327--5336.
[7]
Shiming Chen, Ziming Hong, Yang Liu, Guo-Sen Xie, Baigui Sun, Hao Li, Qinmu Peng, Ke Lu, and Xinge You. 2022a. Transzero: Attribute-guided transformer for zero-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 330--338.
[8]
Shiming Chen, Ziming Hong, Guo-Sen Xie, Wenhan Yang, Qinmu Peng, Kai Wang, Jian Zhao, and Xinge You. 2022b. Msdn: Mutually semantic distillation network for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7612--7621.
[9]
Shizhe Chen and Dong Huang. 2021. Elaborative rehearsal for zero-shot action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13638--13647.
[10]
Berkan Demirel, Ramazan Gokberk Cinbis, and Nazli Ikizler-Cinbis. 2018. Zero-Shot Object Detection by Hybrid Region Embedding. In BMVC.
[11]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8303--8311.
[12]
Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and recognizing human-object interactions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8359--8367.
[13]
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014).
[14]
Pranay Gupta, Divyanshu Sharma, and Ravi Kiran Sarvadevabhatla. 2021. Syntactically guided generative embeddings for zero-shot skeleton action recognition. In 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 439--443.
[15]
Nasir Hayat, Munawar Hayat, Shafin Rahman, Salman Khan, Syed Waqas Zamir, and Fahad Shahbaz Khan. 2020. Synthesizing the Unseen for Zero-shot Object Detection. In Proceedings of the Asian Conference on Computer Vision.
[16]
Mingyao Hong, Guorong Li, Xinfeng Zhang, and Qingming Huang. 2020. Generalized zero-shot video classification via generative adversarial networks. In Proceedings of the 28th ACM international conference on multimedia. 2419--2426.
[17]
Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, and Dacheng Tao. 2021. Detecting human-object interaction via fabricated compositional learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14646--14655.
[18]
Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. 2022. Prompting Visual-Language Models for Efficient Video Understanding. In European Conference on Computer Vision (ECCV).
[19]
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 951--958.
[20]
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2013. Attribute-based classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 3 (2013), 453--465.
[21]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.
[22]
Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In CVPR 2011. IEEE, 3337--3344.
[23]
Ye Liu, Junsong Yuan, and Chang Wen Chen. 2020. Consnet: Learning consistency graph for zero-shot human-object interaction detection. In Proceedings of the 28th ACM International Conference on Multimedia. 4235--4243.
[24]
Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. 2019. Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1429--1437.
[25]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, Vol. 26 (2013).
[26]
George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM, Vol. 38, 11 (1995), 39--41.
[27]
Sanath Narayan, Akshita Gupta, Fahad Shahbaz Khan, Cees G. M. Snoek, and Ling Shao. 2020. Latent embedding feedback and discriminative features for zero-shot classification. In ECCV. 479--495.
[28]
Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. 2022. Expanding language-image pretrained models for general video recognition. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part IV. Springer, 1--18.
[29]
Genevieve Patterson, Chen Xu, Hang Su, and James Hays. 2014. The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding. IJCV, Vol. 108, 1--2 (2014), 59--81.
[30]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[31]
Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, Vol. 11, sept (2010), 2487--2531.
[32]
Shafin Rahman, Salman Khan, and Nick Barnes. 2020. Improved visual-semantic alignment for zero-shot object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11932--11939.
[33]
Shafin Rahman, Salman Khan, and Fatih Porikli. 2018. Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts. In Asian Conference on Computer Vision. Springer, 547--563.
[34]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 6 (2016), 1137--1149.
[35]
Sandipan Sarma, SUSHIL KUMAR, and Arijit Sur. 2022. Resolving Semantic Confusions for Improved Zero-Shot Detection. In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press. https://bmvc2022.mpi-inf.mpg.de/0347.pdf
[36]
Sandipan Sarma and Arijit Sur. 2023. DiRaC-I: Identifying Diverse and Rare Training Classes for Zero-Shot Learning. ACM Trans. Multimedia Comput. Commun. Appl. (may 2023). https://doi.org/10.1145/3603147 Just Accepted.
[37]
Liyue Shen, Serena Yeung, Judy Hoffman, Greg Mori, and Li Fei-Fei. 2018. Scaling human-object interaction recognition through zero-shot learning. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1568--1576.
[38]
C. Wah., S. Branson, P. Welinder, P. Perona, and S. Belongie. 2011. The Caltech-UCSD Birds-200--2011 Dataset. Technical Report CNS-TR-2011-001. California Institute of Technology.
[39]
Qian Wang and Ke Chen. 2017. Alternative semantic representations for zero-shot human action recognition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18-22, 2017, Proceedings, Part I 10. Springer, 87--102.
[40]
Qunbo Wang, Wenjun Wu, Yongchi Zhao, and Yuzhang Zhuang. 2021. Graph active learning for GCN-based zero-shot classification. Neurocomputing, Vol. 435 (2021), 15--25.
[41]
Suchen Wang, Yueqi Duan, Henghui Ding, Yap-Peng Tan, Kim-Hui Yap, and Junsong Yuan. 2022. Learning transferable human-object interaction detector with natural language supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 939--948.
[42]
Suchen Wang, Kim-Hui Yap, Junsong Yuan, and Yap-Peng Tan. 2020. Discovering human interactions with novel objects via zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11652--11661.
[43]
Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6857--6866.
[44]
Mingrui Wu, Jiaxin Gu, Yunhang Shen, Mingbao Lin, Chao Chen, Xiaoshuai Sun, and Rongrong Ji. 2022. End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation. arXiv preprint arXiv:2204.03541 (2022).
[45]
Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. 2018. Zero-shot learning-A comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 9 (2018), 2251--2265.
[46]
Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, and Mohan S Kankanhalli. 2019. Learning to detect human-object interactions with knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[47]
Kai Yi, Xiaoqian Shen, Yunhao Gou, and Mohamed Elhoseiny. 2022. Exploring hierarchical graph representation for large-scale zero-shot image classification. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XX. Springer, 116--132.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action recognition
  2. clip
  3. gan
  4. human-object interaction detection
  5. object detection
  6. object recognition
  7. seed construction
  8. transformer
  9. triplet loss
  10. visual-semantic mining
  11. zero-shot learning

Qualifiers

  • Short-paper

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 153
    Total Downloads
  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)5
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media