Skip to main content

Multimodal Conditional VAE for Zero-Shot Real-World Event Discovery

  • Conference paper
  • First Online:
Advanced Data Mining and Applications (ADMA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14177))

Included in the following conference series:

  • 496 Accesses

Abstract

In this paper, we propose a multimodal conditional variational auto-encoder (MC-VAE) in two branches to achieve a unified real-world event embedding space for zero-shot event discovery. More specifically, given multimodal data, Vision Transformer is exploited to extract global and local visual features, and BERT is adopted to obtain high-level semantic textual features. Furthermore, the textual MC-VAE and visual MC-VAE are designed to learn complementary multimodal representations. By using textual features as conditions, the textual MC-VAE encodes visual features to conform to textual semantics. Similarly, the visual MC-VAE encodes textual features in accordance with visual semantics using visual features as conditions. In particular, the textual MC-VAE and visual MC-VAE exploit MSE loss to keep visual and textual semantics for learning complementary multimodal representations, respectively. Finally, the complementary multimodal representations achieved by MC-VAE in two branches are integrated to predict real-world event labels in embedding forms, which provides feedback to finetune the Vision Transformer in turn. Experiments conducted on real-world datasets and zero-shot datasets show the outperformance of the proposed MC-VAE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alamri, F., Dutta, A.: Multi-head self-attention via vision transformer for zero-shot learning. In: IMVIP, pp. 1–8 (2021)

    Google Scholar 

  2. Chen, S., et al.: Free: feature refinement for generalized zero-shot learning. In: ICCV, pp. 122–131 (2021)

    Google Scholar 

  3. Chen, S., et al..: HSVA: hierarchical semantic-visual adaptation for zero-shot learning. In: NeurIPS, pp. 16622–16634 (2021)

    Google Scholar 

  4. Chen, X., Sun, Y., Zhang, M., Peng, D.: Evolving deep convolutional variational autoencoders for image classification. TEVC 25(5), 815–829 (2020)

    Google Scholar 

  5. Chen, Z., Luo, Y., Wang, S., Qiu, R., Li, J., Huang, Z.: Mitigating generation shifts for generalized zero-shot learning. In: MM, pp. 844–852 (2021)

    Google Scholar 

  6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR, pp. 1–21 (2021)

    Google Scholar 

  8. Gune, O., Banerjee, B., Chaudhuri, S., Cuzzolin, F.: Generalized zero-shot learning using generated proxy unseen samples and entropy separation. In: MM, pp. 4262–4270 (2020)

    Google Scholar 

  9. Han, Z., Fu, Z., Chen, S., Yang, J.: Contrastive embedding for generalized zero-shot learning. In: ICCV, pp. 2371–2381 (2021)

    Google Scholar 

  10. Han, Z., Fu, Z., Yang, J.: Learning the redundancy-free features for generalized zero-shot object recognition. In: ICCV, pp. 12865–12874 (2020)

    Google Scholar 

  11. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR, pp. 1–14 (2014)

    Google Scholar 

  12. Li, Y., Li, J., Jin, H., Peng, L.: Focusing attention across multiple images for multimodal event detection. In: MM, pp. 1–6 (2021)

    Google Scholar 

  13. Li, Z., Chang, X., Yao, L., Pan, S., Zongyuan, G., Zhang, H.: Grounding visual concepts for zero-shot event detection and event captioning. In: KDD, pp. 297–305 (2020)

    Google Scholar 

  14. Liu, S., Long, M., Wang, J., Jordan, M.I.: Generalized zero-shot learning with deep calibration network. In: NeurIPS, pp. 2009–2019 (2018)

    Google Scholar 

  15. Liu, Y., Guo, J., Cai, D., He, X.: Attribute attention for semantic disambiguation in zero-shot learning. In: ICCV, pp. 6697–6706 (2019)

    Google Scholar 

  16. Ma, P., Hu, X.: A variational autoencoder with deep embedding model for generalized zero-shot learning. In: AAAI, pp. 11733–11740 (2020)

    Google Scholar 

  17. Narayan, S., Gupta, A., Khan, F.S., Snoek, C.G., Shao, L.: Latent embedding feedback and discriminative features for zero-shot classification. In: ECCV, pp. 479–495 (2020)

    Google Scholar 

  18. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP, pp. 722–729 (2008)

    Google Scholar 

  19. Palatucci, M., Pomerleau, D., Hinton, G.E., Mitchell, T.M.: Zero-shot learning with semantic output codes. In: NeurIPS, pp. 1410–1418 (2009)

    Google Scholar 

  20. Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero-and few-shot learning via aligned variational autoencoders. In: ICCV, pp. 8247–8255 (2019)

    Google Scholar 

  21. Singh, T., Kumari, M.: Burst: real-time events burst detection in social text stream. TJS 77(10), 11228–11256 (2021)

    Google Scholar 

  22. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset. California Institute of Technology (2011)

    Google Scholar 

  23. Wang, Q., Chen, K.: Zero-shot visual recognition via bidirectional latent embedding. IJCV 124(3), 356–383 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  24. Xie, G.S., et al.: Attentive region embedding network for zero-shot learning. In: ICCV, pp. 9384–9393 (2019)

    Google Scholar 

  25. Xu, W., Xian, Y., Wang, J., Schiele, B., Akata, Z.: Attribute prototype network for zero-shot learning. In: NeurIPS, pp. 21969–21980 (2020)

    Google Scholar 

  26. Yang, Z., Li, Q., Liu, W., Lv, J.: Shared multi-view data representation for multi-domain event detection. IEEE T-PAMI 42(5), 1243–1256 (2020)

    Google Scholar 

  27. Yang, Z., Lin, Z., Guo, L., Li, Q., Liu, W.: MMED: a multi-domain and multi-modality event dataset. IPM 57(6), 102315 (2020)

    Google Scholar 

  28. Ye, Z., Hu, F., Lyu, F., Li, L., Huang, K.: Disentangling semantic-to-visual confusion for zero-shot learning. TMM 24, 2828–2840 (2022)

    Google Scholar 

Download references

Acknowledgment

This work is supported by the National Natural Science Foundation of China (No. 62076073), and the Youth Talent Support Programme of Guangdong Provincial Association for Science and Technology (No. SKXRC202305).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenguo Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Z., Luo, D., You, J., Guo, Z., Yang, Z. (2023). Multimodal Conditional VAE for Zero-Shot Real-World Event Discovery. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14177. Springer, Cham. https://doi.org/10.1007/978-3-031-46664-9_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-46664-9_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-46663-2

  • Online ISBN: 978-3-031-46664-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics