Multimodal Conditional VAE for Zero-Shot Real-World Event Discovery

Yang, Zhuopan; Luo, Di; You, Jiuxiang; Guo, Zhiwei; Yang, Zhenguo

doi:10.1007/978-3-031-46664-9_43

Zhuopan Yang¹⁵,
Di Luo¹⁵,
Jiuxiang You¹⁵,
Zhiwei Guo¹⁵ &
…
Zhenguo Yang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14177))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

496 Accesses

Abstract

In this paper, we propose a multimodal conditional variational auto-encoder (MC-VAE) in two branches to achieve a unified real-world event embedding space for zero-shot event discovery. More specifically, given multimodal data, Vision Transformer is exploited to extract global and local visual features, and BERT is adopted to obtain high-level semantic textual features. Furthermore, the textual MC-VAE and visual MC-VAE are designed to learn complementary multimodal representations. By using textual features as conditions, the textual MC-VAE encodes visual features to conform to textual semantics. Similarly, the visual MC-VAE encodes textual features in accordance with visual semantics using visual features as conditions. In particular, the textual MC-VAE and visual MC-VAE exploit MSE loss to keep visual and textual semantics for learning complementary multimodal representations, respectively. Finally, the complementary multimodal representations achieved by MC-VAE in two branches are integrated to predict real-world event labels in embedding forms, which provides feedback to finetune the Vision Transformer in turn. Experiments conducted on real-world datasets and zero-shot datasets show the outperformance of the proposed MC-VAE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alamri, F., Dutta, A.: Multi-head self-attention via vision transformer for zero-shot learning. In: IMVIP, pp. 1–8 (2021)
Google Scholar
Chen, S., et al.: Free: feature refinement for generalized zero-shot learning. In: ICCV, pp. 122–131 (2021)
Google Scholar
Chen, S., et al..: HSVA: hierarchical semantic-visual adaptation for zero-shot learning. In: NeurIPS, pp. 16622–16634 (2021)
Google Scholar
Chen, X., Sun, Y., Zhang, M., Peng, D.: Evolving deep convolutional variational autoencoders for image classification. TEVC 25(5), 815–829 (2020)
Google Scholar
Chen, Z., Luo, Y., Wang, S., Qiu, R., Li, J., Huang, Z.: Mitigating generation shifts for generalized zero-shot learning. In: MM, pp. 844–852 (2021)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR, pp. 1–21 (2021)
Google Scholar
Gune, O., Banerjee, B., Chaudhuri, S., Cuzzolin, F.: Generalized zero-shot learning using generated proxy unseen samples and entropy separation. In: MM, pp. 4262–4270 (2020)
Google Scholar
Han, Z., Fu, Z., Chen, S., Yang, J.: Contrastive embedding for generalized zero-shot learning. In: ICCV, pp. 2371–2381 (2021)
Google Scholar
Han, Z., Fu, Z., Yang, J.: Learning the redundancy-free features for generalized zero-shot object recognition. In: ICCV, pp. 12865–12874 (2020)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR, pp. 1–14 (2014)
Google Scholar
Li, Y., Li, J., Jin, H., Peng, L.: Focusing attention across multiple images for multimodal event detection. In: MM, pp. 1–6 (2021)
Google Scholar
Li, Z., Chang, X., Yao, L., Pan, S., Zongyuan, G., Zhang, H.: Grounding visual concepts for zero-shot event detection and event captioning. In: KDD, pp. 297–305 (2020)
Google Scholar
Liu, S., Long, M., Wang, J., Jordan, M.I.: Generalized zero-shot learning with deep calibration network. In: NeurIPS, pp. 2009–2019 (2018)
Google Scholar
Liu, Y., Guo, J., Cai, D., He, X.: Attribute attention for semantic disambiguation in zero-shot learning. In: ICCV, pp. 6697–6706 (2019)
Google Scholar
Ma, P., Hu, X.: A variational autoencoder with deep embedding model for generalized zero-shot learning. In: AAAI, pp. 11733–11740 (2020)
Google Scholar
Narayan, S., Gupta, A., Khan, F.S., Snoek, C.G., Shao, L.: Latent embedding feedback and discriminative features for zero-shot classification. In: ECCV, pp. 479–495 (2020)
Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP, pp. 722–729 (2008)
Google Scholar
Palatucci, M., Pomerleau, D., Hinton, G.E., Mitchell, T.M.: Zero-shot learning with semantic output codes. In: NeurIPS, pp. 1410–1418 (2009)
Google Scholar
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero-and few-shot learning via aligned variational autoencoders. In: ICCV, pp. 8247–8255 (2019)
Google Scholar
Singh, T., Kumari, M.: Burst: real-time events burst detection in social text stream. TJS 77(10), 11228–11256 (2021)
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset. California Institute of Technology (2011)
Google Scholar
Wang, Q., Chen, K.: Zero-shot visual recognition via bidirectional latent embedding. IJCV 124(3), 356–383 (2017)
Article MathSciNet MATH Google Scholar
Xie, G.S., et al.: Attentive region embedding network for zero-shot learning. In: ICCV, pp. 9384–9393 (2019)
Google Scholar
Xu, W., Xian, Y., Wang, J., Schiele, B., Akata, Z.: Attribute prototype network for zero-shot learning. In: NeurIPS, pp. 21969–21980 (2020)
Google Scholar
Yang, Z., Li, Q., Liu, W., Lv, J.: Shared multi-view data representation for multi-domain event detection. IEEE T-PAMI 42(5), 1243–1256 (2020)
Google Scholar
Yang, Z., Lin, Z., Guo, L., Li, Q., Liu, W.: MMED: a multi-domain and multi-modality event dataset. IPM 57(6), 102315 (2020)
Google Scholar
Ye, Z., Hu, F., Lyu, F., Li, L., Huang, K.: Disentangling semantic-to-visual confusion for zero-shot learning. TMM 24, 2828–2840 (2022)
Google Scholar

Download references

Acknowledgment

This work is supported by the National Natural Science Foundation of China (No. 62076073), and the Youth Talent Support Programme of Guangdong Provincial Association for Science and Technology (No. SKXRC202305).

Author information

Authors and Affiliations

Guangdong University of Technology, Guangzhou, China
Zhuopan Yang, Di Luo, Jiuxiang You, Zhiwei Guo & Zhenguo Yang

Authors

Zhuopan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Di Luo
View author publications
You can also search for this author in PubMed Google Scholar
Jiuxiang You
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwei Guo
View author publications
You can also search for this author in PubMed Google Scholar
Zhenguo Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenguo Yang .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Xiaochun Yang
The University of Indonesia, Depok, Indonesia
Heru Suhartanto
Beijing Institute of Technology, Beijing, China
Guoren Wang
Northeastern University, Shenyang, China
Bin Wang
University of Technology Sydney, Sydney, NSW, Australia
Jing Jiang
Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
Bing Li
Sun Yat-sen University, Guangzhou, China
Huaijie Zhu
Anhui University, Hefei, China
Ningning Cui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Z., Luo, D., You, J., Guo, Z., Yang, Z. (2023). Multimodal Conditional VAE for Zero-Shot Real-World Event Discovery. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14177. Springer, Cham. https://doi.org/10.1007/978-3-031-46664-9_43

Download citation

DOI: https://doi.org/10.1007/978-3-031-46664-9_43
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46663-2
Online ISBN: 978-3-031-46664-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Conditional VAE for Zero-Shot Real-World Event Discovery