Abstract
In this paper, we propose a multimodal conditional variational auto-encoder (MC-VAE) in two branches to achieve a unified real-world event embedding space for zero-shot event discovery. More specifically, given multimodal data, Vision Transformer is exploited to extract global and local visual features, and BERT is adopted to obtain high-level semantic textual features. Furthermore, the textual MC-VAE and visual MC-VAE are designed to learn complementary multimodal representations. By using textual features as conditions, the textual MC-VAE encodes visual features to conform to textual semantics. Similarly, the visual MC-VAE encodes textual features in accordance with visual semantics using visual features as conditions. In particular, the textual MC-VAE and visual MC-VAE exploit MSE loss to keep visual and textual semantics for learning complementary multimodal representations, respectively. Finally, the complementary multimodal representations achieved by MC-VAE in two branches are integrated to predict real-world event labels in embedding forms, which provides feedback to finetune the Vision Transformer in turn. Experiments conducted on real-world datasets and zero-shot datasets show the outperformance of the proposed MC-VAE.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alamri, F., Dutta, A.: Multi-head self-attention via vision transformer for zero-shot learning. In: IMVIP, pp. 1–8 (2021)
Chen, S., et al.: Free: feature refinement for generalized zero-shot learning. In: ICCV, pp. 122–131 (2021)
Chen, S., et al..: HSVA: hierarchical semantic-visual adaptation for zero-shot learning. In: NeurIPS, pp. 16622–16634 (2021)
Chen, X., Sun, Y., Zhang, M., Peng, D.: Evolving deep convolutional variational autoencoders for image classification. TEVC 25(5), 815–829 (2020)
Chen, Z., Luo, Y., Wang, S., Qiu, R., Li, J., Huang, Z.: Mitigating generation shifts for generalized zero-shot learning. In: MM, pp. 844–852 (2021)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR, pp. 1–21 (2021)
Gune, O., Banerjee, B., Chaudhuri, S., Cuzzolin, F.: Generalized zero-shot learning using generated proxy unseen samples and entropy separation. In: MM, pp. 4262–4270 (2020)
Han, Z., Fu, Z., Chen, S., Yang, J.: Contrastive embedding for generalized zero-shot learning. In: ICCV, pp. 2371–2381 (2021)
Han, Z., Fu, Z., Yang, J.: Learning the redundancy-free features for generalized zero-shot object recognition. In: ICCV, pp. 12865–12874 (2020)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR, pp. 1–14 (2014)
Li, Y., Li, J., Jin, H., Peng, L.: Focusing attention across multiple images for multimodal event detection. In: MM, pp. 1–6 (2021)
Li, Z., Chang, X., Yao, L., Pan, S., Zongyuan, G., Zhang, H.: Grounding visual concepts for zero-shot event detection and event captioning. In: KDD, pp. 297–305 (2020)
Liu, S., Long, M., Wang, J., Jordan, M.I.: Generalized zero-shot learning with deep calibration network. In: NeurIPS, pp. 2009–2019 (2018)
Liu, Y., Guo, J., Cai, D., He, X.: Attribute attention for semantic disambiguation in zero-shot learning. In: ICCV, pp. 6697–6706 (2019)
Ma, P., Hu, X.: A variational autoencoder with deep embedding model for generalized zero-shot learning. In: AAAI, pp. 11733–11740 (2020)
Narayan, S., Gupta, A., Khan, F.S., Snoek, C.G., Shao, L.: Latent embedding feedback and discriminative features for zero-shot classification. In: ECCV, pp. 479–495 (2020)
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP, pp. 722–729 (2008)
Palatucci, M., Pomerleau, D., Hinton, G.E., Mitchell, T.M.: Zero-shot learning with semantic output codes. In: NeurIPS, pp. 1410–1418 (2009)
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero-and few-shot learning via aligned variational autoencoders. In: ICCV, pp. 8247–8255 (2019)
Singh, T., Kumari, M.: Burst: real-time events burst detection in social text stream. TJS 77(10), 11228–11256 (2021)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset. California Institute of Technology (2011)
Wang, Q., Chen, K.: Zero-shot visual recognition via bidirectional latent embedding. IJCV 124(3), 356–383 (2017)
Xie, G.S., et al.: Attentive region embedding network for zero-shot learning. In: ICCV, pp. 9384–9393 (2019)
Xu, W., Xian, Y., Wang, J., Schiele, B., Akata, Z.: Attribute prototype network for zero-shot learning. In: NeurIPS, pp. 21969–21980 (2020)
Yang, Z., Li, Q., Liu, W., Lv, J.: Shared multi-view data representation for multi-domain event detection. IEEE T-PAMI 42(5), 1243–1256 (2020)
Yang, Z., Lin, Z., Guo, L., Li, Q., Liu, W.: MMED: a multi-domain and multi-modality event dataset. IPM 57(6), 102315 (2020)
Ye, Z., Hu, F., Lyu, F., Li, L., Huang, K.: Disentangling semantic-to-visual confusion for zero-shot learning. TMM 24, 2828–2840 (2022)
Acknowledgment
This work is supported by the National Natural Science Foundation of China (No. 62076073), and the Youth Talent Support Programme of Guangdong Provincial Association for Science and Technology (No. SKXRC202305).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, Z., Luo, D., You, J., Guo, Z., Yang, Z. (2023). Multimodal Conditional VAE for Zero-Shot Real-World Event Discovery. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14177. Springer, Cham. https://doi.org/10.1007/978-3-031-46664-9_43
Download citation
DOI: https://doi.org/10.1007/978-3-031-46664-9_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46663-2
Online ISBN: 978-3-031-46664-9
eBook Packages: Computer ScienceComputer Science (R0)