Skip to main content

HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot Classification with Unimodal Cues

  • Conference paper
  • First Online:
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2023)

Abstract

Recognition of remote sensing (RS) or aerial images is currently of great interest, and advancements in deep learning algorithms added flavor to it in recent years. Occlusion, intra-class variance, lighting, etc., might arise while training neural networks using unimodal RS visual input. Even though joint training of audio-visual modalities improves classification performance in a low-data regime, it has yet to be thoroughly investigated in the RS domain. Here, we aim to solve a novel problem where both the audio and visual modalities are present during the meta-training of a few-shot learning (FSL) classifier; however, one of the modalities might be missing during the meta-testing stage. This problem formulation is pertinent in the RS domain, given the difficulties in data acquisition or sensor malfunctioning. To mitigate, we propose a novel few-shot generative framework, Hallucinated Audio-Visual Embeddings-Network (HAVE-Net), to meta-train cross-modal features from limited unimodal data. Precisely, these hallucinated features are meta-learned from base classes and used for few-shot classification on novel classes during the inference phase. The experimental results on the benchmark ADVANCE and AudioSetZSL datasets show that our hallucinated modality augmentation strategy for few-shot classification outperforms the classifier performance trained with the real multimodal information at least by 0.8–2%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    To the best of our knowledge, we compare our proposed method with most relevant method. [10] concentrates on image-text pairs as multiple modalities, which is not fair to use in comparison with our problem of joint audio-visual learning.

References

  1. Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: benchmark and state of the art. CoRR arxiv preprint arxiv:abs/1703.00121 (2017). http://arxiv.org/abs/1703.00121

  2. Finn, C., Xu, K., Levine, S.: Probabilistic model-agnostic meta-learning. In: Neural Information Processing Systems NeurIPS (2018)

    Google Scholar 

  3. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proc. IEEE ICASSP 2017. New Orleans, LA (2017)

    Google Scholar 

  4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arxiv preprint arxiv: abs/1512.03385 (2015)

  5. Heidler, K., et al.: Self-supervised audiovisual representation learning for remote sensing data. CoRR arxiv preprint arxiv: abs/2108.00688 (2021)

  6. Hu, D., et al.: Cross-task transfer for geotagged audiovisual aerial scene recognition (2020)

    Google Scholar 

  7. Koch, G., Zemel, R., Salakhutdinov, R., et al.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2. Lille (2015)

    Google Scholar 

  8. Mao, G., Yuan, Y., Xiaoqiang, L.: Deep cross-modal retrieval for remote sensing image and audio. In: 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS), pp. 1–7 (2018). https://doi.org/10.1109/PRRS.2018.8486338

  9. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)

  10. Pahde, F., Puscas, M., Klein, T., Nabi, M.: Multimodal prototypical networks for few-shot learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2644–2653 (2021)

    Google Scholar 

  11. Pal, D., Bundele, V., Banerjee, B., Jeppu, Y.: SPN: stable prototypical network for few-shot learning-based hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2022). https://doi.org/10.1109/LGRS.2021.3085522

    Article  MATH  Google Scholar 

  12. Parida, K.K., Matiyali, N., Guha, T., Sharma, G.: Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2020)

    Google Scholar 

  13. Ren, M., et al.: Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676 (2018)

  14. Salem, T., Zhai, M., Workman, S., Jacobs, N.: A multimodal approach to mapping soundscapes. In: IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 3477–3480 (2018). https://doi.org/10.1109/IGARSS.2018.8517977

  15. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  16. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  17. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 29 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ankit Jha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jha, A., Pal, D., Singha, M., Agarwal, N., Banerjee, B. (2025). HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot Classification with Unimodal Cues. In: Meo, R., Silvestri, F. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2023. Communications in Computer and Information Science, vol 2136. Springer, Cham. https://doi.org/10.1007/978-3-031-74640-6_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-74640-6_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-74639-0

  • Online ISBN: 978-3-031-74640-6

  • eBook Packages: Artificial Intelligence (R0)

Publish with us

Policies and ethics