Abstract
Recognition of remote sensing (RS) or aerial images is currently of great interest, and advancements in deep learning algorithms added flavor to it in recent years. Occlusion, intra-class variance, lighting, etc., might arise while training neural networks using unimodal RS visual input. Even though joint training of audio-visual modalities improves classification performance in a low-data regime, it has yet to be thoroughly investigated in the RS domain. Here, we aim to solve a novel problem where both the audio and visual modalities are present during the meta-training of a few-shot learning (FSL) classifier; however, one of the modalities might be missing during the meta-testing stage. This problem formulation is pertinent in the RS domain, given the difficulties in data acquisition or sensor malfunctioning. To mitigate, we propose a novel few-shot generative framework, Hallucinated Audio-Visual Embeddings-Network (HAVE-Net), to meta-train cross-modal features from limited unimodal data. Precisely, these hallucinated features are meta-learned from base classes and used for few-shot classification on novel classes during the inference phase. The experimental results on the benchmark ADVANCE and AudioSetZSL datasets show that our hallucinated modality augmentation strategy for few-shot classification outperforms the classifier performance trained with the real multimodal information at least by 0.8–2%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
To the best of our knowledge, we compare our proposed method with most relevant method. [10] concentrates on image-text pairs as multiple modalities, which is not fair to use in comparison with our problem of joint audio-visual learning.
References
Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: benchmark and state of the art. CoRR arxiv preprint arxiv:abs/1703.00121 (2017). http://arxiv.org/abs/1703.00121
Finn, C., Xu, K., Levine, S.: Probabilistic model-agnostic meta-learning. In: Neural Information Processing Systems NeurIPS (2018)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proc. IEEE ICASSP 2017. New Orleans, LA (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arxiv preprint arxiv: abs/1512.03385 (2015)
Heidler, K., et al.: Self-supervised audiovisual representation learning for remote sensing data. CoRR arxiv preprint arxiv: abs/2108.00688 (2021)
Hu, D., et al.: Cross-task transfer for geotagged audiovisual aerial scene recognition (2020)
Koch, G., Zemel, R., Salakhutdinov, R., et al.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2. Lille (2015)
Mao, G., Yuan, Y., Xiaoqiang, L.: Deep cross-modal retrieval for remote sensing image and audio. In: 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS), pp. 1–7 (2018). https://doi.org/10.1109/PRRS.2018.8486338
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Pahde, F., Puscas, M., Klein, T., Nabi, M.: Multimodal prototypical networks for few-shot learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2644–2653 (2021)
Pal, D., Bundele, V., Banerjee, B., Jeppu, Y.: SPN: stable prototypical network for few-shot learning-based hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2022). https://doi.org/10.1109/LGRS.2021.3085522
Parida, K.K., Matiyali, N., Guha, T., Sharma, G.: Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2020)
Ren, M., et al.: Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676 (2018)
Salem, T., Zhai, M., Workman, S., Jacobs, N.: A multimodal approach to mapping soundscapes. In: IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 3477–3480 (2018). https://doi.org/10.1109/IGARSS.2018.8517977
Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems (2017)
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 29 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jha, A., Pal, D., Singha, M., Agarwal, N., Banerjee, B. (2025). HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot Classification with Unimodal Cues. In: Meo, R., Silvestri, F. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2023. Communications in Computer and Information Science, vol 2136. Springer, Cham. https://doi.org/10.1007/978-3-031-74640-6_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-74640-6_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74639-0
Online ISBN: 978-3-031-74640-6
eBook Packages: Artificial Intelligence (R0)