Abstract:
Audio and Visual are two important visual modalities in video content understanding. However, the absence of one modality may be observed in practical applications due to...Show MoreMetadata
Abstract:
Audio and Visual are two important visual modalities in video content understanding. However, the absence of one modality may be observed in practical applications due to the real environmental factors, which leads to the information loss. Therefore, audio and visual fusion is focused on using the shared and complementary information between modalities to recover the missing modalities from the available data modalities. In this paper, an Adversarial Hierarchical Variational Auto-Encoder (Adv-HVAE) model is proposed to solve this problem of modality data loss. A multimodal representation is first learned using a hierarchical Variational Autoencoder (VAE) model that enables the generation of missing modal data under any subset of available modalities. Also to obtain a more robust multimodal representation, a feature generation network is utilized to approximate the latent distribution of missing modalities. Finally, the adversarial training network is shown to be effective in improving the data quality generated through the Adv-HVAE framework. Experimental results demonstrate that Adv-HVAE achieves best generation results on two benchmark datasets, avMNIST and Sub-URMP.
Date of Conference: 19-22 May 2024
Date Added to IEEE Xplore: 02 July 2024
ISBN Information: