Abstract:
Cross-modal Retrieval (CMR) is formulated for the scenarios where the queries and retrieval results are of different modalities. Existing Cross-modal Retrieval (CMR) stud...Show MoreMetadata
Abstract:
Cross-modal Retrieval (CMR) is formulated for the scenarios where the queries and retrieval results are of different modalities. Existing Cross-modal Retrieval (CMR) studies mainly focus on the common contextualized information between text transcripts and images, and the synchronized event information in audio-visual recordings. Unlike all previous works, in this article, we investigate the geometric correspondence between images and speech recordings captured in the same space and formulate a novel CMR task, called Spatial Image-Acoustic Retrieval (SIAR). To this end, we first design a novel speech encoder that consists of convolution neural networks and transformer layers, to learn space-aware speech representations. Then, to eliminate the cross-modal inherent discrepancy, we propose the Contrastive Speech Image Retrieval (CSIR) method which uses supervised contrastive learning to attract the same-space cross-modal features while repelling the ones from different spaces. Finally, image and speech features are directly compared and we predict the SIAR result with the maximum similarity. Extensive experiments demonstrate that our proposed speech encoder can recognize space from human speeches with superior performance over the other prevailing networks. It also sets our penultimate goal of speech-to-speech retrieval. Furthermore, our CSIR proposal can successfully perform bi-directional SIAR between spatial images and reverberant speeches with promising results. Code and data will be available.
Published in: IEEE Transactions on Multimedia ( Volume: 26)