Abstract:
Training under supervision of triplet ranking loss is a dominant methodology for cross-modal matching models, while good-performing losses in this domain are immensely un...Show MoreMetadata
Abstract:
Training under supervision of triplet ranking loss is a dominant methodology for cross-modal matching models, while good-performing losses in this domain are immensely under-explored since the majority of advanced metric losses are inapplicable due to the particularity of cross-modal setting. Current prominent approaches of metric learning have developed various weighting schemes that assign weights to separate positive or negative samples. It is the interclass relative order in a triplet, however, that matters. In this work, we propose a new Interclass-Relativity-Adaptive (IRA) loss that assigns weights to the relative similarities between positive and negative pairs instead of separate pairs, which allows us to regard a whole triplet as a weighable entity and achieve maximum utilization of sole positive under cross-modal setting. Our method outperforms the baselines by a large margin and obtains competitive results on two video-text matching benchmarks and two image-text matching benchmarks. We also further extend our method to two unimodal image retrieval benchmarks to test its generality and achieve new state-of-the-art results.
Published in: IEEE Transactions on Multimedia ( Volume: 23)