Abstract:
Weakly supervised video moment retrieval or weakly supervised language moment retrieval aims to search the most relevant moment given a language query. In order to guide ...Show MoreMetadata
Abstract:
Weakly supervised video moment retrieval or weakly supervised language moment retrieval aims to search the most relevant moment given a language query. In order to guide the model to capture the most matching video segments with the text description, we design a two-granularity loss function that simultaneously considers both video-level and instance-level relationships. Specifically, we first generate coarse video segments and regard each video segment as an instance. For video-level regularized multiple instance loss (MIL), we leverage the latent alignment between all intra-video segments (ie., positive bag) and text descriptions. Then, we classify these segments by regarding this procedure as a supervised learning task under noisy labels. With the instance-level regularized loss function, our model can learn to correct noisy instance-level labels so as to locate the more accurate frame boundary from all the positive instances. Comprehensive experimental results on ActivityNet and DiDeMo demonstrate that the proposed loss function sets a new state-of-the-art.
Published in: IEEE Transactions on Multimedia ( Volume: 24)