Abstract:
Video moment localization, as an important branch of video content analysis, has attracted extensive attention in recent years. However, it is still in its infancy due to...Show MoreMetadata
Abstract:
Video moment localization, as an important branch of video content analysis, has attracted extensive attention in recent years. However, it is still in its infancy due to the following challenges: cross-modal semantic alignment and localization efficiency. To address these impediments, we present a cross-modal semantic alignment network. To be specific, we first design a video encoder to generate moment candidates, learn their representations, as well as model their semantic relevance. Meanwhile, we design a query encoder for diverse query intention understanding. Thereafter, we introduce a multi-granularity interaction module to deeply explore the semantic correlation between multi-modalities. Thereby, we can effectively complete target moment localization via sufficient cross-modal semantic understanding. Moreover, we introduce a semantic pruning strategy to reduce cross-modal retrieval overhead, improving localization efficiency. Experimental results on two benchmark datasets have justified the superiority of our model over several state-of-the-art competitors.
Published in: IEEE Transactions on Image Processing ( Volume: 30)