Abstract:
Despite the great progress in temporal action localization (TAL), most existing methods directly use video encoders trained on trimmed Kinetics400 dataset to obtain clip-...Show MoreMetadata
Abstract:
Despite the great progress in temporal action localization (TAL), most existing methods directly use video encoders trained on trimmed Kinetics400 dataset to obtain clip-level visual features, ignoring the cross-dataset bias between Kinetics400 and TAL benchmarks. Such a dataset bias leads to poor visual representation, potentially hindering performance in both temporal detection and action recognition for TAL. In this paper, we propose a novel TAL method, termed feature refinement with masked cascaded network (FR-MCN), to tackle the above problem. Specifically, FR-MCN presents a new feature refinement strategy by developing clip-level feature classification task for both action and background clips to improve temporal sensitivity and enhance action semantics of visual features. Moreover, FR-MCN employs a masked cascaded paradigm for refinement to learn semantic disparities between action and background clips near boundary, enabling the starting and ending instants to be detected accurately for TAL. Extensive experimental results on THUMOS14 and ActivityNetv1.3 demonstrate that our FR-MCN, can significantly improve the action localization performance.
Published in: 2023 IEEE International Conference on Visual Communications and Image Processing (VCIP)
Date of Conference: 04-07 December 2023
Date Added to IEEE Xplore: 29 January 2024
ISBN Information: