Impact Statement:Robust video fingerprinting has been proven to be effective to address the problem of video content authentication, which maps the perceptual content of the input video i...Show More
Abstract:
With the increasing number of edited videos, many robust video fingerprinting schemes have been proposed to solve the problem of video content authentication. However, mo...Show MoreMetadata
Impact Statement:
Robust video fingerprinting has been proven to be effective to address the problem of video content authentication, which maps the perceptual content of the input video into a fixed-size fingerprint. However, the majority of current video fingerprinting schemes consider spatiotemporal features of videos symmetrically, which ignores the different characteristics of the spatiotemporal dimension information. In this paper, we propose a network that employs two branches to focus on the temporal and spatial features, respectively. To better simulate actual scenarios, a large-scale and diverse video dataset is constructed for the robust video fingerprinting task. The proposed scheme ensures the robustness of generated video fingerprints against various common video content-preserving manipulations, while effectively distinguishing perceptual distinct videos. Compared to the state-of-the-art, our scheme provides significant boost in content authentication performance.
Abstract:
With the increasing number of edited videos, many robust video fingerprinting schemes have been proposed to solve the problem of video content authentication. However, most of them either deal with the temporal and spatial features symmetrically or insufficiently consider the temporal information. In this work, an end-to-end two-branch network toward robust video fingerprinting (RVFNet) is proposed, where the two branches focus on the temporal and spatial information, respectively. The temporal branch aims to comprehensively capture complex motion patterns by combining subtle motion changes with the overall motion trend. The spatial branch exploits the pixel-level information obtained by multiple receptive fields while preserving significant structural features. Deep metric learning is employed in the training process, and we adopt hard triplet loss to constrain the generation of fingerprints. Furthermore, we construct a large-scale and complex dataset for the robust video fingerprinti...
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 5, Issue: 5, May 2024)