Abstract:
Nowadays, video surveillance systems are widely deployed in public areas. However, in the unreachable corner of surveillance cameras, it still seems impossible to find th...Show MoreMetadata
Abstract:
Nowadays, video surveillance systems are widely deployed in public areas. However, in the unreachable corner of surveillance cameras, it still seems impossible to find the suspects only depending on eyewitness memory. Therefore, the technology that can detect particular pedestrians only by text-based attributes, or text-attribute person search, attracts lots of attention from academia. Most existing text-attribute person search methods focus on learning better feature representations by designing better network structures or using local information but lack direct constraints between modalities. This paper proposes a feature embedding motivated and graph attention network-based model, optimizing the feature extraction process by its attention mechanism. Meanwhile, this paper studies the effectiveness of the attention mechanism in feature alignment, and thus redesigns the cross-attention module, simplifying the complexity of the model and constraining the inter-modality gap in maximum by the self-attention mechanism of the graph attention network. In this way, the method simultaneously offsets the influence of modal-specific features and optimizes the number of parameters. Thus, the method improves performance and reduces time costs. Meanwhile, according to the inherent feature of attributes, this article introduces a novel embedding space, which effectively enhances the discrimination ability of the model. Extensive experiments illustrate the superiority of our model in two widely used text-attribute person search benchmarks among the state-of-the-art methods.
Published in: IEEE Transactions on Multimedia ( Volume: 26)