skip to main content
research-article

Pedestrian Attribute Recognition via Spatio-temporal Relationship Learning for Visual Surveillance

Published: 08 March 2024 Publication History

Abstract

Pedestrian attribute recognition (PAR) aims at predicting the visual attributes of a pedestrian image. PAR has been used as soft biometrics for visual surveillance and IoT security. Most of the current PAR methods are developed based on discrete images. However, it is challenging for the image-based method to handle the occlusion and action-related attributes in real-world applications. Recently, video-based PAR has attracted much attention in order to exploit the temporal cues in the video sequences for better PAR. Unfortunately, existing methods usually ignore the correlations among different attributes and the relations between attributes and spatio regions. To address this problem, we propose a novel method for video-based PAR by exploring the relationships among different attributes in both the spatio and temporal domains. More specifically, a spatio-temporal saliency module (STSM) is introduced to capture the key visual patterns from the video sequences, and a module for spatio-temporal attribute relationship learning (STARL) is proposed to mine the correlations among these patterns. Meanwhile, a large-scale benchmark for video-based PAR, RAP-Video, is built by extending the image-based dataset RAP-2, which contains 83,216 tracklets with 25 scenes. To the best of our knowledge, this is the largest dataset for video-based PAR. Extensive experiments are performed on the proposed benchmark as well as on MARS Attribute and DukeMTMC-Video Attribute. The superior performance demonstrates the effectiveness of the proposed method.

References

[1]
Tianrui Chai, Zhiyuan Chen, Annan Li, Jiaxin Chen, Xinyu Mei, and Yunhong Wang. 2022. Video person re-identification using attribute-enhanced features. IEEE Transactions on Circuits and Systems for Video Technology 32, 11 (2022), 7951–7966.
[2]
Xiaodong Chen, Xinchen Liu, Wu Liu, Xiao-Ping Zhang, Yongdong Zhang, and Tao Mei. 2021. Explainable person re-identification with attribute-guided metric distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11813–11822.
[3]
Zhiyuan Chen, Annan Li, and Yunhong Wang. 2019. A temporal attentive approach for video-based pedestrian attribute recognition. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV’19). Springer, 209–220.
[4]
Xinhua Cheng, Mengxi Jia, Qian Wang, and Jian Zhang. 2022. A simple visual-textual baseline for pedestrian attribute recognition. IEEE Transactions on Circuits and Systems for Video Technology 32, 10 (2022), 6994–7004.
[5]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth \(16\times 16\) words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[6]
Xinqian Gu, Hong Chang, Bingpeng Ma, Hongkai Zhang, and Xilin Chen. 2020. Appearance-preserving 3D convolution for video-based person re-identification. In Proceedings of the 16th European Conference on Computer Vision (ECCV’20), Part II 16. Springer, 228–243.
[7]
Kai Han, Yunhe Wang, Han Shu, Chuanjian Liu, Chunjing Xu, and Chang Xu. 2019. Attribute aware pooling for pedestrian attribute recognition. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19), Sarit Kraus (Ed.). 2456–2462.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[9]
Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2020. Temporal complementary learning for video person re-identification. In Proceedings of the 16th European Conference on Computer Vision (ECCV’20), Part XXV 16. Springer, 388–405.
[10]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.
[11]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. Advances in Neural Information Processing Systems 28 (2015), 2017–2025.
[12]
Jian Jia, Naiyu Gao, Fei He, Xiaotang Chen, and Kaiqi Huang. 2022. Learning disentangled attribute representations for robust pedestrian attribute recognition. Proceedings of the AAAI Conference on Artificial Intelligence 36, 1 (2022), 1069–1077.
[13]
Xin Jin, Xinning Li, Hao Lou, Chenyu Fan, Qiang Deng, Chaoen Xiao, Shuai Cui, and Amit Kumar Singh. 2023. Aesthetic attribute assessment of images numerically on mixed multi-attribute datasets. ACM Transactions on Multimedia Computing, Communications and Applications 18, 3s (2023), 1–16.
[14]
Jack Lanchantin, Tianlu Wang, Vicente Ordonez, and Yanjun Qi. 2021. General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16478–16488.
[15]
Ryan Layne, Timothy M. Hospedales, and Shaogang Gong. 2012. Person re-identification by attributes. In British Machine Vision Conference (BMVC) 2, 3 (2012), 8.
[16]
Dangwei Li, Xiaotang Chen, and Kaiqi Huang. 2015. Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR’15). 111–115.
[17]
Dangwei Li, Zhang Zhang, Xiaotang Chen, and Kaiqi Huang. 2018. A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios. IEEE Transactions on Image Processing 28, 4 (2018), 1575–1590.
[18]
Qiaozhe Li, Xin Zhao, Ran He, and Kaiqi Huang. 2019. Pedestrian attribute recognition by joint visual-semantic reasoning and knowledge distillation. In IJCAI. 833–839.
[19]
Qiaozhe Li, Xin Zhao, Ran He, and Kaiqi Huang. 2019. Recurrent prediction with spatio-temporal attention for crowd attribute recognition. IEEE Transactions on Circuits and Systems for Video Technology 30, 7 (2019), 2167–2177.
[20]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2117–2125.
[21]
Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, Zhilan Hu, Chenggang Yan, and Yi Yang. 2019. Improving person re-identification by attribute and identity learning. Pattern Recognition 95 (2019), 151–161.
[22]
Pengze Liu, Xihui Liu, Junjie Yan, and Jing Shao. 2018. Localization guided learning for pedestrian attribute recognition. arXiv preprint arXiv:1808.09102 (2018).
[23]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11976–11986.
[24]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations.
[25]
Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, PamelaMishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visualmodels fromnatural language supervision. In International Conference on Machine Learning. Proceedings of Machine Learning Research (PMLR), 8748–8763.
[26]
Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision. Springer, 17–35.
[27]
Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. 2018. Deep imbalanced attribute classification using visual attention aggregation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 680–697.
[28]
M. Saquib Sarfraz, Arne Schumann, Yan Wang, and Rainer Stiefelhagen. 2017. Deep view-sensitive pedestrian attribute inference in an end-to-end model. In British Machine Vision Conference 2017 (BMVC’17).
[29]
Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant network for action recognition in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 3 (2022), 1–18.
[30]
Thomhert Suprapto Siadari, Mikyong Han, and Hyunjin Yoon. 2019. GSR-MAR: Global super-resolution for person multi-attribute recognition. In International Conference on Computer Vision (ICCV) Workshops. 1098–1103.
[31]
Zichang Tan, Yang Yang, Jun Wan, Guodong Guo, and Stan Z. Li. 2020. Relation-aware pedestrian attribute recognition with graph convolutional networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12055–12062.
[32]
Zichang Tan, Yang Yang, Jun Wan, Hanyuan Hang, Guodong Guo, and Stan Z. Li. 2019. Attention-based pedestrian attribute analysis. IEEE Transactions on Image Processing 28, 12 (2019), 6126–6140.
[33]
Chufeng Tang, Lu Sheng, Zhaoxiang Zhang, and Xiaolin Hu. 2019. Improving pedestrian attribute recognition with weakly-supervised multi-scale attribute-specific localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4997–5006.
[34]
Ziyi Tang, Ruimao Zhang, Zhanglin Peng, Jinrui Chen, and Liang Lin. 2022. Multi-stage spatio-temporal aggregation transformer for video person re-identification. IEEE Transactions on Multimedia, Early Access, 1–15.
[35]
Xiao Wang, Shaofei Zheng, Rui Yang, Aihua Zheng, Zhe Chen, Jin Tang, and Bin Luo. 2021. Pedestrian attribute recognition: A survey. Pattern Recognition 121 (2021), 108220.
[36]
Suncheng Xiang, Dahong Qian, Mengyuan Guan, Binjie Yan, Ting Liu, Yuzhuo Fu, and Guanjie You. 2023. Less is more: Learning from synthetic data with fine-grained attributes for person re-identification. ACM Transactions on Multimedia Computing, Communications and Applications 19, 5s (2023), 1–20.
[37]
Cheng Xu, Zejun Chen, Jiajie Mai, Xuemiao Xu, and Shengfeng He. 2023. Pose-and attribute-consistent person image synthesis. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2s (2023), 1–21.
[38]
Haotian Xu, Xiaobo Jin, Qiufeng Wang, Amir Hussain, and Kaizhu Huang. 2022. Exploiting attention-consistency loss for spatial-temporal stream action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 2s (2022), 1–15.
[39]
Wencheng Yang, Song Wang, Nor Masri Sahri, Nickson M. Karie, Mohiuddin Ahmed, and Craig Valli. 2021. Biometrics for internet-of-things security: A review. Sensors 21, 18 (2021), 6163.
[40]
Yang Yang, Zichang Tan, Prayag Tiwari, Hari Mohan Pandey, Jun Wan, Zhen Lei, Guodong Guo, and Stan Z. Li. 2021. Cascaded split-and-aggregate learning with feature recombination for pedestrian attribute recognition. International Journal of Computer Vision 129, 10 (2021), 2731–2744.
[41]
Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. 2016. Mars: A video benchmark for large-scale person re-identification. In European Conference on Computer Vision. Springer, 868–884.
[42]
Jun Zhu, Jiandong Jin, Zihan Yang, Xiaohao Wu, and Xiao Wang. 2023. Learning clip guided visual-text fusion transformer for video-based pedestrian attribute recognition. arXiv preprint arXiv:2304.10091 (2023).
[43]
Jianqing Zhu, Shengcai Liao, Zhen Lei, Dong Yi, and Stan Li. 2013. Pedestrian attribute classification in surveillance: Database and evaluation. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 331–338.

Cited By

View all
  • (2024)Introduction to the Special Issue on Integrity of Multimedia and Multimodal Data in Internet of ThingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364304020:6(1-4)Online publication date: 8-Mar-2024

Index Terms

  1. Pedestrian Attribute Recognition via Spatio-temporal Relationship Learning for Visual Surveillance

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 6
    June 2024
    715 pages
    EISSN:1551-6865
    DOI:10.1145/3613638
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 March 2024
    Online AM: 13 November 2023
    Accepted: 08 November 2023
    Revised: 21 September 2023
    Received: 11 June 2023
    Published in TOMM Volume 20, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Video-based pedestrian attribute recognition
    2. spatio-temporal relationship learning
    3. IoT security

    Qualifiers

    • Research-article

    Funding Sources

    • Talent Introduction Program for Youth Innovation Teams of Shandong Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)302
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Introduction to the Special Issue on Integrity of Multimedia and Multimodal Data in Internet of ThingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364304020:6(1-4)Online publication date: 8-Mar-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media