Abstract
Person search by natural language description is a challenging problem because of demands for modelling and learning visual-text semantic embedding. While several works have been dedicated to person search by English description, very few attempts have been made for other languages. This paper presents the first work towards person search by Vietnamese description. The contribution of the paper is threefold. First, the first and large-scale dataset for person search by Vietnamese natural language named 3000VnPersonSearch is built. Second, inspired by dual-path architecture (Zheng et al. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23, 2020), in which single loss for intra-modal and triple loss for cross-modal learning of text and image data distribution were considered, in this paper, we employ this architecture for Vietnamese description-based person search. However, as Vietnamese language is under-resource, the existing word embedding model is still modest compared to that of English. Therefore, instead of using word2vec model as in Zheng et al. (ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23, 2020), we modify the initialization process of the first convolution layer of the text-CNN path. In addition, we investigate in detail two online triplet mining strategies that are batch all and batch hard triplet. Extensive experiments have been conducted on benchmark datasets as well as on 3000VnPersonSearch. Experimental results show that the proposed method obtains 2.42% of improvement over the baseline method on CUHK-PEDES dataset and achieved state of the art results on VnPersonSearch dataset with a significant margin in comparison with the method in Pham et al. (2020). Finally, in order to illustrate the practical usage of person search by Vietnamese description language, a web-based application of person search is implemented and deployed.
Similar content being viewed by others
References
Bochkovskiy A, Wang C-Y, Liao H-Y M (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv:2004.10934
Carneiro G, Chan A B, Moreno P J, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410
Chen D, Zhang S, Yang J, Schiele B (2020) Norm-aware embedding for efficient person search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12615–12624
Chen T, Xu C, Luo J (2018) Improving text-based person search by spatial matching and adaptive threshold. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp 1879–1887
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10578–10587
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Dubey S, Olimov F, Rafique M A, Kim J, Jeon M (2021) Label-attention transformer with geometrically coherent objects for image captioning. arXiv:2109.07799
Gao S, Chia L-T, Tsang I W-H, Ren Z (2014) Concurrent single-label image classification and annotation via efficient multi-layer group sparse coding. IEEE Trans Multimed 16(3):762–771
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Islam K (2020) Person search: New paradigm of person re-identification: a survey and outlook of recent works. Image Vis Comput 101:103970. https://doi.org/10.1016/j.imavis.2020.103970, https://www.sciencedirect.com/science/article/pii/S0262885620301025
Iyengar G, Duygulu P, Feng S, Ircing P, Khudanpur SP, Klakow D, Krause MR, Manmatha R, Nock H J, Petkova D et al (2005) Joint visual-text modeling for automatic retrieval of multimedia documents. In: Proceedings of the 13th annual ACM international conference on Multimedia, pp 21–30
Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp 119–126
Jin L, Li Z, Tang J (2020) Deep semantic multimodal hashing network for scalable image-text and video-text retrievals. IEEE Transactions on Neural Networks and Learning Systems
Jing X-Y, Wu F, Li Z, Hu R, Zhang D (2016) Multi-label dictionary learning for image annotation. IEEE Trans Image Process 25(6):2712–2725
King D E (2009) Dlib-ml: A machine learning toolkit. J Mach Learn Res 10:1755–1758
Lan X, Zhu X, Gong S (2018) Person search by multi-scale matching. In: Proceedings of the European conference on computer vision (ECCV), pp 536–552
Lavrenko V, Manmatha R, Jeon J et al (2003) A model for learning the semantics of pictures.. In: Nips, vol 1. Citeseer
Le T L, Boucher A, Thonnat M, Bremond F (2010Aug) Surveillance video retrieval: what we have already done?. In: Third International Conference on Communications and Electronics (ICCE 2010). https://hal.inria.fr/inria-00515574, Nha Trang
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8928–8937
Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1970–1979
Li Z, Tang J, Mei T (2018) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083
Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vis 128
Lin D, Fidler S, Kong C, Urtasun R (2014) Visual semantic search: Retrieving videos via complex textual queries. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2657–2664
Lin X, Ren P, Xiao Y, Chang X, Hauptmann A (2021) Person search challenges and solutions: A survey. CoRR, arXiv:2105.01605
Lu Z, Han P, Wang L, Wen J-R (2014) Semantic sparse recoding of visual content for image applications. IEEE Trans Image Process 24(1):176–188
Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: European conference on computer vision. Springer, pp 316–329
Moran S, Lavrenko V (2014) Sparse kernel learning for image annotation. In: Proceedings of international conference on multimedia retrieval, pp 113–120
Narayanaswamy S, Barbu A, Siskind J M (2014) Seeing what you’re told: Sentence-guided activity recognition in video. In: Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014)
Nguyen D Q, Nguyen D Q, Vu T, Dras M, Johnson M (2017) A fast and accurate vietnamese word segmenter. arXiv:1709.06307
Nguyen T-B, Le T-L, Devillaine L, Pham T T T, Ngoc N P (2019) Effective multi-shot person re-identification through representative frames selection and temporal feature pooling. Multimed Tools Appl:1–29
Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra r-cnn: Towards balanced learning for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 821–830
Pham T T T, Nguyen D-D, Ta B H P, Nguyen T-B, Le T-L et al (2020) Person search by queried description in vietnamese natural language. In: Asian conference on intelligent information and database systems. Springer, pp 469–480
Qian X, Fu Y, Jiang Y-G, Xiang T, Xue X (2017) Multi-scale deep learning architectures for person re-identification. In: Proceedings of the IEEE international conference on computer vision, pp 5399–5408
Quan N H, Binh N T, Long T D et al (2020) A unified framework for automated person re-identification. Transport Commun Sci J 71(7):868–880
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Rennie S J, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Sarafianos N, Xu X, Kakadiaris I A (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5814–5824
Shree V, Chao W-L, Campbell M (2019) An empirical study of person re-identification with attributes. In: 2019 28th IEEE international conference on robot and human interactive communication (RO-MAN). IEEE, pp 1–8
Si J, Zhang H, Li C-G, Kuen J, Kong X, Kot A C, Wang G (2018) Dual attention matching network for context-aware feature sequence based person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5363–5372
Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: Computer vision, IEEE international conference on, vol 3. IEEE Computer Society, pp 1470–1470
Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 966–973
Song G, Liu Y, Wang X (2020) Revisiting the sibling head in object detector. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11563–11572
Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9627–9636
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Verma Y, Jawahar CV (2012) Image annotation using metric learning in semantic neighbourhoods. In: European conference on computer vision. Springer, pp 836–849
Verma Y, Jawahar CV (2013) Exploring svm for image annotation in presence of confusing labels.. In: BMVC
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Vu T, Nguyen D Q, Nguyen D Q, Dras M, Johnson M (2018) Vncorenlp: A vietnamese natural language processing toolkit. arXiv:1801.01331
Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: Visual-textual attributes alignment in person search by natural language. In: ECCV
Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP). IEEE, pp 3645–3649
Xiao T, Li S, Wang B, Lin L, Wang X (2016) End-to-end deep learning for person search. CoRR, arXiv:1604.01850
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057
Xu Y, Ma B, Huang R, Lin L (2014) Person search in a scene by jointly modeling people commonness and person uniqueness. In: Proceedings of the 22nd ACM international conference on multimedia, pp 937–940
Yamaguchi M, Saito K, Ushiku Y, Harada T (2017) Spatio-temporal person retrieval via natural language queries. In: Proceedings of the IEEE international conference on computer vision, pp 1453–1462
Yan Y, Li J, Qin J, Bai S, Liao S, Liu L, Zhu F, Shao L (2021) Anchor-free person search. CoRR, arXiv:2103.11617
Yan Y, Qin J, Ni B, Chen J, Liu L, Zhu F, Zheng W-S, Yang X, Shao L (2020) Learning multi-attention context graph for group-based re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence
Yan Y, Zhang Q, Ni B, Zhang W, Xu M, Yang X (2019) Learning context graph for person search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2158–2167
Yang Z, Liu S, Hu H, Wang L, Lin S (2019) Reppoints: Point set representation for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9657–9666
Yao B Z, Yang X, Lin L, Lee M W, Zhu S-C (2010) I2t: Image parsing to text description. Proc IEEE 98(8):1485–1508
Zheng D, Xiao J, Huang K, Zhao Y (2020) Segmentation mask guided end-to-end person search. Signal Process Image Commun 86:115876
Zheng L, Yang Y, Hauptmann A G (2016) Person re-identification: Past, present and future. arXiv:1610.02984
Zheng L, Zhang H, Sun S, Chandraker M, Yang Y, Tian Q (2017) Person re-identification in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1367–1376
Zheng M, Karanam S, Wu Z, Radke R J (2019) Re-identification with consistent attentive siamese networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5735–5744
Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23
Zhong Y, Wang X, Zhang S (2020) Robust partial matching for person search in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6827–6835
Zhou S, Wang F, Huang Z, Wang J (2019) Discriminative feature learning with consistent attention regularization for person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8040–8049
Zhou T, Chen M, Yu J, Terzopoulos D (2017) Attention-based natural language person retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 27–34
Acknowledgements
This research is funded by the Vietnam Ministry of Education and Training under grant number CT2020.02.BKA.02. Dr. Thanh-Thuy Pham is supported by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 11/2020/STS01.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare no conflicts of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pham, T.T.T., Nguyen, HQ., Phan, H. et al. Towards a large-scale person search by vietnamese natural language: dataset and methods. Multimed Tools Appl 81, 27569–27600 (2022). https://doi.org/10.1007/s11042-022-12138-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12138-1