Skip to main content
Log in

Towards a large-scale person search by vietnamese natural language: dataset and methods

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Person search by natural language description is a challenging problem because of demands for modelling and learning visual-text semantic embedding. While several works have been dedicated to person search by English description, very few attempts have been made for other languages. This paper presents the first work towards person search by Vietnamese description. The contribution of the paper is threefold. First, the first and large-scale dataset for person search by Vietnamese natural language named 3000VnPersonSearch is built. Second, inspired by dual-path architecture (Zheng et al. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23, 2020), in which single loss for intra-modal and triple loss for cross-modal learning of text and image data distribution were considered, in this paper, we employ this architecture for Vietnamese description-based person search. However, as Vietnamese language is under-resource, the existing word embedding model is still modest compared to that of English. Therefore, instead of using word2vec model as in Zheng et al. (ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23, 2020), we modify the initialization process of the first convolution layer of the text-CNN path. In addition, we investigate in detail two online triplet mining strategies that are batch all and batch hard triplet. Extensive experiments have been conducted on benchmark datasets as well as on 3000VnPersonSearch. Experimental results show that the proposed method obtains 2.42% of improvement over the baseline method on CUHK-PEDES dataset and achieved state of the art results on VnPersonSearch dataset with a significant margin in comparison with the method in Pham et al. (2020). Finally, in order to illustrate the practical usage of person search by Vietnamese description language, a web-based application of person search is implemented and deployed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. https://cloud.google.com/translate/docs/

  2. https://code.google.com/archive/p/word2vec/

  3. https://github.com/sonvx/word2vecVN

  4. http://mica.edu.vn:8007/psearch/en/index.php

References

  1. Bochkovskiy A, Wang C-Y, Liao H-Y M (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv:2004.10934

  2. Carneiro G, Chan A B, Moreno P J, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410

    Article  Google Scholar 

  3. Chen D, Zhang S, Yang J, Schiele B (2020) Norm-aware embedding for efficient person search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12615–12624

  4. Chen T, Xu C, Luo J (2018) Improving text-based person search by spatial matching and adaptive threshold. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp 1879–1887

  5. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10578–10587

  6. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  7. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

  8. Dubey S, Olimov F, Rafique M A, Kim J, Jeon M (2021) Label-attention transformer with geometrically coherent objects for image captioning. arXiv:2109.07799

  9. Gao S, Chia L-T, Tsang I W-H, Ren Z (2014) Concurrent single-label image classification and annotation via efficient multi-layer group sparse coding. IEEE Trans Multimed 16(3):762–771

    Article  Google Scholar 

  10. Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643

  11. Islam K (2020) Person search: New paradigm of person re-identification: a survey and outlook of recent works. Image Vis Comput 101:103970. https://doi.org/10.1016/j.imavis.2020.103970, https://www.sciencedirect.com/science/article/pii/S0262885620301025

    Article  Google Scholar 

  12. Iyengar G, Duygulu P, Feng S, Ircing P, Khudanpur SP, Klakow D, Krause MR, Manmatha R, Nock H J, Petkova D et al (2005) Joint visual-text modeling for automatic retrieval of multimedia documents. In: Proceedings of the 13th annual ACM international conference on Multimedia, pp 21–30

  13. Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp 119–126

  14. Jin L, Li Z, Tang J (2020) Deep semantic multimodal hashing network for scalable image-text and video-text retrievals. IEEE Transactions on Neural Networks and Learning Systems

  15. Jing X-Y, Wu F, Li Z, Hu R, Zhang D (2016) Multi-label dictionary learning for image annotation. IEEE Trans Image Process 25(6):2712–2725

    Article  MathSciNet  Google Scholar 

  16. King D E (2009) Dlib-ml: A machine learning toolkit. J Mach Learn Res 10:1755–1758

    Google Scholar 

  17. Lan X, Zhu X, Gong S (2018) Person search by multi-scale matching. In: Proceedings of the European conference on computer vision (ECCV), pp 536–552

  18. Lavrenko V, Manmatha R, Jeon J et al (2003) A model for learning the semantics of pictures.. In: Nips, vol 1. Citeseer

  19. Le T L, Boucher A, Thonnat M, Bremond F (2010Aug) Surveillance video retrieval: what we have already done?. In: Third International Conference on Communications and Electronics (ICCE 2010). https://hal.inria.fr/inria-00515574, Nha Trang

  20. Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8928–8937

  21. Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1970–1979

  22. Li Z, Tang J, Mei T (2018) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083

    Article  Google Scholar 

  23. Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vis 128

  24. Lin D, Fidler S, Kong C, Urtasun R (2014) Visual semantic search: Retrieving videos via complex textual queries. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2657–2664

  25. Lin X, Ren P, Xiao Y, Chang X, Hauptmann A (2021) Person search challenges and solutions: A survey. CoRR, arXiv:2105.01605

  26. Lu Z, Han P, Wang L, Wen J-R (2014) Semantic sparse recoding of visual content for image applications. IEEE Trans Image Process 24(1):176–188

    MathSciNet  MATH  Google Scholar 

  27. Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: European conference on computer vision. Springer, pp 316–329

  28. Moran S, Lavrenko V (2014) Sparse kernel learning for image annotation. In: Proceedings of international conference on multimedia retrieval, pp 113–120

  29. Narayanaswamy S, Barbu A, Siskind J M (2014) Seeing what you’re told: Sentence-guided activity recognition in video. In: Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014)

  30. Nguyen D Q, Nguyen D Q, Vu T, Dras M, Johnson M (2017) A fast and accurate vietnamese word segmenter. arXiv:1709.06307

  31. Nguyen T-B, Le T-L, Devillaine L, Pham T T T, Ngoc N P (2019) Effective multi-shot person re-identification through representative frames selection and temporal feature pooling. Multimed Tools Appl:1–29

  32. Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra r-cnn: Towards balanced learning for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 821–830

  33. Pham T T T, Nguyen D-D, Ta B H P, Nguyen T-B, Le T-L et al (2020) Person search by queried description in vietnamese natural language. In: Asian conference on intelligent information and database systems. Springer, pp 469–480

  34. Qian X, Fu Y, Jiang Y-G, Xiang T, Xue X (2017) Multi-scale deep learning architectures for person re-identification. In: Proceedings of the IEEE international conference on computer vision, pp 5399–5408

  35. Quan N H, Binh N T, Long T D et al (2020) A unified framework for automated person re-identification. Transport Commun Sci J 71(7):868–880

    Google Scholar 

  36. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  37. Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  38. Rennie S J, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

  39. Sarafianos N, Xu X, Kakadiaris I A (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5814–5824

  40. Shree V, Chao W-L, Campbell M (2019) An empirical study of person re-identification with attributes. In: 2019 28th IEEE international conference on robot and human interactive communication (RO-MAN). IEEE, pp 1–8

  41. Si J, Zhang H, Li C-G, Kuen J, Kong X, Kot A C, Wang G (2018) Dual attention matching network for context-aware feature sequence based person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5363–5372

  42. Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: Computer vision, IEEE international conference on, vol 3. IEEE Computer Society, pp 1470–1470

  43. Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 966–973

  44. Song G, Liu Y, Wang X (2020) Revisiting the sibling head in object detector. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11563–11572

  45. Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9627–9636

  46. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  47. Verma Y, Jawahar CV (2012) Image annotation using metric learning in semantic neighbourhoods. In: European conference on computer vision. Springer, pp 836–849

  48. Verma Y, Jawahar CV (2013) Exploring svm for image annotation in presence of confusing labels.. In: BMVC

  49. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  50. Vu T, Nguyen D Q, Nguyen D Q, Dras M, Johnson M (2018) Vncorenlp: A vietnamese natural language processing toolkit. arXiv:1801.01331

  51. Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: Visual-textual attributes alignment in person search by natural language. In: ECCV

  52. Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP). IEEE, pp 3645–3649

  53. Xiao T, Li S, Wang B, Lin L, Wang X (2016) End-to-end deep learning for person search. CoRR, arXiv:1604.01850

  54. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057

  55. Xu Y, Ma B, Huang R, Lin L (2014) Person search in a scene by jointly modeling people commonness and person uniqueness. In: Proceedings of the 22nd ACM international conference on multimedia, pp 937–940

  56. Yamaguchi M, Saito K, Ushiku Y, Harada T (2017) Spatio-temporal person retrieval via natural language queries. In: Proceedings of the IEEE international conference on computer vision, pp 1453–1462

  57. Yan Y, Li J, Qin J, Bai S, Liao S, Liu L, Zhu F, Shao L (2021) Anchor-free person search. CoRR, arXiv:2103.11617

  58. Yan Y, Qin J, Ni B, Chen J, Liu L, Zhu F, Zheng W-S, Yang X, Shao L (2020) Learning multi-attention context graph for group-based re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence

  59. Yan Y, Zhang Q, Ni B, Zhang W, Xu M, Yang X (2019) Learning context graph for person search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2158–2167

  60. Yang Z, Liu S, Hu H, Wang L, Lin S (2019) Reppoints: Point set representation for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9657–9666

  61. Yao B Z, Yang X, Lin L, Lee M W, Zhu S-C (2010) I2t: Image parsing to text description. Proc IEEE 98(8):1485–1508

    Article  Google Scholar 

  62. Zheng D, Xiao J, Huang K, Zhao Y (2020) Segmentation mask guided end-to-end person search. Signal Process Image Commun 86:115876

    Article  Google Scholar 

  63. Zheng L, Yang Y, Hauptmann A G (2016) Person re-identification: Past, present and future. arXiv:1610.02984

  64. Zheng L, Zhang H, Sun S, Chandraker M, Yang Y, Tian Q (2017) Person re-identification in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1367–1376

  65. Zheng M, Karanam S, Wu Z, Radke R J (2019) Re-identification with consistent attentive siamese networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5735–5744

  66. Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23

    Article  Google Scholar 

  67. Zhong Y, Wang X, Zhang S (2020) Robust partial matching for person search in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6827–6835

  68. Zhou S, Wang F, Huang Z, Wang J (2019) Discriminative feature learning with consistent attention regularization for person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8040–8049

  69. Zhou T, Chen M, Yu J, Terzopoulos D (2017) Attention-based natural language person retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 27–34

Download references

Acknowledgements

This research is funded by the Vietnam Ministry of Education and Training under grant number CT2020.02.BKA.02. Dr. Thanh-Thuy Pham is supported by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 11/2020/STS01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thi-Lan Le.

Ethics declarations

Conflict of Interests

The authors declare no conflicts of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pham, T.T.T., Nguyen, HQ., Phan, H. et al. Towards a large-scale person search by vietnamese natural language: dataset and methods. Multimed Tools Appl 81, 27569–27600 (2022). https://doi.org/10.1007/s11042-022-12138-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12138-1

Keywords

Navigation