Towards a large-scale person search by vietnamese natural language: dataset and methods

Pham, Thi Thanh Thuy; Nguyen, Hong-Quan; Phan, Hoai; Do, Thi-Ngoc-Diep; Nguyen, Thuy-Binh; Tran, Thanh-Hai; Le, Thi-Lan

doi:10.1007/s11042-022-12138-1

Towards a large-scale person search by vietnamese natural language: dataset and methods

Published: 28 March 2022

Volume 81, pages 27569–27600, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Thi Thanh Thuy Pham^1,2,
Hong-Quan Nguyen³,
Hoai Phan¹,
Thi-Ngoc-Diep Do²,
Thuy-Binh Nguyen⁴,
Thanh-Hai Tran^2,5 &
…
Thi-Lan Le ORCID: orcid.org/0000-0001-9541-3905^2,5

232 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Person search by natural language description is a challenging problem because of demands for modelling and learning visual-text semantic embedding. While several works have been dedicated to person search by English description, very few attempts have been made for other languages. This paper presents the first work towards person search by Vietnamese description. The contribution of the paper is threefold. First, the first and large-scale dataset for person search by Vietnamese natural language named 3000VnPersonSearch is built. Second, inspired by dual-path architecture (Zheng et al. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23, 2020), in which single loss for intra-modal and triple loss for cross-modal learning of text and image data distribution were considered, in this paper, we employ this architecture for Vietnamese description-based person search. However, as Vietnamese language is under-resource, the existing word embedding model is still modest compared to that of English. Therefore, instead of using word2vec model as in Zheng et al. (ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23, 2020), we modify the initialization process of the first convolution layer of the text-CNN path. In addition, we investigate in detail two online triplet mining strategies that are batch all and batch hard triplet. Extensive experiments have been conducted on benchmark datasets as well as on 3000VnPersonSearch. Experimental results show that the proposed method obtains 2.42% of improvement over the baseline method on CUHK-PEDES dataset and achieved state of the art results on VnPersonSearch dataset with a significant margin in comparison with the method in Pham et al. (2020). Finally, in order to illustrate the practical usage of person search by Vietnamese description language, a web-based application of person search is implemented and deployed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Leyzer: A Dataset for Multilingual Virtual Assistants

Attentive Feature Focusing for Person Search by Natural Language

Deployment and Comparison of Large Language Models Based on Virtual Cluster

Notes

References

Bochkovskiy A, Wang C-Y, Liao H-Y M (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv:2004.10934
Carneiro G, Chan A B, Moreno P J, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410
Article Google Scholar
Chen D, Zhang S, Yang J, Schiele B (2020) Norm-aware embedding for efficient person search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12615–12624
Chen T, Xu C, Luo J (2018) Improving text-based person search by spatial matching and adaptive threshold. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp 1879–1887
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10578–10587
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Dubey S, Olimov F, Rafique M A, Kim J, Jeon M (2021) Label-attention transformer with geometrically coherent objects for image captioning. arXiv:2109.07799
Gao S, Chia L-T, Tsang I W-H, Ren Z (2014) Concurrent single-label image classification and annotation via efficient multi-layer group sparse coding. IEEE Trans Multimed 16(3):762–771
Article Google Scholar
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Islam K (2020) Person search: New paradigm of person re-identification: a survey and outlook of recent works. Image Vis Comput 101:103970. https://doi.org/10.1016/j.imavis.2020.103970, https://www.sciencedirect.com/science/article/pii/S0262885620301025
Article Google Scholar
Iyengar G, Duygulu P, Feng S, Ircing P, Khudanpur SP, Klakow D, Krause MR, Manmatha R, Nock H J, Petkova D et al (2005) Joint visual-text modeling for automatic retrieval of multimedia documents. In: Proceedings of the 13th annual ACM international conference on Multimedia, pp 21–30
Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp 119–126
Jin L, Li Z, Tang J (2020) Deep semantic multimodal hashing network for scalable image-text and video-text retrievals. IEEE Transactions on Neural Networks and Learning Systems
Jing X-Y, Wu F, Li Z, Hu R, Zhang D (2016) Multi-label dictionary learning for image annotation. IEEE Trans Image Process 25(6):2712–2725
Article MathSciNet Google Scholar
King D E (2009) Dlib-ml: A machine learning toolkit. J Mach Learn Res 10:1755–1758
Google Scholar
Lan X, Zhu X, Gong S (2018) Person search by multi-scale matching. In: Proceedings of the European conference on computer vision (ECCV), pp 536–552
Lavrenko V, Manmatha R, Jeon J et al (2003) A model for learning the semantics of pictures.. In: Nips, vol 1. Citeseer
Le T L, Boucher A, Thonnat M, Bremond F (2010Aug) Surveillance video retrieval: what we have already done?. In: Third International Conference on Communications and Electronics (ICCE 2010). https://hal.inria.fr/inria-00515574, Nha Trang
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8928–8937
Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1970–1979
Li Z, Tang J, Mei T (2018) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083
Article Google Scholar
Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vis 128
Lin D, Fidler S, Kong C, Urtasun R (2014) Visual semantic search: Retrieving videos via complex textual queries. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2657–2664
Lin X, Ren P, Xiao Y, Chang X, Hauptmann A (2021) Person search challenges and solutions: A survey. CoRR, arXiv:2105.01605
Lu Z, Han P, Wang L, Wen J-R (2014) Semantic sparse recoding of visual content for image applications. IEEE Trans Image Process 24(1):176–188
MathSciNet MATH Google Scholar
Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: European conference on computer vision. Springer, pp 316–329
Moran S, Lavrenko V (2014) Sparse kernel learning for image annotation. In: Proceedings of international conference on multimedia retrieval, pp 113–120
Narayanaswamy S, Barbu A, Siskind J M (2014) Seeing what you’re told: Sentence-guided activity recognition in video. In: Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014)
Nguyen D Q, Nguyen D Q, Vu T, Dras M, Johnson M (2017) A fast and accurate vietnamese word segmenter. arXiv:1709.06307
Nguyen T-B, Le T-L, Devillaine L, Pham T T T, Ngoc N P (2019) Effective multi-shot person re-identification through representative frames selection and temporal feature pooling. Multimed Tools Appl:1–29
Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra r-cnn: Towards balanced learning for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 821–830
Pham T T T, Nguyen D-D, Ta B H P, Nguyen T-B, Le T-L et al (2020) Person search by queried description in vietnamese natural language. In: Asian conference on intelligent information and database systems. Springer, pp 469–480
Qian X, Fu Y, Jiang Y-G, Xiang T, Xue X (2017) Multi-scale deep learning architectures for person re-identification. In: Proceedings of the IEEE international conference on computer vision, pp 5399–5408
Quan N H, Binh N T, Long T D et al (2020) A unified framework for automated person re-identification. Transport Commun Sci J 71(7):868–880
Google Scholar
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Rennie S J, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Sarafianos N, Xu X, Kakadiaris I A (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5814–5824
Shree V, Chao W-L, Campbell M (2019) An empirical study of person re-identification with attributes. In: 2019 28th IEEE international conference on robot and human interactive communication (RO-MAN). IEEE, pp 1–8
Si J, Zhang H, Li C-G, Kuen J, Kong X, Kot A C, Wang G (2018) Dual attention matching network for context-aware feature sequence based person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5363–5372
Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: Computer vision, IEEE international conference on, vol 3. IEEE Computer Society, pp 1470–1470
Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 966–973
Song G, Liu Y, Wang X (2020) Revisiting the sibling head in object detector. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11563–11572
Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9627–9636
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Verma Y, Jawahar CV (2012) Image annotation using metric learning in semantic neighbourhoods. In: European conference on computer vision. Springer, pp 836–849
Verma Y, Jawahar CV (2013) Exploring svm for image annotation in presence of confusing labels.. In: BMVC
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Vu T, Nguyen D Q, Nguyen D Q, Dras M, Johnson M (2018) Vncorenlp: A vietnamese natural language processing toolkit. arXiv:1801.01331
Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: Visual-textual attributes alignment in person search by natural language. In: ECCV
Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP). IEEE, pp 3645–3649
Xiao T, Li S, Wang B, Lin L, Wang X (2016) End-to-end deep learning for person search. CoRR, arXiv:1604.01850
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057
Xu Y, Ma B, Huang R, Lin L (2014) Person search in a scene by jointly modeling people commonness and person uniqueness. In: Proceedings of the 22nd ACM international conference on multimedia, pp 937–940
Yamaguchi M, Saito K, Ushiku Y, Harada T (2017) Spatio-temporal person retrieval via natural language queries. In: Proceedings of the IEEE international conference on computer vision, pp 1453–1462
Yan Y, Li J, Qin J, Bai S, Liao S, Liu L, Zhu F, Shao L (2021) Anchor-free person search. CoRR, arXiv:2103.11617
Yan Y, Qin J, Ni B, Chen J, Liu L, Zhu F, Zheng W-S, Yang X, Shao L (2020) Learning multi-attention context graph for group-based re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence
Yan Y, Zhang Q, Ni B, Zhang W, Xu M, Yang X (2019) Learning context graph for person search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2158–2167
Yang Z, Liu S, Hu H, Wang L, Lin S (2019) Reppoints: Point set representation for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9657–9666
Yao B Z, Yang X, Lin L, Lee M W, Zhu S-C (2010) I2t: Image parsing to text description. Proc IEEE 98(8):1485–1508
Article Google Scholar
Zheng D, Xiao J, Huang K, Zhao Y (2020) Segmentation mask guided end-to-end person search. Signal Process Image Commun 86:115876
Article Google Scholar
Zheng L, Yang Y, Hauptmann A G (2016) Person re-identification: Past, present and future. arXiv:1610.02984
Zheng L, Zhang H, Sun S, Chandraker M, Yang Y, Tian Q (2017) Person re-identification in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1367–1376
Zheng M, Karanam S, Wu Z, Radke R J (2019) Re-identification with consistent attentive siamese networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5735–5744
Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23
Article Google Scholar
Zhong Y, Wang X, Zhang S (2020) Robust partial matching for person search in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6827–6835
Zhou S, Wang F, Huang Z, Wang J (2019) Discriminative feature learning with consistent attention regularization for person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8040–8049
Zhou T, Chen M, Yu J, Terzopoulos D (2017) Attention-based natural language person retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 27–34

Download references

Acknowledgements

This research is funded by the Vietnam Ministry of Education and Training under grant number CT2020.02.BKA.02. Dr. Thanh-Thuy Pham is supported by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 11/2020/STS01.

Author information

Authors and Affiliations

Faculty of Cyber Security, Academy of People Security, Hanoi, Vietnam
Thi Thanh Thuy Pham & Hoai Phan
MICA International Research Institute, Hanoi University of Science and Technology, Hanoi, Vietnam
Thi Thanh Thuy Pham, Thi-Ngoc-Diep Do, Thanh-Hai Tran & Thi-Lan Le
Faculty of Information Technology, Viet-Hung Industrial University, Hanoi, Vietnam
Hong-Quan Nguyen
Faculty of Electrical and Electronics Engineering, University of Transport and Communications, Hanoi, Vietnam
Thuy-Binh Nguyen
School of Electronics and Telecommunications, Hanoi University of Science and Technology, Hanoi, Vietnam
Thanh-Hai Tran & Thi-Lan Le

Authors

Thi Thanh Thuy Pham
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Quan Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Hoai Phan
View author publications
You can also search for this author in PubMed Google Scholar
Thi-Ngoc-Diep Do
View author publications
You can also search for this author in PubMed Google Scholar
Thuy-Binh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Thanh-Hai Tran
View author publications
You can also search for this author in PubMed Google Scholar
Thi-Lan Le
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thi-Lan Le.

Ethics declarations

Conflict of Interests

The authors declare no conflicts of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pham, T.T.T., Nguyen, HQ., Phan, H. et al. Towards a large-scale person search by vietnamese natural language: dataset and methods. Multimed Tools Appl 81, 27569–27600 (2022). https://doi.org/10.1007/s11042-022-12138-1

Download citation

Received: 16 May 2021
Revised: 22 October 2021
Accepted: 03 January 2022
Published: 28 March 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s11042-022-12138-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards a large-scale person search by vietnamese natural language: dataset and methods

Abstract

Access this article

Similar content being viewed by others

Leyzer: A Dataset for Multilingual Virtual Assistants

Attentive Feature Focusing for Person Search by Natural Language

Deployment and Comparison of Large Language Models Based on Virtual Cluster

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards a large-scale person search by vietnamese natural language: dataset and methods

Abstract

Access this article

Similar content being viewed by others

Leyzer: A Dataset for Multilingual Virtual Assistants

Attentive Feature Focusing for Person Search by Natural Language

Deployment and Comparison of Large Language Models Based on Virtual Cluster

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation