skip to main content
research-article

Listen, Look, and Find the One: Robust Person Search with Multimodality Index

Published: 22 May 2020 Publication History

Abstract

Person search with one portrait, which attempts to search the targets in arbitrary scenes using one portrait image at a time, is an essential yet unexplored problem in the multimedia field. Existing approaches, which predominantly depend on the visual information of persons, cannot solve problems when there are variations in the person’s appearance caused by complex environments and changes in pose, makeup, and clothing. In contrast to existing methods, in this article, we propose an associative multimodality index for person search with face, body, and voice information. In the offline stage, an associative network is proposed to learn the relationships among face, body, and voice information. It can adaptively estimate the weights of each embedding to construct an appropriate representation. The multimodality index can be built by using these representations, which exploit the face and voice as long-term keys and the body appearance as a short-term connection. In the online stage, through the multimodality association in the index, we can retrieve all targets depending only on the facial features of the query portrait. Furthermore, to evaluate our multimodality search framework and facilitate related research, we construct the Cast Search in Movies with Voice (CSM-V) dataset, a large-scale benchmark that contains 127K annotated voices corresponding to tracklets from 192 movies. According to extensive experiments on the CSM-V dataset, the proposed multimodality person search framework outperforms the state-of-the-art methods.

References

[1]
T. Naoya, G. Michael, and G. Luc. 2018. AENet: Learning deep audio features for video analysis. IEEE Trans. Multimedia 20, 3 (2018), 513--524.
[2]
S. Li, X. Liu, W. Liu, H. Ma, and H. Zhang. 2016. A discriminative null space based deep learning approach for person re-identification. In Proceedings of the CCIS. 480--484.
[3]
L. Zheng, Y. Yang, and Q. Tian. 2018. SIFT meets CNN: A decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 40, 5 (2018), 1224--1244.
[4]
S. Hao, X. Wu, Z. Bing, Y. Wu, and Y. Jia. 2019. Temporal action localization in untrimmed videos using action pattern trees. IEEE Trans. Multimedia 21, 3 (2019), 717--730.
[5]
W. Ruan, J. Chen, Y. Wu, J. Wang, and C. Liang. 2019. Multi-correlation filters with triangle-structure constraints for object tracking. IEEE Trans. Multimedia 21, 5 (2019), 1122--1134.
[6]
W. Liu, C. Zhang, H. Ma, and S. Li. 2018. Learning efficient spatial-temporal gait features with deep learning for human identification. Neuroinformatics 16, 3–4 (2018), 457--471.
[7]
Q. Huang, W. Liu, and D. Lin. 2018. Person search in videos with one portrait through visual and temporal links. In Proceedings of the ECCV. 425--441.
[8]
C. Loy, D. Lin, and W. Ouyang. 2018. WIDER face and pedestrian challenge: http://wider-challenge.org/. arXiv:1902.06854 (2018).
[9]
Y. Gao, J. Ma, and A Yuille. 2017. Semi-supervised sparse representation based classification for face recognition with insufficient labeled samples. IEEE Trans. Image Process. 26, 5 (2017), 2545--2560.
[10]
F. Mokhayeri, E. Granger, and G. Bilodeau. 2018. Domain-specific face synthesis for video face recognition from a single sample per person. IEEE Trans. Inf. Forens. Secur. 14, 3 (2018), 757--772.
[11]
M. Rui, N. Kose, and J. Dugelay. 2017. KinectFaceDB: A kinect database for face recognition. IEEE Trans. Syst. Man, Cybern. 44, 11 (2017), 1534--1548.
[12]
L. Best-Rowden and A. Jain. 2018. Longitudinal study of automatic face recognition. IEEE Trans. Pattern Anal. Mach. Intell 1, 99 (2018), 148--162.
[13]
Z. Wang, R. Hu, C. Chen, Y. Yu, J. Jiang, C. Liang, and S. Satoh. 2017. Person reidentification via discrepancy matrix and matrix metric. IEEE Trans. Cybern. 1, 99 (2017), 1--15.
[14]
A. Torfi, N. Nasrabadi, and J. Dawson. 2017. Text-independent speaker verification using 3D convolutional neural networks. arXiv:1705.09422 (2017).
[15]
WVU multimodal dataset. Retrieved from http://biic.wvu.edu.
[16]
A. Nagrani, S. Albanie, and A. Zisserman. 2018. Seeing voices and hearing faces: Cross-modal biometric matching. In Proceedings of the CVPR. 8427--8436.
[17]
T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. 2017. Joint detection and identification feature learning for person search. In Proceedings of the CVPR. 3415--3424.
[18]
L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian. 2017. Person re-identification in the wild. In Proceedings of the CVPR. 1367--1376.
[19]
H. Liu, J. Feng, Z. Jie, K. Jayashree, B. Zhao, M. Qi, J. Jiang, and S. Yan. 2017. Neural person search machines. In Proceedings of the ICCV. 493--501.
[20]
B. Munjal, S. Amin, F. Tombari, and F. Galasso. 2019. Query-guided end-to-end person search. (2019), 811--820.
[21]
S. Horiguchi, N. Kanda, and K. Nagamatsu. 2018. Face-voice matching using cross-modal embeddings. In Proceedings of the ACM Multimedia. 1--10.
[22]
C. Gan, T. Yang, and B. Gong. 2016. Learning attributes equals multi-source domain generalization. In Proceedings of the CVPR. 87--97.
[23]
R. Arandjelovic and A. Zisserman. 2017. Look, listen and learn. arXiv:1705.08168 (2017).
[24]
C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba. 2019. Self-supervised moving vehicle tracking with stereo sound. In Proceedings of the ICCV. 7053--7062.
[25]
H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. 2018. The sound of pixels. In Proceedings of the ECCV. 570--586.
[26]
S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Li. 2017. Faceboxes: A CPU real-time face detector with high accuracy. In Proceedings of the IJCB. 1--9.
[27]
V. Jain and E. Learned-Miller. 2010. FDDB: A benchmark for face detection in unconstrained settings. In UMass Amherst Technical Report. 1--6.
[28]
Z. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu. 2018. Wing loss for robust facial landmark localisation with convolutional neural networks. In Proceedings of the CVPR. 2235--2245.
[29]
C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 2013. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the ICCV Workshops. 1--7.
[30]
F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and C. Change Loy. 2018. The devil of face recognition is in the noise. In Proceedings of the ECCV. 765--780.
[31]
Retrieved from http://trillionpairs.deepglint.com/overview. ([n. d.]).
[32]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770--778.
[33]
K. Liu, W. Liu, C. Gan, M. Tan, and H. Ma. 2018. T-C3D: Temporal convolutional 3D network for real-time action recognition. In Proceedings of the AAAI. 7138--7145.
[34]
B. Normalization. 2015. Accelerating deep network training by reducing internal covariate shift. CoRR.abs/1502.03167 (2015).
[35]
X. Long, C. Gan, G. de Melo, J. Wu, X. Liu, and S. Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the CVPR. 7834--7843.
[36]
A. Nagrani, J. Chung, and A. Zisserman. 2017. VoxCeleb: A large-scale speaker identification dataset. (2017), 2616--2620.
[37]
J. Chung, A. Nagrani, and A. Zisserman. VoxCeleb2: Deep speaker recognition. ([n. d.]), 1086--1090.
[38]
M. Corbetta and G. Shulman. 2002. Control of goal-directed and stimulus-driven attention in the brain. Nature Rev. Neurosci. 3, 3 (2002), 201--210.
[39]
H. Ke, D. Chen, T. Shah, X. Liu, X. Zhang, L. Zhang, and X. Li. 2018. Cloud aided online EEG classification system for brain healthcare: A case study of depression evaluation with a lightweight CNN. Softw: Pract Exper. (2018), 1--15.
[40]
B. Hasan, S. Awwad, M. Valdessosa, J. Gross, and Pascal Belin. 2016. Hearing faces and seeing voices: Amodal coding of person identity in the human brain. Sci. Rep. 108, 374 (2016), 44--49.
[41]
D. Chen, Y. Tang, H. Zhang, L. Wang, and X. Li. 2019. Incremental factorization of big time series data with blind factor approximation. IEEE Trans. Knowl. Data Eng.
[42]
D. Chen, Y. Hu, L. Wang, A. Y. Zomaya, and X. Li. 2017. HPARAFAC: Hierarchical parallel factor analysis of multidimensional big data. IEEE Trans. Parallel Distrib. Syst. 28, 4 (2017), 1091--1104.
[43]
W. Liu, X. Liu, H. Ma, and C. Peng. 2017. Beyond human-level license plate super-resolution with progressive vehicle search and domain priori GAN. In Proceedings of the ACM Multimedia. 1618--1626.
[44]
W. Liu, T. Mei, Y. Zhang, J. Li, and S. Li. 2013. Listen, look, and gotcha: Instant video search with mobile phones by layered audio-video indexing. In Proceedings of the ACM Multimedia. 887--896.
[45]
J. Liu, S. Nishimura, and T. Araki. 2019. P-Index: A novel index based on prime factorization for similarity search. In Proceedings of the BigComp. 1--8.
[46]
L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. 2016. MARS: A video benchmark for large-scale person re-identification. In Proceedings of the ECCV. 868--884.
[47]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.
[48]
Y. Liu, P. Shi, B. Peng, H. Yan, Y. Zhou, B. Han, Y. Zheng, C. Lin, J. Jiang, and Y. Fan. 2018. iQIYI-VID: A large dataset for multi-modal person identification. arXiv:1811.07548 (2018).
[49]
K. Zhang, Z. Zhang, Z. Li, and Q. Yu. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Proc. Lett. 23, 10 (2016), 1499--1503.
[50]
Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. 2016. MS-celeb-1M: Challenge of recognizing one million celebrities in the real world. Electron. Imag. 2016, 11 (2016), 1--6.

Cited By

View all
  • (2025)Context-Assisted Active Learning for Weakly Supervised Person SearchACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3714413Online publication date: 10-Feb-2025
  • (2024)Adaptive Pruning of Channel Spatial Dependability in Convolutional Neural NetworksProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681419(6073-6082)Online publication date: 28-Oct-2024
  • (2024)COVID-19 Detection from CT Scan Images using Transfer Learning ApproachProceedings of the 2024 8th International Conference on Machine Learning and Soft Computing10.1145/3647750.3647774(152-157)Online publication date: 26-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2
May 2020
390 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3401894
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2020
Online AM: 07 May 2020
Accepted: 01 January 2020
Revised: 01 December 2019
Received: 01 October 2019
Published in TOMM Volume 16, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Person search
  2. associative network
  3. multimodality index

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Fundamental Research Funds for the Central Universities
  • National Nature Science Foundation of China
  • National Key R8D Program of China
  • Hubei Province Technological Innovation Major Project

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)3
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Context-Assisted Active Learning for Weakly Supervised Person SearchACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3714413Online publication date: 10-Feb-2025
  • (2024)Adaptive Pruning of Channel Spatial Dependability in Convolutional Neural NetworksProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681419(6073-6082)Online publication date: 28-Oct-2024
  • (2024)COVID-19 Detection from CT Scan Images using Transfer Learning ApproachProceedings of the 2024 8th International Conference on Machine Learning and Soft Computing10.1145/3647750.3647774(152-157)Online publication date: 26-Jan-2024
  • (2023)Generative adversarial network-based algorithm for 3D construction of pedestrians2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML)10.1109/PRML59573.2023.10348331(224-227)Online publication date: 4-Aug-2023
  • (2023)Sign Language Detection and Recognition using CNN2023 International Conference on Sustainable Computing and Smart Systems (ICSCSS)10.1109/ICSCSS57650.2023.10169225(553-557)Online publication date: 14-Jun-2023
  • (2023)Sequential Transfer Learning Models with Additional Layers for Pneumonia Diagnosis2023 International Conference on Computer, Electronics & Electrical Engineering & their Applications (IC2E3)10.1109/IC2E357697.2023.10262764(1-6)Online publication date: 8-Jun-2023
  • (2023)SCPNet: Self-constrained parallelism network for keypoint-based lightweight object detectionJournal of Visual Communication and Image Representation10.1016/j.jvcir.2022.10371990(103719)Online publication date: Feb-2023
  • (2023)QE-DAL: A quantum image feature extraction with dense distribution-aware learning framework for object counting and localizationApplied Soft Computing10.1016/j.asoc.2023.110149138(110149)Online publication date: May-2023
  • (2023)Dual-focus: person search from Coarse-Grained Focus to Fine-Grained FocusMultimedia Systems10.1007/s00530-022-00929-329:5(3105-3114)Online publication date: 1-Oct-2023
  • (2023)Segmentation quality assessment network-based object detection and optimized CNN with transfer learning for yoga pose classification for health careSoft Computing10.1007/s00500-023-08863-wOnline publication date: 27-Jul-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media