skip to main content
10.1145/3394171.3414455acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
abstract

A Cross-modality and Progressive Person Search System

Published: 12 October 2020 Publication History

Abstract

This demonstration presents an instant and progressive cross-modality person search system, called 'CMPS'. Through the system, users can instantly find the lost children or elderly persons by simply describing their appearance through speech. Unlike most existing person search applications which have to cost much time to find the probe images, CMPS will save more valuable time in the early stage of losing. The proposed CMPS is one of the first attempts towards instant and progressive person search leveraging the audio, text, and visual modalities together. In detail, the system first takes the speech that describes the appearance of a person as the input to obtain a textual description by speech-to-text conversion. Then the cross-modal search is performed by matching the textual embedding with the visual representations of images in the learned latent space. The searched images can be used as candidates for query expansion. If the candidates are not right, the user can quickly adjust their description through speech. Once a right image is found, the user can directly click it as a new query. Finally the system will give the complete track of the lost person by once-click. On the built CUHK-PEDES-AUDIOS dataset, the system can achieve 82.46% rank-1 accuracy in real-time speed. Our code of CMPS is available at https://github.com/SheldongChen/Search-People-With-Audio.

Supplementary Material

MP4 File (3394171.3414455.mp4)
Person search or re-identification (Re-ID) is an important and challenging task in the multimedia and computer vision communities. With wide real-world applications such as intelligent video surveillance, smart retailing, etc., this task aims at searching for the same person captured by multiple non-overlapping cameras. However, existing person search or Re-ID methods usually use images of a specific person as the probe, which has limitations in real-world urgent scenarios. In this paper, we develop a simple, convenient, and real-time person search system. This system has several featured properties: 1) It provides a convenient input and interaction mode, which takes the audio of speech as the input to search for a target person captured by cameras. 2) This system performs person search in a progressive manner to guarantee both the accuracy and speed, users can interactively input new queries and query expansion, which makes it able to find more accurate results with less time consumption.

References

[1]
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, and et al. 2016. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. In ICML, Vol. 48. 173--182.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In IEEE/CVF ICCV. 2425--2433.
[3]
Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, and Chuang Gan. 2020. Generating Visually Aligned Sound from Videos. CoRR, Vol. abs/2008.00820 (2020).
[4]
Tianlang Chen, Chenliang Xu, and Jiebo Luo. 2018. Improving Text-Based Person Search by Spatial Matching and Adaptive Threshold. In WACV. 1879--1887.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.
[6]
Chuang Gan, Hang Zhao, Peihao Chen, David D. Cox, and Antonio Torralba. 2019. Self-Supervised Moving Vehicle Tracking With Stereo Sound. In IEEE/CVF ICCV. 7052--7061.
[7]
Lingxiao He, Jian Liang, Haiqing Li, and Zhenan Sun. 2018. Deep Spatial Feature Reconstruction for Partial Person Re-Identification: Alignment-Free Approach. In IEEE/CVF CVPR. 7073--7082.
[8]
Lingxiao He, Yinggang Wang, Wu Liu, He Zhao, Zhenan Sun, and Jiashi Feng. 2019. Foreground-Aware Pyramid Reconstruction for Alignment-Free Occluded Person Re-Identification. In IEEE/CVF ICCV. 8449--8458.
[9]
Sepp Hochreiter and Jü rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.
[10]
Mahdi M. Kalayeh, Emrah Basaran, Muhittin Gö kmen, Mustafa E. Kamasak, and Mubarak Shah. 2018. Human Semantic Parsing for Person Re-Identification. In IEEE/CVF CVPR. 1062--1071.
[11]
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017a. Identity-Aware Textual-Visual Matching with Latent Co-attention. In IEEE/CVF ICCV. 1908--1917.
[12]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017b. Person Search with Natural Language Description. In IEEE/CVF CVPR. 5187--5196.
[13]
Wu Liu, Tao Mei, Yongdong Zhang, Jintao Li, and Shipeng Li. 2013. Listen, look, and gotcha: instant video search with mobile phones by layered audio-video indexing. In ACM MM. 887--896.
[14]
Lei Qi, Jing Huo, Lei Wang, Yinghuan Shi, and Yang Gao. 2018. MaskReID: A Mask Based Deep Ranking Neural Network for Person Re-identification. (2018).
[15]
Weijian Ruan, Wu Liu, Qian Bao, Jun Chen, Yuhao Cheng, and Tao Mei. 2019. POINet: Pose-Guided Ovonic Insight Network for Multi-Person Pose Tracking. In ACM MM. 284--292.
[16]
Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In IEEE/CVF CVPR. 4510--4520.
[17]
M. Saquib Sarfraz, Arne Schumann, Andreas Eberle, and Rainer Stiefelhagen. 2018. A Pose-Sensitive Embedding for Person Re-Identification With Expanded Cross Neighborhood Re-Ranking. In IEEE/CVF CVPR. 420--429.
[18]
Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. 2018. Mask-Guided Contrastive Attention Model for Person Re-Identification. In IEEE/CVF CVPR. 1179--1188.
[19]
Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. 2018. Part-Aligned Bilinear Representations for Person Re-identification. In ECCV. 418--437.
[20]
Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. 2018. Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline). In ECCV. 501--518.
[21]
Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning Discriminative Features with Multiple Granularities for Person Re-Identification. In ACM MM. 274--282.
[22]
Qi Wang, Xinchen Liu, Wu Liu, An-An Liu, Wenyin Liu, and Tao Mei. 2020. MetaSearch: Incremental Product Search via Deep Meta-Learning. IEEE Trans. Image Process., Vol. 29 (2020), 7549--7564.
[23]
Yuyu Wang, Chunjuan Bo, Dong Wang, Shuang Wang, Yunwei Qi, and Huchuan Lu. 2019. Language Person Search with Mutually Connected Classification Loss. In IEEE ICASSP. 2057--2061.
[24]
Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene Graph Generation by Iterative Message Passing. In IEEE/CVF CVPR. 3097--3106.

Cited By

View all
  • (2024)VGSG: Vision-Guided Semantic-Group Network for Text-Based Person SearchIEEE Transactions on Image Processing10.1109/TIP.2023.333765333(163-176)Online publication date: 2024
  • (2023)Text-based Person Search in Full Images via Semantic-Driven Proposal GenerationProceedings of the 4th International Workshop on Human-centric Multimedia Analysis10.1145/3606041.3618058(5-14)Online publication date: 2-Nov-2023
  • (2022)Bottom-Up Foreground-Aware Feature Fusion for Practical Person SearchIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.305866832:1(262-274)Online publication date: Jan-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Check for updates

Author Tags

  1. cross-modality search
  2. person search
  3. progressive search

Qualifiers

  • Abstract

Funding Sources

  • National Natural Science Foundation of China

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)3
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)VGSG: Vision-Guided Semantic-Group Network for Text-Based Person SearchIEEE Transactions on Image Processing10.1109/TIP.2023.333765333(163-176)Online publication date: 2024
  • (2023)Text-based Person Search in Full Images via Semantic-Driven Proposal GenerationProceedings of the 4th International Workshop on Human-centric Multimedia Analysis10.1145/3606041.3618058(5-14)Online publication date: 2-Nov-2023
  • (2022)Bottom-Up Foreground-Aware Feature Fusion for Practical Person SearchIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.305866832:1(262-274)Online publication date: Jan-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media