skip to main content
10.1145/2502081.2502084acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Listen, look, and gotcha: instant video search with mobile phones by layered audio-video indexing

Published: 21 October 2013 Publication History

Abstract

Mobile video is quickly becoming a mass consumer phenomenon. More and more people are using their smartphones to search and browse video content while on the move. In this paper, we have developed an innovative instant mobile video search system through which users can discover videos by simply pointing their phones at a screen to capture a very few seconds of what they are watching. The system is able to index large-scale video data using a new layered audio-video indexing approach in the cloud, as well as extract light-weight joint audio-video signatures in real time and perform progressive search on mobile devices. Unlike most existing mobile video search applications that simply send the original video query to the cloud, the proposed mobile system is one of the first attempts at instant and progressive video search leveraging the light-weight computing capacity of mobile devices. The system is characterized by four unique properties: 1) a joint audio-video signature to deal with the large aural and visual variances associated with the query video captured by the mobile phone, 2) layered audio-video indexing to holistically exploit the complementary nature of audio and video signals, 3) light-weight fingerprinting to comply with mobile processing capacity, and 4) a progressive query process to significantly reduce computational costs and improve the user experience---the search process can stop anytime once a confident result is achieved. We have collected 1,400 query videos captured by 25 mobile users from a dataset of 600 hours of video. The experiments show that our system outperforms state-of-the-art methods by achieving 90.79% precision when the query video is less than 10 seconds and 70.07% even when the query video is less than 5 seconds.

References

[1]
H. Bay, A. Ess, T. Tuytelaars, and L. J. V. Gool. Speeded-Up Robust Features (SURF). Computer Vision and Image Understanding, 110(3):346--359, 2008.
[2]
V. Chandrasekhar, G. Takacs, D. M. Chen, S. S. Tsai, R. Grzeszczuk, and B. Girod. Chog: Compressed histogram of gradients a low bit-rate feature descriptor. In CVPR, pages 2504--2511, 2009.
[3]
D. M. Chen, N.-M. Cheung, S. S. Tsai, V. Chandrasekhar, G. Takacs, R. Vedantham, R. Grzeszczuk, and B. Girod. Dynamic selection of a feature-rich query frame for mobile video retrieval. In ICIP, pages 1017--1020, 2010.
[4]
B. Girod, V. Chandrasekhar, D. M. Chen, N.-M. Cheung, R. Grzeszczuk, Y. A. Reznik, G. Takacs, S. S. Tsai, and R. Vedantham. Mobile visual search. IEEE Signal Process. Mag., 28(4):61--76, 2011.
[5]
J. He, J. Feng, X. Liu, T. Cheng, T.-H. Lin, H. Chung, and S.-F. Chang. Mobile product search with bag of hash bits and boundary reranking. In CVPR, pages 3005--3012, 2012.
[6]
J. Huber, J. Steimle, and M. Mühlhäuser. Mobile interaction techniques for interrelated videos. In CHI Extended Abstracts, pages 3535--3540, 2010.
[7]
IntoNow. http://www.intonow.com/.
[8]
R. Ji, L.-Y. Duan, J. Chen, H. Yao, Y. Rui, S.-F. Chang, and W. Gao. Towards low bit rate mobile visual search with multiple-channel coding. In ACM Multimedia, pages 573--582, 2011.
[9]
H. Liu, T. Mei, J. Luo, H. Li, and S. Li. Finding perfect rendezvous on the go: accurate mobile visual localization and its applications to routing. In ACM Multimedia, pages 9--18, 2012.
[10]
Y. Liu, W. Zhao, C.-W. Ngo, C. Xu, and H. Lu. Coherent bag-of audio words model for efficient large-scale video copy detection. In CIVR, pages 89--96, 2010.
[11]
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. In VLDB, pages 950--961, 2007.
[12]
M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In International Conference on Computer Vision Theory and Application, pages 331--340, 2009.
[13]
M. Muja and D. G. Lowe. Fast matching of binary features. In Proceedings of International Conference on Computer and Robot Vision, pages 404--410, 2012.
[14]
M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In ICML, pages 353--360, 2011.
[15]
M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. In CVPR, pages 3108--3115, 2012.
[16]
S. Paisitkriangkrai, T. Mei, J. Zhang, and X.-S. Hua. Scalable clip-based near-duplicate video detection with ordinal measure. In CIVR, pages 121--128, 2010.
[17]
J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007.
[18]
L. Shang, L. Yang, F. Wang, K.-P. Chan, and X.-S. Hua. Real-time large scale near-duplicate web video retrieval. In ACM Multimedia, pages 531--540, 2010.
[19]
H. T. Shen, J. Shao, Z. Huang, and X. Zhou. Effective and efficient query processing for video subsequence identification. IEEE Trans. Knowl. Data Eng., 21(3):321--334, 2009.
[20]
J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, pages 1470--1477, 2003.
[21]
C. Snoek, M. Worring, and A. W. M. Smeulders. Early versus late fusion in semantic video analysis. In ACM Multimedia, pages 399--402, 2005.
[22]
TRECVID 2011. http://www-nlpir.nist.gov/projects/tv2011/.
[23]
K.-Y. Tseng, Y.-L. Lin, Y.-H. Chen, and W. H. Hsu. Sketch-based image retrieval on mobile devices using compact hash bits. In ACM Multimedia, pages 913--916, 2012.
[24]
VideoSurf. http://www.videosurf.com/mobile.
[25]
A. Wang. An industrial strength audio search algorithm. In ISMIR, pages 7--13, 2003.
[26]
Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753--1760, 2008.
[27]
G.-L. Wu, Y.-C. Su, T.-H. Chiu, L.-C. Hsieh, and W. H. Hsu. Scalable mobile video question-answering system with locally aggregated descriptors and random projection. In ACM Multimedia, pages 647--650, 2011.
[28]
X. Wu, C.-W. Ngo, A. G. Hauptmann, and H.-K. Tan. Real-time near-duplicate elimination for web video search with content and context. IEEE Transactions on Multimedia, 11(2):196--207, 2009.
[29]
Y. Wu, S. Lu, T. Mei, J. Zhang, and S. Li. Local visual words coding for low bit rate mobile visual search. In ACM Multimedia, pages 989--992, 2012.
[30]
X. Yang and K.-T. T. Cheng. Accelerating surf detector on mobile devices. In ACM Multimedia, pages 569--578, 2012.

Cited By

View all
  • (2022)Search-oriented Micro-video CaptioningProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548180(3234-3243)Online publication date: 10-Oct-2022
  • (2021)A Real-Time Action Representation With Temporal Encoding and Deep CompressionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.298456931:2(647-660)Online publication date: Feb-2021
  • (2021)Rich Common Crucial Feature for Crowdsourcing-Based Mobile Visual Location RecognitionIEEE Access10.1109/ACCESS.2021.30934629(103627-103636)Online publication date: 2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '13: Proceedings of the 21st ACM international conference on Multimedia
October 2013
1166 pages
ISBN:9781450324045
DOI:10.1145/2502081
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio-video signatures
  2. instant search
  3. layered audio-video indexing
  4. mobile video search
  5. progressive query process

Qualifiers

  • Research-article

Conference

MM '13
Sponsor:
MM '13: ACM Multimedia Conference
October 21 - 25, 2013
Barcelona, Spain

Acceptance Rates

MM '13 Paper Acceptance Rate 47 of 235 submissions, 20%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Search-oriented Micro-video CaptioningProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548180(3234-3243)Online publication date: 10-Oct-2022
  • (2021)A Real-Time Action Representation With Temporal Encoding and Deep CompressionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.298456931:2(647-660)Online publication date: Feb-2021
  • (2021)Rich Common Crucial Feature for Crowdsourcing-Based Mobile Visual Location RecognitionIEEE Access10.1109/ACCESS.2021.30934629(103627-103636)Online publication date: 2021
  • (2020)Hierarchical Gumbel Attention Network for Text-based Person SearchProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413864(3441-3449)Online publication date: 12-Oct-2020
  • (2019)Fine-grained Cross-media Representation Learning with Deep Quantization Attention NetworkProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3350892(1313-1321)Online publication date: 15-Oct-2019
  • (2018)Common Crucial Feature for Crowdsourcing Based Mobile Visual Location Recognition2018 25th IEEE International Conference on Image Processing (ICIP)10.1109/ICIP.2018.8451477(908-912)Online publication date: Oct-2018
  • (2015)A Hybrid Mobile Visual Search System With Compact Global SignaturesIEEE Transactions on Multimedia10.1109/TMM.2015.242774417:7(1019-1030)Online publication date: Jul-2015
  • (2014)Me-linkProceedings of the 23rd International Conference on World Wide Web10.1145/2567948.2577018(147-150)Online publication date: 7-Apr-2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media