skip to main content
10.1145/3444685.3446317acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Attention feature matching for weakly-supervised video relocalization

Published: 03 May 2021 Publication History

Abstract

Localizing the desired video clip for a given query in an untrimmed video has been a hot research topic for multimedia understanding. Recently, a new task named video relocalization, in which the query is a video clip, has been raised. Some methods have been developed for this task, however, these methods often require dense annotations of the temporal boundaries inside long videos for training. A more practical solution is the weakly-supervised approach, which only needs the matching information between the query and video.
Motivated by that, we propose a weakly-supervised video relocalization approach based on an attention-based feature matching method. Specifically, it recognizes the video clip by finding the clip whose frames are the most relevant to the query clip frames based on the matching results of the frame embeddings. In addition, an attention module is introduced to identify the frames containing rich semantic correlations in the query video. Extensive experiments on the ActivityNet dataset demonstrate that our method can outperform several weakly-supervised methods consistently and even achieve competing performance to supervised baselines.

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803--5812.
[2]
Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-supervised alignment of video with text. In Proceedings of the IEEE international conference on computer vision. 4462--4470.
[3]
Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2911--2920.
[4]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961--970.
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[6]
Shih-Fu Chang, William Chen, Horace J Meng, Hari Sundaram, and Di Zhong. 1998. A fully automated content-based video search engine supporting spatiotemporal queries. IEEE transactions on circuits and systems for video technology 8, 5 (1998), 602--615.
[7]
Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 162--171.
[8]
Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, Chilie Tan, and Xiaolin Li. 2020. Rethinking the Bottom-Up Framework for Query-Based Video Localization. In AAAI. 10551--10558.
[9]
Chien-Li Chou, Hua-Tsung Chen, and Suh-Yin Lee. 2015. Pattern-based near-duplicate video retrieval and localization on web-scale videos. IEEE Transactions on Multimedia 17, 3 (2015), 382--395.
[10]
Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Spatio-temporal video re-localization by warp lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1288--1297.
[11]
Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. 2018. Video re-localization. In Proceedings of the European Conference on Computer Vision (ECCV). 51--66.
[12]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal Activity Localization via Language Query. In The IEEE International Conference on Computer Vision (ICCV).
[13]
Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander Hauptmann. 2019. ExCL: Extractive Clip Localization Using Natural Language Descriptions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 1984--1990.
[14]
Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. arXiv preprint arXiv:1901.06829 (2019).
[15]
Xiaodong He, Li Deng, and Wu Chou. 2008. Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine 25, 5 (2008), 14--36.
[16]
Anthony Hoogs, AG Amitha Perera, Roderic Collins, Arslan Basharat, Keith Field-house, Chuck Atkins, Linus Sherrill, Benjamin Boeckel, Russell Blue, Matthew Woehlke, et al. 2015. An end-to-end system for content-based video retrieval using behavior, actions, and appearance with interactive query refinement. In 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--6.
[17]
Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4555--4564.
[18]
Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. 2011. A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41, 6 (2011), 797--819.
[19]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[20]
Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual semantic search: Retrieving videos via complex textual queries. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2657--2664.
[21]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 15--24.
[22]
Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 843--851.
[23]
Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K Roy-Chowdhury. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11592--11601.
[24]
Wei Ren, Sameer Singh, Maneesh Singh, and Yuesheng S Zhu. 2009. State-of-the-art on spatio-temporal information-based video retrieval. Pattern recognition 42, 2 (2009), 267--282.
[25]
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45, 11 (1997), 2673--2681.
[26]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[27]
Haoyu Tang, Jihua Zhu, Meng Liu, Zan Gao, Zhiyong Cheng, et al. 2020. Frame-wise Cross-modal Match for Video Moment Retrieval. arXiv preprint arXiv:2009.10434 (2020).
[28]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.
[29]
Shuohang Wang and Jing Jiang. 2016. Learning Natural Language Inference with LSTM. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1442--1451.
[30]
Huijuan Xu, Kun He, Bryan A Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062--9069.
[31]
Huijuan Xu, Kun He, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2018. Text-to-clip video retrieval with early fusion and re-captioning. arXiv preprint arXiv:1804.05113 2, 6 (2018), 7.

Cited By

View all
  • (2025)Skim-and-scan transformer: A new transformer-inspired architecture for video-query based video moment retrievalExpert Systems with Applications10.1016/j.eswa.2025.126525270(126525)Online publication date: Apr-2025
  • (2024)Weakly Supervised Video Re-Localization Through Multi-Agent-Reinforced Switchable NetworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334797034:7(6116-6127)Online publication date: Jul-2024
  • (2023)Semantic Relevance Learning for Video-Query Based Video Moment RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.325008825(9290-9301)Online publication date: 27-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia
March 2021
512 pages
ISBN:9781450383080
DOI:10.1145/3444685
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content-based video retrieval
  2. fine-grained feature matching
  3. weakly-supervised video relocalization

Qualifiers

  • Research-article

Funding Sources

Conference

MMAsia '20
Sponsor:
MMAsia '20: ACM Multimedia Asia
March 7, 2021
Virtual Event, Singapore

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)57
  • Downloads (Last 6 weeks)4
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Skim-and-scan transformer: A new transformer-inspired architecture for video-query based video moment retrievalExpert Systems with Applications10.1016/j.eswa.2025.126525270(126525)Online publication date: Apr-2025
  • (2024)Weakly Supervised Video Re-Localization Through Multi-Agent-Reinforced Switchable NetworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334797034:7(6116-6127)Online publication date: Jul-2024
  • (2023)Semantic Relevance Learning for Video-Query Based Video Moment RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.325008825(9290-9301)Online publication date: 27-Feb-2023
  • (2023)Deep Temporal State Perception Toward Artificial Cyber–Physical SystemsIEEE Internet of Things Journal10.1109/JIOT.2023.323941310:21(18782-18789)Online publication date: 1-Nov-2023
  • (2022)Video action re-localization using spatio-temporal correlation2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)10.1109/WACVW54805.2022.00025(192-201)Online publication date: Jan-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media