research-article

Attention feature matching for weakly-supervised video relocalization

Authors:

Zhiyong ChengAuthors Info & Claims

MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Article No.: 64, Pages 1 - 7

https://doi.org/10.1145/3444685.3446317

Published: 03 May 2021 Publication History

Abstract

Localizing the desired video clip for a given query in an untrimmed video has been a hot research topic for multimedia understanding. Recently, a new task named video relocalization, in which the query is a video clip, has been raised. Some methods have been developed for this task, however, these methods often require dense annotations of the temporal boundaries inside long videos for training. A more practical solution is the weakly-supervised approach, which only needs the matching information between the query and video.

Motivated by that, we propose a weakly-supervised video relocalization approach based on an attention-based feature matching method. Specifically, it recognizes the video clip by finding the clip whose frames are the most relevant to the query clip frames based on the matching results of the frame embeddings. In addition, an attention module is introduced to identify the frames containing rich semantic correlations in the query video. Extensive experiments on the ActivityNet dataset demonstrate that our method can outperform several weakly-supervised methods consistently and even achieve competing performance to supervised baselines.

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803--5812.

[2]

Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-supervised alignment of video with text. In Proceedings of the IEEE international conference on computer vision. 4462--4470.

Digital Library

[3]

Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2911--2920.

[4]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961--970.

[5]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[6]

Shih-Fu Chang, William Chen, Horace J Meng, Hari Sundaram, and Di Zhong. 1998. A fully automated content-based video search engine supporting spatiotemporal queries. IEEE transactions on circuits and systems for video technology 8, 5 (1998), 602--615.

[7]

Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 162--171.

[8]

Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, Chilie Tan, and Xiaolin Li. 2020. Rethinking the Bottom-Up Framework for Query-Based Video Localization. In AAAI. 10551--10558.

[9]

Chien-Li Chou, Hua-Tsung Chen, and Suh-Yin Lee. 2015. Pattern-based near-duplicate video retrieval and localization on web-scale videos. IEEE Transactions on Multimedia 17, 3 (2015), 382--395.

Digital Library

[10]

Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Spatio-temporal video re-localization by warp lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1288--1297.

[11]

Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. 2018. Video re-localization. In Proceedings of the European Conference on Computer Vision (ECCV). 51--66.

[12]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal Activity Localization via Language Query. In The IEEE International Conference on Computer Vision (ICCV).

[13]

Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander Hauptmann. 2019. ExCL: Extractive Clip Localization Using Natural Language Descriptions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 1984--1990.

[14]

Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. arXiv preprint arXiv:1901.06829 (2019).

[15]

Xiaodong He, Li Deng, and Wu Chou. 2008. Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine 25, 5 (2008), 14--36.

[16]

Anthony Hoogs, AG Amitha Perera, Roderic Collins, Arslan Basharat, Keith Field-house, Chuck Atkins, Linus Sherrill, Benjamin Boeckel, Russell Blue, Matthew Woehlke, et al. 2015. An end-to-end system for content-based video retrieval using behavior, actions, and appearance with interactive query refinement. In 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--6.

[17]

Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4555--4564.

[18]

Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. 2011. A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41, 6 (2011), 797--819.

Digital Library

[19]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[20]

Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual semantic search: Retrieving videos via complex textual queries. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2657--2664.

Digital Library

[21]

Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 15--24.

Digital Library

[22]

Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 843--851.

Digital Library

[23]

Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K Roy-Chowdhury. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11592--11601.

[24]

Wei Ren, Sameer Singh, Maneesh Singh, and Yuesheng S Zhu. 2009. State-of-the-art on spatio-temporal information-based video retrieval. Pattern recognition 42, 2 (2009), 267--282.

[25]

Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45, 11 (1997), 2673--2681.

Digital Library

[26]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[27]

Haoyu Tang, Jihua Zhu, Meng Liu, Zan Gao, Zhiyong Cheng, et al. 2020. Frame-wise Cross-modal Match for Video Moment Retrieval. arXiv preprint arXiv:2009.10434 (2020).

[28]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.

Digital Library

[29]

Shuohang Wang and Jing Jiang. 2016. Learning Natural Language Inference with LSTM. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1442--1451.

[30]

Huijuan Xu, Kun He, Bryan A Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062--9069.

Digital Library

[31]

Huijuan Xu, Kun He, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2018. Text-to-clip video retrieval with early fusion and re-captioning. arXiv preprint arXiv:1804.05113 2, 6 (2018), 7.

Cited By

Huo SZhou YChen KXiang W(2025)Skim-and-scan transformer: A new transformer-inspired architecture for video-query based video moment retrievalExpert Systems with Applications10.1016/j.eswa.2025.126525270(126525)Online publication date: Apr-2025
https://doi.org/10.1016/j.eswa.2025.126525
Zhou YGuo AHuo SLiu YKung S(2024)Weakly Supervised Video Re-Localization Through Multi-Agent-Reinforced Switchable NetworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334797034:7(6116-6127)Online publication date: Jul-2024
https://doi.org/10.1109/TCSVT.2023.3347970
Huo SZhou YWang RXiang WKung S(2023)Semantic Relevance Learning for Video-Query Based Video Moment RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.325008825(9290-9301)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3250088
Show More Cited By

Recommendations

A Novel Weakly Supervised Problem: Learning from Positive-Unlabeled Proportions
Proceedings of the 16th Conference of the Spanish Association for Artificial Intelligence on Advances in Artificial Intelligence - Volume 9422

Standard supervised classification learns a classifier from a set of labeled examples. Alternatively, in the field of weakly supervised classification different frameworks have been presented where the training data cannot be certainly labeled. In this ...
Weakly supervised activity analysis with spatio-temporal localisation

In computer vision, an increasing number of weakly annotated videos have become available, due to the fact it is often difficult and time consuming to annotate all the details in the videos collected. Learning methods that analyse human activities in ...
Ground Truth Inference for Weakly Supervised Entity Matching
PACMMOD

Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

March 2021

512 pages

ISBN:9781450383080

DOI:10.1145/3444685

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Jingdong Wang
Microsoft Research
,
Qi Tian
Huawei Noah's Ark
,
Program Chairs:
Cathal Gurrin
Dublin City University
,
Jia Jia
Tsinghua University
,
Hanwang Zhang
Nanyang Technological University
,
Qianru Sun
Singapore Management University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Young creative team in universities of Shandong Province
Jinan 20 projects in universities
the Innovation Teams in Colleges and Universities in Jinan

Conference

MMAsia '20

Sponsor:

SIGMM

MMAsia '20: ACM Multimedia Asia

March 7, 2021

Virtual Event, Singapore

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
157
Total Downloads

Downloads (Last 12 months)57
Downloads (Last 6 weeks)4

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huo SZhou YChen KXiang W(2025)Skim-and-scan transformer: A new transformer-inspired architecture for video-query based video moment retrievalExpert Systems with Applications10.1016/j.eswa.2025.126525270(126525)Online publication date: Apr-2025
https://doi.org/10.1016/j.eswa.2025.126525
Zhou YGuo AHuo SLiu YKung S(2024)Weakly Supervised Video Re-Localization Through Multi-Agent-Reinforced Switchable NetworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334797034:7(6116-6127)Online publication date: Jul-2024
https://doi.org/10.1109/TCSVT.2023.3347970
Huo SZhou YWang RXiang WKung S(2023)Semantic Relevance Learning for Video-Query Based Video Moment RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.325008825(9290-9301)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3250088
Jin YWang SLiu FFan HHu YLi XLiu S(2023)Deep Temporal State Perception Toward Artificial Cyber–Physical SystemsIEEE Internet of Things Journal10.1109/JIOT.2023.323941310:21(18782-18789)Online publication date: 1-Nov-2023
https://doi.org/10.1109/JIOT.2023.3239413
Ramaswamy ASeemakurthy KGubbi JBalamuralidhar P(2022)Video action re-localization using spatio-temporal correlation2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)10.1109/WACVW54805.2022.00025(192-201)Online publication date: Jan-2022
https://doi.org/10.1109/WACVW54805.2022.00025

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten