research-article

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

Authors:
Junwen Chen

Rochester Institute of Technology, Rochester, NY, USA

Rochester Institute of Technology, Rochester, NY, USA
View Profile

,
Wentao Bao

Rochester Institute of Technology, Rochester, NY, USA

Rochester Institute of Technology, Rochester, NY, USA
View Profile

,
Yu Kong

Rochester Institute of Technology, Rochester, NY, USA

Rochester Institute of Technology, Rochester, NY, USA
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 3789–3797https://doi.org/10.1145/3394171.3413614

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 3789–3797

ABSTRACT

In this paper, we study the problem of weakly-supervised spatio-temporal grounding from raw untrimmed video streams. Given a video and its descriptive sentence, spatio-temporal grounding aims at predicting the temporal occurrence and spatial locations of each query object across frames. Our goal is to learn a grounding model in a weakly-supervised fashion, without the supervision of both spatial bounding boxes and temporal occurrences during training. Existing methods have been addressed in trimmed videos, but their reliance on object tracking will easily fail due to frequent camera shot cut in untrimmed videos. To this end, we propose a novel spatio-temporal multiple instance learning framework for untrimmed video grounding. Spatial MIL and temporal MIL are mutually guided to ground each query to specific spatial regions and the occurring frames of a video. Furthermore, an activity described in the sentence is captured to use the informative contextual cues for region proposals refinement and text representation. We conduct extensive evaluation on YouCookII and RoboWatch datasets, and demonstrate our method outperforms state-of-the-art methods.

Supplemental Material

3394171.3413614.mp4

mp4

81.8 MB

Download

References

Muhannad Alomari, Paul Duckworth, David C Hogg, and Anthony G Cohn. 2017. Natural language acquisition and grounding for embodied robotic systems. In Thirty-First AAAI Conference on Artificial Intelligence.Google ScholarDigital Library
Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. 2019. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1989--1998.Google ScholarCross Ref
Z Cao, G Martinez Hidalgo, T Simon, SE Wei, and YA Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE transactions on pattern analysis and machine intelligence (2019).Google Scholar
Kan Chen, Jiyang Gao, and Ram Nevatia. 2018. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4042--4050.Google ScholarCross Ref
Lei Chen, Mengyao Zhai, Jiawei He, and Greg Mori. 2019 b. Object Grounding via Iterative Context Reasoning. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 0--0.Google ScholarCross Ref
Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, and Kwan-Yee K Wong. 2020. Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video. arXiv preprint arXiv:2001.09308 (2020).Google Scholar
Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee K Wong. 2019 a. Weakly-supervised spatio-temporally grounding natural sentence in video. ACL (2019).Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL. 4171--4186.Google Scholar
Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems. 3059--3069.Google Scholar
Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. 2019. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5374--5383.Google ScholarCross Ref
Mingfei Gao, Larry Davis, Richard Socher, and Caiming Xiong. 2019. WSLLN: Weakly Supervised Natural Language Localization Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 1481--1487.Google ScholarCross Ref
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.Google ScholarCross Ref
Matthew Honnibal and Mark Johnson. 2015. An Improved Non-monotonic Transition System for Dependency Parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 1373--1378. https://aclweb.org/anthology/D/D15/D15-1162Google ScholarCross Ref
De-An Huang*, Shyamal Buch*, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. 2018. Finding "It": Weakly-Supervised, Reference-Aware Visual Grounding in Instructional Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.Google ScholarCross Ref
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.Google ScholarDigital Library
Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Unsupervised object discovery and tracking in video collections. In Proceedings of the IEEE international conference on computer vision. 3173--3181.Google ScholarDigital Library
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55--60.Google ScholarCross Ref
Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K Roy-Chowdhury. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11592--11601.Google ScholarCross Ref
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.Google ScholarCross Ref
Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and Vittorio Ferrari. 2012. Learning object class detectors from weakly annotated video. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3282--3289.Google ScholarCross Ref
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.Google Scholar
Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision. Springer, 817--834.Google ScholarCross Ref
Ozan Sener, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2015. Unsupervised semantic parsing of video collections. In Proceedings of the IEEE International Conference on Computer Vision. 4480--4488.Google ScholarDigital Library
Jing Shi, Jia Xu, Boqing Gong, and Chenliang Xu. 2019. Not all frames are equal: Weakly-supervised video grounding with contextual similarity and visual clustering losses. In CVPR. 10444--10452.Google Scholar
Mohit Shridhar and David Hsu. 2018. Interactive visual grounding of referring expressions for human-robot interaction. arXiv preprint arXiv:1806.03831 (2018).Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, and Houqiang Li. 2019 b. Unsupervised deep tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1308--1317.Google ScholarCross Ref
Xiaolong Wang, Allan Jabri, and Alexei A Efros. 2019 a. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2566--2576.Google ScholarCross Ref
Sibei Yang, Guanbin Li, and Yizhou Yu. 2019 c. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4145--4154.Google ScholarCross Ref
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019 a. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE International Conference on Computer Vision. 4683--4693.Google ScholarCross Ref
Zhengyuan Yang, Tushar Kumar, Tianlang Chen, and Jiebo Luo. 2019 b. Grounding-Tracking-Integration. arXiv preprint arXiv:1912.06316 (2019).Google Scholar
Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded language learning from video described with sentences. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 53--63.Google Scholar
Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, and Marcus Rohrbach. 2019. Grounded Video Description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Luowei Zhou, Nathan Louis, and Jason J Corso. 2018a. Weakly-supervised video object grounding from text by loss weighting and object interaction. BMVC (2018).Google Scholar
Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018b. Towards automatic learning of procedures from web instructional videos. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarCross Ref

Index Terms

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
    2. Natural language processing

Recommendations

Weakly supervised activity analysis with spatio-temporal localisation

In computer vision, an increasing number of weakly annotated videos have become available, due to the fact it is often difficult and time consuming to annotate all the details in the videos collected. Learning methods that analyse human activities in ...
Read More
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
Computer Vision – ECCV 2020
Abstract
Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind. A major contributing factor has been the prohibitive cost of annotating videos frame-by-frame. In this paper, we present a ...
Read More
Multiple kernel-based dictionary learning for weakly supervised classification

In this paper, we develop a multiple instance learning (MIL) algorithm using the dictionary learning framework where the labels are given in the form of positive and negative bags, with each bag containing multiple samples. A positive bag is guaranteed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
activity
multiple instance learning
untrimmed video
weakly-supervised object grounding
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 376
  Total Downloads
- Downloads (Last 12 months)97
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Weakly supervised activity analysis with spatio-temporal localisation

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

Multiple kernel-based dictionary learning for weakly supervised classification