An Adaptive Video Clip Sampling Approach for Enhancing Query-Based Moment Retrieval in Videos

Kong, Lingdu; Li, Tieying; Yang, Xiaochun; Han, Shengzhi; Wang, Bin

doi:10.1007/978-3-031-30675-4_28

Lingdu Kong¹⁵,
Tieying Li¹⁵,
Xiaochun Yang¹⁵,
Shengzhi Han¹⁵ &
…
Bin Wang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13945))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1417 Accesses

Abstract

Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing retrieval models require the same length for easy training and use. Therefore, videos with different lengths are pre-processed using the fixed sampling method. As a result, the longer the video, the more video clips are lost, thus affecting the accuracy of retrieval. We observed the fixed sampling method causes two accuracy issues, including missing clips and sparse clips. In this paper, we propose an adaptive video clip sampling method including resampling missing clips and enhancing sparse sampled clips to increase the retrieval accuracy. Resampling missing clips is used to address situations in which annotated clips are completely lost during fixed sampling. Enhancing sparse sampled clips aims to prevent the clips containing the same semantics from being too sparse. Our approach first obtains multiple video features through the adaptive sampling methods based on the backbone networks. Then we propose a consistency loss maintenance method to learn the semantics of adaptive sampled features. The extensive experiments on three real datasets demonstrate the effectiveness of our proposed method, especially for long videos.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)
Google Scholar
Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.S.: Temporally grounding natural sentence in video. In: EMNLP (2018)
Google Scholar
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5277–5285 (2017)
Google Scholar
Hahn, M., Kadav, A., Rehg, J.M., Graf, H.P.: Tripping through time: efficient localization of activities in videos. arXiv:abs/1904.09936 (2020)
Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with natural language. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5804–5813 (2017)
Google Scholar
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 706–715 (2017)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)
Google Scholar
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Article Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015)
Google Scholar
Wang, W., Huang, Y., Wang, L.: Language-driven temporal activity localization: a semantic matching reinforcement learning model. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 334–343 (2019)
Google Scholar
Zhang, H., Sun, A., Jing, W., Zhen, L., Zhou, J.T., Goh, R.: Natural language video localization: a revisit in span-based question answering framework. IEEE Trans. Pattern Anal. Mach. Intell. 44, 4252–4266 (2022)
Google Scholar
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI (2020)
Google Scholar
Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019)
Google Scholar
Zhao, Y., Zhao, Z., Zhang, Z., Lin, Z.: Cascaded prediction network via segment tree for temporal video grounding. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4195–4204 (2021)
Google Scholar

Download references

Acknowlegments

The work is partially supported by the National Key Research and Development Program of China (No. 2020YFB1707901), National Natural Science Foundation of China (Nos. U22A2025, 62072088, 62232007), Ten Thousand Talent Program (No. ZX20200035), Liaoning Distinguished Professor (No. XLYC1902057), and 111 Project (B16009).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Northeastern University, Shenyang, China
Lingdu Kong, Tieying Li, Xiaochun Yang, Shengzhi Han & Bin Wang

Authors

Lingdu Kong
View author publications
You can also search for this author in PubMed Google Scholar
Tieying Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shengzhi Han
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaochun Yang .

Editor information

Editors and Affiliations

Tianjin University, Tianjin, China
Xin Wang
University of Torino, Turin, Italy
Maria Luisa Sapino
POSTECH, Pohang, Korea (Republic of)
Wook-Shin Han
University of California Santa Barbara, Santa Barbara, CA, USA
Amr El Abbadi
University of Auckland, Auckland, New Zealand
Gill Dobbie
Tianjin University, Tianjin, China
Zhiyong Feng
Beijing University of Posts and Telecommunications, Beijing, China
Yingxiao Shao
The University of Queensland, Brisbane, QLD, Australia
Hongzhi Yin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kong, L., Li, T., Yang, X., Han, S., Wang, B. (2023). An Adaptive Video Clip Sampling Approach for Enhancing Query-Based Moment Retrieval in Videos. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13945. Springer, Cham. https://doi.org/10.1007/978-3-031-30675-4_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-30675-4_28
Published: 15 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30674-7
Online ISBN: 978-3-031-30675-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Adaptive Video Clip Sampling Approach for Enhancing Query-Based Moment Retrieval in Videos