VAL: Visual-Attention Action Localizer

Song, Xiaomeng; Han, Yahong

doi:10.1007/978-3-030-00767-6_32

Xiaomeng Song¹⁸ &
Yahong Han¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11165))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2572 Accesses
14 Citations

Abstract

In this paper, we focus on solving a new task called TALL (Temporal Activity Localization via Language Query). The goal of it is to use nature language queries to localize actions in longer, untrimmed videos. We propose a new model called VAL (Visual-attention Action Localizer) to address it. Specifically, it employs voxel-wise attention and channel-wise attention on last conv-layer feature maps. These two visual attention are designed corresponding to the characteristics of feature maps. They can enhance the visual representations and boost the cross-modal correlation extraction process. Experimental results on TaCoS and Charades-STA datasets both show the effectiveness of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: ICCV, pp. 5277–5285 (2017)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)
Google Scholar
Yang, Z.W., Han, Y.H., Wang, Z.: Catching the temporal regions-of-interest for video captioning. In: ACM MM, pp. 146–153 (2017)
Google Scholar
Hong, R., Zhang, L., Tao, D.: Unified photo enhancement by discovering aesthetic communities from Flickr. In: IEEE Transactions on Image Processing, pp. 1124–1135 (2016)
Article MathSciNet Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Wang, B., Xu, Y.J., Han, Y.H., Hong, R.C.: Movie question answering: remembering the textual cues for layered visual contents. In: AAAI. (2018)
Google Scholar
Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR, pp. 1049–1058 (2016)
Google Scholar
Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR, pp. 6298–6306 (2017)
Google Scholar
Du, T., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Google Scholar
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV, pp. 2933–2942 (2017)
Google Scholar
Regneri, M., Rohrbach, M., Dominikus, W., Thater, S.: Grounding action descriptions in videos. TACL 1, 25–36 (2013)
Google Scholar
Hong, R., Hu, Z., Wang, R., Wang, M., Tao, D.: Multi-view object retrieval via multi-scale topic models. In: IEEE Transactions on Image Processing, pp. 5814–5827 (2016)
Article MathSciNet Google Scholar
Kiros, R., et al.: Skip-thought vectors. In: NIPS, pp. 3294–3302 (2015)
Google Scholar
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR, pp. 961–970 (2015)
Google Scholar
Hong, R., Zhang, L., Zhang, C., Zimmermann, R.: Flickr circles: aesthetic tendency discovery by multi-view regularized topic modeling. IEEE Trans. Multimedia 18, 1555–1567 (2016)
Article Google Scholar
Xu, Y.J., Han, Y.H., Hong, R.C., Tian, Q.: Sequential video VLAD: training the aggregation locally and temporally. IEEE Trans. Image Process. 27, 4933–4944 (2018)
Article Google Scholar

Download references

Acknowledgments

This work is supported by the NSFC (under Grant U150920 6,61472276) and Tianjin Natural Science Foundation (no. 15JCYBJC15400).

Author information

Authors and Affiliations

School of Computer Science and Technology, Tianjin University, Tianjin, China
Xiaomeng Song & Yahong Han

Authors

Xiaomeng Song
View author publications
You can also search for this author in PubMed Google Scholar
Yahong Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yahong Han .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, X., Han, Y. (2018). VAL: Visual-Attention Action Localizer. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11165. Springer, Cham. https://doi.org/10.1007/978-3-030-00767-6_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-00767-6_32
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00766-9
Online ISBN: 978-3-030-00767-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics