Abstract
In this paper, we focus on solving a new task called TALL (Temporal Activity Localization via Language Query). The goal of it is to use nature language queries to localize actions in longer, untrimmed videos. We propose a new model called VAL (Visual-attention Action Localizer) to address it. Specifically, it employs voxel-wise attention and channel-wise attention on last conv-layer feature maps. These two visual attention are designed corresponding to the characteristics of feature maps. They can enhance the visual representations and boost the cross-modal correlation extraction process. Experimental results on TaCoS and Charades-STA datasets both show the effectiveness of our model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: ICCV, pp. 5277–5285 (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)
Yang, Z.W., Han, Y.H., Wang, Z.: Catching the temporal regions-of-interest for video captioning. In: ACM MM, pp. 146–153 (2017)
Hong, R., Zhang, L., Tao, D.: Unified photo enhancement by discovering aesthetic communities from Flickr. In: IEEE Transactions on Image Processing, pp. 1124–1135 (2016)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Wang, B., Xu, Y.J., Han, Y.H., Hong, R.C.: Movie question answering: remembering the textual cues for layered visual contents. In: AAAI. (2018)
Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR, pp. 1049–1058 (2016)
Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR, pp. 6298–6306 (2017)
Du, T., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV, pp. 2933–2942 (2017)
Regneri, M., Rohrbach, M., Dominikus, W., Thater, S.: Grounding action descriptions in videos. TACL 1, 25–36 (2013)
Hong, R., Hu, Z., Wang, R., Wang, M., Tao, D.: Multi-view object retrieval via multi-scale topic models. In: IEEE Transactions on Image Processing, pp. 5814–5827 (2016)
Kiros, R., et al.: Skip-thought vectors. In: NIPS, pp. 3294–3302 (2015)
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR, pp. 961–970 (2015)
Hong, R., Zhang, L., Zhang, C., Zimmermann, R.: Flickr circles: aesthetic tendency discovery by multi-view regularized topic modeling. IEEE Trans. Multimedia 18, 1555–1567 (2016)
Xu, Y.J., Han, Y.H., Hong, R.C., Tian, Q.: Sequential video VLAD: training the aggregation locally and temporally. IEEE Trans. Image Process. 27, 4933–4944 (2018)
Acknowledgments
This work is supported by the NSFC (under Grant U150920 6,61472276) and Tianjin Natural Science Foundation (no. 15JCYBJC15400).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Song, X., Han, Y. (2018). VAL: Visual-Attention Action Localizer. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11165. Springer, Cham. https://doi.org/10.1007/978-3-030-00767-6_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-00767-6_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00766-9
Online ISBN: 978-3-030-00767-6
eBook Packages: Computer ScienceComputer Science (R0)