Text to Region: Visual-Word Guided Saliency Detection

Xing, Tengfei; Wang, Zhaohui; Yang, Jianyu; Ji, Yi; Liu, Chunping

doi:10.1007/978-3-030-00764-5_68

Tengfei Xing¹⁸,
Zhaohui Wang¹⁸,
Jianyu Yang¹⁹,
Yi Ji¹⁸ &
…
Chunping Liu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11166))

Included in the following conference series:

Pacific Rim Conference on Multimedia

3113 Accesses

Abstract

Image/video captioning based on neural network can generate accurate description. But how to convert visual information into natural language representation is a true enigma. Existing caption-guided saliency methods take the entire sentence as input to generate a saliency map, which exposes the region-to-word mapping. However, visual information is not related to every word in caption. We eliminate these meaningless stop words such as ‘the’, ‘of’ to avoid misleading. We also utilize MFB (Multi-modal Factorized Bilinear Pooling) to fuse C3D features, which could provide richer spatiotemporal information to exposure visual-word guided saliency. Such the system produces better spatiotemporal heatmaps for both predicted captions and arbitrary query sentences without introducing attentional layers. The experimental results on MSR-VTT and Flickr30K dataset surpasses the state-of-the-art by a significant margin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Cao, C., et al.: Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2956–2964 (2015)
Google Scholar
Chen, X., Lawrence Zitnick, C.: Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422–2431 (2015)
Google Scholar
Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, S.M.: Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2015)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE, Washington (2009)
Google Scholar
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 6 (2017)
Google Scholar
Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them (2015)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2641–2649. IEEE (2015)
Google Scholar
Ramanishka, V., Das, A., Zhang, J., Saenko, K.: Top-down visual saliency guided by captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, p. 7 (2017)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. IEEE, Washington (2015)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296. IEEE, Washington (2016)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Google Scholar
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of IEEE International Conference on Computer Visualization, vol. 3 (2017)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Zhang, J., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 543–559. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_33
Chapter Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. IEEE, Washington (2016)
Google Scholar
Zhou, L., Yang, Z., Yuan, Q., Zhou, Z., Hu, D.: Salient region detection via integrating diffusion-based compactness and local contrast. IEEE Trans. Image Process. 24(11), 3308–3320 (2015)
Article MathSciNet Google Scholar
Zhu, W., Liang, S., Wei, Y., Sun, J.: Saliency optimization from robust background detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2814–2821 (2014)
Google Scholar

Download references

Acknowledgements

This work was partially supports by National Natural Science Foundation of China (NSFC Grant No. 61773272, 61272258, 61301299, 61572085, 61170124, 61272005), Provincial Natural Science Foundation of Jiangsu (Grant No. BK20151254, BK20151260), Science and Education Innovation based Cloud Data fusion Foundation of Science and Technology Development Center of Education Ministry (2017B03112), Six talent peaks Project in Jiangsu Province (DZXX-027), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (Grant No. 93K172016K08), and Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu, China
Tengfei Xing, Zhaohui Wang, Yi Ji & Chunping Liu
School of Rail Transportation, Soochow University, Suzhou, Jiangsu, China
Jianyu Yang

Authors

Tengfei Xing
View author publications
You can also search for this author in PubMed Google Scholar
Zhaohui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Ji
View author publications
You can also search for this author in PubMed Google Scholar
Chunping Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yi Ji or Chunping Liu .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xing, T., Wang, Z., Yang, J., Ji, Y., Liu, C. (2018). Text to Region: Visual-Word Guided Saliency Detection. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11166. Springer, Cham. https://doi.org/10.1007/978-3-030-00764-5_68

Download citation

DOI: https://doi.org/10.1007/978-3-030-00764-5_68
Published: 18 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00763-8
Online ISBN: 978-3-030-00764-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics