Joint Visual Context for Pedestrian Captioning

Liu, Quan; Chen, Yingying; Wang, Jinqiao; Zhang, Sijiong

doi:10.1007/978-981-10-8530-7_5

Joint Visual Context for Pedestrian Captioning

Quan Liu^12,13,15,
Yingying Chen^14,15,
Jinqiao Wang^14,15 &
…
Sijiong Zhang^12,13,15

Conference paper
First Online: 01 March 2018

1383 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 819))

Abstract

Image captioning is a fundamental task connecting computer vision and natural language processing. Recent researches usually concentrate on generic image captioning or video captioning among thousands of classes. However, they can not effectively deal with a specific class of objects, such as pedestrian. Pedestrian captioning is critical for analysis, identification and retrieval in massive collections of data. Therefore, in this paper, we propose a novel approach for pedestrian captioning with joint visual context. Firstly, a deep convolutional neural network (CNN) is employed to obtain the global attributes of a pedestrian (e.g., gender, age, and actions), and a Faster R-CNN is utilized to detect the local parts of interest for identification of the local attributes of a pedestrian (e.g., cloth type, color type, and the belongings). Then, we splice the global and local attributes into a fixed length vector and input it into a Long-Short Term Memory network (LSTM) to generate descriptions. Finally, a dataset of 5000 pedestrian images is collected to evaluate the performance of pedestrian captioning. Experimental results show the superiority of the proposed approach.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 107.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Dataset can be downloaded at: www.nlpr.ia.ac.cn/iva/homepage/jqwang/pedestrian_caption_dataset.zip.

References

Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al.: From captions to visual concepts and back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482 (2015)
Google Scholar
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML 2014, pp. 1764–1772 (2014)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
Wu, Q., Shen, C., Liu, L., Dick, A., van den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 203–212 (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
MathSciNet MATH Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: ACL 2014, p. 376 (2014)
Google Scholar
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL 2004 Workshop, p. 8 (2004)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

National Astronomical Observatories/Nanjing Institute of Astronomical Optics and Technology, Chinese Academy of Sciences, Nanjing, 210042, China
Quan Liu & Sijiong Zhang
Key Laboratory of Astronomical Optics and Technology, Nanjing Institute of Astronomical Optics and Technology, Chinese Academy of Sciences, Nanjing, 210042, China
Quan Liu & Sijiong Zhang
National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Yingying Chen & Jinqiao Wang
University of Chinese Academy of Sciences, Beijing, 100190, China
Quan Liu, Yingying Chen, Jinqiao Wang & Sijiong Zhang

Authors

Quan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yingying Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jinqiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Sijiong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quan Liu .

Editor information

Editors and Affiliations

Multimedia Communications Department, EURECOM, Sophia Antipolis, France
Benoit Huet
Shandong University , Qingdao, China
Liqiang Nie
Hefei University of Technology , Hefei, China
Richang Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Q., Chen, Y., Wang, J., Zhang, S. (2018). Joint Visual Context for Pedestrian Captioning. In: Huet, B., Nie, L., Hong, R. (eds) Internet Multimedia Computing and Service. ICIMCS 2017. Communications in Computer and Information Science, vol 819. Springer, Singapore. https://doi.org/10.1007/978-981-10-8530-7_5

Download citation

DOI: https://doi.org/10.1007/978-981-10-8530-7_5
Published: 01 March 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8529-1
Online ISBN: 978-981-10-8530-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics