ABSTRACT
Paragraph generation of images is a new concept, aiming to produce multiple sentences to describe a given image. In this paper, we propose a paragraph generation network with introducing visual relationship detection. We first detect regions which may contain important visual objects and then predict their relationships. Paragraphs are produced based on object regions which have valid relationship with others. Compared with previous works which generate sentences based on region features, we explicitly explore and utilize visual relationships in order to improve final captions. The experimental results show that such strategy could improve paragraph generating performance from two aspects: more details about object relations are detected and more accurate sentences are obtained. Furthermore, our model is more robust to region detection fluctuation.
- Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6298--6306.Google Scholar
- Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting Visual Relationships With Deep Relational Networks. In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 3298--3308.Google Scholar
- Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation .Google ScholarCross Ref
- Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Computer Vision and Pattern Recognition. 677--691. Google ScholarDigital Library
- Andrea Frome, Gregory S Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marcaurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. neural information processing systems (2013), 2121--2129. Google ScholarDigital Library
- Stephen Gould, Jim Rodgers, David Cohen, Gal Elidan, and Daphne Koller. 2008. Multi-Class Segmentation with Relative Location Prior. International Journal of Computer Vision , Vol. 80, 3 (2008), 300--316. Google ScholarDigital Library
- Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. 2009. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 31, 10 (2009), 1775--1789. Google ScholarDigital Library
- Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research , Vol. 47, 1 (2013), 853--899. Google ScholarDigital Library
- Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In Computer Vision and Pattern Recognition. 4565--4574.Google Scholar
- Andrej Karpathy and Fei Fei Li. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 39, 4 (2015), 664. Google ScholarDigital Library
- Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A Hierarchical Approach for Generating Descriptive Image Paragraphs. In IEEE Conference on Computer Vision and Pattern Recognition. 3337--3345.Google ScholarCross Ref
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. https://arxiv.org/abs/1602.07332 Google ScholarDigital Library
- Cewu Lu, Ranjay Krishna, Michael S Bernstein, and Li Feifei. 2016. Visual Relationship Detection with Language Priors. European Conference on Computer Vision (2016), 852--869.Google Scholar
- Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain Images with Multimodal Recurrent Neural Networks. Computer Science (2014).Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Weijing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. Association for Computational Linguistics (2002), 311--318. Google ScholarDigital Library
- Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. 2017. Areas of Attention for Image Captioning. In The IEEE International Conference on Computer Vision (ICCV). 1251--1259.Google Scholar
- Shaoqing Ren, Kaiming He, Ross B Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 39, 6 (2017), 1137--1149. Google ScholarDigital Library
- Lin Tsung-Yi, Maire Michael, Belongie Serge, Bourdev Lubomir, Girshick Ross, Hays James, Perona Pietro, Ramanan Deva, Zitnick C. Lawrence, and Dollar Piotr. 2015. Microsoft COCO: Common Objects in Context. https://arxiv.org/pdf/1405.0312.pdfGoogle Scholar
- Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. computer vision and pattern recognition (2015), 4566--4575.Google Scholar
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition . 3156--3164.Google Scholar
- Linjie Yang, Kevin Tang, Jianchao Yang, and Li-Jia Li. 2017. Dense Captioning With Joint Inference and Visual Context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1978--1987.Google Scholar
- Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017b. Visual Translation Embedding Network for Visual Relation Detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 3107--3115.Google Scholar
- Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, and Ahmed Elgammal. 2017a. Relationship Proposal Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5226--5234.Google Scholar
Index Terms
Paragraph Generation Network with Visual Relationship Detection
Recommendations
Visual Spatial Attention Network for Relationship Detection
MM '18: Proceedings of the 26th ACM international conference on MultimediaVisual relationship detection, which aims to predict a <subject, predicate, object> triplet with the detected objects, has attracted increasing attention in the scene understanding study. During tackling this problem, dealing with varying scales of the ...
Multi-task Compositional Network for Visual Relationship Detection
AbstractPrevious methods treat visual relationship detection as a combination of object detection and predicate detection. However, natural images likely contain hundreds of objects and thousands of object pairs. Relying only on object detection and ...
Relationship graph learning network for visual relationship detection
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in AsiaVisual relationship detection aims to predict the relationships between detected object pairs. It is well believed that the correlations between image components (i.e., objects and relationships between objects) are significant considerations when ...
Comments