Multi-task Compositional Network for Visual Relationship Detection

Zhan, Yibing; Yu, Jun; Yu, Ting; Tao, Dacheng

doi:10.1007/s11263-020-01353-8

Multi-task Compositional Network for Visual Relationship Detection

Published: 30 July 2020

Volume 128, pages 2146–2165, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yibing Zhan¹,
Jun Yu¹,
Ting Yu^1,2 &
…
Dacheng Tao³

1840 Accesses
22 Citations
Explore all metrics

Abstract

Previous methods treat visual relationship detection as a combination of object detection and predicate detection. However, natural images likely contain hundreds of objects and thousands of object pairs. Relying only on object detection and predicate detection is insufficient for effective visual relationship detection because the significant relationships are easily overwhelmed by the dominant less-significant relationships. In this paper, we propose a novel subtask for visual relationship detection, the significance detection, as the complement of object detection and predicate detection. Significance detection refers to the task of identifying object pairs with significant relationships. Meanwhile, we propose a novel multi-task compositional network (MCN) that simultaneously performs object detection, predicate detection, and significance detection. MCN consists of three modules, an object detector, a relationship generator, and a relationship predictor. The object detector detects objects. The relationship generator provides useful relationships, and the relationship predictor produces significance scores and predicts predicates. Furthermore, MCN proposes a multimodal feature fusion strategy based on visual, spatial, and label features and a novel correlated loss function to deeply combine object detection, predicate detection, and significance detection. MCN is validated on two datasets: visual relationship detection dataset and visual genome dataset. The experimental results compared with state-of-the-art methods verify the competitiveness of MCN and the usefulness of significance detection in visual relationship detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

Iterative Visual Relationship Detection via Commonsense Knowledge Graph

Transformer networks with adaptive inference for scene graph generation

Article 10 August 2022

Yini Wang, Yongbin Gao, … Bo Huang

Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition

Notes

The source code is provided at https://github.com/Atmegal/MCN.

References

Chen, T., Yu, W., Chen, R., & Lin, L. (2019). Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6163–6171).
Chen, X., & Gupta, A. (2017). An implementation of faster rcnn with study for region sampling. arXiv preprint arXiv:1702.02138
Desai, C., & Ramanan, D. (2012). Detecting actions, poses, and objects with relational phraselets. In European conference on computer vision (pp 158–172). Springer.
Du Plessis, M., Niu, G., & Sugiyama, M. (2015). Convex formulation for learning from positive and unlabeled data. In International conference on machine learning (pp. 1386–1394).
Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., & Zweig, G. (2015). From captions to visual concepts and back. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (vol 00, pp. 1473–1482). https://doi.org/10.1109/CVPR.2015.7298754.
Farhadi, A., & Sadeghi, M. A. (2011). Recognition using visual phrases. In CVPR 2011(CVPR) (vol 00, pp. 1745–1752). https://doi.org/10.1109/CVPR.2011.5995711.
Galleguillos, C., Rabinovich, A., & Belongie, S. (2008). Object categorization using co-occurrence, location and appearance. In IEEE conference on computer vision and pattern recognition, 2008 (pp. 1–8). CVPR 2008, IEEE.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).
Gould, S., Rodgers, J., Cohen, D., Elidan, G., & Koller, D. (2008). Multi-class segmentation with relative location prior. International Journal of Computer Vision, 80(3), 300–316.
Article Google Scholar
Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., & Ling, M. (2019). Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1969–1978).
Han, C., Shen, F., Liu, L., & Yang, Y. (2018). Shen HT (2018) Visual spatial attention network for relationship detection. In ACM multimedia conference on multimedia conference, ACM (pp. 510–518).
Hsieh, C.J., Natarajan, N., & Dhillon, I.S. (2015). Pu learning for matrix completion. In ICML (pp. 2445–2453).
Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). Relation networks for object detection. In The IEEE conference on computer vision and pattern recognition (CVPR).
Jae Hwang, S., Ravi, SN., Tao, Z., Kim, H.J., Collins, M.D., & Singh, V. (2018). Tensorize, factorize and regularize: Robust visual relationship learning. In The IEEE conference on computer vision and pattern recognition (CVPR).
Kaji, H., Yamaguchi, H., & Sugiyama, M. (2018). Multi task learning with positive and unlabeled data and its application to mental state prediction. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE (pp. 2301–2305).
Kanehira, A., & Harada, T. (2016). Multi-label ranking from positive and unlabeled data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5138–5146).
Kong, Y., & Fu, Y. (2017). Max-margin heterogeneous information machine for rgb-d action recognition. International Journal of Computer Vision, 123(3), 350–371.
Article MathSciNet Google Scholar
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.
Article MathSciNet Google Scholar
Li, X. L., Yu, P. S., Liu, B., & Ng, S. K. (2009). Positive unlabeled learning for data stream classification. In Proceedings of the 2009 SIAM international conference on data mining, SIAM (pp. 259–270).
Li, Y., Ouyang, W., Wang, X., & Tang, X. (2017a). Vip-cnn: Visual phrase guided convolutional neural network. In Computer vision and pattern recognition (pp. 7244–7253).
Li, Y., Ouyang, W., Zhou, B., Wang, K., & Wang, X. (2017b). Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision (pp. 1261–1270).
Liang, K., Guo, Y., Chang, H., & Chen, X. (2018). Visual relationship detection with deep structural ranking. In AAAI Conference on artificial intelligence.
Liang, X., Lee, L., & Xing, E. P. (2017). Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Computer vision and pattern recognition (pp. 4408–4417).
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al. (2016). Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21–37). Springer.
Lu, C., Krishna, R., Bernstein, M. S., & Feifei, L. (2016). Visual relationship detection with language priors. In European conference on computer vision (pp. 852–869).
Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3994–4003).
Ouyang, W., Zeng, X., & Wang, X. (2016). Learning mutual visibility relationship for pedestrian detection with a deep model. International Journal of Computer Vision, 120(1), 14–27.
Article MathSciNet Google Scholar
Palmero, C., Clapés, A., Bahnsen, C., Møgelmose, A., Moeslund, T. B., & Escalera, S. (2016). Multi-modal rgb-depth-thermal human body segmentation. International Journal of Computer Vision, 118(2), 217–239.
Article MathSciNet Google Scholar
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).
Peyre, J., Laptev, I., Schmid, C., & Sivic, J. (2017). Weakly-supervised learning of visual relations. In international conference on computer vision (pp. 5189–5198).
Platanios, E., Poon, H., Mitchell, T. M., & Horvitz, E. J. (2017). Estimating accuracy from unlabeled data: A probabilistic logic approach. In Advances in neural information processing systems (pp. 4361–4370).
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, 6, 1137–1149.
Article Google Scholar
Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098
Sansone, E., De Natale, F. G., & Zhou, Z. H. (2018). Efficient training for positive unlabeled learning. In IEEE Transactions on pattern analysis and machine intelligence.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Xu, D., Zhu, Y., Choy, C.B., & Fei-Fei, L. (2017). Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition (vol. 2).
Yang, X., Zhang, H., & Cai, J. (2018). Shuffle-then-assemble: learning object-agnostic visual relationship features. arXiv preprint arXiv:1808.00171
Yao, T., Pan, Y., Li, Y., & Mei, T. (2018). Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV) (pp. 684–699).
Yin, G., Sheng, L., Liu, B., Yu, N., Wang, X., Shao, J., & Loy, C.C. (2018). Zoom-net: Mining deep feature interactions for visual relationship recognition. arXiv preprint arXiv:1807.04979
Yu, R., Li, A., Morariu, V. I., & Davis, L. S. (2017). Visual relationship detection with internal and external linguistic knowledge distillation. In International conference on computer vision (pp. 1068–1076).
Yu, Z., Yu, J., Fan, J., & Tao, D. (2017). Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 1821–1830).
Yu, Z., Yu, J., Xiang, C., Fan, J., & Tao, D. (2018). Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems, 99, 1–13.
Google Scholar
Zhan, Y., Yu, J., Yu, T., & Tao, D. (2019). On exploring undetermined relationships for visual relationship detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5128–5137).
Zhang, H., Kyaw, Z., Chang, S., & Chua, T. (2017a). Visual translation embedding network for visual relation detection. In Computer vision and pattern recognition (pp. 3107–3115).
Zhang, H., Kyaw, Z., Yu, J., & Chang, S. (2017b). Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In International conference on computer vision (pp. 4243–4251).
Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., & Elhoseiny, M. (2019a). Large-scale visual relationship understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 9185–9194.
Article Google Scholar
Zhang, J., Shih, K. J., Elgammal, A., Tao, A., & Catanzaro, B. (2019b). Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 11535–11543).
Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499–1503.
Article Google Scholar
Zhang, X., & LeCun, Y. (2017). Universum prescription: Regularization using unlabeled data. In AAAI (pp. 2907–2913).
Zhou, J. T., Pan, S. J., Mao, Q., & Tsang, I. W. (2012). Multi-view positive and unlabeled learning. In Asian conference on machine learning (pp. 555–570).
Zhu, Y., & Jiang, S. (2018). Deep structured learning for visual relationship detection. In AAAI Conference on artificial intelligence.
Zhuang, B., Liu, L., Shen, C., & Reid, I. (2017). Towards context-aware interaction recognition for visual relationship detection. In Proceedings of the IEEE international conference on computer vision (pp. 589–598).

Download references

Acknowledgements

This work was supported in part by the National Key R&D Program of China: Grant No. 2018AAA0100603, in part by the National Nature Science Foundation of China: Grant No. 61836002, and in part by the Australian Research Council Project: FL-170100117.

Author information

Authors and Affiliations

Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
Yibing Zhan, Jun Yu & Ting Yu
Zhejiang University of Finance and Economics Dongfang College, Hangzhou, China
Ting Yu
UBTECH Sydney Artificial Intelligence Centre, School of Computer Science, Faculty of Engineering, The University of Sydney, 6 Cleveland St., Darlington, NSW, 2008, Australia
Dacheng Tao

Authors

Yibing Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Jun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Ting Yu
View author publications
You can also search for this author in PubMed Google Scholar
Dacheng Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Yu.

Additional information

Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Jie Chen, Wanli Ouyang, Luc Van Gool.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhan, Y., Yu, J., Yu, T. et al. Multi-task Compositional Network for Visual Relationship Detection. Int J Comput Vis 128, 2146–2165 (2020). https://doi.org/10.1007/s11263-020-01353-8

Download citation

Received: 07 March 2019
Accepted: 05 July 2020
Published: 30 July 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11263-020-01353-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-task Compositional Network for Visual Relationship Detection

Abstract

Access this article

Similar content being viewed by others

Iterative Visual Relationship Detection via Commonsense Knowledge Graph

Transformer networks with adaptive inference for scene graph generation

Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-task Compositional Network for Visual Relationship Detection

Abstract

Access this article

Similar content being viewed by others

Iterative Visual Relationship Detection via Commonsense Knowledge Graph

Transformer networks with adaptive inference for scene graph generation

Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation