Skip to main content
Log in

Evaluation of visual relationship classifiers with partially annotated datasets

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this work, we investigate neural networks as visual relationship classifiers for precision-constrained applications in partially annotated datasets. The classifier is a convolutional neural network, which we benchmark on three visual relationship datasets. We discuss the effect of partial annotation on precision and why precision-based metrics are not adequate in partial annotation cases. So far, this topic has not been explored in the context of visual relationship classification. We introduce a threshold tuning method that imposes a soft constraint on precision while being less sensitive to the degree of annotation than a regular precision-recall trade-off method. Performance can then be measured via recall of predictions computed with thresholds tuned by the proposed method. Our previously introduced negative sample mining method is now extended to partially annotated datasets (namely Visual Relationship Detection, VRD, and Visual Genome, VG), by sampling from unlabeled pairs instead of unrelated pairs. When thresholds are tuned using our method, negative sample mining improves recall from \(24.1\%\) to \(30.6\%\) and from \(36.7\%\) to \(41.3\%\) for VRD and VG, respectively. The neural networks also maintain the ability to correctly classify between predicates. When considering only ground-truth relationships for threshold tuning, there is only a small decrease in recall (from \(45.1\%\) to \(43.8\%\) in VRD, or from \(60.5\%\) to \(58.7\%\) in VG) compared to when the neural networks are trained only on ground-truth samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

the datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Notes

  1. More information on the challenge is available at https://storage.googleapis.com/openimages/web/challenge2019.html.

  2. Instead of using the conv_5 layers on deeper backbones, we use a randomly initialized convolutional stack with the same architecture as the conv_5 layers from the ResNet-18 backbone.

References

  1. Ahmad S, Mehfuz S, Mebarek-Oudina F et al (2022) Rsm analysis based cloud access security broker: a systematic literature review. Cluster Comput 25(5):3733–3763. https://doi.org/10.1007/s10586-022-03598-z

    Article  Google Scholar 

  2. Anderson P, Fernando B, Johnson M, et al (2016) Spice: Semantic propositional image caption evaluation. In: European Conference on Computer Vision, pp 382–398 https://doi.org/10.1007/978-3-319-46454-1_24

  3. Cole E, Mac Aodha O, Lorieul T, et al (2021) Multi-Label Learning From Single Positive Labels. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 933–942

  4. Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3298–3308 https://doi.org/10.1109/CVPR.2017.352

  5. Estevão RdMFilho, Gomes JGR, Nunes LdO (2020) Visual relationship classification with negative-sample mining. In: 2020 IEEE International Conference on Image Processing (ICIP), pp 2251–2255 https://doi.org/10.1109/ICIP40778.2020.9191170

  6. Everingham M, Eslami SMA, Van Gool L et al (2015) The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111(1):98–136. https://doi.org/10.1007/s11263-014-0733-5

    Article  Google Scholar 

  7. He K, Zhang X, Ren S et al (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. 2015 IEEE International Conference on Computer Vision. Santiago, Chile, pp 1026–1034

    Google Scholar 

  8. He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–77 https://doi.org/10.1109/CVPR.2016.90

  9. He K, Gkioxari G, Dollár P, et al (2017) Mask r-cnn. In: IEEE International Conference on Computer Vision, pp 2980–2988 https://doi.org/10.1109/ICCV.2017.322

  10. Inayoshi S, Otani K, Tejero-de Pablos A, et al (2020) Bounding-box Channels for Visual Relationship Detection. In: European Conference on Computer Vision, pp 682–697 https://doi.org/10.1007/978-3-030-58558-7_40

  11. Johnson J, Krishna R, Stark M, et al (2015) Image retrieval using scene graphs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3668–3678 https://doi.org/10.1109/CVPR.2015.7298990

  12. Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4565–4574 https://doi.org/10.1109/CVPR.2016.494

  13. Johnson J, Gupta A, Fei-Fei L (2018) Image generation from scene graphs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1219–1228 https://doi.org/10.1109/CVPR.2018.00133

  14. Krishna R, Zhu Y, Groth O et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123(1):32–73. https://doi.org/10.1007/s11263-016-0981-7

    Article  MathSciNet  Google Scholar 

  15. Kuznetsova A, Rom H, Alldrin N et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision 128:1956–1981. https://doi.org/10.1007/s11263-020-01316-z

    Article  Google Scholar 

  16. Li L, Chen L, Huang Y, et al (2022) The devil is in the labels: Noisy label correction for robust scene graph generation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 18,847–18,856 https://doi.org/10.1109/CVPR52688.2022.01830

  17. Li Y, Ouyang W, Wang X, et al (2017a) Vip-cnn: Visual phrase guided convolutional neural network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7244–7253 https://doi.org/10.1109/CVPR.2017.766

  18. Li Z, Peng C, Yu G, et al (2017b) Light-head r-cnn: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264

  19. Liang K, Guo Y, Chang H, et al (2018) Visual relationship detection with deep structural ranking. In: AAAI Conference on Artificial Intelligence

  20. Liang Y, Bai Y, Zhang W, et al (2019) Vrr-vg: Refocusing visually-relevant relationships. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 10,402–10,411 https://doi.org/10.1109/ICCV.2019.01050

  21. Lin M, Chen Q, Yan S (2013) Network in network. arXiv preprint arXiv:1312.4400

  22. Liu R, Lehman J, Molino P, et al (2018) An intriguing failing of convolutional neural networks and the coordconv solution. In: Advances in Neural Information Processing Systems, pp 9605–9616

  23. Lu C, Krishna R, Bernstein M, et al (2016) Visual relationship detection with language priors. In: European Conference on Computer Vision, pp 852–869 https://doi.org/10.1007/978-3-319-46448-0_51

  24. Ma C, Sun L, Zhong Z et al (2021) ReLaText: Exploiting visual relationships for arbitrary-shaped scene text detection with graph convolutional networks. Pattern Recognition 111(107):684. https://doi.org/10.1016/j.patcog.2020.107684

    Article  Google Scholar 

  25. Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  26. Newell A, Deng J (2017) Pixels to graphs by associative embedding. Advances in Neural Information Processing Systems 30:2171–2180

    Google Scholar 

  27. Nyo MT, Mebarek-Oudina F, Hlaing SS, et al (2022) Otsu’s thresholding technique for mri image brain tumor segmentation. Multimedia Tools and Applications 81:43,837–43,849. https://doi.org/10.1007/s11042-022-13215-1

  28. Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems

  29. Peng Y, Chi J (2020) Unsupervised Cross-Media Retrieval Using Domain Adaptation With Scene Graph. IEEE Transactions on Circuits and Systems for Video Technology 30(11):4368–4379. https://doi.org/10.1109/TCSVT.2019.2953692

    Article  Google Scholar 

  30. Peyre J, Laptev I, Schmid C, et al (2017) Weakly-supervised learning of visual relations. In: IEEE International Conference on Computer Vision, pp 5189–5198 https://doi.org/10.1109/ICCV.2017.554

  31. du Plessis MC, Niu G, Sugiyama M (2014) Analysis of Learning from Positive and Unlabeled Data. In: Advances in Neural Information Processing Systems

  32. Qi M, Wang Y, Li A et al (2020) Sports Video Captioning via Attentive Motion Representation and Group Relationship Modeling. IEEE Transactions on Circuits and Systems for Video Technology 30(8):2617–2633. https://doi.org/10.1109/TCSVT.2019.2921655

    Article  Google Scholar 

  33. Ren S, He K, Girshick R et al (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  34. Ren S, He K, Girshick R et al (2017) Object detection networks on convolutional feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(7):1476–1481. https://doi.org/10.1109/TPAMI.2016.2601099

    Article  Google Scholar 

  35. Sadeghi MA, Farhadi A (2011) Recognition using visual phrases. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1745–1752 https://doi.org/10.1109/CVPR.2011.5995711

  36. Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill, New York, NY, USA

    Google Scholar 

  37. Sutskever I, Martens J, Dahl GE et al (2013) On the importance of initialization and momentum in deep learning. International Conference on Machine Learning. Atlanta, Georgia, USA, pp 1–9

    Google Scholar 

  38. Tang K, Niu Y, Huang J, et al (2020) Unbiased scene graph generation from biased training. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 3713–3722 https://doi.org/10.1109/CVPR42600.2020.00377

  39. Wang H, Ganapathiraju MK (2015) Evaluation of Protein-protein Interaction Predictors with Noisy Partially Labeled Data Sets. arXiv preprint arXiv:1509.05742

  40. Xi Y, Zhang Y, Ding S et al (2020) Visual question answering model based on visual relationship detection. Signal Processing: Image Communication 80(115):648. https://doi.org/10.1016/j.image.2019.115648

    Article  Google Scholar 

  41. Xu D, Zhu Y, Choy CB, et al (2017) Scene graph generation by iterative message passing. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3097–3106 https://doi.org/10.1109/cvpr.2017.330

  42. Yang J, Ang YZ, Guo Z et al (2022) Panoptic scene graph generation. In: Avidan S, Brostow G, Cissé M et al (eds) Computer Vision - ECCV 2022. Springer Nature Switzerland, Cham, pp 178–196

    Chapter  Google Scholar 

  43. Yu F, Tang J, Yin W, et al (2020) ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. arXiv preprint arXiv:2006.16934

  44. Yu R, Li A, Morariu VI, et al (2017) Visual relationship detection with internal and external linguistic knowledge distillation. In: IEEE International Conference on Computer Vision, pp 1068–1076 https://doi.org/10.1109/ICCV.2017.121

  45. Zellers R, Yatskar M, Thomson S, et al (2018) Neural motifs: Scene graph parsing with global context. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5831–5840 https://doi.org/10.1109/cvpr.2018.00611

  46. Zhan Y, Yu J, Yu T, et al (2019) On exploring undetermined relationships for visual relationship detection. In: IEEE Conference on Computer Vision and Pattern Recognition, p 5123–5132https://doi.org/10.1109/cvpr.2019.00527

  47. Zhang H, Kyaw Z, Chang SF, et al (2017a) Visual translation embedding network for visual relation detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3107–3115 https://doi.org/10.1109/CVPR.2017.331

  48. Zhang J, Elhoseiny M, Cohen S, et al (2017b) Relationship proposal networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5226–5234 https://doi.org/10.1109/CVPR.2017.555

  49. Zhang J, Kalantidis Y, Rohrbach M, et al (2019) Large-scale visual relationship understanding. In: AAAI Conference on Artificial Intelligence, pp 9185–9194 https://doi.org/10.1609/aaai.v33i01.33019185

  50. Zhang Z, Wu Q, Wang Y et al (2021) Exploring region relationships implicitly: Image captioning with visual relationship attention. Image and Vision Computing. https://doi.org/10.1016/j.imavis.2021.104146

    Article  Google Scholar 

  51. Zhou H, Zhang C, Zhao M et al (2021) Improving Visual Relationship Detection With Two-Stage Correlation Exploitation. IEEE Transactions on Circuits and Systems for Video Technology 31(7):2751–2763. https://doi.org/10.1109/TCSVT.2020.3032650

    Article  Google Scholar 

  52. Zhuang B, Liu L, Shen C, et al (2017) Towards context-aware interaction recognition for visual relationship detection. In: IEEE International Conference on Computer Vision, pp 589–598 https://doi.org/10.1109/ICCV.2017.71

Download references

Funding

this work has been supported in part by Microsoft ATL in Rio de Janeiro, and in part by Conselho Nacional de Desenvolvimento Científico e Tecnológico - CNPq - Brazil.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roberto de Moura Estevão Filho.

Ethics declarations

Conflicts of interest

the authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

de Moura Estevão Filho, R., Rodríguez Carneiro Gomes, J.G. & Oliveira Nunes, L. Evaluation of visual relationship classifiers with partially annotated datasets. Multimed Tools Appl 83, 18333–18352 (2024). https://doi.org/10.1007/s11042-023-15967-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15967-w

Keywords

Navigation