Skip to main content

Cleaner Categories Improve Object Detection and Visual-Textual Grounding

  • Conference paper
  • First Online:
Image Analysis (SCIA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13885))

Included in the following conference series:

  • 485 Accesses

Abstract

Object detectors are core components of multimodal models, enabling them to locate the region of interest in images which are then used to solve many multimodal tasks. Among the many extant object detectors, the Bottom-Up Faster R-CNN [39] (BUA) object detector is the most commonly used by the multimodal language-and-vision community, usually as a black-box visual feature generator for solving downstream multimodal tasks. It is trained on the Visual Genome Dataset [25] to detect 1600 different objects. However, those object categories are defined using automatically processed image region descriptions from the Visual Genome dataset. The automatic process introduces some unexpected near-duplicate categories (e.g. “watch” and “wristwatch”, “tree” and “trees”, and “motorcycle” and “motorbike”) that may result in a sub-optimal representational space and likely impair the ability of the model to classify objects correctly. In this paper, we manually merge near-duplicate labels to create a cleaner label set, which is used to retrain the object detector. We investigate the effect of using the cleaner label set in terms of: (i) performance on the original object detection task, (ii) the properties of the embedding space learned by the detector, and (iii) the utility of the features in a visual grounding task on the Flickr30K Entities dataset. We find that the BUA model trained with the cleaner categories learns a better-clustered embedding space than the model trained with the noisy categories. The new embedding space improves the object detection task and also presents better bounding boxes features representations which help to solve the visual grounding task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/MILVLG/bottom-up-attention.pytorch.

  2. 2.

    https://github.com/peteanderson80/bottom-up-attention.

  3. 3.

    https://github.com/facebookresearch/detectron2.

  4. 4.

    https://github.com/drigoni/bottom-up-attention.pytorch.

  5. 5.

    In V &L pretraining, it is common to use the (10-100) most confident regions [16] detected in each image.

  6. 6.

    https://github.com/jnhwkim/ban-vqa.

  7. 7.

    The extracted features used in the BAN paper are not made available by the authors. However, some “reproducibility” features (slightly different) were made available by third users (https://github.com/jnhwkim/ban-vqa/issues/44) who successfully reproduced the main paper results.

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  2. Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)

    Google Scholar 

  3. Beyer, L., Hénaff, O.J., Kolesnikov, A., Zhai, X., van den Oord, A.: Are we done with imagenet? https://doi.org/10.48550/ARXIV.2006.07159. https://arxiv.org/abs/2006.07159

  4. Bugliarello, E., Cotterell, R., Okazaki, N., Elliott, D.: Multimodal pretraining unmasked: a meta-analysis and a unified framework of vision-and-language BERTs. arXiv preprint arXiv:2011.15124 (2020)

  5. Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1989–1998 (2019)

    Google Scholar 

  6. Cadene, R., Dancette, C., Cord, M., Parikh, D., et al.: RUBi: reducing unimodal biases for visual question answering. In: Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  7. Chen, K., Gao, J., Nevatia, R.: Knowledge aided consistency for weakly supervised phrase grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4042–4050 (2018)

    Google Scholar 

  8. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  9. Dai, X., et al.: Dynamic head: unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7373–7382 (2021)

    Google Scholar 

  10. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  11. Dost, S., Serafini, L., Rospocher, M., Ballan, L., Sperduti, A.: Jointly linking visual and textual entity mentions with background knowledge. In: Métais, E., Meziane, F., Horacek, H., Cimiano, P. (eds.) NLDB 2020. LNCS, vol. 12089, pp. 264–276. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-51310-8_24

    Chapter  Google Scholar 

  12. Dost, S., Serafini, L., Rospocher, M., Ballan, L., Sperduti, A.: On visual-textual-knowledge entity linking. In: ICSC, pp. 190–193. IEEE (2020)

    Google Scholar 

  13. Dost, S., Serafini, L., Rospocher, M., Ballan, L., Sperduti, A.: VTKEL: a resource for visual-textual-knowledge entity linking. In: ACM, pp. 2021–2028 (2020)

    Google Scholar 

  14. Dou, Z.Y., et al.: Coarse-to-fine vision-language pre-training with fusion in the backbone. arXiv preprint arXiv:2206.07643 (2022)

  15. Frank, S., Bugliarello, E., Elliott, D.: Vision-and-language or vision-for-language. On Cross-Modal Influence in Multimodal Transformers. (2021). https://doi.org/10.18653/v1/2021.emnlp-main.775 (2021)

  16. Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., Gao, J., et al.: Vision-language pre-training: basics, recent advances, and future trends. Found. Trends® Comput. Graph. Vis. 14(3–4), 163–352 (2022)

    Google Scholar 

  17. Gupta, T., Vahdat, A., Chechik, G., Yang, X., Kautz, J., Hoiem, D.: Contrastive learning for weakly supervised phrase grounding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 752–768. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_44

    Chapter  Google Scholar 

  18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2016). https://doi.org/10.1109/cvpr.2016.90

  19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.90

  20. Jing, C., Jia, Y., Wu, Y., Liu, X., Wu, Q.: Maintaining reasoning consistency in compositional visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5099–5108 (2022)

    Google Scholar 

  21. Kafle, K., Shrestha, R., Kanan, C.: Challenges and prospects in vision and language research. Front. Artif. Intell. 2, 28 (2019)

    Article  Google Scholar 

  22. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR - modulated detection for end-to-end multi-modal understanding. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp. 1760–1770. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00180

  23. Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems 31 (2018)

    Google Scholar 

  24. Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)

  25. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)

    Article  MathSciNet  Google Scholar 

  26. Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10313–10322 (2019)

    Google Scholar 

  27. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  28. Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)

    Google Scholar 

  29. Li, W.H., Yang, S., Wang, Y., Song, D., Li, X.Y.: Multi-level similarity learning for image-text retrieval. Inf. Process. Manage. 58(1), 102432 (2021)

    Article  Google Scholar 

  30. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  31. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  32. Liu, L., et al.: Deep learning for generic object detection: a survey. Int. J. Comput. Vision 128(2), 261–318 (2020)

    Article  MATH  Google Scholar 

  33. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019 (December), pp. 8–14, 2019. Vancouver, BC, Canada, pp. 13–23 (2019). https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html

  34. Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015)

    Google Scholar 

  35. Mogadala, A., Kalimuthu, M., Klakow, D.: Trends in integration of vision and language research: a survey of tasks, datasets, and methods. J, Artif. Intell. Res. 71, 1183–1317 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  36. Northcutt, C., Jiang, L., Chuang, I.: Confident learning: estimating uncertainty in dataset labels. J. Artif. Intell. Res. 70, 1373–1411 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  37. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  38. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)

    Google Scholar 

  39. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28 (2015)

    Google Scholar 

  40. Rigoni, D., Serafini, L., Sperduti, A.: A better loss for visual-textual grounding. In: Hong, J., Bures, M., Park, J.W., Cerný, T. (eds.) SAC 2022: The 37th ACM/SIGAPP Symposium on Applied Computing, Virtual Event, 25–29 April 2022, pp. 49–57. ACM (2022). https://doi.org/10.1145/3477314.3507047

  41. Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49

    Chapter  Google Scholar 

  42. Shih, K.J., Singh, S., Hoiem, D.: Where to look: Focus regions for visual question answering. In: CVPR, pp. 4613–4621 (2016)

    Google Scholar 

  43. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  44. Singh, A., et al.: Towards VQA models that can read. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8317–8326 (2019)

    Google Scholar 

  45. Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–19 (2022). https://doi.org/10.1109/TNNLS.2022.3152527

  46. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)

  47. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111 (2019)

    Google Scholar 

  48. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)

    Google Scholar 

  49. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022)

  50. Wang, H., Wang, H., Xu, K.: Categorizing concepts with basic level for vision-to-language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  51. Wang, L., Huang, J., Li, Y., Xu, K., Yang, Z., Yu, D.: Improving weakly supervised visual grounding by contrastive knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14090–14100 (2021)

    Google Scholar 

  52. Wang, Q., Tan, H., Shen, S., Mahoney, M.W., Yao, Z.: MAF: multimodal alignment framework for weakly-supervised phrase grounding. arXiv preprint arXiv:2010.05379 (2020)

  53. Wang, R., Qian, Y., Feng, F., Wang, X., Jiang, H.: Co-VQA: answering by interactive sub question sequence. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 2396–2408 (2022)

    Google Scholar 

  54. Wang, S., Wang, R., Yao, Z., Shan, S., Chen, X.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2020)

    Google Scholar 

  55. Wang, X., Zhang, S., Yu, Z., Feng, L., Zhang, W.: Scale-equalizing pyramid convolution for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13359–13368 (2020)

    Google Scholar 

  56. Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3060–3069 (2021)

    Google Scholar 

  57. Yang, J., Li, C., Gao, J.: Focal modulation networks. arXiv preprint arXiv:2203.11926 (2022)

  58. Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S.: RepPoints: point set representation for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9657–9666 (2019)

    Google Scholar 

  59. Yao, Y., et al.: PEVL: position-enhanced pre-training and prompt tuning for vision-language models. arXiv preprint arXiv:2205.11169 (2022)

  60. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)

    Google Scholar 

  61. Yu, Z., Yu, J., Xiang, C., Zhao, Z., Tian, Q., Tao, D.: Rethinking diversified and discriminative proposal generation for visual grounding. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 1114–1120 (2018)

    Google Scholar 

  62. Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4158–4166 (2018)

    Google Scholar 

  63. Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)

  64. Zhang, H., et al.: Glipv2: unifying localization and vision-language understanding. arXiv preprint arXiv:2206.05836 (2022)

  65. Zhang, Q., Lei, Z., Zhang, Z., Li, S.Z.: Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  66. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759–9768 (2020)

    Google Scholar 

  67. Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015)

  68. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13041–13049 (2020)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Pioneer Centre for AI, DNRF grant number P1. Davide Rigoni was supported by a STSM Grant from Multi-task, Multilingual, Multi-modal Language Generation COST Action CA18231. We acknowledge EuroHPC Joint Undertaking for awarding us access to Vega at IZUM, Slovenia.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davide Rigoni .

Editor information

Editors and Affiliations

Appendices

Appendix 1: Frequencies by Categories

We introduced both the set of clean and random categories deriving from the original ones. The original label set is defined by 1600 categories, while both the new clean and the random sets are defined by 878 categories. Figure 3 shows frequencies of objects appearing in the Visual Genome training split, where objects are either labeled according to the original label set (in blue), the new cleaned label set (in orange), or the random label set (in brown). The new label sets lead mostly to the removal of many low-frequency categories in the long tail, rather than creating new very frequent categories. Surprisingly, the random procedure that generated the random label set also removed the long tail of low-frequencies categories.

Fig. 3.
figure 3

LogLog plots of objects frequencies for each category. The frequencies are calculated on the training set annotations. The distribution of the original categories is in blue, the new categories are in orange, and the random categories are in brown. The cleaning process did not generate high-frequency categories and at the same time removed many low-frequency categories for both cleaner and random label sets. (Color figure online)

Fig. 4.
figure 4

KDE plots for the probability values of the argmax category predicted by the model. The plots on the left consider all the categories, the plots in the center consider just the categories that we did not merge during the cleanup process (i.e. “Untouched”), and the last plots on the right consider only the merged categories. Overall, the cleaned categories lead to higher confidence values than the original categories, while there is no difference between original and random categories.

Appendix 2: Prediction Confidence

In Fig. 4 it is reported the KDE plots for the probability values of the argmax category predicted by the original, clean, and random label sets.

We find that the BUA detector trained on the cleaned categories produces more high confidence predictions than a detector trained on the original noisy categories. Closer inspection shows that this difference is due to higher confidence when predicting objects in the new merged clean categories. However, this is not the case for BUA trained on random categories, which presents the same confidence as the model trained on the original categories.

Table 6. Proportion of K-nearest neighbors that share the same predicted category, comparing models trained using the original versus random categories (cf. Table 3). The random features present small improvements over the original features, suggesting that there is a small advantage in training with fewer labels; however clean labels help more.

Appendix 3: Nearest Neighbors Analysis on Random Labels

In this section, we perform the nearest neighbors analysis on the random labels focusing on the “Merged”, “Untouched”, and “All” categories. Table 6 reports the results of this analysis, considering features extracted with different threshold values (i.e. 0.05 and 0.2) and considering either all features or only features from different images (“Filtered Neighbors”). This step removes features that might be from highly overlapping regions of the same image.

The random features present results very similar to those obtained with the original features, but with a small improvement. In other words, there is an advantage to training on fewer labels overall. However, the improvement given by clean labels is much greater than that obtained with the random labels, strengthening the importance of training BUA with clean categories.

Appendix 4: Clean Labels

The cleaning process produces a new set of 878 categories from the original 1600 categories, which we report below.

figure a
figure b
figure c
figure d
figure e
figure f
figure g
figure h
figure i
figure j

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rigoni, D., Elliott, D., Frank, S. (2023). Cleaner Categories Improve Object Detection and Visual-Textual Grounding. In: Gade, R., Felsberg, M., Kämäräinen, JK. (eds) Image Analysis. SCIA 2023. Lecture Notes in Computer Science, vol 13885. Springer, Cham. https://doi.org/10.1007/978-3-031-31435-3_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-31435-3_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-31434-6

  • Online ISBN: 978-3-031-31435-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics