Cleaner Categories Improve Object Detection and Visual-Textual Grounding

Rigoni, Davide; Elliott, Desmond; Frank, Stella

doi:10.1007/978-3-031-31435-3_28

Davide Rigoni^10,11,
Desmond Elliott¹² &
Stella Frank^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13885))

Included in the following conference series:

Scandinavian Conference on Image Analysis

840 Accesses
1 Citations

Abstract

Object detectors are core components of multimodal models, enabling them to locate the region of interest in images which are then used to solve many multimodal tasks. Among the many extant object detectors, the Bottom-Up Faster R-CNN [39] (BUA) object detector is the most commonly used by the multimodal language-and-vision community, usually as a black-box visual feature generator for solving downstream multimodal tasks. It is trained on the Visual Genome Dataset [25] to detect 1600 different objects. However, those object categories are defined using automatically processed image region descriptions from the Visual Genome dataset. The automatic process introduces some unexpected near-duplicate categories (e.g. “watch” and “wristwatch”, “tree” and “trees”, and “motorcycle” and “motorbike”) that may result in a sub-optimal representational space and likely impair the ability of the model to classify objects correctly. In this paper, we manually merge near-duplicate labels to create a cleaner label set, which is used to retrain the object detector. We investigate the effect of using the cleaner label set in terms of: (i) performance on the original object detection task, (ii) the properties of the embedding space learned by the detector, and (iii) the utility of the features in a visual grounding task on the Flickr30K Entities dataset. We find that the BUA model trained with the cleaner categories learns a better-clustered embedding space than the model trained with the noisy categories. The new embedding space improves the object detection task and also presents better bounding boxes features representations which help to solve the visual grounding task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

Single-stage zero-shot object detection network based on CLIP and pseudo-labeling

Article 20 August 2024

Notes

1.
https://github.com/MILVLG/bottom-up-attention.pytorch.
2.
https://github.com/peteanderson80/bottom-up-attention.
3.
https://github.com/facebookresearch/detectron2.
4.
https://github.com/drigoni/bottom-up-attention.pytorch.
5.
In V &L pretraining, it is common to use the (10-100) most confident regions [16] detected in each image.
6.
https://github.com/jnhwkim/ban-vqa.
7.
The extracted features used in the BAN paper are not made available by the authors. However, some “reproducibility” features (slightly different) were made available by third users (https://github.com/jnhwkim/ban-vqa/issues/44) who successfully reproduced the main paper results.

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Google Scholar
Beyer, L., Hénaff, O.J., Kolesnikov, A., Zhai, X., van den Oord, A.: Are we done with imagenet? https://doi.org/10.48550/ARXIV.2006.07159. https://arxiv.org/abs/2006.07159
Bugliarello, E., Cotterell, R., Okazaki, N., Elliott, D.: Multimodal pretraining unmasked: a meta-analysis and a unified framework of vision-and-language BERTs. arXiv preprint arXiv:2011.15124 (2020)
Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1989–1998 (2019)
Google Scholar
Cadene, R., Dancette, C., Cord, M., Parikh, D., et al.: RUBi: reducing unimodal biases for visual question answering. In: Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Chen, K., Gao, J., Nevatia, R.: Knowledge aided consistency for weakly supervised phrase grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4042–4050 (2018)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Dai, X., et al.: Dynamic head: unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7373–7382 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Dost, S., Serafini, L., Rospocher, M., Ballan, L., Sperduti, A.: Jointly linking visual and textual entity mentions with background knowledge. In: Métais, E., Meziane, F., Horacek, H., Cimiano, P. (eds.) NLDB 2020. LNCS, vol. 12089, pp. 264–276. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-51310-8_24
Chapter Google Scholar
Dost, S., Serafini, L., Rospocher, M., Ballan, L., Sperduti, A.: On visual-textual-knowledge entity linking. In: ICSC, pp. 190–193. IEEE (2020)
Google Scholar
Dost, S., Serafini, L., Rospocher, M., Ballan, L., Sperduti, A.: VTKEL: a resource for visual-textual-knowledge entity linking. In: ACM, pp. 2021–2028 (2020)
Google Scholar
Dou, Z.Y., et al.: Coarse-to-fine vision-language pre-training with fusion in the backbone. arXiv preprint arXiv:2206.07643 (2022)
Frank, S., Bugliarello, E., Elliott, D.: Vision-and-language or vision-for-language. On Cross-Modal Influence in Multimodal Transformers. (2021). https://doi.org/10.18653/v1/2021.emnlp-main.775 (2021)
Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., Gao, J., et al.: Vision-language pre-training: basics, recent advances, and future trends. Found. Trends® Comput. Graph. Vis. 14(3–4), 163–352 (2022)
Google Scholar
Gupta, T., Vahdat, A., Chechik, G., Yang, X., Kautz, J., Hoiem, D.: Contrastive learning for weakly supervised phrase grounding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 752–768. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_44
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2016). https://doi.org/10.1109/cvpr.2016.90
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.90
Jing, C., Jia, Y., Wu, Y., Liu, X., Wu, Q.: Maintaining reasoning consistency in compositional visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5099–5108 (2022)
Google Scholar
Kafle, K., Shrestha, R., Kanan, C.: Challenges and prospects in vision and language research. Front. Artif. Intell. 2, 28 (2019)
Article Google Scholar
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR - modulated detection for end-to-end multi-modal understanding. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp. 1760–1770. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00180
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems 31 (2018)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10313–10322 (2019)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Google Scholar
Li, W.H., Yang, S., Wang, Y., Song, D., Li, X.Y.: Multi-level similarity learning for image-text retrieval. Inf. Process. Manage. 58(1), 102432 (2021)
Article Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, L., et al.: Deep learning for generic object detection: a survey. Int. J. Comput. Vision 128(2), 261–318 (2020)
Article MATH Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019 (December), pp. 8–14, 2019. Vancouver, BC, Canada, pp. 13–23 (2019). https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015)
Google Scholar
Mogadala, A., Kalimuthu, M., Klakow, D.: Trends in integration of vision and language research: a survey of tasks, datasets, and methods. J, Artif. Intell. Res. 71, 1183–1317 (2021)
Article MathSciNet MATH Google Scholar
Northcutt, C., Jiang, L., Chuang, I.: Confident learning: estimating uncertainty in dataset labels. J. Artif. Intell. Res. 70, 1373–1411 (2021)
Article MathSciNet MATH Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28 (2015)
Google Scholar
Rigoni, D., Serafini, L., Sperduti, A.: A better loss for visual-textual grounding. In: Hong, J., Bures, M., Park, J.W., Cerný, T. (eds.) SAC 2022: The 37th ACM/SIGAPP Symposium on Applied Computing, Virtual Event, 25–29 April 2022, pp. 49–57. ACM (2022). https://doi.org/10.1145/3477314.3507047
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49
Chapter Google Scholar
Shih, K.J., Singh, S., Hoiem, D.: Where to look: Focus regions for visual question answering. In: CVPR, pp. 4613–4621 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Singh, A., et al.: Towards VQA models that can read. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8317–8326 (2019)
Google Scholar
Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–19 (2022). https://doi.org/10.1109/TNNLS.2022.3152527
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111 (2019)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)
Google Scholar
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022)
Wang, H., Wang, H., Xu, K.: Categorizing concepts with basic level for vision-to-language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Wang, L., Huang, J., Li, Y., Xu, K., Yang, Z., Yu, D.: Improving weakly supervised visual grounding by contrastive knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14090–14100 (2021)
Google Scholar
Wang, Q., Tan, H., Shen, S., Mahoney, M.W., Yao, Z.: MAF: multimodal alignment framework for weakly-supervised phrase grounding. arXiv preprint arXiv:2010.05379 (2020)
Wang, R., Qian, Y., Feng, F., Wang, X., Jiang, H.: Co-VQA: answering by interactive sub question sequence. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 2396–2408 (2022)
Google Scholar
Wang, S., Wang, R., Yao, Z., Shan, S., Chen, X.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2020)
Google Scholar
Wang, X., Zhang, S., Yu, Z., Feng, L., Zhang, W.: Scale-equalizing pyramid convolution for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13359–13368 (2020)
Google Scholar
Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3060–3069 (2021)
Google Scholar
Yang, J., Li, C., Gao, J.: Focal modulation networks. arXiv preprint arXiv:2203.11926 (2022)
Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S.: RepPoints: point set representation for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9657–9666 (2019)
Google Scholar
Yao, Y., et al.: PEVL: position-enhanced pre-training and prompt tuning for vision-language models. arXiv preprint arXiv:2205.11169 (2022)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Google Scholar
Yu, Z., Yu, J., Xiang, C., Zhao, Z., Tian, Q., Tao, D.: Rethinking diversified and discriminative proposal generation for visual grounding. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 1114–1120 (2018)
Google Scholar
Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4158–4166 (2018)
Google Scholar
Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
Zhang, H., et al.: Glipv2: unifying localization and vision-language understanding. arXiv preprint arXiv:2206.05836 (2022)
Zhang, Q., Lei, Z., Zhang, Z., Li, S.Z.: Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759–9768 (2020)
Google Scholar
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13041–13049 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the Pioneer Centre for AI, DNRF grant number P1. Davide Rigoni was supported by a STSM Grant from Multi-task, Multilingual, Multi-modal Language Generation COST Action CA18231. We acknowledge EuroHPC Joint Undertaking for awarding us access to Vega at IZUM, Slovenia.

Author information

Authors and Affiliations

Department of Mathematics “Tullio Levi-Civita”, University of Padua, Padua, Italy
Davide Rigoni
Bruno Kessler Foundation, Povo, Italy
Davide Rigoni
Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
Desmond Elliott & Stella Frank
Pioneer Center for AI, Copenhagen, Denmark
Stella Frank

Authors

Davide Rigoni
View author publications
You can also search for this author in PubMed Google Scholar
Desmond Elliott
View author publications
You can also search for this author in PubMed Google Scholar
Stella Frank
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Davide Rigoni .

Editor information

Editors and Affiliations

Aalborg University, Aalborg, Denmark
Rikke Gade
Linköping University, Linköping, Sweden
Michael Felsberg
Tampere University, Tampere, Finland
Joni-Kristian Kämäräinen

Appendices

Appendix 1: Frequencies by Categories

We introduced both the set of clean and random categories deriving from the original ones. The original label set is defined by 1600 categories, while both the new clean and the random sets are defined by 878 categories. Figure 3 shows frequencies of objects appearing in the Visual Genome training split, where objects are either labeled according to the original label set (in blue), the new cleaned label set (in orange), or the random label set (in brown). The new label sets lead mostly to the removal of many low-frequency categories in the long tail, rather than creating new very frequent categories. Surprisingly, the random procedure that generated the random label set also removed the long tail of low-frequencies categories.

Appendix 2: Prediction Confidence

In Fig. 4 it is reported the KDE plots for the probability values of the argmax category predicted by the original, clean, and random label sets.

We find that the BUA detector trained on the cleaned categories produces more high confidence predictions than a detector trained on the original noisy categories. Closer inspection shows that this difference is due to higher confidence when predicting objects in the new merged clean categories. However, this is not the case for BUA trained on random categories, which presents the same confidence as the model trained on the original categories.

Table 6. Proportion of K-nearest neighbors that share the same predicted category, comparing models trained using the original versus random categories (cf. Table 3). The random features present small improvements over the original features, suggesting that there is a small advantage in training with fewer labels; however clean labels help more.

Full size table

Appendix 3: Nearest Neighbors Analysis on Random Labels

In this section, we perform the nearest neighbors analysis on the random labels focusing on the “Merged”, “Untouched”, and “All” categories. Table 6 reports the results of this analysis, considering features extracted with different threshold values (i.e. 0.05 and 0.2) and considering either all features or only features from different images (“Filtered Neighbors”). This step removes features that might be from highly overlapping regions of the same image.

The random features present results very similar to those obtained with the original features, but with a small improvement. In other words, there is an advantage to training on fewer labels overall. However, the improvement given by clean labels is much greater than that obtained with the random labels, strengthening the importance of training BUA with clean categories.

Appendix 4: Clean Labels

The cleaning process produces a new set of 878 categories from the original 1600 categories, which we report below.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rigoni, D., Elliott, D., Frank, S. (2023). Cleaner Categories Improve Object Detection and Visual-Textual Grounding. In: Gade, R., Felsberg, M., Kämäräinen, JK. (eds) Image Analysis. SCIA 2023. Lecture Notes in Computer Science, vol 13885. Springer, Cham. https://doi.org/10.1007/978-3-031-31435-3_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-31435-3_28
Published: 27 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31434-6
Online ISBN: 978-3-031-31435-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Cleaner Categories Improve Object Detection and Visual-Textual Grounding