OpenKD: Opening Prompt Diversity for Zero- and Few-Shot Keypoint Detection

Lu, Changsheng; Liu, Zheyuan; Koniusz, Piotr

doi:10.1007/978-3-031-72655-2_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15077))

Included in the following conference series:

European Conference on Computer Vision

219 Accesses

Abstract

Exploiting foundation models (e.g., CLIP) to build a versatile keypoint detector has gained increasing attention. Most existing models accept either the text prompt (e.g., “the nose of a cat”), or the visual prompt (e.g., support image with keypoint annotations), to detect the corresponding keypoints in query image, thereby, exhibiting either zero-shot or few-shot detection ability. However, the research on multimodal prompting is still underexplored, and the prompt diversity in semantics and language is far from opened. For example, how to handle unseen text prompts for novel keypoint detection and the diverse text prompts like “Can you detect the nose and ears of a cat?” In this work, we open the prompt diversity in three aspects: modality, semantics (seen vs. unseen), and language, to enable a more general zero- and few-shot keypoint detection (Z-FSKD). We propose a novel OpenKD model which leverages a multimodal prototype set to support both visual and textual prompting. Further, to infer the keypoint location of unseen texts, we add the auxiliary keypoints and texts interpolated in visual and textual domains into training, which improves the spatial reasoning of our model and significantly enhances zero-shot novel keypoint detection. We also find large language model (LLM) is a good parser, which achieves over 96% accuracy when parsing keypoints from texts. With LLM, OpenKD can handle diverse text prompts. Experimental results show that our method achieves state-of-the-art performance on Z-FSKD and initiates new ways of dealing with unseen text and diverse texts. The source code and data are available at https://github.com/AlanLuSun/OpenKD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Exploring Conditional Multi-modal Prompts for Zero-Shot HOI Detection

X-Pose: Detecting Any Keypoints

Descriptive Attributes for Language-Based Object Keypoint Detection

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Banik, P., Li, L., Dong, X.: A novel dataset for keypoint detection of quadruped animals from images. arXiv preprint arXiv:2108.13958 (2021)
Bohdal, O., et al.: Meta omnium: a benchmark for general-purpose learning-to-learn. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7693–7703 (2023)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33, pp. 1877–1901 (2020)
Google Scholar
Cao, J., Tang, H., Fang, H.S., Shen, X., Lu, C., Tai, Y.W.: Cross-domain adaptation for animal pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9498–9507 (2019)
Google Scholar
Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2019)
Article Google Scholar
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4733–4742 (2016)
Google Scholar
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5386–5395 (2020)
Google Scholar
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org. Accessed 14 Apr 2023
Derpanis, K.G.: The Harris corner detector, pp. 2–3. York University (2004)
Google Scholar
Dosovitskiy, A., et al.: An image is worth $16\times 16$ words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343 (2017)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)
Google Scholar
Ge, Y., Zhang, R., Luo, P.: MetaCloth: learning unseen tasks of dense fashion landmark detection from a few samples. IEEE Trans. Image Process. 31, 1120–1133 (2021)
Article Google Scholar
Gidaris, S., Komodakis, N.: Dynamic few-shot visual learning without forgetting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4367–4375 (2018)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Honari, S., Molchanov, P., Tyree, S., Vincent, P., Pal, C., Kautz, J.: Improving landmark localization with semi-supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1546–1555 (2018)
Google Scholar
Jiao, B., et al.: Toward re-identifying any animal. In: Advances in Neural Information Processing Systems 36 (2024)
Google Scholar
Koch, G., Zemel, R., Salakhutdinov, R., et al.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, Lille, vol. 2 (2015)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, S., Gunel, S., Ostrek, M., Ramdya, P., Fua, P., Rhodin, H.: Deformation-aware unpaired image translation for pose estimation on laboratory animals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13158–13168 (2020)
Google Scholar
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Lowe, G.: SIFT-the scale invariant feature transform. Int. J. 2, 91–110 (2004)
Google Scholar
Lu, C., Gu, C., Wu, K., Xia, S., Wang, H., Guan, X.: Deep transfer neural network using hybrid representations of domain discrepancy. Neurocomputing 409, 60–73 (2020)
Article Google Scholar
Lu, C., Koniusz, P.: Few-shot keypoint detection with uncertainty learning for unseen species. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19416–19426 (2022)
Google Scholar
Lu, C., Koniusz, P.: Detect any keypoints: an efficient light-weight few-shot keypoint detector. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 3882–3890 (2024)
Google Scholar
Lu, C., Zhu, H., Koniusz, P.: From saliency to DINO: saliency-guided vision transformer for few-shot keypoint detection. arXiv preprint arXiv:2304.03140 (2023)
Mathis, A., et al.: DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21(9), 1281–1289 (2018)
Article Google Scholar
Moskvyak, O., Maire, F., Dayoub, F., Baktashmotlagh, M.: Semi-supervised keypoint localization. arXiv preprint arXiv:2101.07988 (2021)
Mukhoti, J., et al.: Open vocabulary semantic segmentation with patch aligned contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19413–19423 (2023)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VIII. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Pereira, T.D., et al.: Fast animal pose estimation using deep neural networks. Nat. Methods 16(1), 117–125 (2019)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175 (2017)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Google Scholar
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018)
Google Scholar
Tang, L., Wertheimer, D., Hariharan, B.: Revisiting pose-normalization for fine-grained few-shot recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14352–14361 (2020)
Google Scholar
Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in Neural Information Processing Systems 27 (2014)
Google Scholar
Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)
Google Scholar
Van Horn, G., et al.: Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 595–604 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: Advances in Neural Information Processing Systems 29, pp. 3630–3638 (2016)
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology (2011)
Google Scholar
Wang, C., et al.: Pseudo-labeled auto-curriculum learning for semi-supervised keypoint localization. arXiv preprint arXiv:2201.08613 (2022)
Wang, L., Huynh, D.Q., Koniusz, P.: A comparative review of recent kinect-based action recognition algorithms. IEEE Trans. Image Process. 29, 15–28 (2019)
Article MathSciNet Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems 35, pp. 24824–24837 (2022)
Google Scholar
Wu, Z., Su, L., Huang, Q.: Stacked cross refinement network for edge-aware salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7264–7273 (2019)
Google Scholar
Xu, L., et al.: Pose for everything: towards category-agnostic pose estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 398–416. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_23
Chapter Google Scholar
Xu, Y., Zhang, J., Zhang, Q., Tao, D.: ViTPose+: vision transformer foundation model for generic body pose estimation. arXiv preprint arXiv:2212.04246 (2022)
Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: keypoint localization via transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11802–11812 (2021)
Google Scholar
Yao, Q., Quan, Q., Xiao, L., Kevin Zhou, S.: One-shot medical landmark detection. In: de Bruijne, M., et al. (eds.) MICCAI 2021, Part II. LNCS, vol. 12902, pp. 177–188. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_17
Chapter Google Scholar
Zhang, H., et al.: Language-driven open-vocabulary keypoint detection for animal body and face. arXiv preprint arXiv:2310.05056 (2023)
Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part I. LNCS, vol. 8689, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_54
Chapter Google Scholar
Zhang, X., Wang, W., Chen, Z., Xu, Y., Zhang, J., Tao, D.: CLAMP: prompt-based contrastive learning for connecting language and animal pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23272–23281 (2023)
Google Scholar
Zhao, S., Gong, M., Zhao, H., Zhang, J., Tao, D.: Deep corner. Int. J. Comput. Vis. 131(11), 2908–2932 (2023)
Article Google Scholar
Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena (2023)
Google Scholar
Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16793–16803 (2022)
Google Scholar

Download references

Acknowledgment

Changsheng Lu is supported by Australian Government Research Training Program (AGRTP) International Scholarship. Piotr Koniusz is supported by CSIRO’s Science Digital.

Author information

Authors and Affiliations

The Australian National University, Canberra, Australia
Changsheng Lu, Zheyuan Liu & Piotr Koniusz
Data61/CSIRO, Sydney, Australia
Piotr Koniusz

Authors

Changsheng Lu
View author publications
You can also search for this author in PubMed Google Scholar
Zheyuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Koniusz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Piotr Koniusz .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 344 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, C., Liu, Z., Koniusz, P. (2025). OpenKD: Opening Prompt Diversity for Zero- and Few-Shot Keypoint Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15077. Springer, Cham. https://doi.org/10.1007/978-3-031-72655-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-72655-2_9
Published: 06 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72654-5
Online ISBN: 978-3-031-72655-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

OpenKD: Opening Prompt Diversity for Zero- and Few-Shot Keypoint Detection