skip to main content
10.1145/3653804.3654608acmotherconferencesArticle/Chapter ViewAbstractPublication PagescvdlConference Proceedingsconference-collections
research-article

DroneGPT: Zero-shot Video Question Answering For Drones

Published: 01 June 2024 Publication History

Abstract

With the continuous development and popularization of drone technology, drones are widely used in various fields, especially in drone video applications. We propose DroneGPT, a neural-symbolic method that learns VISPROG, which does not require any task-specific training. It leverages the contextual learning ability of large language models to generate and execute modular programs, solving complex and compositional drone vision tasks given natural language instructions. The modules in the program can call several ready-made computer vision models to achieve object detection, or write image processing programs by themselves, and finally connect them to achieve drone video question answering. We believe that DroneGPT can expand the task scope of drones in the video field and further enrich the functions of contemporary drones.

References

[1]
Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem. Towards general purpose vision systems. ArXiv, abs/2104.00743, 2021. 2
[2]
Liu, Shilong, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun-yue Li, Jianwei Yang, Hang Su, Jun-Juan Zhu and Lei Zhang. “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.” ArXiv abs/2303.05499 (2023): n. pag.
[3]
Gupta, Tanmay and Aniruddha Kembhavi. “Visual Programming: Compositional visual reasoning without training.” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 14953-14962.
[4]
Gary Bradski. The opencv library. Dr. Dobb's Journal: Software Tools for the Professional Programmer, 25(11):120– 123, 2000. 2
[5]
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
[6]
Lin, Tsung-Yi, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár and C. Lawrence Zitnick. “Microsoft COCO: Common Objects in Context.” European Conference on Computer Vision (2014).
[7]
D. Du et al., "VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results," 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea (South), 2019, pp. 213-226.
[8]
Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J., Chang, K., & Gao, J. (2021). Grounded Language-Image Pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10955-10965.
[9]
Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., & Raff, E. (2022). VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. European Conference on Computer Vision.
[10]
Yang, A., Pan, J., Lin, J., Men, R., Zhang, Y., Zhou, J., & Zhou, C. (2022). Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese. ArXiv, abs/2211.01335.
[11]
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning.
[12]
C. Yin, J. Tang, Z. Xu and Y. Wang, "Memory Augmented Deep Recurrent Neural Network for Video Question Answering," in IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 9, pp. 3159-3167, Sept. 2020.
[13]
Yu, Z., Yu, J., Fan, J., & Tao, D. (2017). Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. 2017 IEEE International Conference on Computer Vision (ICCV), 1839-1848.
[14]
Huang, D., Chen, P., Zeng, R., Du, Q., Tan, M., & Gan, C. (2020). Location-Aware Graph Convolutional Networks for Video Question Answering. AAAI Conference on Artificial Intelligence.
[15]
Jiaxin Shi, Hanwang Zhang, Juanzi Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8376-8384
[16]
Alayrac, Jean-Baptiste, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman and Karen Simonyan. “Flamingo: a Visual Language Model for Few-Shot Learning.” ArXiv abs/2204.14198 (2022): n. pag.
[17]
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever and Dario Amodei. “Language Models are Few-Shot Learners.” ArXiv abs/2005.14165 (2020): n. pag.
[18]
L. Yan, "Video Captioning Using Global-Local Representation," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6642-6656, Oct. 2022.
[19]
L. Yan, Q. Wang, S. Ma, J. Wang and C. Yu, "Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework With Spatio-Temporal Collaboration," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 1, pp. 393-406, Jan. 2023.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CVDL '24: Proceedings of the International Conference on Computer Vision and Deep Learning
January 2024
506 pages
ISBN:9798400718199
DOI:10.1145/3653804
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Grounding DINO
  2. computer vision
  3. drone
  4. question answering
  5. video analysis
  6. visual programing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CVDL 2024

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 81
    Total Downloads
  • Downloads (Last 12 months)81
  • Downloads (Last 6 weeks)17
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media