research-article

DroneGPT: Zero-shot Video Question Answering For Drones

Authors:

Liqi YanAuthors Info & Claims

CVDL '24: Proceedings of the International Conference on Computer Vision and Deep Learning

Article No.: 49, Pages 1 - 6

https://doi.org/10.1145/3653804.3654608

Published: 01 June 2024 Publication History

Abstract

With the continuous development and popularization of drone technology, drones are widely used in various fields, especially in drone video applications. We propose DroneGPT, a neural-symbolic method that learns VISPROG, which does not require any task-specific training. It leverages the contextual learning ability of large language models to generate and execute modular programs, solving complex and compositional drone vision tasks given natural language instructions. The modules in the program can call several ready-made computer vision models to achieve object detection, or write image processing programs by themselves, and finally connect them to achieve drone video question answering. We believe that DroneGPT can expand the task scope of drones in the video field and further enrich the functions of contemporary drones.

References

[1]

Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem. Towards general purpose vision systems. ArXiv, abs/2104.00743, 2021. 2

[2]

Liu, Shilong, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun-yue Li, Jianwei Yang, Hang Su, Jun-Juan Zhu and Lei Zhang. “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.” ArXiv abs/2303.05499 (2023): n. pag.

[3]

Gupta, Tanmay and Aniruddha Kembhavi. “Visual Programming: Compositional visual reasoning without training.” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 14953-14962.

[4]

Gary Bradski. The opencv library. Dr. Dobb's Journal: Software Tools for the Professional Programmer, 25(11):120– 123, 2000. 2

[5]

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.

[6]

Lin, Tsung-Yi, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár and C. Lawrence Zitnick. “Microsoft COCO: Common Objects in Context.” European Conference on Computer Vision (2014).

[7]

D. Du et al., "VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results," 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea (South), 2019, pp. 213-226.

[8]

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J., Chang, K., & Gao, J. (2021). Grounded Language-Image Pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10955-10965.

[9]

Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., & Raff, E. (2022). VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. European Conference on Computer Vision.

[10]

Yang, A., Pan, J., Lin, J., Men, R., Zhang, Y., Zhou, J., & Zhou, C. (2022). Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese. ArXiv, abs/2211.01335.

[11]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning.

[12]

C. Yin, J. Tang, Z. Xu and Y. Wang, "Memory Augmented Deep Recurrent Neural Network for Video Question Answering," in IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 9, pp. 3159-3167, Sept. 2020.

[13]

Yu, Z., Yu, J., Fan, J., & Tao, D. (2017). Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. 2017 IEEE International Conference on Computer Vision (ICCV), 1839-1848.

[14]

Huang, D., Chen, P., Zeng, R., Du, Q., Tan, M., & Gan, C. (2020). Location-Aware Graph Convolutional Networks for Video Question Answering. AAAI Conference on Artificial Intelligence.

[15]

Jiaxin Shi, Hanwang Zhang, Juanzi Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8376-8384

[16]

Alayrac, Jean-Baptiste, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman and Karen Simonyan. “Flamingo: a Visual Language Model for Few-Shot Learning.” ArXiv abs/2204.14198 (2022): n. pag.

[17]

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever and Dario Amodei. “Language Models are Few-Shot Learners.” ArXiv abs/2005.14165 (2020): n. pag.

[18]

L. Yan, "Video Captioning Using Global-Local Representation," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6642-6656, Oct. 2022.

[19]

L. Yan, Q. Wang, S. Ma, J. Wang and C. Yu, "Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework With Spatio-Temporal Collaboration," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 1, pp. 393-406, Jan. 2023.

Recommendations

Demonstration of wireless synchronisation methods in autonomously controlled fleet of drones

This paper describes methods used in synchronising the behaviour of multiple autonomous flight vehicles utilising custom external hardware and firmware injected into current aviation flight systems. To prove the reliability and consistency of an ...
Face Recognition on Drones: Issues and Limitations
DroNet '15: Proceedings of the First Workshop on Micro Aerial Vehicle Networks, Systems, and Applications for Civilian Use

Drones, as known as unmanned aerial vehicles (UAV), are aircrafts which can perform autonomous pilot. They can easily reach locations which are too difficult to reach or dangerous for human beings and collect images from bird's-eye view through aerial ...
Optimization of hybrid delivery by vehicle and drones
Highlights
- We formulate MILP model for hybrid delivery by vehicle and drones considering both cyclic sorties and forward sorties.
- We adopt set covering for the selection of vehicle stops.
- We introduce a customized hybrid genetic search with ...
Abstract
In this paper, we introduce the optimization of hybrid delivery by vehicle and drones. Hybrid delivery by vehicle and drones is advantageous for improving O2O last-mile delivery efficiency. However, vehicle stop selection and drone delivery ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CVDL '24: Proceedings of the International Conference on Computer Vision and Deep Learning

January 2024

506 pages

ISBN:9798400718199

DOI:10.1145/3653804

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CVDL 2024

CVDL 2024: The International Conference on Computer Vision and Deep Learning

January 19 - 21, 2024

Changsha, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
81
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)17

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten