abstract

AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation

Authors:

Mike Zheng ShouAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 11255 - 11257

https://doi.org/10.1145/3664647.3684998

Published: 28 October 2024 Publication History

Abstract

Graphical User Interface (GUI) Automation has shown significant potential recently. Previous works built GUI Agent systems to handle short-procedure tasks such as element grounding or functional assistance. In this paper, we propose a novel PC-Copilot, AssistEditor, that focuses on automating the video editing workflow. Unlike previous approaches, our system does not require users to input specific commands to control the computer. Instead, users simply describe their requirements, such as the content and style of the video, and upload the necessary materials. The system then autonomously translates these requirements into detailed actions for controlling video understanding models and professional video editing software, e.g., Premiere Pro to produce the final video. This functionality is enabled by a collaborative AI agent framework of multiple GUI agents, each capable of dialogue, knowledge retrieval, and software usage. These agents have distinct roles, including interacting with users to gather requirements, generating storyboards, and performing editing tasks. This approach significantly streamlines the video editing process, making advanced editing accessible to users with varying levels of expertise.

References

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).

[2]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2024. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, Vol. 36 (2024).

[3]

Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. 2023. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108 (2023).

[4]

Cognition Labs. 2024. Devin, AI software engineer. https://www.cognition-labs.com/introducing-devin Retrieved April 12, 2024 from

[5]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, Vol. 33 (2020), 9459--9474.

[6]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, Vol. 36 (2023), 51991--52008.

[7]

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. 2023. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2794--2804.

[8]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems, Vol. 36 (2024).

[9]

Meta. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/ Accessed: 2024-04--18.

[10]

Grégoire Mialon, Roberto Dess`i, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. 2023. Augmented language models: a survey. arXiv preprint arXiv:2302.07842 (2023).

[11]

OpenAI. 2023. GPT-4 Technical Report. arxiv: 2303.08774 [cs.CL]

[12]

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. 2023. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088 (2023).

[13]

Toran Bruce Richards. 2023. Auto-GPT: An Autonomous GPT-4 Experiment.

[14]

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina N Toutanova. 2024. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. Advances in Neural Information Processing Systems, Vol. 36 (2024).

[15]

Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, et al. 2023. Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634 (2023).

[16]

Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024. Android in the Zoo: Chain-of-Action-Thought for GUI Agents. arXiv preprint arXiv:2403.02713 (2024).

[17]

Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, and Shuicheng Yan. 2024. AgentStudio: A Toolkit for Building General Virtual Agents. arXiv preprint arXiv:2403.17918 (2024).

[18]

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. 2023. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 (2023).

Index Terms

AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Multi-Agent-Based Film Editing Collaboration System
In order to realize the effective cooperation between editor agents in the film and television editing collaboration system, it is analyzed that the state change of the film and television editing and production process is affected by the cross influence ...
PodReels: Human-AI Co-Creation of Video Podcast Teasers
DIS '24: Proceedings of the 2024 ACM Designing Interactive Systems Conference

Video podcast teasers are short videos that can be shared on social media platforms to capture interest in full episodes of a video podcast. These teasers enable long-form podcasters to reach new audiences and gain more followers. However, creating a ...
Multi-clip video editing from a single viewpoint
CVMP '14: Proceedings of the 11th European Conference on Visual Media Production

We propose a framework for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Assuming important actors and objects can be localized using computer ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Check for updates

Author Tags

Qualifiers

Abstract

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
226
Total Downloads

Downloads (Last 12 months)226
Downloads (Last 6 weeks)103

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten