skip to main content
10.1145/3664647.3684998acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
abstract

AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation

Published: 28 October 2024 Publication History

Abstract

Graphical User Interface (GUI) Automation has shown significant potential recently. Previous works built GUI Agent systems to handle short-procedure tasks such as element grounding or functional assistance. In this paper, we propose a novel PC-Copilot, AssistEditor, that focuses on automating the video editing workflow. Unlike previous approaches, our system does not require users to input specific commands to control the computer. Instead, users simply describe their requirements, such as the content and style of the video, and upload the necessary materials. The system then autonomously translates these requirements into detailed actions for controlling video understanding models and professional video editing software, e.g., Premiere Pro to produce the final video. This functionality is enabled by a collaborative AI agent framework of multiple GUI agents, each capable of dialogue, knowledge retrieval, and software usage. These agents have distinct roles, including interacting with users to gather requirements, generating storyboards, and performing editing tasks. This approach significantly streamlines the video editing process, making advanced editing accessible to users with varying levels of expertise.

References

[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
[2]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2024. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, Vol. 36 (2024).
[3]
Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. 2023. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108 (2023).
[4]
Cognition Labs. 2024. Devin, AI software engineer. https://www.cognition-labs.com/introducing-devin Retrieved April 12, 2024 from
[5]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, Vol. 33 (2020), 9459--9474.
[6]
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, Vol. 36 (2023), 51991--52008.
[7]
Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. 2023. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2794--2804.
[8]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems, Vol. 36 (2024).
[9]
Meta. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/ Accessed: 2024-04--18.
[10]
Grégoire Mialon, Roberto Dess`i, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. 2023. Augmented language models: a survey. arXiv preprint arXiv:2302.07842 (2023).
[11]
OpenAI. 2023. GPT-4 Technical Report. arxiv: 2303.08774 [cs.CL]
[12]
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. 2023. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088 (2023).
[13]
Toran Bruce Richards. 2023. Auto-GPT: An Autonomous GPT-4 Experiment.
[14]
Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina N Toutanova. 2024. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. Advances in Neural Information Processing Systems, Vol. 36 (2024).
[15]
Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, et al. 2023. Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634 (2023).
[16]
Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024. Android in the Zoo: Chain-of-Action-Thought for GUI Agents. arXiv preprint arXiv:2403.02713 (2024).
[17]
Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, and Shuicheng Yan. 2024. AgentStudio: A Toolkit for Building General Virtual Agents. arXiv preprint arXiv:2403.17918 (2024).
[18]
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. 2023. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 (2023).

Index Terms

  1. AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Check for updates

    Author Tags

    1. gui automation
    2. multi-agent collaboration
    3. video editing

    Qualifiers

    • Abstract

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 226
      Total Downloads
    • Downloads (Last 12 months)226
    • Downloads (Last 6 weeks)103
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media