OpenStoryline: An Agentic Framework for Autonomous, Human-Aligned Video Creation
[code]Abstract. Existing "intelligent editing" solutions often remain confined by rigid templates, heavy parameterization, or complex engineering pipelines, failing to bridge the gap between amateur workflows and professional-grade storytelling. In this work, we present FireRed-OpenStoryline, an agentic video creation framework that synergizes Large Language Models (LLMs) with autonomous planning and precise tool execution, fundamentally shifting the paradigm from manual editing to intention-driven directing. Given abstract user intent and raw footage, the system orchestrates a comprehensive pipeline: it performs semantic media retrieval and shot-level content understanding to filter high-quality clips; generates rhythm-aware narratives where subtitle pacing aligns strictly with visual flow; and ensures audiovisual aesthetic alignment through mood-matched background music, beat-synced cutting, and expressive voiceovers. Architecturally, OpenStoryline is modularized via a Model Context Protocol (MCP) server for granular editing primitives and a state-aware middleware for context management. Crucially, unlike "black-box" one-shot generators, our system supports a transparent, human-in-the-loop workflow where users can intervene via natural language at any stage and encapsulate their preferences into reusable "Style Skills", allowing for the efficient replication of distinct editing aesthetics across future projects.
Contents
Feature
Architecture
Figure. Figure presents the FireRed‑OpenStoryline system architecture for agentic video creation. External inputs (natural‑language prompts, multimodal media, and runtime configurations) are processed by an Agent Client, where an LLM/VLM performs intent routing and produces either direct textual responses or structured tool calls. A Storyline Middleware layer enforces robustness by maintaining context, handling dependencies, and summarizing tool outputs; intermediate states are persisted in Agent Memory to enable iterative refinement. Execution is delegated to an MCP (Model Context Protocol) server that exposes modular atomic tool nodes. A resource layer supplies reusable assets and editable “skills” that encode creator preferences, while optional external APIs extend capabilities. This separation of decision making, resilient orchestration, and standardized tool execution supports scalable extensibility beyond tool‑driven editing pipelines.
Demo
Zhongcao Style(种草视频)
Humorous Style(幽默风格)
Product Picks(好物分享)
Artistic Style(文艺风格)
Unboxing(开箱视频)
Talking Pet(萌宠说话)
Travel Vlog(旅行Vlog)
Year-in-Review(年终总结)