Skip to main content
Framesail AI
All posts

How to create long-form video with Claude Code

How to create long-form video with Claude Code from scratch — the agent, the renderer, the image, motion, and voice services — and where it gets hard.

By Hayden · Cofounder, Framesail

You can already get Claude Code to write a script and call an image model. Turning that into a 10-minute narrated video — one with the same character in shot 2 and shot 150, captions that land on the right word, and a single MP4 at the end — is a different kind of problem. The script is the easy part. Everything downstream is wiring and bookkeeping. This is how to create long-form video with Claude Code from scratch: which services you connect, how the agent drives them, and the three things that quietly break once the video gets long.

The same build works with Codex or the Gemini CLI — anything that speaks MCP and can call tools in a loop. Claude Code is just the one we'll use for the examples.

Claude Code orchestrating image, motion, and voice services to build a long-form video, with the consistency and timing problems called out

The from-scratch stack

A finished video is more than generated clips, so you assemble a few pieces. At minimum:

  • A renderer to assemble the final cut. Remotion lets you build videos in React; HyperFrames is a code-driven renderer in the same spirit. Both are excellent for short, templated, data-driven pieces — a product clip, a stat card, a 30-second promo.
  • An image service for the stills behind each scene — OpenAI's image model or Gemini's.
  • A motion service to animate those stills into shots — fal fronts the current video models (Veo, Kling, and friends) behind one API.
  • A voice service for narration — ElevenLabs for the voiceover and the word-level timings.

Claude Code is the conductor. It plans the script, decides what each scene needs, and calls each service in turn through their APIs or MCP servers — the Model Context Protocol is what lets one agent talk to all of them with a consistent tool surface.

Diagram of the from-scratch stack: Claude Code orchestrating OpenAI or Gemini, fal, ElevenLabs, and Remotion, with character refs, caption timing, and project state left for you to manage

Wiring it up with Claude Code

For a short clip, the loop is clean and you can have it working in an afternoon:

  1. Generate the script, then break it into scenes.
  2. For each scene, generate a still with the image service.
  3. Animate the still into a shot with fal.
  4. Generate the narration and timings with ElevenLabs.
  5. Drop the shots, audio, and captions onto a Remotion timeline and render.

For a 30-second video this is genuinely fun. The agent fans out across services, you stitch the results, and a renderer hands you an MP4. If that's the length you're making, you may not need anything else — Remotion or HyperFrames plus a few API calls is the right tool.

Where it breaks past 90 seconds

Stretch that same loop to a 10-minute video — call it 150 scenes — and three things stop being afterthoughts.

Consistency. The image model has no memory between calls. The emperor's face shifts, his armor changes color, the throne room is a different room. You fix it by managing reference images yourself: generate a clean reference for every recurring character and environment, store them, and feed the right ones into every single shot's prompt. That bookkeeping is the job now, and it scales with the video.

Timing. Captions have to sit on the exact word being spoken. ElevenLabs will give you word-level timings, but you're the one mapping every timing onto every caption onto every scene's frame range in the renderer. One off-by-one and the subtitles drift for the rest of the video.

State. A 150-scene project doesn't fit in a context window. Which scenes are rendered? Which failed? Which references go where? Once the agent forgets, it re-runs work or skips it, and you're babysitting a spreadsheet of shot statuses instead of making a video.

None of this is hard in isolation. It's that the glue — refs, timings, scene state, retrying the shot that failed — becomes the whole project, and the creative work gets buried under plumbing.

How framesail removes the tedious part

framesail is the context hub that owns exactly that glue. It's not the agent and it's not a model — it's the workflow and state layer that's been built and tested specifically for long-form. You connect it to Claude Code (or Codex, or Gemini) with one MCP connection instead of wiring four services yourself.

framesail as the single context hub between the agent and a finished long-form MP4, managing references, timing, scene state, and model routing

It generates a reference for each character and environment once and pulls the right ones into every shot, so your characters hold across all 150. It runs the voiceover through word-level timing and sets the captions for you. It tracks every scene's state server-side, so the agent can stop mid-project and resume in a fresh session without losing the thread. The image, motion, and voice models are routed for you — you don't hold the API keys for four vendors in your head.

So the work that's left is the work worth doing: the concept, the script, the style, the calls on what each scene should feel like. The agent drives; framesail keeps minute 7 consistent with minute 1.

FAQ

Do I have to use Claude Code specifically?

No. Any agent that can call tools in a loop works — Claude Code, Codex, or the Gemini CLI. They all speak MCP, so the connection and the workflow are the same; only the agent differs.

Can't Remotion or HyperFrames do the whole thing?

They render the final video, and they're great at it — especially for short, data-driven pieces. What they don't do is manage character consistency, voiceover timing, or project state across dozens of scenes. That orchestration is the part you'd otherwise build by hand.

What actually breaks first on a long video?

Consistency. By a handful of shots in, a character or environment has drifted, because the image model doesn't remember the previous shot. Reference-image management is the fix, and it's the first thing that turns a fun afternoon into bookkeeping.

Does framesail replace my agent?

No — you bring the agent. framesail is the context hub it talks to: it manages references, timing, scene state, and model routing so the agent orchestrates a video instead of a pile of disconnected API calls.

Build it yourself, or skip the plumbing

If you're making short clips, the from-scratch stack is the right call — wire up a renderer and a couple of services and go. If you're making long-form, the glue is the work, and that's what we removed.

Connecting your agent is one command. Create an API key on your account page, then:

claude mcp add --transport http framesail https://api.framesail.com/mcp \
  --header "Authorization: Bearer YOUR_FRAMESAIL_KEY"

The account page also hands you a ready-made version with your key already filled in — plus a paste-to-agent prompt for Cursor, Codex, or anything else that speaks MCP. From there, tell the agent what video you want; the workflow docs cover the rest.

To try it, start a project.

Share