Resources / Blog

How to create consistent AI characters in long-form videos

Why characters drift across shots in long-form AI video, how reference images solve most of it, and where the current models (Nano Banana 2, GPT Image-2) stand.

May 13, 2026

In long-form AI video, characters drift. The face in shot 7 isn't quite the face from shot 2 — the jaw is slightly wider, the eyes are a different color, the jacket has lost a button. Each shot looks fine on its own, but cut together the audience knows something is off. It's frustrating as a creator, and you end up burning extra tokens regenerating shots to chase the original face.

The fix is reference images. Instead of describing the character in text on every shot, generate or pick a single image of the character and pass it into every generation as a reference. The text describes what's changing in the shot; the image carries the identity. With a locked reference, the drift you used to see across 40 shots is largely gone — outfits still drift a bit more than faces, and extreme angles can strain the reference, but identity holds.

Where the models are right now

Nano Banana 2 (Gemini's image-conditioning model) was the first model where a reference image plus a prompt like "same character, profile view, walking" actually held together. Face locked, outfit mostly held, pose changed cleanly.

GPT Image-2 pushed it further. Multi-reference handling is tighter — pass in a character image plus a target pose reference and the output respects both. Outfits hold across pose changes. It still strains on extreme angles, but doesn't break identity when it does.

Each generation of these models gets better at separating what stays (the character) from what changes (the shot).

The workflow

Get one clean reference image of the character.
Pass it into every shot. Let the text describe what's changing; let the reference carry the identity.
Re-render anything that drifts.

One prompt tip: describe what's changing in the shot, not the character. Instead of "Sarah, brown hair, denim jacket, sitting at a desk, morning light" on every shot, write "sitting at a desk, morning light." The less you re-describe the character in text, the less the model has to reinterpret.

Where it gets tedious

That's the simple case — one character, one environment. Long-form video usually isn't that simple.

A typical scene has multiple characters in the same shot, and each one needs its own reference passed in. Those characters move through multiple environments, and each environment has to stay consistent across shots too — the kitchen in scene 3 has to be the same kitchen in scene 11. Then there are props that recur: a phone, a car, a specific painting on the wall. Each one is another reference to track.

Multiply by 80 shots and it becomes a folder of reference images and a mental map of which ones go where. Miss one and you get a drift bug in that shot, usually caught only when you cut the video together.

How framesail handles it

framesail is built for this. Create each character, environment, and prop once, and we generate and manage their reference images. We automatically pull the right references into every shot — the right character refs for who's in the scene, the right environment ref for where it's set, the right props if they appear. Consistency holds across the whole video without tracking the file structure by hand.

To try it, start a project.