Veo 3.1 Prompting: The 6 Rules That Decide Quality
Six mechanical rules separate broadcast-quality Veo 3.1 output from amateur. Most prompting guides miss rule #4 — the one that silently kills audio.
- Dialogue audio only works reliably in text-to-video mode. Image-to-video silently kills audio.
- No quote marks around dialogue. Place dialogue in the MIDDLE of the prompt. Establish style FIRST.
- Lock characters with a 6-trait string (age/gender/hair/build/feature/attire) repeated verbatim every reference. End every prompt with 'No music. No subtitles.'
Google Veo 3.1 is the most under-used video generation model on the market in 2026. Cost-per-clip is roughly 45 cents for an 8-second 1080p generation on Vertex AI — significantly below Sora and competitive with Kling 3.0 — and prompt adherence has measurably improved with the 3.1 update. The problem isn't the model. It's that almost everyone is prompting it wrong, and most of the prompting advice circulating online is from the Veo 2 era when the rules were different.
Six prompting rules decide whether your Veo output looks cinematic or amateur. Get all six right and the model produces broadcast-grade output. Get one wrong — particularly the audio rule — and the output is unusable even though the model is doing its job correctly. This article walks through the six in order of importance.
Rule 1: Dialogue Only Works in Text-to-Video
The most common Veo failure is silent dialogue. The creator writes a prompt with a character saying a line, generates the clip, and the audio is missing or replaced with ambient noise. The reason is almost always that the creator used image-to-video mode.
In Veo 3.1, audio (including dialogue) is only generated reliably in text-to-video mode. Image-to-video mode focuses on motion and visual consistency from the starting frame — it can produce dialogue in some cases, but the failure rate is high, and the model often silently strips audio entirely.
If your prompt includes spoken lines, you must use text-to-video. Period. If you need a specific character or product to appear, describe the character in detail using rule 6 (the character lock) instead of starting from an image.
Rule 2: No Quote Marks Around Dialogue
In Veo 2, dialogue was wrapped in quotes for clarity. In Veo 3.1, quotation marks around dialogue degrade audio generation quality. The model interprets them as part of the prompt syntax and sometimes treats them as instructions to suppress or modify audio.
Write dialogue inline without quotes. Use a colon, an em-dash, or just a natural sentence structure:
Wrong:
The barista says: "We're out of oat milk."
Right:
The barista, a late-twenties woman with tired eyes, says — we're out of oat milk — while gesturing apologetically at the empty container.
The shift feels awkward when you're used to standard screenplay conventions. The output quality difference is real. Test it once and the rule becomes obvious.
Rule 3: Dialogue Lives in the Middle of the Prompt
The placement of dialogue within the prompt matters. Veo 3.1 generates more reliable audio when dialogue sits in the middle of the prompt, between the establishing scene description and the closing style notes.
Wrong (dialogue at start):
The character says — I never thought I'd come back here. He walks slowly into a dimly lit hallway with shafts of warm afternoon light.
Right (dialogue in middle):
A man in his mid-thirties walks slowly into a dimly lit hallway with shafts of warm afternoon light. He pauses, voice low and weighted — I never thought I'd come back here. The camera follows him at chest height as he continues down the corridor. Cinematic, shallow depth of field, warm grade.
The reason is structural — Veo's audio pipeline picks up dialogue cues most reliably when they're framed by visual context on both sides. Front-loaded dialogue gets parsed before the model has visual anchoring. End-loaded dialogue often gets clipped.
Rule 4: "No Music. No Subtitles." Trailer
End every prompt with the explicit trailer No music. No subtitles. unless you specifically want them.
By default, Veo 3.1 frequently adds an ambient music bed and occasionally inserts burned-in subtitles. Neither is usually wanted. The model removes both reliably when the trailer is present and removes neither reliably when it isn't.
This single rule fixes more bad outputs than any other. If you generate a clip with unexpected music or text overlays, check whether your prompt had the trailer. The answer is almost always no.
Use the exact phrasing — periods after each, both as a single short trailer at the end:
[...rest of prompt...]. Cinematic, shallow depth of field, warm grade. No music. No subtitles.
Rule 5: Establish Style Early
Veo 3.1 weights early prompt tokens more heavily than late ones for visual style decisions. If you want a specific look — cinematic, documentary, animated, low-fi, neon-noir — establish it in the first sentence of the prompt, not at the end.
Wrong:
A woman walks through a market. Stalls of fresh fruit line the path. A vendor hands her a peach. She smiles and continues walking. Shot in cinematic style with warm color grading and shallow depth of field.
Right:
Cinematic short film with warm color grading and shallow depth of field. A woman walks through a vibrant outdoor market. Stalls of fresh fruit line the path. A vendor hands her a peach. She smiles and continues walking.
The fix is mechanical — move the style descriptors to the front. The output difference is significant. Style descriptors at the end often get treated as optional notes; at the start they shape every frame.
Rule 6: Lock Your Character With a Six-Trait String
If your generation needs character consistency — across multiple clips, or even within a single clip with multiple shots — lock the character with a six-trait descriptor and repeat it verbatim every time the character is referenced.
The six traits are: age, gender, hair, build, distinctive feature, attire. Example:
Mid-thirties woman, long dark hair pulled into a low ponytail, athletic build, faint freckles across the bridge of her nose, wearing a charcoal wool coat over a cream turtleneck.
This six-trait string becomes the character's identity for Veo. Every time you reference the character in a multi-clip narrative, paste the identical string. Drift in even one trait — "dark brown hair" vs "long dark hair" — produces a different-looking character, breaking continuity.
For a multi-shot single clip, the six-trait lock is what holds the character's face and silhouette consistent through camera moves. For a multi-clip narrative, it's what makes the same character appear in clip 2 as appeared in clip 1.
How the Rules Compound
The six rules work together. Each one in isolation produces a noticeable improvement. Stacked, they produce broadcast-quality output that's hard to distinguish from filmed footage.
A correctly-prompted Veo 3.1 generation in 2026 starts with style descriptors, establishes the scene with concrete visuals, places dialogue in the middle with no quote marks and a tight character-lock, and ends with the no-music-no-subtitles trailer. That single prompt structure, applied consistently, is the difference between Veo-as-toy and Veo-as-production-tool.
The cost math reinforces the value. At 45 cents per 8-second clip, a 60-second video assembled from eight clips costs roughly $3.60. That's broadcast-quality video for under $4. The economics only work if your hit rate is high — and the six rules are what get hit rate from 20% (untrained) to 70%+ (trained).
The complete Veo 3.1 prompt playbook — including the JSON-formatted prompt template that ChatGPT and Claude can fill in automatically, 12 paste-and-ship recipes for common shot types (product reveal, character intro, atmospheric establishing, dialogue-heavy scene), and the failure-mode debugging chart — is inside Veo for Creators ($6.99).
Common Veo 3.1 Failure Modes (And Which Rule Fixes Each)
Silent dialogue → Rule 1 (use text-to-video, not image-to-video) Garbled or robotic audio → Rule 2 (remove quote marks) + Rule 3 (move dialogue to middle) Unwanted music bed → Rule 4 (add the trailer) Burned-in subtitles you didn't ask for → Rule 4 (add the trailer) Generic-looking output → Rule 5 (move style descriptors to the start) Character changes appearance across cuts → Rule 6 (lock the six-trait string)
If you're debugging a bad Veo output, walk through this list in order. The fix is almost always one of the six.
When to Use Veo vs Sora vs Kling
Three video models compete for the operator dollar in 2026. Veo 3.1 is best for cinematic, character-driven, dialogue-heavy content. Sora is best for surreal, complex-motion, and longer-form generation. Kling 3.0 is strong on photoreal humans and physical-motion accuracy. The cost-per-clip ranking puts Veo first, Kling second, Sora third.
For most operators producing marketing video, YouTube Shorts content, ad creative, or product B-roll — Veo 3.1 is the right default in 2026. The six rules close the prompting-skill gap that previously made Veo feel inferior. With the rules applied, Veo produces output that's competitive with everything else on the market at roughly half the cost.
Frequently Asked Questions
Why does my Veo clip have no audio?
Almost certainly Rule 1 — you generated with image-to-video instead of text-to-video. Audio is reliable only in text-to-video mode.
Can Veo 3.1 generate clips longer than 8 seconds?
Yes — Veo supports extend operations via the `predictLongRunning` endpoint, adding 7-second extensions. Chain length is undocumented but reliable for 16-24 second total clips.
What's the difference between Veo 3 and Veo 3.1?
3.1 improved prompt adherence, audio generation reliability, and character consistency. Most \"Veo prompting tips\" online are still from the 3.0 era — be skeptical of older guides.
Can I use generated Veo clips commercially?
Yes — Google Vertex AI terms allow commercial use of Veo output. Standard policy applies; check current terms before high-stakes commercial deployment.
What's the actual per-clip cost in 2026?
Roughly $0.45 per 8-second 1080p clip on Vertex AI. Extensions cost the same per 7-second segment added.
Why do my Veo characters look different in each cut?
Rule 6 — you need to lock the six-trait descriptor (age, gender, hair, build, distinctive feature, attire) and repeat it verbatim every time the character is referenced.
Is Veo better than Sora?
Depends on use case. Veo wins on dialogue, cinematic look, and cost. Sora wins on surreal/complex motion and longer single-clip generations.