Blog · AI Video Generation

The Veo 3.1 Audio Bug That Silently Kills Your Dialogue (And the 30-Second Fix)

You wrote dialogue. You hit generate. The video plays — silent. This is the Veo 3.1 audio bug nobody mentions in the docs, and the one-line fix.

By Cameron Jo'van··8 min read
TL;DR
  • Image-to-video on Veo 3.1 silently strips audio in most operator setups. Text-to-video is the path that reliably generates dialogue audio.
  • If you need audio + a specific visual starting point, use text-to-video with a detailed scene description, not image-to-video with an uploaded reference.
  • Other audio killers: quote marks around dialogue, dialogue at the start or end of the prompt, missing 'No music. No subtitles.' trailer.

The most common Veo 3.1 frustration in 2026: an operator writes a perfect prompt with dialogue, hits generate, and gets back a beautifully rendered silent video. No audio. No error message. No indication of what went wrong.

The cause is almost always the same: the operator used image-to-video with an uploaded reference image. That mode silently strips audio in most setups. The workaround is text-to-video, where audio synthesis reliably triggers.

This article is the full breakdown of the audio bug, why it happens, and the four other rules that determine whether your dialogue makes it into the final render.

The Bug

In Veo 3.1, two generation paths exist:

Text-to-video: prompt describes the scene; model generates from scratch. Audio works reliably.

Image-to-video: prompt + an uploaded reference image; model generates motion starting from the image. Audio is silently stripped most of the time.

The bug is undocumented. Google's official docs don't flag it. Most prompting guides written by people who haven't actually shipped Veo content don't mention it. Operators discover it by burning ~$1 in failed generations before realizing the pattern.

Why It Happens

Educated guess (Google hasn't published the reason): the image-conditioning pipeline routes through a different generation graph than the text-only path. The audio synthesis subsystem isn't wired into the image-conditioning path the same way it's wired into the text-only path. The result is that even when you include dialogue in the prompt, audio generation fails silently.

This may get fixed in a future Veo release. Don't plan around the fix. Plan around the current behavior.

The Fix

For any generation requiring dialogue audio: use text-to-video, not image-to-video.

If you need a specific visual starting point, describe it in extreme detail in the prompt instead of uploading a reference image. Veo's text-to-video pathway responds well to detailed visual descriptions:

"Cinematic short film with warm color grading and shallow depth of field. A woman in her mid-thirties, long dark hair pulled into a low ponytail, athletic build, faint freckles, wearing a charcoal wool coat over a cream turtleneck, stands in a dimly lit hallway with shafts of warm afternoon light falling across her face. She speaks quietly, voice low and weighted — I never thought I'd come back here. The camera holds at chest height for two beats, then follows her slowly down the corridor. No music. No subtitles."

That prompt, run through text-to-video, produces dialogue audio reliably. Run through image-to-video (even with a matching reference image), audio is hit-or-miss at best.

The Five Audio Rules (Including The Bug)

Rule 1 — Use text-to-video for dialogue. The single biggest determinant of whether audio generates.

Rule 2 — No quote marks around dialogue. Veo's parser treats quoted text as a token that degrades audio synthesis. Write dialogue inline, set off by em-dashes or commas:

  • ❌ Bad: She says, "I never thought I'd come back here."
  • ✅ Good: She speaks quietly, voice low and weighted — I never thought I'd come back here.

Rule 3 — Dialogue lives in the middle of the prompt. Front-loaded or trailing dialogue parses worse than dialogue embedded after the scene/character setup but before the camera notes.

  • Order: Style → Character → Action setup → Dialogue → Camera direction → "No music. No subtitles."

Rule 4 — End every prompt with "No music. No subtitles." Veo defaults to adding both. The trailer suppresses them. Without it, you'll get unwanted background music and burned-in subtitle text — both of which compete with or obscure your dialogue.

Rule 5 — Establish style early. Style tokens at the start of the prompt weight more heavily than the same tokens at the end. "Cinematic short film with warm color grading and shallow depth of field" at the start works; the same phrase at the end is mostly ignored.

These five rules together lift dialogue-generation hit rate from ~25% (untrained operator) to ~70%+ (trained operator).

The Cost of Skipping These Rules

A typical untrained Veo session: 4-5 generations to land one usable dialogue clip. At $0.45 per clip, that's $1.80-2.25 per usable clip in API spend, plus 20-30 minutes of frustration.

A trained operator session: 1-2 generations to land one usable clip. That's $0.45-0.90 per usable clip plus 5-10 minutes of work.

Across 50 clips for a typical YouTube Short or marketing video project, the difference is $45+ in API spend and 5+ hours of time. Per project.

What Image-to-Video IS Good For

Image-to-video isn't useless — it's just not the right path for dialogue. Where it works well:

  • Motion-only generations (no audio needed): animating a still image, adding camera movement to a photograph, bringing a painting to life
  • B-roll without dialogue: ambient footage, product reveals, landscape pans
  • Character consistency starting points: establishing a character via a reference image, then continuing the scene via subsequent text-to-video generations with the 6-trait character lock

For these uses, image-to-video is the right tool. For anything requiring synthesized dialogue, switch to text-to-video.

The Hybrid Workflow

For projects that need both a specific visual + dialogue, the working hybrid:

  1. Generate the visual base with image-to-video (no dialogue in prompt) — gets you the specific starting image animated
  2. Strip the silent audio track
  3. Generate dialogue audio separately with ElevenLabs Professional Voice Clone (your own cloned voice) or OpenAI Voice
  4. Overlay in post in your editor (DaVinci, Premiere, Final Cut)

Total cost per clip: ~$0.45 (Veo) + ~$0.10 (ElevenLabs) = ~$0.55. Adds 5-10 minutes of post work but gives you both the specific visual AND clean dialogue.

For projects where the specific starting image matters more than maximum production speed, this hybrid is the right path.

The Character Lock Bonus

Even within text-to-video, character consistency across multiple clips matters for any narrative content. The 6-trait character lock pattern keeps the same character across separate generations:

Lock these six traits per character, used verbatim in every reference:

  1. Age (specific, e.g., "mid-thirties")
  2. Gender
  3. Hair (color, length, style)
  4. Build (athletic, slim, stocky, etc.)
  5. Distinctive feature (freckles, scar, specific accessory)
  6. Attire (the outfit they wear in this scene)

Used verbatim across 8-12 clips, the same character recognizably appears in each, allowing real narrative continuity.

The Trained-Operator Veo Workflow

Putting it all together, the working operator Veo workflow for dialogue-heavy content:

  1. Write the prompt using the 5 audio rules (text-to-video, no quotes, middle dialogue, "No music. No subtitles." trailer, style first)
  2. Lock the character with the 6-trait pattern
  3. Run a 4-second test generation first ($0.30) to verify audio
  4. If audio works, run the full 8-second generation
  5. If audio fails, debug (usually one of the 5 rules was violated)

The full Veo prompting playbook including the 6 rules, 12 paste-and-ship shot recipes, the failure-mode debugging chart, and the character-lock template is in Veo for Creators ($6.99). Most operators recoup the cost on the first project where the rules save 5+ failed renders.

The actionable next step: if you have an active Veo generation that came back silent, check the prompt against the 5 rules. The fix is usually one rule away. Apply the fix, re-run text-to-video, get usable audio on the next attempt.

Frequently Asked Questions

Why does image-to-video kill audio?

It's an undocumented quirk of how Veo 3.1 routes image-conditioned generations. The image-conditioning pipeline doesn't reliably trigger the audio synthesis path. Google hasn't publicly explained why; the workaround is to use text-to-video for any generation needing dialogue.

Will Google fix this?

Possibly, but don't wait. The text-to-video workaround has been stable for months. Plan your workflow around the current behavior rather than expecting a near-term fix.

What if I really need a specific starting image AND audio?

Two options: (1) describe the image in extreme detail in a text-to-video prompt — usually gets you 80% there; (2) generate the video without audio via image-to-video, then add voiceover via ElevenLabs in post-production.

Does this apply to all Veo versions?

The bug has been present in Veo 3.0 and Veo 3.1. Earlier versions had less reliable audio generally. The text-to-video workaround applies to all current versions.

What about other dialogue formatting rules?

Three more rules matter: (1) no quote marks around dialogue; (2) dialogue should sit in the middle of the prompt, not the start or end; (3) end every prompt with 'No music. No subtitles.' to prevent unwanted defaults.

Is there a way to test if audio will generate before paying for the render?

Not in the API directly. The cheap test is to run a 4-second test generation first ($0.30 in API spend) with your prompt. If audio works at 4 seconds, it'll work at 8.

Why does dialogue position in the prompt matter?

The prompt parser weighs early tokens for setting/style and late tokens for cinematography. Dialogue parses best when it lands in the middle of the prompt where the model is focused on action/character. Edge placement degrades audio quality.