On this page

Masterclass

The AI
Voice Acting
Masterclass

How to make AI voices so real, people will swear you hired a voice actor. The complete guide to accents, texture, personality, and the production pipeline that only exists in Luno Studio.

10 sections·40 min read·Intermediate to Advanced

There's a moment every AI filmmaker knows. The visuals are stunning — cinematic lighting, photorealistic skin, perfect composition. Then the character opens their mouth. And the illusion shatters.

The voice is flat. Perfectly enunciated. Rhythmically dead. It sounds like a GPS navigation system reading Shakespeare. This masterclass fixes that.

Why AI Voices Sound Fake

It's not the technology. The technology is extraordinary. The problem is that humans don't actually speak the way AI defaults to speaking.

Real speech is a beautiful mess. We swallow consonants, speed up when excited, breathe mid-sentence, hesitate, trail off, start over. We have accents specific down to a few city blocks.

AI's default is clean speech. Every word pronounced correctly. Every pause mathematically timed. No breath, no rasp, no imperfection. Your brain — calibrated against a lifetime of messy human speech — flags it as wrong immediately.

Vague, generic, forgettable

Male voice. American accent. Happy tone. Read the following script clearly and naturally.

The fix is not waiting for better AI. The fix is learning to direct a voice the way a film director directs an actor. You don't tell Daniel Day-Lewis “say the line happily.” You tell him where his character grew up.

Accent Architecture

This is the single most impactful thing you can do. Accent is not decoration. Accent is the physics engine of speech.

When you specify an accent, you're changing the rhythm of sentences, the placement of stress, the shape of vowels, the speed of delivery. “British accent” is not an accent. It's a continent of accents. Be geographically specific.

South London

Gritty, fast, swallowed T's, compressed sentences

Energy: Urban, streetwise, percussive

The layering technique: Don't just name the accent. Add origin context: “grew up in Lagos but lived in East London for a decade — predominantly London with Yoruba inflections that surface when she gets animated.” This gives the AI a character, not just a sound.

Voice Texture

Accent sets the rhythm. Texture sets the feeling.

Voice texture is the physical quality of sound — how air moves through throat, past teeth, into a microphone. Every real recording has texture that AI defaults strip out. A 45-year-old bartender who smokes doesn't sound like a 22-year-old yoga instructor.

Texture tokens (click to copy)

These produce generic AI voice

clear voiceperfect dictionread naturallynormal speedhappy toneprofessional narratorstandard Americanneutral accent

Mic Proximity

"close-mic, studio" vs. "room tone, slight distance, recorded in a living room"

Changes everything. Close-mic = intimacy (breaths, lip sounds). Room tone = space, ambience.

Physical Qualities

"slight rasp, warm low register, vocal fry on endings"

The physical signature of a specific person. Without this, AI defaults to "broadcast voice."

Sibilance & Artifacts

"no harsh sibilance, natural lip smacks, soft plosives"

Micro-details that separate "recording of a person" from "synthesized audio."

Emotional Undertone

"voice carries a smile, slight tension underneath calm delivery"

Real people don't have neutral voices. Even "neutral" carries emotional residue.

Personality as Direction

“Confident voice” is like telling an actor “be good.” Technically correct. Practically useless.

Use the Primary + Secondary system. Pick a primary style, then add a modifier. The secondary trait is what turns a setting into a character.

Soft-spoken + amused

An introvert who finds this quietly funny

Situational arc (advanced)

Starts formal and measured — clearly rehearsed. By the middle, she's looser, more natural, words coming faster. By the end, genuine emotion breaks through the professionalism.

This gives the AI something to perform, not just execute.

Human Syntax

This is the cheat code. Humans don't speak in perfect sentences. They breathe, pause, hesitate, laugh mid-word, sigh before responding, start a sentence and abandon it.

Use 1-3 non-verbal cues per 15 seconds of speech. More = distracting. Less = rehearsed. Click any cue to copy it.

Before a difficult admission

(exhale) Look, I wasn't honest with you.

During emotional recall

She was just... (pause, voice softens) ...she was everything.

When surprised

Wait, (short laugh) are you serious right now?

When deflecting

(throat clear) Anyway, that's not important.

The breath before the line: Add (inhale) before the first word. It creates physical presence, anticipation, and signals spontaneity. The single most powerful micro-technique in voice direction.

The 5-Second Test

Most people generate the full audio, hate it, and start over. Always test with a 5-second clip first.

Pick the line with the most emotional range. Generate 5 seconds. Listen for these failure modes. Fix one thing. Regenerate once. Lock it in.

If you hear

Too clean, too perfect

The fix

Add 1 breath cue + 1 pause per 15 seconds

If you hear

Too flat, no personality

The fix

Increase personality intensity — add a secondary trait

If you hear

Too chaotic, overacted

The fix

Remove extra cues, shorten pauses, simplify

If you hear

Wrong rhythm entirely

The fix

Go more specific on accent — name the neighborhood

If you hear

Robotic on certain words

The fix

Add a hesitation before the problem word

The golden rule: change one thing at a time. If you adjust accent AND personality AND texture AND cues at once, you have no idea which change helped.

Multi-Character Scenes

The moment two characters talk, the most common failure: both sound like the same person doing slightly different impressions.

For every character, define three things that are completely different: accent origin, pace, and register.

Character A

Accent: South London, fast and direct

Pace: Rapid, punchy, clipped sentences

Register: Warm raspy mid-range

Personality: Confident, bordering on cocky

Character B

Accent: Soft Edinburgh lilt, measured

Pace: Deliberate, weighted pauses

Register: Clear high register

Personality: Quiet authority, never raises voice

With 3+ characters, add a fourth differentiator: a verbal tic or speech pattern. One character always trails off. Another speaks in short, clipped bursts. Another asks questions instead of making statements.

The Luno Pipeline

Everything above works across different tools. But there's one workflow that only exists in Luno Studio. And it changes everything.

Even with perfect prompting, AI voice has a signature — a consistency in breath timing, a too-even frequency response. Luno's pipeline eliminates it entirely.

Only in Luno Studio

The AI model handles lip sync and timing. Voice separation extracts a clean vocal. The voice changer replaces the AI signature with a natural human voice — keeping every breath, pause, and emotional beat you scripted. Recombination produces a finished asset that sounds like a human actor recorded in studio.

No other platform gives you AI voice separation inside the creative workspace. Every other tool forces you to export, strip vocals in a separate app, convert, and re-import. What takes 45 minutes and three applications takes 2 minutes in Luno.

Try the Voice Pipeline

Recipes

Battle-tested voice prompts. Copy them, adapt them, make them yours. Click any recipe to expand the full prompt.

Full prompt

Male, mid-50s. Midwest American — Iowa or Minnesota, flat vowels, honest cadence. Warm baritone, close-mic studio quality. Speaks with quiet authority — not dramatic, not urgent. Like someone telling you a story over coffee who happens to know more about the subject than anyone alive. Measured pace. (Inhale) before paragraphs. Occasional (short pause) mid-sentence at important points. No vocal fry. No upward inflection. Steady, trustworthy warmth.

Why it works: Midwest accent gives Ken Burns quality without being generic. "Story over coffee" avoids the History Channel voice trap.

Full prompt

Female, early 20s. Light Dublin accent — just enough to bend the vowels. Soft, slightly breathy, recorded close to mic. Telling someone something she's never said out loud. Starts quiet and careful. Pace picks up when emotion surfaces. (Deep breath) before starting. "So I never told anyone this, but... (pause) ...when I was sixteen, I—" (voice catches slightly) "—I made a decision that I thought was brave but was actually just... (exhale) ...running."

Why it works: The arc (quiet → building) gives the AI a performance curve. Non-verbal cues — especially the voice catching — create involuntary emotional response.

Full prompt

Male, late 30s. Brooklyn, fast-talking but articulate. Not salesman-smooth — more like a guy at a dinner party who genuinely discovered something incredible and can't shut up about it. High energy, slightly loud, speaks with his whole body. Occasional (short laugh) when he realizes he's getting too excited. "Look, I know how this sounds, but just— (inhale) —hear me out for thirty seconds."

Why it works: "Dinner party" framing avoids corporate voice. Self-awareness ("I know how this sounds") adds authenticity. Brooklyn accent provides natural pace.

Full prompt

Female, 40s. Received Pronunciation — precise, clipped, aristocratic. Speaks slowly, savoring each word. Low register, warm but threatening. Like honey poured over a knife. Long pauses between sentences. (Exhale) turned into a soft, amused sigh. Never raises her voice. The quieter she gets, the more dangerous she sounds. Close-mic, intimate, speaking directly into your ear.

Why it works: "Honey poured over a knife" gives the AI an emotional vector that "menacing" alone never could. Quiet delivery subverts the expected villain voice.

Full prompt

Female, mid-20s. Light French accent — mostly English but soft R's, occasionally dropped H's, certain words tip into French rhythm. Casual, half-awake, talking to herself more than to camera. Room tone, not studio — slight natural reverb. Vocal fry on low notes. Long stretches between phrases. "(Yawn) Okay... (pause) ...coffee first. Everything else can wait."

Why it works: Room tone places her in physical space. "Talking to herself" produces different quality than "narrating to audience." French accent adds character without demanding attention.

Checklist

Pin this to your wall. Run through it before every voice generation.

Named a specific regional accent (neighborhood, not country)Described 2+ voice texture qualities (mic, physical, sibilance)Assigned personality with primary + secondary traitsAdded 1-3 non-verbal cues per 15 secondsIncluded at least one breath cue before the first lineSpecified pace and volume characteristicsSet mic proximity (close-mic studio vs. room tone)Tested with a 5-second sample before full generationUsed Luno voice separation + Voice Changer for final output

0/9 complete

The voice isn't the last 10%.
It's the 50% that determines
whether anyone believes the other 50%.

You can have the most photorealistic footage ever generated. A flat AI voice will undo all of it in the first two seconds. Direct the voice. Script the imperfections. Use Luno's pipeline to turn good AI audio into great human audio.

Start Creating in Luno Studio

Free to start. No credit card required.

The AIVoice ActingMasterclass

Why AI Voices Sound Fake

Accent Architecture

Voice Texture

Personality as Direction

Human Syntax

The 5-Second Test

Multi-Character Scenes

The Luno Pipeline

Recipes

Checklist

The voice isn't the last 10%.It's the 50% that determineswhether anyone believes the other 50%.

The AI
Voice Acting
Masterclass

The voice isn't the last 10%.
It's the 50% that determines
whether anyone believes the other 50%.