Here's the thing about pacing: it's the difference between content that feels alive and content that feels like a robot reciting a grocery list. Too fast and people zone out because they can't keep up. Too slow? They're already three podcasts ahead. But get it right, and your listeners don't even notice they're hanging on every word.
The right pacing actually varies depending on what you're making. A true crime podcast doesn't need the same tempo as a meditation guide. And neither of those works for YouTube, where you're competing with fifty other things in someone's browser. Let's talk about how to nail this.
Words Per Minute—The Baseline
When producers talk about pacing, they're really measuring words per minute, or WPM. Think of it as the speed dial for your voice. Different speeds just feel different, and that feeling matters.
At 120 WPM and below? You're in slow-motion territory. There's something almost theatrical about it—great if you're narrating a meditation or emphasizing something important. But sustained for an entire episode? People's minds wander. Their eyes glaze over. You've lost them.
Then there's 130-150 WPM. This is the measured, thoughtful pace. Complex information, dense material, anything that requires genuine attention benefits from this speed. It's why audiobooks live here—people are often multitasking (driving, exercising, folding laundry), so you need to give their brains room to process without rewinding constantly.
Jump to 150-170 WPM and suddenly everything feels conversational. Natural. Like someone actually talking to you instead of reading at you. This is the sweet spot for most podcasts because it sounds human without feeling rushed.
Get up to 170-190 WPM and now you're energetic. Almost brisk. It works for entertainment content, for audiences under 35, for anything where you want to convey momentum and excitement. Any faster than that—200 WPM and beyond—and you're communicating urgency. It should feel rushed. Use it strategically, not as your baseline.
What Your Content Actually Needs
This is where it gets interesting. Not all 150 WPM is created equal. Your content type matters, your audience matters, even the subject matter matters.
Audiobooks want to breathe. You're asking someone to sit with your voice for hours. They need time to absorb what you're saying. Aim for 130-150 WPM as your main pace. Dense technical content? Go toward the slower end, maybe 130-140 WPM. Fiction narrative? 140-150 works great because the story itself carries momentum. You can bump action sequences to 160 WPM temporarily—that adrenaline works—but then settle back down.
Podcasts are different. They're more forgiving because they're dialogue-heavy, interview-based, or specifically designed for shorter bursts of attention. Educational podcasts should sit around 140-160 WPM because you're explaining things. Conversational podcasts? 160-180 WPM because you're riffing, connecting, not lecturing. News briefings need that same snappier tempo. But storytelling podcasts—the narrative kind—can hang out at 150-170 WPM where they feel natural but still engaging.
YouTube expects efficiency. Your audience is watching and listening, so the visuals are doing some of the cognitive work for you. That means you can actually talk faster—160-180 WPM feels right for explainers and product reviews. Tutorials need to be slightly slower (150-170) because people are trying to follow along. Entertainment content? Lean into 170-190 WPM.
Documentary narration lives between worlds. It's got visuals like YouTube, but it's asking for the kind of contemplative attention you'd give an audiobook. Most documentaries work at 140-160 WPM. Nature documentaries, especially—those are meant to feel contemplative, so 130-150 WPM gives you space to breathe. Fast-paced true crime? You can push toward 160-180 WPM.
The Secret Weapon: Pacing Variation
Here's what separates amateur-sounding AI voice from something genuinely listenable: variety. Absolutely everything at the same speed sounds robotic. Sounds... well, like AI.
Humans naturally talk faster when they're sharing familiar information. "So I went to the coffee shop, right?" Boom, moving right along. But then something specific or important happens: "And the barista was actually my high school teacher." You slow down. You emphasize. You give that moment room.
That's what you want in your content.
Slow down for the moments that matter. New concepts, important points, emotional beats—give them space. When you're moving through familiar territory or just transitioning between ideas, you can pick up the pace. Moving from section to section? Speed it up slightly to maintain momentum. You're bridging them, not dwelling on them.
The easiest way to build variation is to work in sections. That introduction where you're setting tone? Keep it at 1.0x speed—normal baseline. Your first complex point or key takeaway? Drop it to 0.95x, give people time to actually absorb the information. Transitions between major ideas? Bump to 1.1x to keep energy flowing. Concluding your main points, where you want to emphasize what matters? Pull back to 0.9x, let them land with impact.
You can also let your script do the work for you. Shorter sentences naturally read faster. Longer sentences, the ones that unfold and build the way this one is doing, naturally slow things down. Paragraph breaks create pauses. Ellipses—like this…—add a moment of reflection. Em-dashes—like this one—create little interruptions, brief moments of emphasis.
Getting Technical (But Not Really)
In Vois, you've got control from 0.5x all the way to 2.0x. But realistically? Most of your work lives between 0.9x and 1.2x. That's where things feel natural.
0.9x is slightly slower, perfect for emphasizing key points or complex sections. 1.0x is your baseline, your neutral. 1.1x is brisk—moves things along without sounding rushed. 1.2x starts to feel more energetic. Anything beyond that, you're using for specific effects, specific moments, not sustained listening.
The segment-based approach gives you flexibility and control without overcomplicating things. Break your content into logical sections—introduction, main ideas, transitions, conclusion. Assign speeds based on what each section needs. Generate them separately. Assemble them together. Suddenly you've got content with rhythm and life, not just uniform drone.
Actually Testing This Stuff
You can't just guess. You've got to listen.
Grab someone who's never heard your content before. Have them listen to a section. Did they follow the main points? Were there moments where it felt too fast and they lost the thread? Did they zone out during slower sections, or did those moments feel like they needed the space?
Compare your target duration to what you actually generated. If you wanted a 10-minute podcast segment and generated 15, either your content is denser than you thought, or you need to bump up the speed slightly. YouTube voiceovers need to match your video length exactly—that's non-negotiable.
Generate the same content at different speeds and listen to each version objectively. Which one actually sounds like how people talk? Which one lets information land? That's your answer. It might be 0.95x. It might be 1.1x. Every piece of content is slightly different.
When Things Go Wrong
Uniform pacing is usually the first mistake. Everything the same speed sounds mechanical. Solution? Introduce variation through sections and script formatting. Non-negotiable.
Too fast is the second problem. People ask you to repeat things. They're confused. They're exhausted from the cognitive effort. Slow it down. Give them space at key points. Tighten your script so you're only saying things that actually matter.
Too slow and people get impatient. They start skipping ahead, playing at 1.5x on their end, just trying to move through it. Either you need to increase your baseline speed, or your content is bloated. Cut the fluff, pick up the tempo.
Unnatural rhythm—the weird pauses, the awkward flow—usually comes from not paying attention to punctuation and sentence structure. Review what you've written. Longer sentences? They'll slow down. Shorter ones? They'll move. Does the rhythm feel natural, or does it feel like something's off?
Good pacing is invisible. Your listener doesn't think about how fast you're talking. They just absorb the content, stay engaged, don't have to rewind or speed up to feel comfortable. That's the goal. Bad pacing creates friction everywhere. Once you get it right, though—once you understand how speed and rhythm work together—you can create voiceover content that actually holds attention.