Vois
Back to Blog
Product Updates

Spotlight: Japanese and Chinese Voice Library

Vois TeamVois Team
December 6, 2025
8 min read

TLDR:Vois includes 5 Japanese and 8 Mandarin Chinese voices with native-quality prosody. Use native voices for authentic content rather than translated English voices.

Vois includes five Japanese voices and eight Mandarin Chinese voices. And here's the important part: these aren't just a checkbox feature. They're native-quality voices built for genuine content creation in these languages, not translated-from-English workarounds.

People collaborating on content

Japanese Voice Library

The Five Voices

Let's talk about what you're actually getting here.

jf_alpha is your clear, professional female voice. Think of her as the person who reads the morning news—confident, polished, perfect for business content or formal narration. She handles Japanese honorifics naturally because the voice engine understands context, not just phonemes.

jf_nezumi leans warmer. She's got storytelling in her DNA, which makes her great if you're doing narrative work or anything where you want the audience to feel personally connected. She doesn't sound stiff.

jm_kumo is the professional male option. Clear articulation, solid for educational content or corporate videos. The voice strikes that balance between authoritative and not intimidating.

There are two additional Japanese voices in the library too, giving you options for multi-voice projects where you need clear character distinction or host-guest dynamics.

What Actually Matters for Japanese

Here's where most people stumble. English text-to-speech fails at Japanese because the language has completely different rules.

Honorific levels (敬語, 丁寧語, 普通体) change how respectful the speech sounds. If you're creating business content, your script needs to use the right level from the start. The voice engine respects what you write—if you use casual language, it'll sound casual, even with a "professional" voice.

Pitch patterns. English relies on stress—we punch certain syllables louder. Japanese doesn't work that way. It uses pitch patterns where the melody of the sentence carries meaning. A quality voice handles this automatically, which is why these native voices matter.

Length matters. Long vowels (おう, ああ) and double consonants (っ) need actual duration. Sloppy TTS will compress them. These voices get it right.

Particles (は, が, を, etc.) aren't decorative—they change meaning. The voice should reflect their grammatical weight, not treat them as throwaway syllables.

Mandarin Chinese Voice Library

Eight Voices, Two Genders

You get four female and four male options here. That's enough variety to create conversations without people wondering if it's the same voice repeating.

The female voices start with zf_xiaobei, who delivers clean, neutral Mandarin. She's perfect if you need professional credibility—educational content, corporate videos, serious narration. Her accent leans toward standard Beijing Mandarin, which is what most Chinese speakers expect.

Then there's zf_xiaoni, who brings warmth and conversational ease. She sounds like someone actually talking to you rather than reading at you. Great for storytelling or anything where tone matters more than formality.

Two more female voices round things out, each with their own character. You can mix and match based on what your project needs.

On the male side, zm_yunxi is your formal choice—professional delivery, crystal clear for business content and tutorials. zm_yunyang softens things up, bringing approachability without sacrificing clarity. The additional male voices give you even more flexibility.

Global communication

The Mandarin Challenges (And Why Native Matters)

Mandarin breaks TTS engines. It really does. Four tones plus a neutral tone—each one changes meaning entirely. The word "ma" (妈, 麻, 马, 骂, 吗) means mother, hemp, horse, scold, or marks a question. Same syllables. Completely different tones.

A cheap voice engine will mangle this. It'll pronounce all versions the same way, which sounds ridiculous to native speakers. These voices understand tone in context. They know that a fourth-tone character carries different weight than a first-tone one.

Tone sandhi is the next gotcha. When certain tones sit next to each other in speech, they change. A 不 (bù) before a fourth-tone character becomes second-tone. Most TTS systems don't handle this. Quality systems do. These voices do.

Erhua (儿) is that Beijing "r" sound you hear in words like 儿子 (érzi—son) versus 女儿 (nǚ'ér—daughter). It's a regional feature, and it matters if you're creating authentic content for Beijing or northern Chinese audiences.

Full-width punctuation. Chinese text uses (。,!?) instead of English punctuation. Your script should respect that. The voice engine pays attention to what you actually write.

Numbers and dates follow their own patterns in Chinese. You don't say "12" as "one two"—you say 十二 (shí'èr—ten two). The context matters. These voices get it.

Real-World Use Cases

When You're Creating Original Content

Let's say you're a podcaster who wants to reach Japanese audiences. You don't want a subtitle or a translation—you want to create actually-in-Japanese episodes. These voices are for that. Same with Mandarin podcasts for Chinese listeners. You write the script (or have a native speaker help), pick your voices, and you're done.

YouTube creators do this constantly. The algorithm rewards native-language content. If you're making tutorials, product reviews, or educational videos for Asian markets, starting with the right language voice matters far more than trying to localize English content later.

Audiobook narrators absolutely rely on this kind of thing. A 15-hour audiobook in Japanese needs a voice that doesn't fatigue listeners. It needs to handle complex sentence structures and emotional beats correctly. English voices reading Japanese translations? That's a quick way to get bad reviews.

Corporate videos and e-learning modules are the same story. If you're building training content for Japanese or Chinese employees, they deserve to learn in their actual language, not a translation.

Person recording a podcast

When You're Localizing Existing Work

Here's where people get tripped up. You've got English content that's working great, and you want to reach Japanese or Chinese audiences. The temptation is to translate it and read the translation with a neutral English voice, crossed fingers.

Don't. That sounds terrible.

The real approach: Hire a native speaker to adapt your script. Not just translate—adapt. Cultural references that land in English might completely miss in Japan or China. Humor doesn't travel as-is. The structure of how ideas get presented changes.

Once you've got an adapted script in the target language, use a native voice. The effort pays for itself in audience reaction. People sense when content respects their language.

Multi-Language Projects (Expanding Your Reach)

If you're building something that needs to work in multiple languages, consistency becomes important. You want the same energy and professionalism across all versions, whether your audience is listening to Japanese, Mandarin, or English.

Pick voice styles that map across languages—professional in all versions, warm in all versions, authoritative in all versions. Don't use a formal voice in English but a casual voice for the Japanese version. People notice these inconsistencies more than you'd think.

Export settings matter too. Make sure your loudness levels, file formats, and delivery specs are identical across languages. It's boring technical work, but it prevents disaster when everything goes live.

Picking the Right Voice for Your Project

For Japanese Projects

If you're doing formal, business-oriented work—corporate training, serious news-style narration, official communications—reach for jf_alpha or jm_kumo. They both nail that professional delivery without sounding robotic. Alpha (female) is clearer if you need that, Kumo (male) leans slightly warmer while staying professional.

jf_nezumi is your storytelling voice. Audiobooks, narratives, anything where you want the listener to lean in emotionally. She's not stiff. She sounds like someone who actually cares about the story being told.

For multi-speaker projects—podcasts with hosts and guests, audiobooks with dialogue, anything with multiple characters—mix voices by gender. It's immediately clear who's talking. You could do a male-female pair for hosts, then bring in another voice for guest appearances.

For Mandarin Projects

zf_xiaobei and zm_yunxi are your professional anchors. Clear delivery, standard Beijing accent, nothing distracting. Educational content, corporate videos, professional narration—they handle it all without fatigue.

zf_xiaoni and zm_yunyang bring warmth and conversational ease. Use these when you want audience engagement, for storytelling, or anything where the voice matters as much as the words.

Mix and match across genders for dialogue and multi-speaker content. The contrast makes it obvious who's talking without you having to add text labels.

Practical Technical Stuff

You can adjust speed on both libraries—anywhere from 0.5x (half speed, which helps with understanding) to 2.0x (double speed, useful for trailers or rapid delivery). This works perfectly for emphasizing certain passages or creating distinct character voices from the same base voice.

Voice blending works here too if you want to mix voices within the same language. You could blend two female voices for a subtle character shift, or go bold with a completely different voice for contrast. Both Japanese and Chinese voices play nicely with all the standard Vois features—timeline integration, export presets, multi-speaker workflows.

Full-width punctuation and Unicode handling is baked in. Type your Chinese with (。,!?) or Japanese with full punctuation marks, and the voice engine respects it. Same goes if you're mixing scripts (throwing in English words in a Japanese script, for example).

The export options—loudness standards, file formats, presets for different platforms—all work identically with Asian language voices. That consistency is especially valuable if you're managing projects across multiple languages.


Here's the bottom line: if you're creating content for Japanese or Mandarin audiences, don't settle for English voices reading translations. These native voices exist for a reason. Language matters. Authenticity matters. Your audience will feel the difference.

Frequently Asked Questions

Are the Japanese and Chinese voices suitable for native speakers?

Yes. These voices are designed for native-quality output with natural prosody, appropriate intonation patterns, and clear articulation for native speaker audiences.

Can I use these voices for dubbing English content?

Yes. The voices work well for localization, though optimal results come from scripts written or adapted for the target language rather than direct translation.

Ai VoicesUpdatesProduction
Share:
Vois Team

Written by

Vois Team

Product Team

The team behind Vois, building the future of AI voice production.