OmniVoice Real-Time TTS for Polish

Benchmarking OmniVoice against Chatterbox on RTX 5090 — latency, streaming, voice cloning, and audio quality compared head-to-head.

Why Polish? — All tests use Polish, a language that is notoriously difficult for TTS systems. Polish features complex consonant clusters (szcz, prz, trz, dz, dź, dż), nasal vowels (ą, ę), context-dependent palatalization, free stress exceptions, and extensive inflection that changes word endings across cases, genders, and numbers. These phonological challenges make Polish an excellent stress test for any TTS model — if it handles Polish well, most other European languages will be easier.

The Models

OmniVoice

by k2-fsa · Apache-2.0 · arXiv:2604.00688

State-of-the-art massively multilingual zero-shot TTS supporting 600+ languages — the broadest language coverage among zero-shot TTS models. Uses a diffusion language model architecture with an 8-layer hierarchical audio codebook for high-fidelity synthesis.

Architecture: Qwen3-0.6B backbone (~600M params), masked diffusion iterative decoding
Voice cloning: Zero-shot from reference audio + transcript
Voice design: Control gender, age, pitch, accent, dialect, whisper via text instructions
Paralinguistics: [laughter], [breath] and pronunciation correction via pinyin/phonemes
Speed: RTF as low as 0.025 (40x real-time)
Available on: HuggingFace, CLI (omnivoice-infer), Gradio demo

Chatterbox

by Resemble AI · MIT License

Family of three open-source TTS models (Chatterbox, Chatterbox-Multilingual, Chatterbox-Turbo) designed for natural speech generation with voice cloning. Built-in Perth watermarking for AI audio detection.

Architecture: Autoregressive speech tokenizer + mel-spectral decoder, 350M–500M params
Voice cloning: Zero-shot from reference audio
Turbo variant: 350M params, distilled to 1-step decoding for low-latency voice agents
Paralinguistics: Native [laugh], [cough], [chuckle] tags
Languages: 23+ (English, Spanish, French, German, Polish, Chinese, Japanese, Arabic, Hindi, etc.)
Watermarking: Perth watermarks survive MP3 compression with ~100% detection accuracy
Available on: pip install chatterbox-tts, HuggingFace

Architecture Overview

	Chatterbox	OmniVoice
Model type	Autoregressive (token-by-token)	Masked diffusion (iterative unmasking)
Backbone	Custom T3 decoder	Qwen3-0.6B LLM (~600M params)
Audio codec	Single codebook stream	8-layer hierarchical (8×1025 tokens)
Streaming	True per-token streaming	No native streaming — full sequence per call
Voice cloning	Embedding conditioning	Reference audio tokenization + prefix
Languages	Polish + basic multilingual	600+ languages native

Key Architectural Difference

Chatterbox is real-time because it streams token-by-token – the autoregressive decoder emits tokens sequentially, each decoded to audio and sent immediately. TTFA = time to generate the first few tokens.

OmniVoice uses masked diffusion – all audio tokens start as [MASK] and are iteratively unmasked over N steps. Each step runs a full forward pass over the entire sequence. Tokens are revealed by confidence score, not position. Partial audio cannot be decoded mid-generation.

However, OmniVoice’s raw inference speed is so fast (RTF 0.01–0.04) that text-level chunking achieves comparable or better TTFA than Chatterbox’s token streaming.

Test 1: Baseline Latency

Full generation, voice cloned, no chunking. Total wall time from generate() to returned tensor.

Text	Chars	Audio	32 steps	16 steps	8 steps
tiny	12	1.2–1.5s	269ms (RTF 0.222)	136ms (RTF 0.094)	72ms (RTF 0.054)
short	66	4.3s	298ms (RTF 0.069)	151ms (RTF 0.035)	80ms (RTF 0.019)
medium	196	12.6s	503ms (RTF 0.040)	255ms (RTF 0.020)	137ms (RTF 0.011)
long	488	28.5–28.9s	849ms (RTF 0.030)	446ms (RTF 0.015)	255ms (RTF 0.009)

Baseline — Tiny

32 steps

16 steps

8 steps

Baseline — Short

32 steps

16 steps

8 steps

Baseline — Medium

32 steps

16 steps

8 steps

Baseline — Long

32 steps

16 steps

8 steps

RTF improves with longer text (fixed overhead amortized over more audio).
16 steps is the sweet spot: ~2x faster than 32, quality degradation is minimal.
At 8 steps, RTF drops to 0.009–0.054 – 18–110x faster than real-time.

Test 2: Chunked Streaming (TTFA Measurement)

Simulates streaming: split text at sentence boundaries, generate each chunk independently, measure time to first completed chunk.

Medium text (196 chars, 2 chunks)

Steps	TTFA (chunk 0)	Chunk 0 audio	Chunk 1 gen	Chunk 1 audio	Total
32	335ms	7.00s	327ms	5.80s	662ms
16	169ms	7.00s	165ms	5.80s	333ms
8	88ms	7.00s	86ms	5.80s	173ms

Chunked — Medium

32 steps (2 chunks)

16 steps (2 chunks)

8 steps (2 chunks)

Long text (488 chars, 5 chunks)

Steps	TTFA (chunk 0)	Chunks	Total gen	Total audio	RTF
32	331ms	5	1605ms	29.6s	0.054
16	168ms	5	813ms	29.6s	0.027
8	87ms	5	423ms	29.6s	0.014

Per-chunk breakdown (long text, 16 steps):

Chunk	Gen time	Audio	Content
0	168ms	7.00s	"Powazny blad w obiegu dokumentow..."
1	165ms	5.80s	"Przez pomylke dokumentacja..."
2	150ms	4.08s	"Incydent zostal zgloszony..."
3	171ms	7.44s	"Linia lotnicza przeprosila..."
4	159ms	5.32s	"Zwiazki zawodowe domagaja sie..."

Chunked — Long

32 steps (5 chunks)

16 steps (5 chunks)

8 steps (5 chunks)

At 16 steps, TTFA is 169ms – 32% faster than Chatterbox’s typical 250ms.
Each chunk generates ~4–7s of audio in ~150–170ms. The playback buffer is enormous.
Chunking overhead is minimal: total gen time ~30% higher than single-shot, but the streaming benefit far outweighs it.

Test 3: Voice Clone Prompt Caching

create_voice_clone_prompt() pre-encodes reference audio into reusable tokens.

Mode	Generation time
Raw ref_audio path (re-encodes each call)	264ms
Pre-cached VoiceClonePrompt	255ms
Prompt creation cost	37ms (one-time)
Per-call savings	9ms (3%)

Prompt encoding is already fast (37ms). Caching is still worthwhile for a server to avoid redundant re-encoding.

Test 4: Concurrent Inference

Same text, 3 requests

Mode	Wall time	Per-request	Speedup
Sequential	767ms	255ms each	1.0x
3x concurrent (thread pool + CUDA streams)	462ms	436–460ms each	1.66x

Mixed text lengths, 3 concurrent

Request	Text	Latency	Audio
0	tiny (12 chars)	408ms	1.51s
1	medium (196 chars)	502ms	12.58s
2	long (488 chars)	555ms	28.61s
Wall time		558ms

GIL + shared model weights limit true parallelism to ~1.66x speedup for 3 threads.
Individual request latency increases ~1.8x under contention.
Chatterbox’s InferenceSlot + SlotPool pattern would improve this further.

Test 5: Pipeline Breakdown

Detailed instrumentation of every pipeline stage, averaged over 5 runs per configuration. RTX 5090, float16, voice-cloned.

Where the milliseconds go

The OmniVoice pipeline has 7 stages. Stages 1–4 (preprocessing) are negligible. Stage 5 (iterative decode) dominates at 87–97% of total time.

Stage	Short (66 chars, 4.4s audio)			Medium (196 chars, 12.6s audio)
	32 steps	16 steps	8 steps	32 steps	16 steps	8 steps
1. Preprocess/resolve	0.0ms	0.0ms	0.0ms	0.0ms	0.0ms	0.0ms
2. Duration estimation	0.0ms	0.0ms	0.0ms	0.0ms	0.0ms	0.0ms
3. Text tokenize + inputs	0.4ms	0.4ms	0.4ms	0.5ms	0.6ms	0.5ms
4. Batch construct (CFG)	0.2ms	0.2ms	0.2ms	0.2ms	0.2ms	0.2ms
5. Iterative decode	357ms	182ms	88ms	521ms	264ms	130ms
5a. LLM forward passes	330ms	166ms	82ms	491ms	246ms	122ms
5b. CFG + token selection	27ms	16ms	7ms	30ms	17ms	7ms
6. Audio decode (vocoder)	5.5ms	3.7ms	3.6ms	12.3ms	8.6ms	8.3ms
7. Post-process	3.3ms	3.2ms	3.1ms	9.9ms	9.8ms	9.7ms
TOTAL	367ms	190ms	96ms	544ms	283ms	148ms

Long text (488 chars, 28.5s audio, 950 tokens)

Stage	32 steps	% of total
5a. LLM forward passes	818ms	90.5%
5b. CFG + token selection	40ms	4.5%
6. Audio decode (vocoder)	21ms	2.3%
7. Post-process	24ms	2.6%
TOTAL	905ms	100%

Per-step cost

Each diffusion step runs a full Qwen3-0.6B forward pass over the entire sequence. Cost scales with sequence length:

Text	Sequence length	Cost per step
Short (66 chars)	220 tokens	~11ms
Medium (196 chars)	465 tokens	~16ms
Long (488 chars)	950 tokens	~27ms

All steps are uniform in cost – no step is significantly more expensive than others. Halving the step count halves the LLM time almost exactly.

Key findings

LLM forward passes = 87–91% of total time. Everything else is rounding error.
Text tokenization: <1ms. Negligible even for long text.
Duration estimation: <0.1ms. Rule-based, no neural network.
CFG + token selection: 5–8%. Classifier-free guidance math, Gumbel sampling, top-k.
Audio decode (vocoder): 4–21ms. HiggsAudioV2 on GPU. Scales with audio duration.
Post-process: 3–24ms. Silence removal via pydub (CPU). Linear with audio length.
Optimization target: Reducing per-step LLM cost (quantization, flash attention, KV caching) would yield near-linear total speedup.

Test 6: First-Chunk-Optimized Streaming

Best strategy: split at first sentence, generate short first chunk for minimum TTFA, generate rest while first chunk plays.

Steps	TTFA	First chunk plays	Rest gen time	Gap?
32	335ms	7.00s	700ms	No — rest ready 6.3s early
16	169ms	7.00s	357ms	No — rest ready 6.6s early
8	87ms	7.00s	185ms	No — rest ready 6.8s early

Optimized Streaming — Long text

32 steps

16 steps

8 steps

Even for the longest text (488 chars, 29s audio), there is zero playback gap at any step count. The first chunk produces 7s of audio, providing a massive buffer window. Margin is 6–7 seconds – enough to absorb network jitter, encoding, and client buffering.

Test 7: GPU Memory Profile

Scenario	Peak VRAM
Model loaded (idle)	5.41 GB
1 inference	5.61 GB
3 concurrent inferences	5.87 GB
Headroom on 32 GB	26.1 GB free
Estimated max concurrent	~5

Incremental cost per concurrent inference is ~150 MB. Substantial headroom for additional model instances or concurrent requests.

OmniVoice vs Chatterbox

Metric	Chatterbox	OmniVoice (16 steps)	Winner
TTFA	~250ms	169ms	OmniVoice
RTF (medium text)	0.05–0.10	0.020	OmniVoice
Streaming type	True token-level	Chunk-level	Chatterbox
Playback gaps	None	None (7s buffer)	Tie
Voice quality	Good	Excellent	OmniVoice
Voice cloning	Embedding conditioning	Ref audio + text	OmniVoice
Languages	Polish + limited	600+	OmniVoice
VRAM usage	4–6 GB	5.6 GB	Tie
Concurrent users	3–4 with slot pool	3–5 on single GPU	Tie

Chatterbox vs OmniVoice – Side-by-Side (weronika voice)

Same text, same voice (weronika), same GPU. Chatterbox generated via production server. OmniVoice at 16 steps (recommended) and 32 steps (best quality).

Tiny – “Dzien dobry.”

Chatterbox

OmniVoice 16 steps

Short – News sentence (66 chars)

Chatterbox

OmniVoice 16 steps

Medium – PLL LOT incident (196 chars)

Chatterbox

OmniVoice 16 steps

Chatterbox (same)

OmniVoice 32 steps

Long – Full news story (488 chars)

Chatterbox

OmniVoice 16 steps

Chatterbox (same)

OmniVoice 32 steps

OmniVoice Internal Comparisons

Step count and chunking tradeoffs within OmniVoice.

32 vs 16 steps (medium text)

32 steps (reference)

16 steps (recommended)

Baseline vs Chunked (medium text, 16 steps)

Baseline (single generation)

Chunked (2 chunks, streaming sim)

8-step quality floor (medium text)

32 steps (best quality)

8 steps (fastest, 87ms TTFA)

Optimized streaming vs Baseline (long text)

Baseline 16 steps (single shot)

Optimized 16 steps (chunked stream)

Paralinguistics: Non-Verbal Sound Tags

OmniVoice supports expressive non-verbal tags embedded directly in text. All samples: Polish, voice-cloned (weronika), 16 steps.

Supported tags: [laughter], [sigh], [confirmation-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn]

Tag	Text	Gen	Audio
`[laughter]`	"...to sie stalo `[laughter]` naprawde nie moge."	156ms	3.91s
`[sigh]`	"No coz `[sigh]` trzeba bylo to przewidziec."	150ms	2.41s
`[confirmation-en]`	"`[confirmation-en]` tak, dokladnie o to mi chodzilo."	152ms	2.77s
`[question-ah]`	"Naprawde tak uwazasz `[question-ah]` bo ja mam watpliwosci."	153ms	3.68s
`[question-oh]`	"`[question-oh]` a to ciekawe, kiedy to sie stalo?"	151ms	2.74s
`[question-ei]`	"Mowisz powaznie `[question-ei]` nie zartujesz?"	147ms	2.71s
`[surprise-ah]`	"`[surprise-ah]` nie spodziewalam sie tego!"	146ms	2.61s
`[surprise-oh]`	"`[surprise-oh]` to niesamowite co sie wydarzylo."	151ms	2.76s
`[surprise-wa]`	"`[surprise-wa]` ale rewelacja, nie do wiary!"	150ms	2.30s
`[surprise-yo]`	"Wygralismy konkurs `[surprise-yo]` fantastycznie!"	148ms	3.08s
`[dissatisfaction-hnn]`	"`[dissatisfaction-hnn]` no nie wiem, to mnie nie przekonuje."	155ms	3.38s
Mixed (4 tags)	"...co sie stalo `[question-ah]` ...wygrali `[surprise-oh]` ...stracili `[sigh]` ...`[dissatisfaction-hnn]` trzeba bylo..."	247ms	10.44s
No tags (control)	"Nie moge uwierzyc, ze to sie stalo, naprawde nie moge."	153ms	3.40s

Emotions & Reactions

"...to sie stalo [laughter] naprawde nie moge."

"No coz [sigh] trzeba bylo to przewidziec."

"[confirmation-en] tak, dokladnie o to mi chodzilo."

"[dissatisfaction-hnn] no nie wiem, to mnie nie przekonuje."

Questions

"Naprawde tak uwazasz [question-ah] bo ja mam watpliwosci."

"[question-oh] a to ciekawe, kiedy to sie stalo?"

"Mowisz powaznie [question-ei] nie zartujesz?"

Surprise

"[surprise-ah] nie spodziewalam sie tego!"

"[surprise-oh] to niesamowite co sie wydarzylo."

"[surprise-wa] ale rewelacja, nie do wiary!"

"Wygralismy konkurs [surprise-yo] fantastycznie!"

Mixed emotions (4 tags in one sentence)

"Slyszales co sie stalo [question-ah] okazuje sie ze wygrali [surprise-oh] a potem wszystko stracili [sigh] no i co tu duzo mowic [dissatisfaction-hnn] trzeba bylo lepiej planowac."

Control (no tags)

"Nie moge uwierzyc, ze to sie stalo, naprawde nie moge."

All tags produce distinct non-verbal sounds at the marked positions. The mixed sample (4 tags, 10.44s) demonstrates natural flow between speech and emotions. Generation time stays consistent (~150ms) regardless of tag count; the mixed sample is longer (247ms) only because the text itself is longer.

Conclusion

OmniVoice is viable for real-time streaming TTS and outperforms Chatterbox on raw speed metrics. The masked-diffusion architecture prevents true token-level streaming, but sentence-level chunking achieves 169ms TTFA at 16 steps with zero playback gaps. Combined with excellent voice quality, 600+ language support, and low VRAM footprint, OmniVoice is a strong candidate for production TTS deployment.

For consulting on real-time TTS integration, streaming architecture, or AI/ML engineering, contact Folx.