OmniVoice Real-Time TTS for Polish
Benchmarking OmniVoice against Chatterbox on RTX 5090 — latency, streaming, voice cloning, and audio quality compared head-to-head.
The Models
OmniVoice
by k2-fsa · Apache-2.0 · arXiv:2604.00688
State-of-the-art massively multilingual zero-shot TTS supporting 600+ languages — the broadest language coverage among zero-shot TTS models. Uses a diffusion language model architecture with an 8-layer hierarchical audio codebook for high-fidelity synthesis.
- Architecture: Qwen3-0.6B backbone (~600M params), masked diffusion iterative decoding
- Voice cloning: Zero-shot from reference audio + transcript
- Voice design: Control gender, age, pitch, accent, dialect, whisper via text instructions
- Paralinguistics:
[laughter],[breath]and pronunciation correction via pinyin/phonemes - Speed: RTF as low as 0.025 (40x real-time)
- Available on: HuggingFace, CLI (
omnivoice-infer), Gradio demo
Chatterbox
by Resemble AI · MIT License
Family of three open-source TTS models (Chatterbox, Chatterbox-Multilingual, Chatterbox-Turbo) designed for natural speech generation with voice cloning. Built-in Perth watermarking for AI audio detection.
- Architecture: Autoregressive speech tokenizer + mel-spectral decoder, 350M–500M params
- Voice cloning: Zero-shot from reference audio
- Turbo variant: 350M params, distilled to 1-step decoding for low-latency voice agents
- Paralinguistics: Native
[laugh],[cough],[chuckle]tags - Languages: 23+ (English, Spanish, French, German, Polish, Chinese, Japanese, Arabic, Hindi, etc.)
- Watermarking: Perth watermarks survive MP3 compression with ~100% detection accuracy
- Available on:
pip install chatterbox-tts, HuggingFace
Architecture Overview
| Chatterbox | OmniVoice | |
|---|---|---|
| Model type | Autoregressive (token-by-token) | Masked diffusion (iterative unmasking) |
| Backbone | Custom T3 decoder | Qwen3-0.6B LLM (~600M params) |
| Audio codec | Single codebook stream | 8-layer hierarchical (8×1025 tokens) |
| Streaming | True per-token streaming | No native streaming — full sequence per call |
| Voice cloning | Embedding conditioning | Reference audio tokenization + prefix |
| Languages | Polish + basic multilingual | 600+ languages native |
Key Architectural Difference
Chatterbox is real-time because it streams token-by-token – the autoregressive decoder emits tokens sequentially, each decoded to audio and sent immediately. TTFA = time to generate the first few tokens.
OmniVoice uses masked diffusion – all audio tokens start as [MASK] and are iteratively unmasked over N steps. Each step runs a full forward pass over the entire sequence. Tokens are revealed by confidence score, not position. Partial audio cannot be decoded mid-generation.
However, OmniVoice’s raw inference speed is so fast (RTF 0.01–0.04) that text-level chunking achieves comparable or better TTFA than Chatterbox’s token streaming.
Test 1: Baseline Latency
Full generation, voice cloned, no chunking. Total wall time from generate() to returned tensor.
| Text | Chars | Audio | 32 steps | 16 steps | 8 steps |
|---|---|---|---|---|---|
| tiny | 12 | 1.2–1.5s | 269ms (RTF 0.222) | 136ms (RTF 0.094) | 72ms (RTF 0.054) |
| short | 66 | 4.3s | 298ms (RTF 0.069) | 151ms (RTF 0.035) | 80ms (RTF 0.019) |
| medium | 196 | 12.6s | 503ms (RTF 0.040) | 255ms (RTF 0.020) | 137ms (RTF 0.011) |
| long | 488 | 28.5–28.9s | 849ms (RTF 0.030) | 446ms (RTF 0.015) | 255ms (RTF 0.009) |
- RTF improves with longer text (fixed overhead amortized over more audio).
- 16 steps is the sweet spot: ~2x faster than 32, quality degradation is minimal.
- At 8 steps, RTF drops to 0.009–0.054 – 18–110x faster than real-time.
Test 2: Chunked Streaming (TTFA Measurement)
Simulates streaming: split text at sentence boundaries, generate each chunk independently, measure time to first completed chunk.
Medium text (196 chars, 2 chunks)
| Steps | TTFA (chunk 0) | Chunk 0 audio | Chunk 1 gen | Chunk 1 audio | Total |
|---|---|---|---|---|---|
| 32 | 335ms | 7.00s | 327ms | 5.80s | 662ms |
| 16 | 169ms | 7.00s | 165ms | 5.80s | 333ms |
| 8 | 88ms | 7.00s | 86ms | 5.80s | 173ms |
Long text (488 chars, 5 chunks)
| Steps | TTFA (chunk 0) | Chunks | Total gen | Total audio | RTF |
|---|---|---|---|---|---|
| 32 | 331ms | 5 | 1605ms | 29.6s | 0.054 |
| 16 | 168ms | 5 | 813ms | 29.6s | 0.027 |
| 8 | 87ms | 5 | 423ms | 29.6s | 0.014 |
Per-chunk breakdown (long text, 16 steps):
| Chunk | Gen time | Audio | Content |
|---|---|---|---|
| 0 | 168ms | 7.00s | "Powazny blad w obiegu dokumentow..." |
| 1 | 165ms | 5.80s | "Przez pomylke dokumentacja..." |
| 2 | 150ms | 4.08s | "Incydent zostal zgloszony..." |
| 3 | 171ms | 7.44s | "Linia lotnicza przeprosila..." |
| 4 | 159ms | 5.32s | "Zwiazki zawodowe domagaja sie..." |
- At 16 steps, TTFA is 169ms – 32% faster than Chatterbox’s typical 250ms.
- Each chunk generates ~4–7s of audio in ~150–170ms. The playback buffer is enormous.
- Chunking overhead is minimal: total gen time ~30% higher than single-shot, but the streaming benefit far outweighs it.
Test 3: Voice Clone Prompt Caching
create_voice_clone_prompt() pre-encodes reference audio into reusable tokens.
| Mode | Generation time |
|---|---|
| Raw ref_audio path (re-encodes each call) | 264ms |
| Pre-cached VoiceClonePrompt | 255ms |
| Prompt creation cost | 37ms (one-time) |
| Per-call savings | 9ms (3%) |
Prompt encoding is already fast (37ms). Caching is still worthwhile for a server to avoid redundant re-encoding.
Test 4: Concurrent Inference
Same text, 3 requests
| Mode | Wall time | Per-request | Speedup |
|---|---|---|---|
| Sequential | 767ms | 255ms each | 1.0x |
| 3x concurrent (thread pool + CUDA streams) | 462ms | 436–460ms each | 1.66x |
Mixed text lengths, 3 concurrent
| Request | Text | Latency | Audio |
|---|---|---|---|
| 0 | tiny (12 chars) | 408ms | 1.51s |
| 1 | medium (196 chars) | 502ms | 12.58s |
| 2 | long (488 chars) | 555ms | 28.61s |
| Wall time | 558ms |
- GIL + shared model weights limit true parallelism to ~1.66x speedup for 3 threads.
- Individual request latency increases ~1.8x under contention.
- Chatterbox’s
InferenceSlot+SlotPoolpattern would improve this further.
Test 5: Pipeline Breakdown
Detailed instrumentation of every pipeline stage, averaged over 5 runs per configuration. RTX 5090, float16, voice-cloned.
Where the milliseconds go
The OmniVoice pipeline has 7 stages. Stages 1–4 (preprocessing) are negligible. Stage 5 (iterative decode) dominates at 87–97% of total time.
| Stage | Short (66 chars, 4.4s audio) | Medium (196 chars, 12.6s audio) | ||||
|---|---|---|---|---|---|---|
| 32 steps | 16 steps | 8 steps | 32 steps | 16 steps | 8 steps | |
| 1. Preprocess/resolve | 0.0ms | 0.0ms | 0.0ms | 0.0ms | 0.0ms | 0.0ms |
| 2. Duration estimation | 0.0ms | 0.0ms | 0.0ms | 0.0ms | 0.0ms | 0.0ms |
| 3. Text tokenize + inputs | 0.4ms | 0.4ms | 0.4ms | 0.5ms | 0.6ms | 0.5ms |
| 4. Batch construct (CFG) | 0.2ms | 0.2ms | 0.2ms | 0.2ms | 0.2ms | 0.2ms |
| 5. Iterative decode | 357ms | 182ms | 88ms | 521ms | 264ms | 130ms |
| 5a. LLM forward passes | 330ms | 166ms | 82ms | 491ms | 246ms | 122ms |
| 5b. CFG + token selection | 27ms | 16ms | 7ms | 30ms | 17ms | 7ms |
| 6. Audio decode (vocoder) | 5.5ms | 3.7ms | 3.6ms | 12.3ms | 8.6ms | 8.3ms |
| 7. Post-process | 3.3ms | 3.2ms | 3.1ms | 9.9ms | 9.8ms | 9.7ms |
| TOTAL | 367ms | 190ms | 96ms | 544ms | 283ms | 148ms |
Long text (488 chars, 28.5s audio, 950 tokens)
| Stage | 32 steps | % of total |
|---|---|---|
| 5a. LLM forward passes | 818ms | 90.5% |
| 5b. CFG + token selection | 40ms | 4.5% |
| 6. Audio decode (vocoder) | 21ms | 2.3% |
| 7. Post-process | 24ms | 2.6% |
| TOTAL | 905ms | 100% |
Per-step cost
Each diffusion step runs a full Qwen3-0.6B forward pass over the entire sequence. Cost scales with sequence length:
| Text | Sequence length | Cost per step |
|---|---|---|
| Short (66 chars) | 220 tokens | ~11ms |
| Medium (196 chars) | 465 tokens | ~16ms |
| Long (488 chars) | 950 tokens | ~27ms |
All steps are uniform in cost – no step is significantly more expensive than others. Halving the step count halves the LLM time almost exactly.
Key findings
- LLM forward passes = 87–91% of total time. Everything else is rounding error.
- Text tokenization: <1ms. Negligible even for long text.
- Duration estimation: <0.1ms. Rule-based, no neural network.
- CFG + token selection: 5–8%. Classifier-free guidance math, Gumbel sampling, top-k.
- Audio decode (vocoder): 4–21ms. HiggsAudioV2 on GPU. Scales with audio duration.
- Post-process: 3–24ms. Silence removal via pydub (CPU). Linear with audio length.
- Optimization target: Reducing per-step LLM cost (quantization, flash attention, KV caching) would yield near-linear total speedup.
Test 6: First-Chunk-Optimized Streaming
Best strategy: split at first sentence, generate short first chunk for minimum TTFA, generate rest while first chunk plays.
| Steps | TTFA | First chunk plays | Rest gen time | Gap? |
|---|---|---|---|---|
| 32 | 335ms | 7.00s | 700ms | No — rest ready 6.3s early |
| 16 | 169ms | 7.00s | 357ms | No — rest ready 6.6s early |
| 8 | 87ms | 7.00s | 185ms | No — rest ready 6.8s early |
Even for the longest text (488 chars, 29s audio), there is zero playback gap at any step count. The first chunk produces 7s of audio, providing a massive buffer window. Margin is 6–7 seconds – enough to absorb network jitter, encoding, and client buffering.
Test 7: GPU Memory Profile
| Scenario | Peak VRAM |
|---|---|
| Model loaded (idle) | 5.41 GB |
| 1 inference | 5.61 GB |
| 3 concurrent inferences | 5.87 GB |
| Headroom on 32 GB | 26.1 GB free |
| Estimated max concurrent | ~5 |
Incremental cost per concurrent inference is ~150 MB. Substantial headroom for additional model instances or concurrent requests.
OmniVoice vs Chatterbox
| Metric | Chatterbox | OmniVoice (16 steps) | Winner |
|---|---|---|---|
| TTFA | ~250ms | 169ms | OmniVoice |
| RTF (medium text) | 0.05–0.10 | 0.020 | OmniVoice |
| Streaming type | True token-level | Chunk-level | Chatterbox |
| Playback gaps | None | None (7s buffer) | Tie |
| Voice quality | Good | Excellent | OmniVoice |
| Voice cloning | Embedding conditioning | Ref audio + text | OmniVoice |
| Languages | Polish + limited | 600+ | OmniVoice |
| VRAM usage | 4–6 GB | 5.6 GB | Tie |
| Concurrent users | 3–4 with slot pool | 3–5 on single GPU | Tie |
Chatterbox vs OmniVoice – Side-by-Side (weronika voice)
Same text, same voice (weronika), same GPU. Chatterbox generated via production server. OmniVoice at 16 steps (recommended) and 32 steps (best quality).
Tiny – “Dzien dobry.”
Chatterbox
OmniVoice 16 steps
Short – News sentence (66 chars)
Chatterbox
OmniVoice 16 steps
Medium – PLL LOT incident (196 chars)
Chatterbox
OmniVoice 16 steps
Chatterbox (same)
OmniVoice 32 steps
Long – Full news story (488 chars)
Chatterbox
OmniVoice 16 steps
Chatterbox (same)
OmniVoice 32 steps
OmniVoice Internal Comparisons
Step count and chunking tradeoffs within OmniVoice.
32 vs 16 steps (medium text)
32 steps (reference)
16 steps (recommended)
Baseline vs Chunked (medium text, 16 steps)
Baseline (single generation)
Chunked (2 chunks, streaming sim)
8-step quality floor (medium text)
32 steps (best quality)
8 steps (fastest, 87ms TTFA)
Optimized streaming vs Baseline (long text)
Baseline 16 steps (single shot)
Optimized 16 steps (chunked stream)
Paralinguistics: Non-Verbal Sound Tags
OmniVoice supports expressive non-verbal tags embedded directly in text. All samples: Polish, voice-cloned (weronika), 16 steps.
Supported tags: [laughter], [sigh], [confirmation-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn]
| Tag | Text | Gen | Audio |
|---|---|---|---|
[laughter] | "...to sie stalo [laughter] naprawde nie moge." | 156ms | 3.91s |
[sigh] | "No coz [sigh] trzeba bylo to przewidziec." | 150ms | 2.41s |
[confirmation-en] | "[confirmation-en] tak, dokladnie o to mi chodzilo." | 152ms | 2.77s |
[question-ah] | "Naprawde tak uwazasz [question-ah] bo ja mam watpliwosci." | 153ms | 3.68s |
[question-oh] | "[question-oh] a to ciekawe, kiedy to sie stalo?" | 151ms | 2.74s |
[question-ei] | "Mowisz powaznie [question-ei] nie zartujesz?" | 147ms | 2.71s |
[surprise-ah] | "[surprise-ah] nie spodziewalam sie tego!" | 146ms | 2.61s |
[surprise-oh] | "[surprise-oh] to niesamowite co sie wydarzylo." | 151ms | 2.76s |
[surprise-wa] | "[surprise-wa] ale rewelacja, nie do wiary!" | 150ms | 2.30s |
[surprise-yo] | "Wygralismy konkurs [surprise-yo] fantastycznie!" | 148ms | 3.08s |
[dissatisfaction-hnn] | "[dissatisfaction-hnn] no nie wiem, to mnie nie przekonuje." | 155ms | 3.38s |
| Mixed (4 tags) | "...co sie stalo [question-ah] ...wygrali [surprise-oh] ...stracili [sigh] ...[dissatisfaction-hnn] trzeba bylo..." | 247ms | 10.44s |
| No tags (control) | "Nie moge uwierzyc, ze to sie stalo, naprawde nie moge." | 153ms | 3.40s |
[laughter] naprawde nie moge."[sigh] trzeba bylo to przewidziec."[confirmation-en] tak, dokladnie o to mi chodzilo."[dissatisfaction-hnn] no nie wiem, to mnie nie przekonuje."[question-ah] bo ja mam watpliwosci."[question-oh] a to ciekawe, kiedy to sie stalo?"[question-ei] nie zartujesz?"[surprise-ah] nie spodziewalam sie tego!"[surprise-oh] to niesamowite co sie wydarzylo."[surprise-wa] ale rewelacja, nie do wiary!"[surprise-yo] fantastycznie!"[question-ah] okazuje sie ze wygrali [surprise-oh] a potem wszystko stracili [sigh] no i co tu duzo mowic [dissatisfaction-hnn] trzeba bylo lepiej planowac."All tags produce distinct non-verbal sounds at the marked positions. The mixed sample (4 tags, 10.44s) demonstrates natural flow between speech and emotions. Generation time stays consistent (~150ms) regardless of tag count; the mixed sample is longer (247ms) only because the text itself is longer.
Conclusion
OmniVoice is viable for real-time streaming TTS and outperforms Chatterbox on raw speed metrics. The masked-diffusion architecture prevents true token-level streaming, but sentence-level chunking achieves 169ms TTFA at 16 steps with zero playback gaps. Combined with excellent voice quality, 600+ language support, and low VRAM footprint, OmniVoice is a strong candidate for production TTS deployment.
For consulting on real-time TTS integration, streaming architecture, or AI/ML engineering, contact Folx.