Fish Audio just ran a ten-day blind A/B test pitting its S2 Pro against ElevenLabs V3, MiniMax, Inworld, and its own older S1 model. Over five thousand cross-provider pairs, evaluated by real users making actual download choices — not subjective ratings on a Likert scale. S2 Pro walked away with a 65.7% win rate. ElevenLabs V3 landed at 40.6%.

The Numbers

The test ran March 26 through April 5. Fish Audio spent over $2,000 on third-party API fees alone to generate competitor samples, then surfaced them anonymously to users alongside its own. The Bradley-Terry scores tell the story:

Model BT Score Win Rate
Fish Audio S2 Pro 3.07 65.7%
Fish Audio S1 1.86 41.0%
ElevenLabs V3 1.80 40.6%
ElevenLabs Multilingual V2 1.35 36.2%
ElevenLabs 2.5 Flash 1.00 29.8%
Inworld TTS 1.5 Max 0.59 20.1%
MiniMax Speech 2.8 HD 0.12 5.0%

Head-to-head against ElevenLabs V3 specifically, the preference split 60-40 across 581 paired comparisons. Not a blowout in English — but a clear, statistically significant lead. Separately, on EmergentTTS-Eval against GPT-4o-mini-tts as baseline, S2 Pro posted an 81.88% overall win rate, with its strongest showing in paralinguistics at 91.61%.

Dual-AR: The Architecture That Got It There

S2 Pro runs a Dual-Autoregressive setup on a decoder-only transformer with an RVQ-based audio codec — 10 codebooks at roughly 21 Hz frame rate. The "Slow AR" (4B parameters) handles the primary semantic codebook along the time axis: it decides what to say and when. The "Fast AR" (400M parameters) fills in 9 residual codebooks at each timestep for acoustic texture — the breaths, the vocal fry, the micro-pauses that make speech sound human.

The training corpus behind this: over 10 million hours of audio across approximately 50 languages. But the architectural trick worth highlighting is the reward model pipeline. Fish Audio reused the same models that filter and annotate training data as reward models during GRPO reinforcement learning. This eliminates the distribution mismatch that usually plagues post-training — the model gets optimized against the same quality signal that curated its pre-training data in the first place.

Emotion control works through inline natural language tags. Not fixed categories like [happy] or [sad], but free-form descriptions: [whisper in small voice], [professional broadcast tone], [pitch up]. You embed these directly in the input text and the model adjusts prosody at the word level. It's more expressive than a dropdown menu of six emotions, and it actually works — though results get unpredictable when you stack three or four tags in a single sentence.

Where Fish Wins Big — and Where ElevenLabs Still Has an Edge

The language breakdown is where the real gap opens up. In Chinese, S2 Pro scored 8.11 to ElevenLabs V3's 2.36 — a nearly 3.5x advantage. Japanese was similarly lopsided at 3.12 versus 1.88. If you're building anything multilingual or targeting East Asian markets, this isn't a marginal improvement. It's a different league.

For Latin-script languages, particularly English, the competitive distance shrinks. ElevenLabs V3 still holds its own for long-form narration where emotional depth and sustained prosody consistency matter — audiobooks, character voices, documentary work. The 60-40 preference split in the blind test is real, but it's not the kind of delta that forces studios already deep in an ElevenLabs workflow to rip out their integration tomorrow.

The Price Difference

Fish Audio's API runs 15 per million characters. ElevenLabs charges roughly 80 per million for comparable tiers. Voice cloning, streaming, multilingual support — all included in the same API call, no feature gating by plan. Latency on an H200: ~100ms time-to-first-audio, 3,000+ acoustic tokens per second, real-time factor of 0.195.

Meanwhile, On Your Laptop

While Fish Audio dominates the cloud benchmark race, Neuphonic quietly shipped NeuTTS Air — a 748M-parameter model built on Qwen2 that runs entirely on-device. It ships in GGUF quantizations (Q4 and Q8), inference through llama.cpp, zero cloud dependency. Apache-2.0 licensed.

The claim: 3 seconds of reference audio to clone a voice, running on a CPU. Phones, laptops, Raspberry Pis. NeuCodec, their audio codec, operates at 50 Hz with a single codebook — aggressively efficient compared to S2 Pro's 10-codebook RVQ setup. The quality ceiling is obviously lower than a 4.4B cloud model trained on 10 million hours. But for voice assistants, accessibility tools, or anything that needs to function offline, having a usable voice clone running locally in real-time is a capability that didn't exist twelve months ago.

Two Open Models, Two Bets

Fish Audio S2 Pro bets that the TTS crown belongs to whoever builds the best cloud inference stack. Neuphonic bets that the future of voice runs on the device in your pocket. Both published their weights under permissive licenses.

ElevenLabs remains the most polished product for English-first content studios. But it's a closed platform now competing against two open alternatives that are, respectively, five times cheaper and entirely free. That's a competitive position that tends to erode — ask anyone who sold proprietary image generation APIs in 2024.