Releases: fishaudio/fish-speech
Releases ยท fishaudio/fish-speech
Fish Audio S2 Beta
Fish Audio S2 โ Pre-Release
Best text-to-speech system among both open source and closed source.
Trained on 10M+ hours of audio across ~50 languages, S2 combines a Dual-AR architecture (Qwen3 backbone) with GRPO reinforcement learning alignment to produce natural, emotionally rich speech with fine-grained inline control.
Technical Report ยท Blog ยท Model ยท Playground
Model
| Variant | Params | Codec | Output |
|---|---|---|---|
| S2-Pro | 4B (slow) + 400M (fast) | ModifiedDAC, 10 codebooks, ~21 Hz | 44.1 kHz |
Highlights
- Dual-AR: Slow AR (4B) predicts semantic codebook along time axis; Fast AR (400M) fills 9 residual codebooks per step
- Inline Control: Free-form tags like
[laugh],[whispers],[super happy]at word level - RL Alignment: GRPO with unified data-reward pipeline โ same model for data filtering and RL reward
- SGLang Streaming: RTF 0.195, TTFA ~100ms, 3000+ tokens/s on single H200
- 50+ Languages, multi-speaker (
<|speaker:i|>), multi-turn, rapid voice cloning (10-30s reference)
What's Changed
Model & Inference
- New Dual-AR architecture with Qwen3 backbone, replacing Fish-Speech v1.5
- New
ModifiedDACaudio codec (replaces Firefly/VQ-GAN) - Support
fish_qwen3_omnicheckpoint format (sharded safetensors) with backward compatibility - Fixed: torch.compile bugs, GPU memory leak, audio quality issues
Docker & Deployment
- Docker overhaul: multi-target builds, compose support, health checks, non-root user
- SGLang server integration
API & Server
- Reference voice management API (CRUD), multipart upload support
- Various server bug fixes,
/healthendpoint
Finetune
- Full finetune pipeline for S1/S2 (datasets, training, LoRA merge)
Docs & Infra
- README & MkDocs rewritten for S2 across 6 languages
- License updated to Fish Audio Research License
- Removed legacy code (Firefly VQ-GAN, SenseVoice, Fish Agent, old batch files)
V1.5.1
V1.5.0
V1.4.3
V1.4.2
What's Changed
- Add Audio Select to WebUI by @PoTaTo-Mika in #556
- Fix cache max_seq_len by @AnyaCoder in #568
- docs: Docker icon is missing in zh-cn README & ja README displays that it is in English & properer expression โ็ฎไฝไธญๆโ by @Octopus058 in #569
- docs: Corrected the wrong expressions of supported languages in README by @Octopus058 in #574
- Api json format by @AnyaCoder in #588
- Update v1.4 readmes & samples by @AnyaCoder in #592
- [chore] add docs for macos by @Tps-F in #544
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #599
- chore: typo fix on post_api by @bjwswang in #605
- feat: enable more workers in
api.pyby @AnyaCoder in #621 - Fix broken
remove_parameterizationin firefly by @med1844 in #620 - Fix dockerfile by @AnyaCoder in #622
- Fix dockerfile for
pyaudioby @AnyaCoder in #623 - Update docs by @AnyaCoder in #626
- Fix backend by @AnyaCoder in #627
- Update docs by @AnyaCoder in #638
New Contributors
- @Octopus058 made their first contribution in #569
- @bjwswang made their first contribution in #605
- @med1844 made their first contribution in #620
Full Changelog: v1.4.1...v1.4.2
V1.4.1
Fish Speech V1.4 Release
Fish Speech V1.4 is a leading TTS model trained on 700k hours of audio data in multiple languages.
Supported languages:
- English (en) ~300k hours
- Chinese (zh) ~300k hours
- German (de) ~20k hours
- Japanese (ja) ~20k hours
- French (fr) ~20k hours
- Spanish (es) ~20k hours
- Korean (ko) ~20k hours
- Arabic (ar) ~20k hours
Have fun :)
V1.2.1
Fish Speech V1.2 Release
In this release, we roll-out both 1.2 pretrain and SFT model, and also support auto-reranking for stable generation.