Skip to content

Releases: fishaudio/fish-speech

Fish Audio S2 Beta

10 Mar 15:29
3578e4e

Choose a tag to compare

Fish Audio S2 Beta Pre-release
Pre-release

Fish Audio S2 โ€” Pre-Release

Best text-to-speech system among both open source and closed source.

Trained on 10M+ hours of audio across ~50 languages, S2 combines a Dual-AR architecture (Qwen3 backbone) with GRPO reinforcement learning alignment to produce natural, emotionally rich speech with fine-grained inline control.

Technical Report ยท Blog ยท Model ยท Playground

Model

Variant Params Codec Output
S2-Pro 4B (slow) + 400M (fast) ModifiedDAC, 10 codebooks, ~21 Hz 44.1 kHz

Highlights

  • Dual-AR: Slow AR (4B) predicts semantic codebook along time axis; Fast AR (400M) fills 9 residual codebooks per step
  • Inline Control: Free-form tags like [laugh], [whispers], [super happy] at word level
  • RL Alignment: GRPO with unified data-reward pipeline โ€” same model for data filtering and RL reward
  • SGLang Streaming: RTF 0.195, TTFA ~100ms, 3000+ tokens/s on single H200
  • 50+ Languages, multi-speaker (<|speaker:i|>), multi-turn, rapid voice cloning (10-30s reference)

What's Changed

Model & Inference

  • New Dual-AR architecture with Qwen3 backbone, replacing Fish-Speech v1.5
  • New ModifiedDAC audio codec (replaces Firefly/VQ-GAN)
  • Support fish_qwen3_omni checkpoint format (sharded safetensors) with backward compatibility
  • Fixed: torch.compile bugs, GPU memory leak, audio quality issues

Docker & Deployment

  • Docker overhaul: multi-target builds, compose support, health checks, non-root user
  • SGLang server integration

API & Server

  • Reference voice management API (CRUD), multipart upload support
  • Various server bug fixes, /health endpoint

Finetune

  • Full finetune pipeline for S1/S2 (datasets, training, LoRA merge)

Docs & Infra

  • README & MkDocs rewritten for S2 across 6 languages
  • License updated to Fish Audio Research License
  • Removed legacy code (Firefly VQ-GAN, SenseVoice, Fish Agent, old batch files)

V1.5.1

31 May 12:15
58046ea

Choose a tag to compare

The last stable branch before the next model release.

V1.5.0

25 Dec 02:53
7902e40

Choose a tag to compare

Fish Speech 1.5 release, both inference and finetune are done.

V1.4.3

29 Nov 06:36
1359896

Choose a tag to compare

Last stable release before 1.5

V1.4.2

25 Oct 07:15
f8a57fb

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.4.1...v1.4.2

V1.4.1

15 Sep 09:49

Choose a tag to compare

This release includes bug fix and container optimization.

Fish Speech V1.4 Release

12 Sep 14:38
a817507

Choose a tag to compare

Fish Speech V1.4 is a leading TTS model trained on 700k hours of audio data in multiple languages.

Supported languages:

  • English (en) ~300k hours
  • Chinese (zh) ~300k hours
  • German (de) ~20k hours
  • Japanese (ja) ~20k hours
  • French (fr) ~20k hours
  • Spanish (es) ~20k hours
  • Korean (ko) ~20k hours
  • Arabic (ar) ~20k hours

Have fun :)

V1.2.1

10 Sep 00:24
237f4fd

Choose a tag to compare

This is the final stable release before 1.4 release on Sep 10.

Fish Speech V1.2 Release

18 Jul 16:41
dc250ab

Choose a tag to compare

In this release, we roll-out both 1.2 pretrain and SFT model, and also support auto-reranking for stable generation.

V1.1.2

02 Jul 04:55
97e8e3c

Choose a tag to compare

This is the final stable release before 1.2