StyleTTS2 (LibriTTS) — CoreML

Apple-Silicon-optimized CoreML conversion of yl4579/StyleTTS2 LibriTTS multi-speaker checkpoint (yl4579/StyleTTS2-LibriTTS → Models/LibriTTS/epochs_2nd_00020.pth).

Four-stage pipeline; per-stage compute-unit placement; selective int8 PTQ on the text-and-prosody predictor; fp32 decoder.

These weights carry use restrictions beyond MIT. Read the License section before downloading. They are not a drop-in permissively-licensed TTS model. If you need permissive terms, use Kokoro instead.

License & use restrictions

The upstream repository code is MIT, but the pre-trained LibriTTS weights carry two non-negotiable restrictions declared in yl4579/StyleTTS2's README:

Synthetic-origin disclosure. Any deployment that produces audio from these weights must clearly disclose to listeners that the audio is synthetic. No undisclosed synthetic-speech publishing.
Speaker consent for voice cloning. Cloning a real person's voice requires their consent. No unauthorized celebrity / public-figure / non-consenting third-party voice cloning.

These restrictions ride with the weights through every redistribution, fine-tune, and downstream derivative. Anyone downloading this repo inherits them and must propagate them in turn.

If you cannot or will not honor these terms, do not download these weights.

License-of-record: github.com/yl4579/StyleTTS2 upstream README at the time of conversion (see Conversion provenance below for the pinned commit).

What's in this repo

Package	Compute unit	Precision	Buckets	Called
`styletts2_text_predictor_{32,64,128,256,512}.mlpackage`	ANE	fp16	5 token-length	1× per utterance
`styletts2_diffusion_step_512.mlpackage`	CPU+GPU	fp16	1 (B=512 only)	~5× per utterance
`styletts2_f0n_energy.mlpackage`	ANE	fp16	dynamic	1× per utterance
`styletts2_decoder_{256,512,1024,2048,4096}.mlpackage`	CPU+GPU	fp32	5 mel-length	1× per utterance
`constants/text_cleaner_vocab.json`	—	—	—	phoneme→id table
`config.json`	—	—	—	bundle runtime contract (audio/sampler/buckets)

Total on-disk size: ~1.4 GB per format.

Both source .mlpackage (uncompiled, portable across Xcode versions) and pre-compiled .mlmodelc (Apple Silicon, ready for MLModel(contentsOf:)) are shipped. The .mlmodelc artifacts are under compiled/. Pick one:

*.mlpackage — load via MLModel(contentsOf:); the OS compiles on first load (~5–20 s cold start the first time, cached afterward).
compiled/*.mlmodelc — already compiled; same loader path skips the on-device compile. Useful for shipping inside an app bundle.

The diffusion sampler loop (ADPM2 + Karras schedule + CFG) and the hard-alignment matrix (cumsum-of-durations → one-hot → matmul) live in your host application (Swift / Python). Per-step inference is in CoreML; control flow is not.

Why the precision split looks like this

text_predictor is fp16. Selective int8 PTQ was tried and dropped: on Apple Silicon ANE the int8 path saves only ~3 MB per bucket of weight bandwidth, has no exposed int8 GEMM, and dequantizes back to fp16 on load. The savings did not justify the parity risk on small projections.
diffusion_step stays fp16. It runs 5 times per utterance through an ODE-style sampler; quantization noise compounds through iterations. Same lesson as PocketTTS issue #7.
f0n_energy stays fp16. ~6 MB. No bandwidth payoff; quantizing small projections injects audible pitch noise.
decoder is fp32, not fp16. SineGen's harmonic source accumulates phase via cumsum × 2π × hop=300, reaching magnitudes ~~4000 mid-frame. fp16 precision at that magnitude (~~4) is much larger than the per-sample increment (~0.05 rad), which scrambles the sine output and produces audibly robotic synthesis. fp32 is required end-to-end.

Why only one diffusion bucket

Empirically every observed bert_dur fits in B=512. The 32/64/128/256 buckets were dead weight (~192 MB) given the non-linear cost ladder (B=32 ≈ 66 ms/step, B=512 ≈ 152 ms/step). Dropping them adds at most ~430 ms per utterance in the worst short case.

Performance

RTFx: 4.32× warm on M-series Mac (5-step ADPM2 sampler, all buckets pre-warmed).
Log-mel cosine vs PyTorch fp32: 0.9687.
ECAPA-TDNN cosine to reference clip: 0.18 — at the model's architectural ceiling. PyTorch fp32 itself only reaches 0.29 on the same metric. Voice-clone fidelity is bounded by StyleTTS2's architecture, not by this conversion.

How to use

Phonemizer

espeak-ng IPA + stress. The 178-token vocabulary in constants/text_cleaner_vocab.json mirrors text_utils.TextCleaner from the upstream repo: [pad] + punctuation + ASCII letters + IPA letters.

Pad token is $ at id 0.

Inference shape

text → phonemes → token ids
                     │
                     ▼
text_predictor (ANE, int8)
   │   ├─ d_en (1, T_dur, hidden)
   │   ├─ s_pred (1, 256)             (sampler init via diffusion)
   │   └─ duration logits → duration → one-hot alignment matrix (host)
   │
   ▼
diffusion_step × 5  (CPU+GPU, fp16)   (ADPM2 + Karras schedule + CFG)
   │
   ▼
[blend(s, ref_s) + alignment]
   │
   ▼
f0n_energy (ANE, fp16) → F0_curve, N
   │
   ▼
decoder (CPU+GPU, fp32) → 24 kHz waveform

The Swift host owns the sampler loop, alignment construction, and bucket routing. A reference Swift integration is in FluidInference/FluidAudio.

Bucket routing

Round each variable-length input up to the next bucket. Pad with zeros.

Input	Axis	Buckets
text_predictor `tokens`	T_tok	32 / 64 / 128 / 256 / 512
diffusion_step `embedding`	T_bert	512 only (pad)
decoder `asr`	T_mel	256 / 512 / 1024 / 2048 / 4096

f0n_energy is shape-flexible.

Conversion provenance

Upstream code: yl4579/StyleTTS2
Upstream weights: yl4579/StyleTTS2-LibriTTS, file Models/LibriTTS/epochs_2nd_00020.pth
Conversion scripts: FluidInference/mobius PR #46 (models/tts/styletts2/scripts/)
Quantization: coremltools.optimize.coreml.linear_quantize_weights, mode=linear_symmetric, dtype=int8, granularity=per_channel, weight_threshold=200_000
Target: coremltools ≥ 8.0, minimum_deployment_target=iOS17 (macOS 14+ / iOS 17+)

Known limitations

English (LibriTTS) only. No multilingual support in this checkpoint.
HiFi-GAN decoder, not iSTFTNet. LibriTTS upstream uses HiFi-GAN, so no torch.stft / complex tensors in the conversion path.
Decoder is fp32, not fp16. Documented above. The mlpackage size reflects this (≈210 MB per bucket).
Voice-clone fidelity ceiling is architectural. ECAPA-TDNN cosine to reference clip ≈ 0.18 here, ≈ 0.29 in PyTorch fp32. The same-speaker threshold is ~0.30. This isn't a quantization or conversion artifact; see PR #46 TRIALS.md Phase 5.
No streaming. Whole utterance only. Add chunked streaming on the host side if you need it.

Citation & acknowledgments

Yinghao Aaron Li et al. — StyleTTS2 architecture and LibriTTS checkpoint.
LibriTTS authors (CC-BY-4.0 training data).
espeak-ng — phonemization frontend.

@inproceedings{li2023styletts2,
  title  = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author = {Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima},
  booktitle = {NeurIPS},
  year   = {2023}
}

Downloads last month: 23