StyleTTS2 (LibriTTS) β€” CoreML

Apple-Silicon-optimized CoreML conversion of yl4579/StyleTTS2 LibriTTS multi-speaker checkpoint (yl4579/StyleTTS2-LibriTTS β†’ Models/LibriTTS/epochs_2nd_00020.pth).

Four-stage pipeline; per-stage compute-unit placement; selective int8 PTQ on the text-and-prosody predictor; fp32 decoder.

These weights carry use restrictions beyond MIT. Read the License section before downloading. They are not a drop-in permissively-licensed TTS model. If you need permissive terms, use Kokoro instead.

License & use restrictions

The upstream repository code is MIT, but the pre-trained LibriTTS weights carry two non-negotiable restrictions declared in yl4579/StyleTTS2's README:

  1. Synthetic-origin disclosure. Any deployment that produces audio from these weights must clearly disclose to listeners that the audio is synthetic. No undisclosed synthetic-speech publishing.
  2. Speaker consent for voice cloning. Cloning a real person's voice requires their consent. No unauthorized celebrity / public-figure / non-consenting third-party voice cloning.

These restrictions ride with the weights through every redistribution, fine-tune, and downstream derivative. Anyone downloading this repo inherits them and must propagate them in turn.

If you cannot or will not honor these terms, do not download these weights.

License-of-record: github.com/yl4579/StyleTTS2 upstream README at the time of conversion (see Conversion provenance below for the pinned commit).

What's in this repo

Package Compute unit Precision Buckets Called
styletts2_text_predictor_{32,64,128,256,512}.mlpackage ANE fp16 5 token-length 1Γ— per utterance
styletts2_diffusion_step_512.mlpackage CPU+GPU fp16 1 (B=512 only) ~5Γ— per utterance
styletts2_f0n_energy.mlpackage ANE fp16 dynamic 1Γ— per utterance
styletts2_decoder_{256,512,1024,2048,4096}.mlpackage CPU+GPU fp32 5 mel-length 1Γ— per utterance
constants/text_cleaner_vocab.json — — — phoneme→id table
config.json β€” β€” β€” bundle runtime contract (audio/sampler/buckets)

Total on-disk size: ~1.4 GB per format.

Both source .mlpackage (uncompiled, portable across Xcode versions) and pre-compiled .mlmodelc (Apple Silicon, ready for MLModel(contentsOf:)) are shipped. The .mlmodelc artifacts are under compiled/. Pick one:

  • *.mlpackage β€” load via MLModel(contentsOf:); the OS compiles on first load (~5–20 s cold start the first time, cached afterward).
  • compiled/*.mlmodelc β€” already compiled; same loader path skips the on-device compile. Useful for shipping inside an app bundle.

The diffusion sampler loop (ADPM2 + Karras schedule + CFG) and the hard-alignment matrix (cumsum-of-durations β†’ one-hot β†’ matmul) live in your host application (Swift / Python). Per-step inference is in CoreML; control flow is not.

Why the precision split looks like this

  • text_predictor is fp16. Selective int8 PTQ was tried and dropped: on Apple Silicon ANE the int8 path saves only ~3 MB per bucket of weight bandwidth, has no exposed int8 GEMM, and dequantizes back to fp16 on load. The savings did not justify the parity risk on small projections.
  • diffusion_step stays fp16. It runs 5 times per utterance through an ODE-style sampler; quantization noise compounds through iterations. Same lesson as PocketTTS issue #7.
  • f0n_energy stays fp16. ~6 MB. No bandwidth payoff; quantizing small projections injects audible pitch noise.
  • decoder is fp32, not fp16. SineGen's harmonic source accumulates phase via cumsum Γ— 2Ο€ Γ— hop=300, reaching magnitudes 4000 mid-frame. fp16 precision at that magnitude (4) is much larger than the per-sample increment (~0.05 rad), which scrambles the sine output and produces audibly robotic synthesis. fp32 is required end-to-end.

Why only one diffusion bucket

Empirically every observed bert_dur fits in B=512. The 32/64/128/256 buckets were dead weight (~192 MB) given the non-linear cost ladder (B=32 β‰ˆ 66 ms/step, B=512 β‰ˆ 152 ms/step). Dropping them adds at most ~430 ms per utterance in the worst short case.

Performance

  • RTFx: 4.32Γ— warm on M-series Mac (5-step ADPM2 sampler, all buckets pre-warmed).
  • Log-mel cosine vs PyTorch fp32: 0.9687.
  • ECAPA-TDNN cosine to reference clip: 0.18 β€” at the model's architectural ceiling. PyTorch fp32 itself only reaches 0.29 on the same metric. Voice-clone fidelity is bounded by StyleTTS2's architecture, not by this conversion.

How to use

Phonemizer

espeak-ng IPA + stress. The 178-token vocabulary in constants/text_cleaner_vocab.json mirrors text_utils.TextCleaner from the upstream repo: [pad] + punctuation + ASCII letters + IPA letters.

Pad token is $ at id 0.

Inference shape

text β†’ phonemes β†’ token ids
                     β”‚
                     β–Ό
text_predictor (ANE, int8)
   β”‚   β”œβ”€ d_en (1, T_dur, hidden)
   β”‚   β”œβ”€ s_pred (1, 256)             (sampler init via diffusion)
   β”‚   └─ duration logits β†’ duration β†’ one-hot alignment matrix (host)
   β”‚
   β–Ό
diffusion_step Γ— 5  (CPU+GPU, fp16)   (ADPM2 + Karras schedule + CFG)
   β”‚
   β–Ό
[blend(s, ref_s) + alignment]
   β”‚
   β–Ό
f0n_energy (ANE, fp16) β†’ F0_curve, N
   β”‚
   β–Ό
decoder (CPU+GPU, fp32) β†’ 24 kHz waveform

The Swift host owns the sampler loop, alignment construction, and bucket routing. A reference Swift integration is in FluidInference/FluidAudio.

Bucket routing

Round each variable-length input up to the next bucket. Pad with zeros.

Input Axis Buckets
text_predictor tokens T_tok 32 / 64 / 128 / 256 / 512
diffusion_step embedding T_bert 512 only (pad)
decoder asr T_mel 256 / 512 / 1024 / 2048 / 4096

f0n_energy is shape-flexible.

Conversion provenance

  • Upstream code: yl4579/StyleTTS2
  • Upstream weights: yl4579/StyleTTS2-LibriTTS, file Models/LibriTTS/epochs_2nd_00020.pth
  • Conversion scripts: FluidInference/mobius PR #46 (models/tts/styletts2/scripts/)
  • Quantization: coremltools.optimize.coreml.linear_quantize_weights, mode=linear_symmetric, dtype=int8, granularity=per_channel, weight_threshold=200_000
  • Target: coremltools β‰₯ 8.0, minimum_deployment_target=iOS17 (macOS 14+ / iOS 17+)

Known limitations

  • English (LibriTTS) only. No multilingual support in this checkpoint.
  • HiFi-GAN decoder, not iSTFTNet. LibriTTS upstream uses HiFi-GAN, so no torch.stft / complex tensors in the conversion path.
  • Decoder is fp32, not fp16. Documented above. The mlpackage size reflects this (β‰ˆ210 MB per bucket).
  • Voice-clone fidelity ceiling is architectural. ECAPA-TDNN cosine to reference clip β‰ˆ 0.18 here, β‰ˆ 0.29 in PyTorch fp32. The same-speaker threshold is ~0.30. This isn't a quantization or conversion artifact; see PR #46 TRIALS.md Phase 5.
  • No streaming. Whole utterance only. Add chunked streaming on the host side if you need it.

Citation & acknowledgments

  • Yinghao Aaron Li et al. β€” StyleTTS2 architecture and LibriTTS checkpoint.
  • LibriTTS authors (CC-BY-4.0 training data).
  • espeak-ng β€” phonemization frontend.
@inproceedings{li2023styletts2,
  title  = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author = {Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima},
  booktitle = {NeurIPS},
  year   = {2023}
}
Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support