StyleTTS2 (LibriTTS) β CoreML
Apple-Silicon-optimized CoreML conversion of yl4579/StyleTTS2
LibriTTS multi-speaker checkpoint
(yl4579/StyleTTS2-LibriTTS β Models/LibriTTS/epochs_2nd_00020.pth).
Four-stage pipeline; per-stage compute-unit placement; selective int8 PTQ on the text-and-prosody predictor; fp32 decoder.
These weights carry use restrictions beyond MIT. Read the License section before downloading. They are not a drop-in permissively-licensed TTS model. If you need permissive terms, use Kokoro instead.
License & use restrictions
The upstream repository code is MIT, but the pre-trained LibriTTS weights carry two non-negotiable restrictions declared in yl4579/StyleTTS2's README:
- Synthetic-origin disclosure. Any deployment that produces audio from these weights must clearly disclose to listeners that the audio is synthetic. No undisclosed synthetic-speech publishing.
- Speaker consent for voice cloning. Cloning a real person's voice requires their consent. No unauthorized celebrity / public-figure / non-consenting third-party voice cloning.
These restrictions ride with the weights through every redistribution, fine-tune, and downstream derivative. Anyone downloading this repo inherits them and must propagate them in turn.
If you cannot or will not honor these terms, do not download these weights.
License-of-record: github.com/yl4579/StyleTTS2 upstream README at the time of conversion (see Conversion provenance below for the pinned commit).
What's in this repo
| Package | Compute unit | Precision | Buckets | Called |
|---|---|---|---|---|
styletts2_text_predictor_{32,64,128,256,512}.mlpackage |
ANE | fp16 | 5 token-length | 1Γ per utterance |
styletts2_diffusion_step_512.mlpackage |
CPU+GPU | fp16 | 1 (B=512 only) | ~5Γ per utterance |
styletts2_f0n_energy.mlpackage |
ANE | fp16 | dynamic | 1Γ per utterance |
styletts2_decoder_{256,512,1024,2048,4096}.mlpackage |
CPU+GPU | fp32 | 5 mel-length | 1Γ per utterance |
constants/text_cleaner_vocab.json |
β | β | β | phonemeβid table |
config.json |
β | β | β | bundle runtime contract (audio/sampler/buckets) |
Total on-disk size: ~1.4 GB per format.
Both source .mlpackage (uncompiled, portable across Xcode versions) and
pre-compiled .mlmodelc (Apple Silicon, ready for MLModel(contentsOf:))
are shipped. The .mlmodelc artifacts are under compiled/. Pick one:
*.mlpackageβ load viaMLModel(contentsOf:); the OS compiles on first load (~5β20 s cold start the first time, cached afterward).compiled/*.mlmodelcβ already compiled; same loader path skips the on-device compile. Useful for shipping inside an app bundle.
The diffusion sampler loop (ADPM2 + Karras schedule + CFG) and the hard-alignment matrix (cumsum-of-durations β one-hot β matmul) live in your host application (Swift / Python). Per-step inference is in CoreML; control flow is not.
Why the precision split looks like this
- text_predictor is fp16. Selective int8 PTQ was tried and dropped: on Apple Silicon ANE the int8 path saves only ~3 MB per bucket of weight bandwidth, has no exposed int8 GEMM, and dequantizes back to fp16 on load. The savings did not justify the parity risk on small projections.
- diffusion_step stays fp16. It runs 5 times per utterance through an ODE-style sampler; quantization noise compounds through iterations. Same lesson as PocketTTS issue #7.
- f0n_energy stays fp16. ~6 MB. No bandwidth payoff; quantizing small projections injects audible pitch noise.
- decoder is fp32, not fp16. SineGen's harmonic source accumulates
phase via
cumsum Γ 2Ο Γ hop=300, reaching magnitudes4000 mid-frame. fp16 precision at that magnitude (4) is much larger than the per-sample increment (~0.05 rad), which scrambles the sine output and produces audibly robotic synthesis. fp32 is required end-to-end.
Why only one diffusion bucket
Empirically every observed bert_dur fits in B=512. The 32/64/128/256
buckets were dead weight (~192 MB) given the non-linear cost ladder
(B=32 β 66 ms/step, B=512 β 152 ms/step). Dropping them adds at most
~430 ms per utterance in the worst short case.
Performance
- RTFx: 4.32Γ warm on M-series Mac (5-step ADPM2 sampler, all buckets pre-warmed).
- Log-mel cosine vs PyTorch fp32: 0.9687.
- ECAPA-TDNN cosine to reference clip: 0.18 β at the model's architectural ceiling. PyTorch fp32 itself only reaches 0.29 on the same metric. Voice-clone fidelity is bounded by StyleTTS2's architecture, not by this conversion.
How to use
Phonemizer
espeak-ng IPA + stress. The 178-token vocabulary in
constants/text_cleaner_vocab.json mirrors text_utils.TextCleaner from
the upstream repo: [pad] + punctuation + ASCII letters + IPA letters.
Pad token is $ at id 0.
Inference shape
text β phonemes β token ids
β
βΌ
text_predictor (ANE, int8)
β ββ d_en (1, T_dur, hidden)
β ββ s_pred (1, 256) (sampler init via diffusion)
β ββ duration logits β duration β one-hot alignment matrix (host)
β
βΌ
diffusion_step Γ 5 (CPU+GPU, fp16) (ADPM2 + Karras schedule + CFG)
β
βΌ
[blend(s, ref_s) + alignment]
β
βΌ
f0n_energy (ANE, fp16) β F0_curve, N
β
βΌ
decoder (CPU+GPU, fp32) β 24 kHz waveform
The Swift host owns the sampler loop, alignment construction, and bucket routing. A reference Swift integration is in FluidInference/FluidAudio.
Bucket routing
Round each variable-length input up to the next bucket. Pad with zeros.
| Input | Axis | Buckets |
|---|---|---|
text_predictor tokens |
T_tok | 32 / 64 / 128 / 256 / 512 |
diffusion_step embedding |
T_bert | 512 only (pad) |
decoder asr |
T_mel | 256 / 512 / 1024 / 2048 / 4096 |
f0n_energy is shape-flexible.
Conversion provenance
- Upstream code: yl4579/StyleTTS2
- Upstream weights: yl4579/StyleTTS2-LibriTTS,
file
Models/LibriTTS/epochs_2nd_00020.pth - Conversion scripts: FluidInference/mobius PR #46
(
models/tts/styletts2/scripts/) - Quantization:
coremltools.optimize.coreml.linear_quantize_weights,mode=linear_symmetric,dtype=int8,granularity=per_channel,weight_threshold=200_000 - Target:
coremltoolsβ₯ 8.0,minimum_deployment_target=iOS17(macOS 14+ / iOS 17+)
Known limitations
- English (LibriTTS) only. No multilingual support in this checkpoint.
- HiFi-GAN decoder, not iSTFTNet. LibriTTS upstream uses HiFi-GAN, so
no
torch.stft/ complex tensors in the conversion path. - Decoder is fp32, not fp16. Documented above. The mlpackage size reflects this (β210 MB per bucket).
- Voice-clone fidelity ceiling is architectural. ECAPA-TDNN cosine to reference clip β 0.18 here, β 0.29 in PyTorch fp32. The same-speaker threshold is ~0.30. This isn't a quantization or conversion artifact; see PR #46 TRIALS.md Phase 5.
- No streaming. Whole utterance only. Add chunked streaming on the host side if you need it.
Citation & acknowledgments
- Yinghao Aaron Li et al. β StyleTTS2 architecture and LibriTTS checkpoint.
- LibriTTS authors (CC-BY-4.0 training data).
- espeak-ng β phonemization frontend.
@inproceedings{li2023styletts2,
title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author = {Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima},
booktitle = {NeurIPS},
year = {2023}
}
- Downloads last month
- 23