VoiceAPI-Models / README.md

Update README.md

487d8fc verified 13 days ago

6.03 kB

	---
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	license: mit
	title: VoiceAPI
	tags:
	- tts
	- text-to-speech
	- indian-languages
	- vits
	- multilingual
	- speech-synthesis
	language:
	- hi
	- bn
	- mr
	- te
	- kn
	- en
	- bho
	- mai
	- mag
	- hne
	- gu
	---

	# 🎙️ VoiceAPI - Multi-lingual Indian Language TTS

	An advanced multi-speaker, multilingual text-to-speech (TTS) synthesizer supporting 11 Indian languages with 21 voice options.


	## 🌟 Features

	- 11 Indian Languages: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
	- 21 Voice Options: Male and female voices for each language
	- High-Quality Audio: 22050 Hz sample rate, natural prosody
	- REST API: Simple GET/POST endpoints for easy integration
	- Real-time Synthesis: Fast inference on CPU/GPU

	## 🗣️ Supported Languages

	\| Language \| Code \| Female \| Male \| Script \|
	\|----------\|------\|--------\|------\|--------\|
	\| Hindi \| hi \| ✅ \| ✅ \| देवनागरी \|
	\| Bengali \| bn \| ✅ \| ✅ \| বাংলা \|
	\| Marathi \| mr \| ✅ \| ✅ \| देवनागरी \|
	\| Telugu \| te \| ✅ \| ✅ \| తెలుగు \|
	\| Kannada \| kn \| ✅ \| ✅ \| ಕನ್ನಡ \|
	\| Gujarati \| gu \| ✅ (MMS) \| - \| ગુજરાતી \|
	\| Bhojpuri \| bho \| ✅ \| ✅ \| देवनागरी \|
	\| Chhattisgarhi \| hne \| ✅ \| ✅ \| देवनागरी \|
	\| Maithili \| mai \| ✅ \| ✅ \| देवनागरी \|
	\| Magahi \| mag \| ✅ \| ✅ \| देवनागरी \|
	\| English \| en \| ✅ \| ✅ \| Latin \|

	## 📡 API Usage

	### Endpoint

	\`\`\`
	[https://harshil748-voiceapi.hf.space/](https://harshil748-voiceapi.hf.space/)
	\`\`\`

	### Parameters

	\| Parameter \| Type \| Required \| Description \|
	\|-----------\|------\|----------\|-------------\|
	\| \`text\` \| string \| Yes \| Text to synthesize (lowercase for English) \|
	\| \`lang\` \| string \| Yes \| Language name (hindi, bengali, etc.) \|
	\| \`speaker_wav\` \| file \| Yes \| Reference WAV file (for API compatibility) \|

	### Example (Python)

	\`\`\`python
	import requests

	base_url = 'https://harshil748-voiceapi.hf.space/Get_Inference'
	WavPath = 'reference.wav'

	params = {
	'text': 'नमस्ते, आप कैसे हैं?',
	'lang': 'hindi',
	}

	with open(WavPath, "rb") as AudioFile:
	response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()})

	if response.status_code == 200:
	with open('output.wav', 'wb') as f:
	f.write(response.content)
	print("Audio saved as 'output.wav'")
	\`\`\`

	### Example (cURL)

	\`\`\`bash
	curl -X POST "https://harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \\
	-F "speaker_wav=@reference.wav" \\
	-o output.wav
	\`\`\`

	## 🏗️ Model Architecture

	- Base Model: VITS (Variational Inference with adversarial learning for Text-to-Speech)
	- Encoder: Transformer-based text encoder (6 layers, 192 hidden channels)
	- Decoder: HiFi-GAN neural vocoder
	- Duration Predictor: Stochastic duration predictor for natural prosody
	- Sample Rate: 22050 Hz (16000 Hz for Gujarati MMS)

	## 📊 Training

	### Datasets Used

	\| Dataset \| Languages \| Source \| License \|
	\|---------\|-----------\|--------\|---------\|
	\| OpenSLR-103 \| Hindi \| [OpenSLR](https://www.openslr.org/103/) \| CC BY 4.0 \|
	\| OpenSLR-37 \| Bengali \| [OpenSLR](https://www.openslr.org/37/) \| CC BY 4.0 \|
	\| OpenSLR-64 \| Marathi \| [OpenSLR](https://www.openslr.org/64/) \| CC BY 4.0 \|
	\| OpenSLR-66 \| Telugu \| [OpenSLR](https://www.openslr.org/66/) \| CC BY 4.0 \|
	\| OpenSLR-79 \| Kannada \| [OpenSLR](https://www.openslr.org/79/) \| CC BY 4.0 \|
	\| OpenSLR-78 \| Gujarati \| [OpenSLR](https://www.openslr.org/78/) \| CC BY 4.0 \|
	\| Common Voice \| Hindi, Bengali \| [Mozilla](https://commonvoice.mozilla.org/) \| CC0 \|
	\| IndicTTS \| Multiple \| [IIT Madras](https://www.iitm.ac.in/donlab/tts/) \| Research \|
	\| Indic-Voices \| Multiple \| [AI4Bharat](https://ai4bharat.iitm.ac.in/indic-voices/) \| CC BY 4.0 \|

	### Training Configuration

	- Epochs: 1000
	- Batch Size: 32
	- Learning Rate: 2e-4
	- Optimizer: AdamW
	- FP16 Training: Enabled
	- Hardware: NVIDIA V100/A100 GPUs

	See \`training/\` directory for full training scripts and configurations.

	## 🚀 Deployment

	This API is deployed on HuggingFace Spaces using Docker:

	\`\`\`dockerfile
	FROM python:3.10-slim
	# ... installs dependencies
	# Downloads models from Harshil748/VoiceAPI-Models
	# Runs FastAPI server on port 7860
	\`\`\`

	Models are hosted separately at [Harshil748/VoiceAPI-Models](https://huggingface.co/Harshil748/VoiceAPI-Models) (~8GB).

	## 📁 Project Structure

	\`\`\`

	VoiceAPI/
	├── app.py # HuggingFace Spaces entry point
	├── Dockerfile # Docker configuration
	├── requirements.txt # Python dependencies
	├── download_models.py # Model downloader
	├── src/
	│ ├── api.py # FastAPI REST server
	│ ├── engine.py # TTS inference engine
	│ ├── config.py # Voice configurations
	│ └── tokenizer.py # Text tokenization
	└── training/
	├── train_vits.py # VITS training script
	├── prepare_dataset.py # Data preparation
	├── export_model.py # Model export
	├── datasets.csv # Dataset links
	└── configs/ # Training configs

	\`\`\`

	## 📜 License

	- Code: MIT License
	- Models: CC BY 4.0 (following SYSPIN licensing)
	- Datasets: Individual licenses (see training/datasets.csv)

	## 🙏 Acknowledgments

	- [SYSPIN IISc SPIRE Lab](https://syspin.iisc.ac.in/) for pre-trained VITS models
	- [Facebook MMS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) for Gujarati TTS
	- [Coqui TTS](https://github.com/coqui-ai/TTS) for the TTS library
	- [AI4Bharat](https://ai4bharat.iitm.ac.in/) for Indian language resources

	## 📧 Contact

	Built for the Voice Tech for All Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities.