Try Orpheus TTS here
Convert audio to text with context and language options
Generate Speech from Text