GenAI Models for Text-to-Audio Conversion

Introduction:

Over the past few years, Text-to-Audio (TTA) systems have experienced a significant evolution. Models can now read text aloud and imbue that speech with personality, emotion and identity, thanks to Generative AI (GenAI). Coqui and Kokoro are two of the most promising open-source models spearheading this change. Both provide strong emotional synthesis, voice cloning and fine-tuning capabilities. This article compares their performances, examines their advantages and explores how they facilitate the creation of expressive audio.

Flat robotic voices are no longer the only option available to modern TTS systems. These models can now produce human-like audio that captures subtle cues, such as pauses, sighs, pitch changes, laughter and even expressions of happiness or frustration, thanks to GenAI.

New possibilities for clearly narrated audiobooks and podcasts have been made possible by GenAI.

Voices of characters in animation and video games
Maintaining voice identity while dubbing in several languages
Emotional warnings and screen readers that sound natural
Voice assistants and human-sounding customer service bots

Comparison of models:

Coqui created an effective model known as XTTS (Cross-lingual Text-to-Speech), which is notable for its adaptability and expressiveness. It is suitable for both deeper custom voice creation and quick prototyping, as it supports both full fine-tuning and few-shot speaker cloning. Highly realistic results can be obtained by fine-tuning XTTS with as few as 100 to 200 emotion-rich samples. With remarkable accuracy, it captures emotional nuances, tone shifts and vocal characteristics.

A more recent, open-source, lightweight model designed for emotional speech synthesis is called Kokoro. Voice adaptation requires deep fine-tuning because it lacks few-shot cloning, unlike XTTS. Its architecture, however, is best used for producing emotionally charged speech, especially in dialogue and narrative settings.

Kokoro only supports inference with .pth models now, which means no fine-tuning, speaker cloning or multidisciplinary synthesis. It does not have CLI or API support just yet, so users must physically load the models manually through Python scripts! Documentation is scarce and community support is just budding, so most users find it hard to customize or train it in audiobook-style narration.

Coqui XTTSV2 is highly capable for expressive, multilingual speech synthesis. It supports audiobook-style narration, emotional monologues and high-prosody dialogue using a fixed or cloned voice. The model offers few-shot speaker cloning and fine-tuning with small datasets. It also helps in multilingual synthesis and emotion learning from data with a huge documentation and active support.

XTTSv2 is an upgrade of the original XTTS model, offering significantly improved quality in both performance and audio. It provides lower latency inference and a more natural, human-like voice, making it more suitable for real-time applications. The v2 model gives cleaner audio with fewer audio glitches or artifacts and a more accurate, realistic-sounding voice clone. XTTSv2 also supports a larger set of languages and provides more robust fine-tuning, in particular for small datasets.

Kokoro:

The repo for Kokoro is present in hugging face at https://huggingface.co/hexgrad/Kokoro-82M, clone it for using in the colab notebook.

Either wrap the mode using KModel('/content/Kokoro-82M/kokoro-v1_0.pth') or go for loading the model present in the path. The voice embedding for the model are from selected languages and voice like English. Chinese, Japanese and few more. Generate the audio using your custom text and the selected voice and language.

model_path = "kokoro-v0_19.pth"

model = build_model(model_path, device)

voice = torch.load("voices/af_bella.pt", weights_only=True).to(device)

text_input = "Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects "

audio, sr = generate(model, text_input, voice, lang='a')

ipd.display(ipd.Audio(audio.cpu().numpy(), rate=sr))

The text is chosen as the default from the Kokoro website. The generated output is in wav format.

Xxts:

The coqui goes beyond the simple conversion of text to audio and clones the voice according to the speaker input given.

From the TTS import, import and load the needed model xtts_v2

from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=False)

speaker_wavs = [os.path.join("/content/clean_wavs", f) for f in os.listdir("/content/clean_wavs") if f.endswith(".wav")]

tts.tts_to_file(

text="Coqui TTS with XTTS technology offers remarkable voice cloning capabilities, allowing users to replicate voices from just 3-second audio samples. This technology opens up a world of possibilities for personalized voice synthesis.",

speaker_wav=speaker_wavs,

language="en",

file_path="/content/cloned_output.wav",

)

The text is chosen as the default from the Coqui website. The speaker input for the few shots prompt is my own, Indian male The generate output is in wav format.

This shows the powerful model of coqui Xtts in text to audio conversion and then the cloning using few-shot prompting.

Tags:

AIと生成AI

エンジニアリング

共有：

リンクをコピー

Converting text to audio with GenAI

関連コンテンツ

Supercharging progress through product development in the age of AI

SupplyChain Copilot: Technical architecture and Agentic workflow

Not Too Hot, Not Too Cold: The Goldilocks Principle for AI Success in CCaaS