Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
View example on GitHub
misaki extras. The endpoint returns 24 kHz mono audio as a base64-encoded WAV file.
By the end of this tutorial, you’ll be able to generate audio like this:
Set up imports
Kokoro exposes two classes:KModel (the weights and forward pass) and KPipeline (G2P and voice management). By default both download from Hugging Face on first use. This Truss uses the Baseten Delivery Network to mirror the weights to a local mount instead, so cold starts skip the download and load points KModel and KPipeline at that mount.
model/model.py
Define the Model class and load function
Load KModel from the BDN-mounted config.json and kokoro-v1_0.pth, then read every voicepack from /weights/kokoro/voices/ into memory. Each KPipeline reuses the shared model and inherits the preloaded voicepacks, so no request ever reaches Hugging Face.
The base
kokoro package only ships English G2P. To use Japanese or Mandarin voices, add misaki[ja] or misaki[zh] to the requirements block in config.yaml. Spanish, French, Hindi, Italian, and Portuguese voices use the espeak-ng fallback, which is already installed below.model/model.py
Define the predict function
KPipeline is a generator that yields one (graphemes, phonemes, audio) tuple per chunk. It splits English on phoneme boundaries (510-phoneme chunks) and non-English on sentence boundaries, so you don’t need to pre-chunk long input. Concatenate the per-chunk audio tensors and encode the result as a base64 WAV.
The full set of voices is listed in the model’s VOICES.md. Voice names follow the pattern <lang><gender>_<name>, for example af_heart (American female), bm_lewis (British male), or ef_dora (Spanish female).
model/model.py
Set up the config.yaml
The kokoro package pulls torch and transformers as transitive dependencies, so the requirements list stays short. Use the weights block to specify the Hugging Face source and a mount_location for the model files. This uses BDN, which mirrors the weights once and serves them from multi-tier caches on every cold start.
config.yaml
Configure resources for Kokoro
A T4 GPU runs Kokoro’s 82M parameters with room to spare.config.yaml
System packages
Kokoro usesespeak-ng as a fallback grapheme-to-phoneme backend for out-of-dictionary words and non-English languages.
config.yaml
Deploy the model
Deploy the model like you would any other Truss:Generate a WAV file
Call the deployed model and decode the base64 response to a.wav file.
infer.py
infer.py decodes the base64 response into output.wav in your working directory. Select the file in your file browser, then select play to hear Kokoro speak the text from your request.
The first inference call after a cold start takes a few seconds while Kokoro compiles its CUDA kernels. Subsequent calls return audio in under a second.
Other TTS options
For higher-throughput or streaming use cases, see:- Orpheus 3B WebSocket TTS: real-time streaming over WebSocket with TensorRT-LLM on an H100.
- Chatterbox TTS: voice cloning from a reference audio clip.
- Piper TTS: CPU-only TTS for low-latency, low-cost deployments.