KeiSeiKit-1.0/_primitives/_rust/kei-tts/README.md
Parfii-bot b5da1940e1 feat(kei-tts + kei-stt): TTS/STT abstractions with 4+3 backends
Two parallel atomars in the kei-buddy phase-1 plan. Mirror each other's
architecture: trait + feature-gated backend modules + env-driven dispatch
+ wiremock tests for HTTP backends + subprocess-error test for local.

## kei-tts (text-to-speech)
LOC: 959 across 15 files (largest src/lib.rs 121).
Trait `TtsBackend` + 4 backends behind feature flags:
  * elevenlabs — POST api.elevenlabs.io/v1/text-to-speech/{voice}/stream
  * openai     — POST api.openai.com/v1/audio/speech (tts-1, tts-1-hd)
  * google     — POST texttospeech.googleapis.com/v1/text:synthesize
                 (Wavenet voices, base64 audioContent)
  * piper      — local subprocess to piper-tts binary, raw PCM out
Default features: ["piper"]. all-backends feature gates the rest.
`from_env()` reads KEI_TTS_BACKEND (default piper). Returns Box<dyn TtsBackend>.
Tests: 9 passed (env routing + 3 wiremock backends + piper subprocess error).

## kei-stt (speech-to-text)
LOC: 935 across 13 files (largest whisper_local.rs 181).
Trait `SttBackend` + 3 backends:
  * whisper-local  — subprocess to `whisper` CLI / faster-whisper,
                     reads JSON output, parses segments
  * deepgram       — POST api.deepgram.com/v1/listen (Token auth header,
                     raw audio body, parses words → Segments)
  * openai-whisper — POST api.openai.com/v1/audio/transcriptions
                     (multipart file + model=whisper-1 +
                      response_format=verbose_json)
Default features: ["whisper-local"]. all-backends gates the rest.
`from_env()` reads KEI_STT_BACKEND (default whisper-local).
Tests: 10 passed + 1 doc-test (env routing + 5 wiremock + 2 JSON parsers
+ 1 subprocess error + 1 auth-header check).

## Common architecture decisions
  * `with_base_url(url)` constructor on each HTTP backend for wiremock
    testability — same pattern as kei-llm-router and kei-notify-telegram.
  * `tempfile` crate added to kei-stt for whisper-local audio scratch.
  * `base64 = { version = "0.22", optional = true }` in kei-tts for
    Google's base64-encoded audioContent.

## Verify-before-commit (RULE 0.13 §)
  * cargo check -p kei-tts (default + all-backends): PASS
  * cargo check -p kei-stt (default + all-backends): PASS
  * cargo test -p kei-tts --features all-backends --lib: 9/0
  * cargo test -p kei-stt --features all-backends --lib: 10/0
  * cargo check --workspace: PASS

STATUS-TRUTH from both agents: shipped=functional, stubs=0,
behaviour-verified=yes.

## Follow-up (deferred, non-blocking)
  * Real backend verification needs API keys for ElevenLabs / OpenAI /
    Google / Deepgram and piper-tts binary + .onnx model on PATH.
  * whisper-local language_detected always None — whisper CLI JSON
    schema differs across versions, parse heuristic to be added.
  * faster-whisper has different JSON schema from openai-whisper;
    current parser covers openai-whisper convention only.
2026-05-12 13:47:35 +08:00

53 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# kei-tts
Text-to-speech abstraction crate with 4 backends selected at runtime via
`KEI_TTS_BACKEND`. Default backend is **piper** (local, free, zero latency).
## Backend matrix
| Backend | Feature flag | Cost | Latency | Quality | Language coverage |
|-------------|---------------|-------------|------------|-----------|-------------------|
| `piper` | `piper` | Free | ~50200 ms | Good | 20+ language packs |
| `elevenlabs`| `elevenlabs` | ~$0.30/1k ch| 300600 ms | Excellent | 30+ languages |
| `openai` | `openai` | ~$0.015/1k ch| 200500 ms| Very good | 50+ languages |
| `google` | `google` | ~$4/1M ch | 200400 ms | Very good | 40+ languages |
## Environment variables
| Variable | Backend | Required | Description |
|-------------------------|-------------|----------|------------------------------------|
| `KEI_TTS_BACKEND` | all | No | `piper` (default) / `elevenlabs` / `openai` / `google` |
| `ELEVENLABS_API_KEY` | elevenlabs | Yes | ElevenLabs API key |
| `OPENAI_API_KEY` | openai | Yes | OpenAI API key |
| `KEI_TTS_OPENAI_MODEL` | openai | No | `tts-1` (default) or `tts-1-hd` |
| `GOOGLE_TTS_API_KEY` | google | Yes | Google Cloud API key |
| `KEI_TTS_PIPER_MODEL` | piper | Yes | Path to `.onnx` piper model file |
| `KEI_TTS_PIPER_BINARY` | piper | No | Path to `piper-tts` (default: PATH)|
## Usage
```toml
[dependencies]
kei-tts = { path = "../kei-tts", features = ["piper"] }
```
```rust
#[tokio::main]
async fn main() -> Result<(), kei_tts::TtsError> {
let backend = kei_tts::from_env()?;
let req = kei_tts::TtsRequest::new("Hello, world!");
let resp = backend.synth(&req).await?;
std::fs::write("out.mp3", &resp.audio_bytes).ok();
println!("synthesised {} bytes via {}", resp.audio_bytes.len(), backend.name());
Ok(())
}
```
## Compile-time features
```toml
# All backends:
kei-tts = { features = ["all-backends"] }
# Cloud only, no piper:
kei-tts = { features = ["elevenlabs", "openai", "google"], default-features = false }
```