Two parallel atomars in the kei-buddy phase-1 plan. Mirror each other's
architecture: trait + feature-gated backend modules + env-driven dispatch
+ wiremock tests for HTTP backends + subprocess-error test for local.
## kei-tts (text-to-speech)
LOC: 959 across 15 files (largest src/lib.rs 121).
Trait `TtsBackend` + 4 backends behind feature flags:
* elevenlabs — POST api.elevenlabs.io/v1/text-to-speech/{voice}/stream
* openai — POST api.openai.com/v1/audio/speech (tts-1, tts-1-hd)
* google — POST texttospeech.googleapis.com/v1/text:synthesize
(Wavenet voices, base64 audioContent)
* piper — local subprocess to piper-tts binary, raw PCM out
Default features: ["piper"]. all-backends feature gates the rest.
`from_env()` reads KEI_TTS_BACKEND (default piper). Returns Box<dyn TtsBackend>.
Tests: 9 passed (env routing + 3 wiremock backends + piper subprocess error).
## kei-stt (speech-to-text)
LOC: 935 across 13 files (largest whisper_local.rs 181).
Trait `SttBackend` + 3 backends:
* whisper-local — subprocess to `whisper` CLI / faster-whisper,
reads JSON output, parses segments
* deepgram — POST api.deepgram.com/v1/listen (Token auth header,
raw audio body, parses words → Segments)
* openai-whisper — POST api.openai.com/v1/audio/transcriptions
(multipart file + model=whisper-1 +
response_format=verbose_json)
Default features: ["whisper-local"]. all-backends gates the rest.
`from_env()` reads KEI_STT_BACKEND (default whisper-local).
Tests: 10 passed + 1 doc-test (env routing + 5 wiremock + 2 JSON parsers
+ 1 subprocess error + 1 auth-header check).
## Common architecture decisions
* `with_base_url(url)` constructor on each HTTP backend for wiremock
testability — same pattern as kei-llm-router and kei-notify-telegram.
* `tempfile` crate added to kei-stt for whisper-local audio scratch.
* `base64 = { version = "0.22", optional = true }` in kei-tts for
Google's base64-encoded audioContent.
## Verify-before-commit (RULE 0.13 §)
* cargo check -p kei-tts (default + all-backends): PASS
* cargo check -p kei-stt (default + all-backends): PASS
* cargo test -p kei-tts --features all-backends --lib: 9/0
* cargo test -p kei-stt --features all-backends --lib: 10/0
* cargo check --workspace: PASS
STATUS-TRUTH from both agents: shipped=functional, stubs=0,
behaviour-verified=yes.
## Follow-up (deferred, non-blocking)
* Real backend verification needs API keys for ElevenLabs / OpenAI /
Google / Deepgram and piper-tts binary + .onnx model on PATH.
* whisper-local language_detected always None — whisper CLI JSON
schema differs across versions, parse heuristic to be added.
* faster-whisper has different JSON schema from openai-whisper;
current parser covers openai-whisper convention only.
66 lines
2.4 KiB
Markdown
66 lines
2.4 KiB
Markdown
# kei-stt
|
||
|
||
Speech-to-text abstraction crate with 3 backends selected at runtime via
|
||
`KEI_STT_BACKEND`. Default backend is **whisper-local** (free, local, no API key).
|
||
|
||
## Backend matrix
|
||
|
||
| Backend | Feature flag | Cost | Latency | Quality |
|
||
|------------------|------------------|----------------|-------------|-----------|
|
||
| `whisper-local` | `whisper-local` | Free | 1–10× RT | Very good |
|
||
| `deepgram` | `deepgram` | ~$0.0043/min | 200–500 ms | Excellent |
|
||
| `openai-whisper` | `openai-whisper` | ~$0.006/min | 300–800 ms | Excellent |
|
||
|
||
RT = real-time factor (depends on hardware / model size for whisper-local).
|
||
|
||
## Environment variables
|
||
|
||
| Variable | Backend | Required | Description |
|
||
|----------------------------|-----------------|----------|------------------------------------------|
|
||
| `KEI_STT_BACKEND` | all | No | `whisper-local` (default) / `deepgram` / `openai-whisper` |
|
||
| `KEI_STT_WHISPER_BINARY` | whisper-local | No | Path to `whisper` CLI (default: PATH) |
|
||
| `KEI_STT_WHISPER_MODEL` | whisper-local | No | Model name (default: `base.en`) |
|
||
| `DEEPGRAM_API_KEY` | deepgram | Yes | Deepgram API key |
|
||
| `OPENAI_API_KEY` | openai-whisper | Yes | OpenAI API key |
|
||
|
||
## Usage
|
||
|
||
```toml
|
||
[dependencies]
|
||
kei-stt = { path = "../kei-stt", features = ["whisper-local"] }
|
||
```
|
||
|
||
```rust
|
||
#[tokio::main]
|
||
async fn main() -> Result<(), kei_stt::SttError> {
|
||
let backend = kei_stt::from_env()?;
|
||
let audio = std::fs::read("speech.wav").unwrap();
|
||
let req = kei_stt::SttRequest::new_wav(audio);
|
||
let resp = backend.transcribe(&req).await?;
|
||
println!("[{}] {}", backend.name(), resp.text);
|
||
for seg in &resp.segments {
|
||
println!(" {:>6}ms–{:>6}ms {}", seg.start_ms, seg.end_ms, seg.text);
|
||
}
|
||
Ok(())
|
||
}
|
||
```
|
||
|
||
## Compile-time features
|
||
|
||
```toml
|
||
# All backends:
|
||
kei-stt = { features = ["all-backends"] }
|
||
# Cloud only, no local whisper:
|
||
kei-stt = { features = ["deepgram", "openai-whisper"], default-features = false }
|
||
```
|
||
|
||
## whisper-local prerequisites
|
||
|
||
Install the `openai-whisper` Python package:
|
||
|
||
```sh
|
||
pip install openai-whisper
|
||
```
|
||
|
||
This makes the `whisper` CLI available. Alternatively point `KEI_STT_WHISPER_BINARY`
|
||
at a compatible binary (`faster-whisper`, etc. with identical CLI interface).
|