KeiSeiKit-1.0/_primitives/_rust/kei-tts/src/google.rs
Parfii-bot cb59b77ed2 feat(kei-tts + kei-stt): TTS/STT abstractions with 4+3 backends
Two parallel atomars in the kei-buddy phase-1 plan. Mirror each other's
architecture: trait + feature-gated backend modules + env-driven dispatch
+ wiremock tests for HTTP backends + subprocess-error test for local.

## kei-tts (text-to-speech)
LOC: 959 across 15 files (largest src/lib.rs 121).
Trait `TtsBackend` + 4 backends behind feature flags:
  * elevenlabs — POST api.elevenlabs.io/v1/text-to-speech/{voice}/stream
  * openai     — POST api.openai.com/v1/audio/speech (tts-1, tts-1-hd)
  * google     — POST texttospeech.googleapis.com/v1/text:synthesize
                 (Wavenet voices, base64 audioContent)
  * piper      — local subprocess to piper-tts binary, raw PCM out
Default features: ["piper"]. all-backends feature gates the rest.
`from_env()` reads KEI_TTS_BACKEND (default piper). Returns Box<dyn TtsBackend>.
Tests: 9 passed (env routing + 3 wiremock backends + piper subprocess error).

## kei-stt (speech-to-text)
LOC: 935 across 13 files (largest whisper_local.rs 181).
Trait `SttBackend` + 3 backends:
  * whisper-local  — subprocess to `whisper` CLI / faster-whisper,
                     reads JSON output, parses segments
  * deepgram       — POST api.deepgram.com/v1/listen (Token auth header,
                     raw audio body, parses words → Segments)
  * openai-whisper — POST api.openai.com/v1/audio/transcriptions
                     (multipart file + model=whisper-1 +
                      response_format=verbose_json)
Default features: ["whisper-local"]. all-backends gates the rest.
`from_env()` reads KEI_STT_BACKEND (default whisper-local).
Tests: 10 passed + 1 doc-test (env routing + 5 wiremock + 2 JSON parsers
+ 1 subprocess error + 1 auth-header check).

## Common architecture decisions
  * `with_base_url(url)` constructor on each HTTP backend for wiremock
    testability — same pattern as kei-llm-router and kei-notify-telegram.
  * `tempfile` crate added to kei-stt for whisper-local audio scratch.
  * `base64 = { version = "0.22", optional = true }` in kei-tts for
    Google's base64-encoded audioContent.

## Verify-before-commit (RULE 0.13 §)
  * cargo check -p kei-tts (default + all-backends): PASS
  * cargo check -p kei-stt (default + all-backends): PASS
  * cargo test -p kei-tts --features all-backends --lib: 9/0
  * cargo test -p kei-stt --features all-backends --lib: 10/0
  * cargo check --workspace: PASS

STATUS-TRUTH from both agents: shipped=functional, stubs=0,
behaviour-verified=yes.

## Follow-up (deferred, non-blocking)
  * Real backend verification needs API keys for ElevenLabs / OpenAI /
    Google / Deepgram and piper-tts binary + .onnx model on PATH.
  * whisper-local language_detected always None — whisper CLI JSON
    schema differs across versions, parse heuristic to be added.
  * faster-whisper has different JSON schema from openai-whisper;
    current parser covers openai-whisper convention only.
2026-05-12 13:47:35 +08:00

103 lines
3.4 KiB
Rust

// SPDX-License-Identifier: Apache-2.0
// Copyright 2026 <author org>
//! Google Cloud TTS backend — calls `texttospeech.googleapis.com`.
//!
//! Endpoint: `POST /v1/text:synthesize?key={api_key}`
//! Response: JSON `{"audioContent": "<base64>"}`. Base64-decoded bytes
//! are returned as `TtsResponse.audio_bytes`.
//!
//! Constructor surface:
//! * [`GoogleBackend::from_env`] — reads `GOOGLE_TTS_API_KEY`.
//! * [`GoogleBackend::with_base_url`] — explicit URL + key (tests).
#![cfg(feature = "google")]
use base64::{engine::general_purpose::STANDARD as B64, Engine as _};
use crate::error::TtsError;
use crate::request::{AudioFormat, TtsRequest};
use crate::response::TtsResponse;
use crate::trait_def::TtsBackend;
const DEFAULT_BASE_URL: &str = "https://texttospeech.googleapis.com";
const DEFAULT_VOICE: &str = "en-US-Wavenet-D";
const DEFAULT_LANG: &str = "en-US";
pub struct GoogleBackend {
api_key: String,
client: reqwest::Client,
base_url: String,
}
impl GoogleBackend {
/// Build from explicit parameters (used in wiremock tests).
pub fn with_base_url(
base_url: impl Into<String>,
api_key: impl Into<String>,
) -> Self {
Self {
api_key: api_key.into(),
client: reqwest::Client::new(),
base_url: base_url.into().trim_end_matches('/').to_string(),
}
}
/// Build from `GOOGLE_TTS_API_KEY` env var.
pub fn from_env() -> Result<Self, TtsError> {
let key = std::env::var("GOOGLE_TTS_API_KEY")
.map_err(|_| TtsError::MissingEnv("GOOGLE_TTS_API_KEY".into()))?;
Ok(Self::with_base_url(DEFAULT_BASE_URL, key))
}
fn encoding_str(fmt: AudioFormat) -> &'static str {
match fmt {
AudioFormat::Mp3 => "MP3",
AudioFormat::Ogg => "OGG_OPUS",
AudioFormat::Wav | AudioFormat::Raw => "LINEAR16",
}
}
}
#[derive(serde::Deserialize)]
struct GoogleResponse {
#[serde(rename = "audioContent")]
audio_content: String,
}
#[async_trait::async_trait]
impl TtsBackend for GoogleBackend {
fn name(&self) -> &'static str { "google" }
async fn synth(&self, req: &TtsRequest) -> Result<TtsResponse, TtsError> {
let url = format!(
"{}/v1/text:synthesize?key={}",
self.base_url, self.api_key
);
let voice_name = req.voice_id.as_deref().unwrap_or(DEFAULT_VOICE);
let lang = req.language.as_deref().unwrap_or(DEFAULT_LANG);
let body = serde_json::json!({
"input": { "text": req.text },
"voice": { "languageCode": lang, "name": voice_name },
"audioConfig": { "audioEncoding": Self::encoding_str(req.format) },
});
let resp = self.client
.post(&url)
.json(&body)
.send()
.await?;
if !resp.status().is_success() {
let status = resp.status().as_u16();
let text = resp.text().await.unwrap_or_default();
return Err(TtsError::Http(format!("http {status}: {text}")));
}
let parsed: GoogleResponse = resp.json().await
.map_err(|e| TtsError::InvalidResponse(e.to_string()))?;
let bytes = B64.decode(&parsed.audio_content)
.map_err(|e| TtsError::InvalidResponse(format!("base64: {e}")))?;
Ok(TtsResponse::new(bytes, req.format.mime_type().to_string()))
}
}
#[cfg(test)]
#[path = "google_test.rs"]
mod tests;