From Speech-to-Text to Speech-to-LLM: Piping audio directly into local models

I was messing around with Qwen3.5-9B via llama.cpp recently. Its reasoning performance is exceptional for a model this small, and it’s natively multimodal. I mostly use cloud-based models like Gemini or Claude because I optimize for reasoning quality over speed. But watching this thing absolutely fly at 60 tokens per second on my machine felt pretty good. I had it open in a terminal pane, typing out quick questions to generate bash commands or formatted prose. It was great, but switching contexts to type into a terminal window defeats the purpose of a fast local model. I just need the answer where my cursor already is.

At the same time, I was doing some boring refactoring on Parakeet STT, my fully local, push-to-talk Linux dictation tool.

Then the obvious idea hit me: why am I typing? What if I could pipe my speech into the STT model, take that transcript, pipe it directly into the local LLM, and inject the AI’s response exactly where my cursor is?

A few hours later, I had it working. Hold Right Ctrl to dictate raw text. Hold Shift + Right Ctrl to query the LLM. It works in terminals, IDEs, browsers, any surface that accepts pasted text.

The architecture

Parakeet STT was already split into a Python daemon (running NVIDIA’s NeMo STT) and a Rust frontend (handling hotkeys, Wayland overlay, and uinput injection).

Instead of jamming LLM logic into the Python STT daemon, the Rust client acts as the orchestrator.

[ evdev Hotkey ] ---> [ Rust Client ] <---> [ Python STT Daemon ]
                              |
                              v
                      [ llama-server ]

Intent at the kernel level

The system needs to distinguish between a standard dictation (Right Ctrl) and an LLM query (Shift + Right Ctrl).

If you rely on checking modifier state after the speech is done, you introduce race conditions (what if the user releases Shift a millisecond before releasing Right Ctrl?). Instead, intent is captured at hotkey-down by introspecting the evdev event stream:

enum HotkeyIntent {
    Dictate,     // Right Ctrl alone → raw transcript
    LlmQuery,    // Right Ctrl + Shift → LLM query
}

Because we seed the modifier key state directly from the kernel when the listener attaches, if Shift is already held when the system starts, the very first utterance routes correctly.

Streaming SSE directly to the UI overlay

The LLM integration uses Server-Sent Events (SSE) via reqwest to stream the answer. I explicitly avoided reusing the STT WebSocket for this. The STT daemon shouldn’t know or care about the LLM.

async fn fetch_llm_streamed_answer(
    llm: &LlmRuntimeConfig,
    session_id: Uuid,
    transcript: &str,
    progress_tx: &mpsc::UnboundedSender<LlmProgress>,
) -> Result<String> {
    // Standard OpenAI-compatible POST
    let request_body = json!({
        "model": llm.model,
        "stream": true,
        "messages": [
            {"role": "system", "content": llm.system_prompt},
            {"role": "user", "content": transcript},
        ],
    });

    // ... SSE parsing loop ...
    if let Some(delta) = extract_delta_content(&parsed) {
        assembled.push_str(delta);
        if llm.overlay_stream {
            progress_tx.send(LlmProgress::Delta {
                session_id, delta: delta.to_string(),
            });
        }
    }
}

The streaming progress feeds directly into the exact same Wayland overlay routing infrastructure used for interim speech results. You get a unified UX: the overlay shows your transcribed speech rolling in, and then immediately transitions to the LLM’s answer streaming in, using the same sequence number tracking and staggered 16ms fade-in animations.

Non-blocking orchestration

The WebSocket event loop never blocks waiting for the LLM. When the Python daemon returns a FinalResult, the Rust loop marks llm_busy = true, spawns a detached Tokio task for the LLM HTTP request, and immediately goes back to listening for hotkeys.

If you try to trigger another dictation while the LLM is busy thinking, the system explicitly rejects it and updates the overlay:

if llm_busy {
    warn!("ignoring hotkey down while LLM response is in progress");
    overlay_router.route_interim_state(
        /* ... */
        "LLM busy; wait for current answer".to_string(),
    );
    continue;
}

Minimalist design philosophies

No lost data

LLMs crash. Servers OOM. Context windows get exceeded.

If the LLM request fails for any reason, the system instantly falls back to injecting your raw speech transcript.

let response_text = match result {
    Ok(answer) => sanitize_model_answer(&answer),
    Err(err) => {
        warn!(%err, "llm query failed, falling back to raw transcript");
        fallback_transcript.clone()
    }
};

This is a strict reliability contract: your speech is never thrown away. The worst-case scenario is that you get exactly what you said, rather than the AI’s answer.

The text injection subsystem (which uses uinput for kernel-level keystroke spoofing) is completely origin-agnostic. It receives an InjectionJob struct. It doesn’t care if that text came from a raw speech transcript or an LLM response. The origin is strictly used for logging and observability (origin=llm_answer vs origin=raw_final_result), keeping the clipboard-and-chord logic completely isolated from the AI orchestration.

No cloud dependencies

Everything runs locally. The default configuration points to http://127.0.0.1:8080.

To ensure the LLM doesn’t take down the speech recognition if it panics, llama-server is managed by a helper script that spawns it in a completely separate tmux session (parakeet-llm). If the LLM dies, Parakeet STT just logs the HTTP error, drops your raw transcript into your editor, and keeps listening.

The lived experience

The real proof is in daily usage. Using voice to instantly query Qwen3.5-9B and have the answer materialize in my notes, terminal, or browser exactly where my cursor is feels refreshing from the days of switching tabs for search or AI chats. I still do use those, but only when they are actually necessary.

Even with current “dumb” small models, the in-context availability and speed makes it incredibly useful for boilerplate, shell commands, or recalling syntax. Seeing how fast local models have improved in just the last year, this kind of instant, zero-latency local pipeline is only going to get more powerful.

View the code on GitHub