Adding a real-time overlay to local STT on Linux

I added a real-time overlay to Parakeet STT, my local push-to-talk transcription tool for Linux.

Parakeet STT is fully local. You hold Right Ctrl, speak, and release. The transcribed text is injected wherever your cursor is via uinput. There is no cloud and no subscription. The backend is a Python daemon running NVIDIA’s NeMo Parakeet TDT 0.6B v3 model, while the frontend is a Rust binary (parakeet-ptt) handling the hotkey and WebSocket connection.

The missing feedback loop

The previous version lacked visual feedback. Working in a terminal meant guessing whether the microphone was hot or if the model was still processing.

I built parakeet-overlay to solve this. It is a frosted-glass pill anchored to the bottom of the screen that cycles through four states: Hidden -> Listening -> Interim -> Finalizing -> Hidden.

It renders a live waveform, rolling interim text with a staggered 16ms per-character fade-in, and a “breathing” animation (3s cycle, 5% amplitude) to indicate the system is alive.

Process isolation for reliability

I decided against threading the overlay into the main binary. Instead, parakeet-overlay is a separate process spawned by parakeet-ptt. Communication happens via NDJSON frames over the child’s stdout.

This creates a strict reliability contract. Text delivery is critical; visual feedback is optional. If the overlay crashes, hangs, or the compositor creates a graphical glitch, it cannot block the transcription or text injection pipeline. The injection logic (clipboard -> uinput -> ydotool) is completely decoupled from the rendering loop.

Protocol choice: layer shell vs xdg

The overlay is not a normal window. It uses the zwlr_layer_shell_v1 protocol, which allows surfaces to anchor to screen edges without window decorations, similar to notification toasts or status bars.

At startup, the binary probes the Wayland registry. If the compositor supports zwlr_layer_shell_v1, it uses it. If not (common with some non-wlroots compositors), it falls back to xdg_toplevel as a standard window.

I also exposed PARAKEET_OVERLAY_MODE to force behavior: auto, layer-shell, fallback-window, or disabled.

Pure software rendering

I avoided OpenGL and wgpu entirely. The rendering pipeline is pure CPU-side pixel manipulation written directly into a Wayland shared-memory buffer.

Graphics: Premultiplied alpha compositing and rounded-rect coverage calculations for anti-aliasing.
Text: fontdb for system font discovery and fontdue for pure-Rust glyph rasterization.
Why: Initializing a GPU context takes 50–200ms; mapping RAM takes microseconds. This ensures the overlay appears instantly, while completely eliminating graphics drivers from the failure domain.

The waveform path

The daemon broadcasts an audio_level message (dB value) via WebSocket. The PTT binary forwards this to the overlay as an NDJSON frame:

{"AudioLevel": {"session_id": "...", "level_db": -24.3}}

The overlay implements classic VU meter behavior. I use an attack coefficient of 0.85 and a release of 0.05 to smooth the needle motion. The visual range covers -60dB to -5dB.

Stack

ASR: NeMo Parakeet TDT 0.6B v3
Daemon: Python, FastAPI, sounddevice
Overlay: Rust, wayland-client, wayland-protocols-wlr
IPC: NDJSON over stdout
Injection: uinput (kernel-level) with ydotool fallback

If you are building local STT tooling on Wayland, the biggest lesson here is decoupling the critical path (text injection) from the user interface.

View the code on GitHub