Voice Prompt Tool - Push-to-Talk Dictation for Windows with Groq Whisper
A lightweight Windows voice-to-text utility that records on a global hotkey, transcribes with Groq Whisper or a local faster-whisper fallback, and types the result straight into the active text field.

Voice Prompt Tool - Push-to-Talk Dictation for Windows with Groq Whisper
π» View on GitHub: Voice Prompt Tool
πΈ Project Preview



π Quick Links
- π» GitHub Repository - View the full source code
π Table of Contents
- π Overview
- β¨ Key Features
- β‘ How It Works
- π οΈ Tech Stack
- ποΈ Architecture
- π€ Audio Capture Pipeline
- π§ Transcription Strategy
- βοΈ Configuration Model
- π‘οΈ Reliability and Polish
- π― Why This Project Stands Out
π Overview
Voice Prompt Tool is a Windows desktop utility designed around a single, well-defined workflow: click into any text field, hold a global hotkey, speak, release, and watch the transcription appear at the cursor. It is built for writing prompts, notes, and messages hands-free β without context switches and without leaving the focused window.
Instead of being a generic dictation app, the project is shaped as a focused power-user tool:
- a floating pill overlay that always stays on top
- push-to-talk recording on
Ctrl + Shift - cloud transcription via Groq Whisper Large v3 Turbo
- on-device fallback through
faster-whisper - automatic keyboard injection into the active window
- clipboard sync for every transcription
- inline history of the last five transcriptions
- silent launch with no console window and a single-instance lock
The result is a small, predictable utility that behaves like a system-level feature rather than an application you have to open first.
β¨ Key Features
ποΈ Push-to-Talk Capture
- Global
Ctrl + Shifthotkey works from any focused window - Live audio waveform rendered inside the overlay pill
- Recording is bounded by a configurable maximum duration
- Very short presses are ignored to avoid empty transcripts
β‘ Fast Cloud Transcription
- Groq API with
whisper-large-v3-turboas the primary backend - Typical round-trip of
0.2β0.5 sfor normal-length utterances - Explicit language hint (
ru/en) for faster, more accurate decoding - Resilient parsing of segments and detected language
π§± Local Fallback
faster-whisperruns on-device when no API key is configured- Multiple model sizes (
tiny,base,small,medium) - Cached under
models/on first use - Same public interface as the cloud transcriber β fully swappable
πͺ Floating Overlay
- Always-on-top pill widget with no taskbar entry
- Color and opacity change with state (idle / recording / transcribing)
- Click to expand into a history panel with the last 5 transcriptions
- Per-entry copy buttons and a built-in close action
- Draggable, repositionable anywhere on screen
β¨οΈ Auto-Injection
- Transcription is typed at the cursor via
pynputkeyboard emulation - Same text is also written to the system clipboard
- Optional post-processing for capitalisation, punctuation, and custom replacements
π§° Quality-of-Life Details
- System tray icon with status, history, settings, and quit
run.batlauncher with one-time virtual environment setup- Silent startup through
pythonw.exe(no console window) - Single-instance lock file prevents duplicate processes
- Rotating log files under
logs/for traceable diagnostics
β‘ How It Works
The workflow is intentionally minimal:
- Click into any text field β browser, IDE, chat, anywhere
- Hold
Ctrl + Shiftand speak - Release the keys
On release, the captured audio is sent to the transcription backend, the resulting text is typed into the focused window, and the same text is copied to the clipboard. The overlay reflects every stage:
- Idle β semi-transparent neutral pill
- Recording β animated waveform, green tint, raised opacity
- Transcribing β amber tint while the backend is working
- Idle β back to neutral, ready for the next press
The internal state machine is explicit:
textIDLE β RECORDING β TRANSCRIBING β IDLE
This single, linear pipeline is what makes the tool feel predictable: at any point the overlay shows exactly the same state the application is actually in.
π οΈ Tech Stack
Core Runtime
- Python 3.10+
- Windows 10 / 11
pythonw.exefor windowless launch
Audio and Transcription
sounddevicefor microphone capturewavefor WAV serialisation- Groq API (
whisper-large-v3-turbo) for cloud transcription faster-whisperfor local on-device transcription
System Integration
pynputfor the global hotkey and keyboard injectionTkinterfor the floating overlay widget- System tray icon with a contextual menu
.env+config.jsonfor runtime configuration
Tooling
run.batone-click launcher with auto-bootstrap- Single-instance lock file
- Rotating log files for diagnostics
ποΈ Architecture
The application is split into small, single-purpose modules inside app/, each owning one concern:
textVoice-Prompt-Tool/ βββ app/ β βββ audio_recorder.py # Microphone capture, WAV writing, RMS callback β βββ config.py # Strongly-typed configuration with validation β βββ groq_transcriber.py # Groq Whisper cloud client β βββ transcriber.py # Local faster-whisper backend β βββ hotkeys.py # Global push-to-talk listener (pynput) β βββ text_injector.py # Keyboard emulation into the focused window β βββ text_postprocess.py # Capitalisation, punctuation, replacements β βββ overlay.py # Floating pill widget + history panel β βββ tray.py # System tray icon and context menu β βββ state.py # IDLE β RECORDING β TRANSCRIBING state machine β βββ history_service.py # Append-only JSON transcription history β βββ single_instance.py # Lock file guard against duplicate processes β βββ notifications.py # Tray balloon notifications β βββ logger.py # Logging configuration β βββ main.py # Bootstrap and recording pipeline βββ data/history.json # Auto-created βββ logs/ # Auto-created βββ models/ # Local Whisper cache (auto-created) βββ temp/ # Temporary audio files (auto-created) βββ config.json # Auto-created on first run βββ requirements.txt βββ run.bat βββ .env # GROQ_API_KEY (not committed)
Runtime Data Flow
At a high level, a single press of the hotkey moves data through the system like this:
textHotkey listener β State machine β Audio recorder β Transcriber (Groq | local) β βΌ Post-processing β Text injector β βΌ Clipboard + History + Overlay
This separation is what allows the transcriber to be swapped (cloud or local), the overlay to be disabled, and the post-processing to be customised β without touching the recording or injection layers.
π€ Audio Capture Pipeline
The recorder is intentionally explicit and thread-safe. Each press of the hotkey starts a dedicated worker that writes to a unique WAV file under temp/, maintains an in-memory buffer for live previews, and emits normalised RMS levels to the overlay.
A few details that matter in practice:
- Bounded sessions β every recording has a
max_record_secondscap to prevent runaway captures - Minimum duration guard β presses shorter than
min_duration_secondsare silently discarded - Stale file cleanup β leftover temporary WAV files are removed on each new session
- Overflow detection β
sd.RawInputStreamoverflows are logged but do not crash the pipeline - Live RMS callback β the overlay subscribes to a normalised level (0.0β1.0) for the waveform
A simplified view of the capture loop:
pythonwith sd.RawInputStream( samplerate=self._config.sample_rate, channels=self._config.channels, dtype="int16", blocksize=self._config.block_frames, ) as stream: ready_event.set() while not stop_event.is_set(): remaining_frames = max_frames - session.frames_captured if remaining_frames <= 0: session.max_duration_reached = True break data, overflowed = stream.read( min(self._config.block_frames, remaining_frames) ) if overflowed: self._logger.warning("Audio input overflow detected while recording.") wav_handle.writeframes(data) with session.buffer_lock: session.audio_buffer.extend(data) session.frames_captured += len(data) // bytes_per_frame
The result of a session is a RecordingResult that either carries a RecordingArtifact (file path, duration, sample rate, frame count) or an explicit ignored reason. That makes the downstream pipeline trivial to reason about: there is always a structured outcome, never an implicit failure.
π§ Transcription Strategy
The transcription layer is built around two interchangeable backends with the same public surface.
Cloud Backend β Groq Whisper
When GROQ_API_KEY is available, the cloud transcriber is preferred. It uses whisper-large-v3-turbo with response_format="verbose_json" so segments and a detected language are returned alongside the raw text.
pythonwith audio_path.open("rb") as f: response = self._client.audio.transcriptions.create( file=(audio_path.name, f.read()), model=self._model, language=self._language, response_format="verbose_json", prompt=self._initial_prompt or None, temperature=0.0, )
Key choices:
temperature=0.0for deterministic decoding- explicit
languagehint (ru/en) when configured, otherwise auto-detect - a configurable
initial_promptto bias the model toward domain vocabulary - a 30-second client-side timeout so the UI never hangs indefinitely
- a guard that flags suspicious transcripts (gibberish patterns) without silently dropping them
Local Backend β faster-whisper
When no API key is configured, the application falls back to faster-whisper running on the CPU. Model size, beam width, VAD filter, and compute type are all configurable, and the same TranscriptionResult shape is returned β so the rest of the pipeline does not need to know which backend produced the text.
Why Both Exist
This two-backend strategy is the difference between a demo and a tool you would actually keep installed:
- Groq gives fast, accurate dictation when online
- Local keeps the tool usable offline, on flights, or in restricted networks
- Same interface means the rest of the app β overlay, hotkey, injector β does not care which backend handled the request
βοΈ Configuration Model
The application is configured through config.json, generated with sensible defaults on first launch and validated on every load.
Why a Strongly-Typed Config Layer
Configuration parsing in this project is not a thin json.load. It is a deliberate layer that:
- merges user values on top of typed defaults
- validates every field with a precise error message (
ConfigError) - normalises hotkey tokens (
ctrl+shift,lctrl,space, β¦) into a canonical form - supports legacy keys for backward compatibility
- resolves paths relative to the project root, including in PyInstaller builds
The result is an AppConfig dataclass with nested, frozen sections β AudioConfig, TranscriptionConfig, OverlayUiConfig, GroqConfig, and so on β that the rest of the code can rely on without defensive checks.
A Minimal config.json
json{ "hotkey": { "combination": "ctrl+shift" }, "transcription": { "model_size": "small", "language_mode": "ru" }, "groq": { "enabled": true, "api_key": "", "model": "whisper-large-v3-turbo" }, "overlay": { "enabled": true, "margin": 12 } }
Notable knobs:
| Key | Accepted values | Description |
| ----------------------------- | ------------------------------------ | ------------------------------------------------------- |
| hotkey.combination | e.g. ctrl+shift, ctrl+alt+space | Push-to-talk key combination |
| transcription.language_mode | ru Β· en Β· auto | Explicit language is faster and more accurate than auto |
| transcription.model_size | tiny Β· base Β· small Β· medium | Local model size (ignored when Groq is enabled) |
| groq.enabled | true Β· false | true = Groq cloud, false = local faster-whisper |
| overlay.enabled | true Β· false | Show or hide the floating pill widget |
| overlay.margin | integer (px) | Distance from the screen edge |
The GROQ_API_KEY environment variable from .env takes precedence over an empty groq.api_key in config.json.
π‘οΈ Reliability and Polish
Small details add up to the feeling that this is a tool, not a script.
Silent, Single-Instance Launch
run.batbootstraps a.venv/on first run, installs dependencies, and starts the app silently viapythonw.exe- Subsequent launches skip setup and start immediately
- A lock file blocks duplicate processes so the global hotkey never has two listeners
Predictable State
- A single state machine (
IDLE β RECORDING β TRANSCRIBING β IDLE) drives the entire app - The overlay, tray icon, and notifications all read from the same state β so visual feedback never lies about what the app is doing
Defensive Boundaries
- The audio recorder enforces startup and stop timeouts (5 s each) and never blocks the UI forever
- Groq calls have an explicit 30 s timeout and fall back gracefully on failure
- Temporary WAV files are cleaned on a configurable schedule
- Logs are rotated under
logs/so diagnostics survive crashes
Built-in Diagnostics
When something goes wrong β a missing API key, a busy hotkey combination, a refused microphone β the failure is logged with enough context (path, model, language, timeout) to reproduce it. That makes the tool genuinely supportable rather than a black box.
π― Why This Project Stands Out
Voice Prompt Tool is small on purpose, but it combines several layers that are rarely shaped into one coherent desktop utility:
- a global push-to-talk hotkey that works from any focused window
- a thread-safe audio capture pipeline with bounded sessions and live RMS feedback
- two interchangeable transcription backends (Groq cloud + local Whisper) behind a single interface
- an always-on-top overlay that mirrors a real state machine
- automatic keyboard injection and clipboard sync into the active application
- a strongly-typed configuration layer with validation and legacy support
- a silent, single-instance Windows launcher with one-time auto-bootstrap
It is not just a wrapper around the Whisper API. It is a focused Windows tool where each layer β audio, transcription, injection, overlay, configuration β is deliberately designed to behave like a small piece of system-level infrastructure.