NVIDIA Jetson Voice Agent

A professional-grade, voice-activated AI companion designed to run locally on the NVIDIA Jetson Nano. This project implements a sophisticated Dual-Prompt architecture to solve common LLM tool-calling hallucination issues, enabling reliable home automation control and multilingual conversation through a seamless Speech-to-Text (STT) and Text-to-Speech (TTS) pipeline.

Features

Dual-Prompt Intent Classification: Uses a dedicated Tool Selector system prompt to eliminate tool-use hallucinations and ensure strict adherence to available functions.
Voice-First Interface: Hands-free interaction using a speaker-microphone setup.
Multilingual Support: Automatic language detection for English and Mandarin Chinese, with localized voice synthesis.
Local Inference: All processing (LLM, STT, TTS) happens on-device for maximum privacy and low latency.
Knowledge Base Injection: Dynamically appends real-time sensor/device data into the chat context.
Contextual Memory: Maintains a rolling conversation history for natural, chatty interactions.

System Architecture

The system operates via a continuous orchestration loop:

Capture: PyAudio records high-fidelity audio via a USB interface.
Transcription: Whisper.cpp converts audio to text.
Phase 1 - Intent Selection: A strict system prompt determines if a tool (e.g., temperature, lights) is required.
Tool Execution: If required, Python functions fetch real-world data and wrap it in a KNOWLEDGE BASE tag.
Phase 2 - Response Generation: A friendly Companion prompt processes the input plus retrieved data to form a natural response.
Synthesis: Piper TTS generates audio in the detected language.
Playback: aplay outputs the response through the speaker.

Hardware Requirements

Compute: NVIDIA Jetson Nano (4GB / Developer Kit recommended)
Audio: USB speaker-microphone combo
Storage: High-speed microSD card (64GB+) or external SSD

Software Prerequisites

Ensure the following tools are installed and paths are correctly mapped:

Ollama for local model hosting
Whisper.cpp for transcription
Piper TTS for neural voice synthesis
FFmpeg for audio resampling
Python dependencies for orchestrating audio, prompts, and tool execution

Why This Project

I built this voice agent to explore how local AI companion systems can stay reliable and private without cloud dependencies. The dual-prompt design separates intent classification from response generation, eliminating the common tool hallucination failure mode and enabling the assistant to remain friendly while only using accurate, verified tool outputs.

Future Roadmap

Wake-word detection with Porcupine or Snowboy
Home Assistant integration for real IoT device control
Sliding window memory trimming to prevent context overflow
Expand support for additional local models and languages