Skip to content

Audio Module (audio)

The audio crate is the Driven Adapter responsible for all audio capture and playback plumbing. It implements the Core’s AudioInterface and exposes ports (traits) for microphones, speakers, and acoustic processing. It contains no machine learning; Wake Word, STT, and other audio ML live in inference.

AdapterImplements Port(s)Capability FeatureTechnology / Purpose
MicrophoneAdapterAudioSourceaudio_cpalCPAL → Cross-platform local audio capture
SpeakerAdapterSpeakerCPAL → Cross-platform local audio playback
WebRtcAdapterAudioProcessoraudio_webrtcWebRTC audio processing → Acoustic echo cancellation (AEC) and noise suppression before Inference
MockAdapterAudioSource, Speakeraudio_mockSimulated audio capture/playback (and pipeline) for CI testing

The audio crate is driven by the core orchestrator. It receives commands to start or stop audio streams and provides clean PCM data that the core can then hand over to inference for Speech-to-Text or Wake Word detection.

Like other domain crates, audio clearly separates its audio processing logic from the hardware-specific implementations.

crates/audio/
├── src/
│ ├── domain/ # AudioController, AudioProfile, RingBuffer
│ │ └── ports.rs # Internal Traits: `AudioCapturePort`, `DspEnginePort`
│ ├── adapters/ # Hardware and OS implementations
│ │ ├── alsa.rs # Linux ALSA capture
│ │ ├── cpal.rs # Cross-platform audio (fallback/desktop)
│ │ └── dsp.rs # Beamforming, AEC, Noise Suppression
│ └── lib.rs # Implements Core's `AudioInterface`
└── Cargo.toml

Crucial distinction: audio contains absolutely zero machine learning. Tasks such as Wake Word detection or Speech-to-Text are in inference. The Audio domain only handles:

  • Capture: Pulling raw or processed PCM from the hardware (microphones)
  • Playback: Pushing PCM (e.g., TTS output) to speakers or line-out
  • Buffering: Lock-free ring buffer between capture and consumers (e.g., pre-roll for Wake Word); bounded to avoid OOM
  • Signal conditioning: AEC and noise suppression before the stream is handed to Inference

Why separate audio plumbing from ML:

  • Testability: Audio pipelines can be tested with mock hardware and inference
  • Portability: Same plumbing (CPAL, WebRTC AEC) runs on x86, ARM, or mock
  • Resource boundaries: Audio streaming is real-time; ML is batch-oriented
  • Single responsibility: audio = “get clean PCM in and out”; inference = “run models on that PCM”

Sensors (microphones) and actuators (speakers) sit on the Driven side. The Core initiates audio streams and pulls data; hardware does not dictate control flow.

These match the architecture diagram: AudioManager contains AudioConfig, RingBuffer, and the ports AudioSource, Speaker, AudioProcessor, WakeWordDetector. The WakeWordDetector port is implemented in inference (see Wake Word and the Interface in this section).

ComponentResponsibility
AudioManagerCentral facade implementing AudioInterface. Orchestrates hardware adapters and exposes a single entry point for the SessionManager and voice flows.
AudioConfigHardware and pipeline constraints: sample rates (16 kHz for ML), channel count, buffer sizes. Ensures PCM format matches what Inference expects.
RingBufferLock-free, safe cross-thread buffer for continuous PCM. Capture writes, Core/Inference reads; bounded capacity so overflow drops old samples instead of OOM. Provides pre-roll for Wake Word.
AudioSourcePort: supplies raw or processed PCM from the microphone(s) (audio_in). Implemented by e.g. CPAL/ALSA adapters.
SpeakerPort: outputs PCM (audio_out, e.g. TTS from Inference) to hardware (speakers / line-out). Implemented by e.g. CPAL adapter.
AudioProcessorPort: AEC and noise suppression (e.g. WebRTC). Consumes capture/buffer and outputs clean processed_audio to Inference.
WakeWordDetectorPort: wake-word detection on PCM. Implemented in inference (e.g. Sherpa-ONNX); Audio only holds the port / passes stream to Core for Inference.

Wake Word Detection (WWD) is implemented in inference, not in audio. The transcript and architecture clarify the flow:

  1. Audio captures and conditions PCM (AEC, noise suppression, buffering) and hands clean PCM to the Core.
  2. The Core (SessionManager) passes that stream to inference, where the WakeWordDetector port (e.g. Sherpa-ONNX) runs.
  3. When a wake word is detected, Inference emits a wake-word event (e.g. WWD_event).
  4. That event is delivered to the Interface / Core (e.g. via EventBus or an inference→core callback). The SessionManager then transitions session state (e.g. to Listening or Recording) and drives the rest of the voice pipeline (STT → LLM → TTS).

So audio never runs ML; it only provides the conditioned stream. The Core coordinates Audio and Inference and reacts to wake-word events from Inference.

Cancellation-Safe Interface (Saga Rollback Support)

Section titled “Cancellation-Safe Interface (Saga Rollback Support)”

audio is a participant in Core’s Saga rollback. When the SessionManager cancels an in-progress voice flow (e.g., the user interrupts while the system is responding), it issues compensating commands to Audio in a specific order. AudioManager must support clean cancellation at any point:

  • stop_capture(): halts microphone input, flushes or discards the RingBuffer, releases the CPAL/ALSA device handle. Must be safe to call even if capture was never started.
  • stop_playback(): halts TTS output mid-sentence via the Speaker port, does not block or wait for drain.

Failure to support these cleanly will leave the microphone open or the speaker playing after the Core has moved back to Idle, causing a Zombie State. See Core: User-Interruption & State Rollback for the full rollback sequence.

  • Responsibility: Supplies raw or processed PCM from the microphone(s)
  • Implemented by: MicrophoneAdapter, gated by audio_cpal (real hardware) or audio_mock (tests)
  • Technology: Uses the cpal crate to pull raw audio from the hardware mic. On desktop and rockchip builds, audio_cpal selects the concrete adapter; in CI the audio_mock feature provides a mock microphone implementation wired into the same port.
  • Responsibility: Outputs PCM (e.g., TTS) to the speaker or Bluetooth line-out
  • Implemented by: SpeakerAdapter, gated by audio_cpal or audio_mock
  • Technology: Uses cpal to push audio to the default output device. As with capture, audio_cpal selects the concrete adapter on desktop/rockchip and audio_mock swaps in a mock speaker for tests.
  • Responsibility: Vital for a Voice Assistant. When the device is speaking (TTS), the microphone picks up the speaker’s audio. Without AEC, the STT/Wake Word models would hear themselves and trigger false positives. Performs hardware-efficient loopback cancellation and noise suppression before handing clean PCM to Inference.
  • Implemented by: WebRtcAdapter, gated by audio_webrtc (real AEC pipeline) and complemented by audio_mock (mock processing) in CI.
  • Technology: Uses webrtc-audio-processing (and related bindings such as sonora on rockchip) for AEC and optional noise suppression.
  • Responsibility: Provide a fully simulated audio stack so the engine can exercise capture, processing, and playback flows in CI without real hardware.
  • Implements ports: AudioSource, Speaker, and a mocked AudioProcessor pipeline, all behind the audio_mock feature flag.
  • Technology / Purpose: In the test profile, replaces CPAL and WebRTC bindings with deterministic, fast mocks so audio behaviour can be tested in isolation from devices.
flowchart TD
  A[Microphone Hardware] --> B[AudioSource Port cpal]
  B --> C[RingBuffer]
  C --> D[AudioProcessor AEC/WebRTC]
  D --> E[Clean PCM to Inference<br/>Wake Word, STT, VAD]
  F[TTS Response from Inference] --> G[Speaker Port cpal]
  G --> H[Hardware Speaker]