Inference Module (inference)

Overview

The inference crate acts as the “Brain” (Driven Adapter) of the system, handling all machine learning tasks. It implements the InferenceInterface defined by the Core, providing a unified abstraction for heterogeneous AI backends while maintaining strict separation between “thinking” (inference) and “acting” (tool execution).

Ports & Adapters (Feature Flags)

Adapter	Implements Port(s)	Capability Feature	Technology / Supported Models
`VisionModelAdapter`	`ObjectDetector`, `VisualQA`, `ImageCaptioner`	`infer_rknn`	RKNN SDK → Rockchip NPU Vision/CNN models
`LanguageModelAdapter`	`LLM`	`infer_rkllm`	RKLLM SDK → Rockchip NPU Language models (Llama, Qwen)
	`LLM`, `EmbeddingModel`	`infer_llamacpp_cpu`	llama-cpp-2 → CPU-based fallback (ARM NEON)
	`LLM`, `EmbeddingModel`	`infer_llamacpp_vulkan`	llama-cpp-2 → GPU-accelerated (Vulkan/CUDA)
`AudioMLAdapter`	`WakeWordDetector`, `SpeechToText`, `TextToSpeech`, `VoiceActivityDetector`	`infer_sherpa`	Sherpa-ONNX → CPU-optimized Audio ML
`ToolAdapter`	`ToolExecution`	`infer_mcp_client`	mcp-rs → External action routing via JSON-RPC
`MockAdapter`	All Ports (Mock)	`infer_mock`	Simulated ML backend for CI/testing

Architecture Context / Relationships

The inference crate is the “Brain” of the system. It is a driven adapter that sits on the right side of the Hexagon. It is invoked purely by core to perform heavy ML tasks (STT, TTS, LLM generation, Wake Word detection). Interestingly, inference also implements an MCP Client adapter, meaning it can reach further out to external MCP Servers when the LLM decides to use a tool.

Crate Structure

crates/inference/
├── src/
│   ├── domain/               # InferenceController, ResourceManager
│   │   └── ports.rs          # Traits: `LLM`, `ObjectDetector`, `WakeWordDetector`, etc.
│   ├── adapters/             # Concrete ML backend implementations
│   │   ├── rknn.rs           # Rockchip NPU Vision/CNN models
│   │   ├── rkllm.rs          # Rockchip NPU LLMs
│   │   ├── llamacpp.rs       # CPU/GPU fallback for LLMs
│   │   ├── sherpa.rs         # CPU-optimized Audio ML (Wake word, STT)
│   │   └── mcp_client.rs     # External Tool Execution via MCP
│   └── lib.rs                # Implements Core's `InferenceInterface`
└── Cargo.toml

The Inference Domain

InferenceController

The central facade implementing InferenceInterface:

pub struct InferenceController {
    resource_manager: ResourceManager,
    adapters: InferenceAdapterRegistry,
}

impl InferenceInterface for InferenceController {
    fn execute_llm(&self, prompt: &str, context: &InferenceContext) -> Result<LLMResponse>;
    fn detect_wake_word(&self, audio: &[f32]) -> Result<WakeWordResult>;
    fn transcribe(&self, audio: &[f32]) -> Result<String>;
    // ... other inference methods
}

Inference Ports (Capability Traits)

The domain defines capability traits; adapters implement one or more:

Category	Ports
Audio	`WakeWordDetector`, `VoiceActivityDetector`, `SpeakerDiarizer`, `SpeechToText`, `TextToSpeech`
Vision	`ObjectDetector`, `ImageCaptioner` (multimodal), `VisualQA`, `SemanticChangeDetector`
Language	`LLM` (text generation), `EmbeddingModel` (RAG), `FunctionCaller` (tool use / grammars)
Actions	`ToolExecution` (MCP client for external tool calls)

Hardware Acceleration & Adapter Split

Rockchip NPU (Separated)

We strictly split Rockchip NPU acceleration into two adapters:

Feature	Adapter	Purpose	Models
`infer_rknn`	`RknnAdapter`	Vision and CNNs	Object Detection, VisualQA, Image Classification
`infer_rkllm`	`RkllmAdapter`	Large Language Models	Llama, Mistral, Qwen (via `rk-llama.cpp` fork)

Why separate: Different C-APIs and runtimes, different load patterns and resource profiles, separate features prevent monolithic bloat.

Audio ML (Sherpa-ONNX on CPU)

Feature	Adapter	Purpose
`infer_sherpa`	`SherpaAdapter`	Wake Word, STT, TTS, VAD (ONNX-based, CPU-optimized)

Architectural decision: CPU vs. NPU for Audio:

Battery: Wake Word runs 24/7; keeping NPU powered on drains battery. Lightweight ONNX on “LITTLE” CPU cores saves power.
Compatibility: Audio models are difficult to quantize for RKNN without accuracy loss. CPU-based ONNX guarantees high accuracy.
Resource reservation: CPU for audio reserves 100 percent of NPU for LLMs and Vision.

Fallback: LlamaCppAdapter

Features	Adapter	Ports	Purpose
`infer_llamacpp_cpu`, `infer_llamacpp_vulkan`	`LlamaCppAdapter`	`LLM`, `EmbeddingModel`	Host development + fallback

Essential for development on host/desktop where no NPU exists. Supports GGUF models with grammar/function-calling.

Adapter Selection

The InferenceController selects adapters based on:

Model format (RKNN → NPU, GGUF → LlamaCpp, ONNX → Sherpa)
Hardware availability (NPU cores, GPU memory, CPU load)
Feature flags (only enabled adapters are considered)
Selection modes: Auto (best available), Manual (user-specified), Fallback (graceful degradation: NPU → GPU → CPU)

Resource Management (Edge-AI Critical)

InferenceResourceManager

Problem: Rockchip SoC has unified but limited RAM (8–16GB total). Loading LLM + Vision + Audio simultaneously can cause OOM crashes.

Solution: Acts as the memory traffic controller:

Monitors RAM/VRAM usage
Dynamically unloads/swaps models using strategies: LRU, Priority-Based, or Size-Based
Ensures only needed models are loaded at any time

Example scenario: User requests object detection (2GB Vision model loaded), then requests LLM query (needs 4GB). ResourceManager unloads Vision, loads LLM, executes. When Vision is needed again, it swaps back.

SharedBackendContext

Prevents redundant memory. If multiple ports (e.g., ImageCaptioner and VisualQA) need the same model, it’s loaded once and shared:

Memory Efficiency: Single model instance serves multiple ports
Performance: Shared KV cache for faster multi-turn LLM conversations
Consistency: All ports using the same model see identical behavior

InferenceConfig & ModelProvider

InferenceConfig: Model paths, context window sizes, temperature, hardware preferences, fallback strategies
ModelProvider: Resolves model availability: local filesystem, remote download (HuggingFace), caching, integrity validation

MCP Client and Tool Execution

Strict Separation: “Thinking” vs “Acting”

The LLM never executes tools directly. It generates structured JSON tool-calls that are routed through the ToolExecutionPort to external MCP servers.

Security: Direct execution of LLM output would expose the device to prompt injection and unpredictable behavior. The MCP Client enforces a clear security boundary and auditability.

McpClientAdapter

Feature	Adapter	Protocol
`infer_mcp_client`	`McpClientAdapter`	MCP (JSON-RPC 2.0 via `mcp-rs`)

Flow:

LLM generates structured tool-call JSON
InferenceManager routes to ToolExecutionPort (does not execute)
McpClientAdapter resolves tool → MCP server, forwards via JSON-RPC
External MCP server executes, returns result
Result fed back to LLM for follow-up reasoning

MCP Server Examples: Home Assistant (smart home), SQLite Memory (long-term memory), Web Search, File System, Email.

Benefits:

Boundless capabilities via standard MCP protocol without engine changes
Lightweight core: tool execution delegated to external systems
Security: tool calls gated by PermissionManager (HITL)

Data Flow

graph TD
    Core[Core SessionManager] --> IC[InferenceController]
    IC --> RM[ResourceManager<br/>memory check]
    IC --> MP[ModelProvider]
    MP --> SBC[SharedBackendContext]
    RM --> SBC

    SBC --> SA[SherpaAdapter]
    SBC --> RA[RknnAdapter]
    SBC --> RLA[RkllmAdapter]
    SBC --> LCA[LlamaCppAdapter]
    SBC --> MCA[McpClientAdapter]

    SA -.-> M1[Wake Word, STT, TTS<br/>CPU/ONNX]
    RA -.-> M2[Vision models<br/>NPU]
    RLA -.-> M3[LLMs<br/>NPU]
    LCA -.-> M4[LLMs<br/>CPU/GPU fallback]
    MCA -.-> M5[External MCP Servers<br/>tool execution]

Crates & Integration Reference

Concrete crate usage per adapter. See Model Integration Guide for choosing the right format.

ONNX — `ort` crate (`infer_sherpa` / direct)

use ort::{Session, GraphOptimizationLevel};

let model = Session::builder()?
    .with_optimization_level(GraphOptimizationLevel::Level3)?
    .commit_from_file("model.onnx")?;

let outputs = model.run(ort::inputs!["input" => input_tensor]?)?;

Recommended crates: ort (ONNX Runtime bindings), silero-vad-rs (VAD), ten-vad-rs (alternative VAD)

GGUF / LlamaCpp — `llama-cpp-2` crate (`infer_llamacpp_*`)

use llama_cpp_2::{model::LlamaModel, context::LlamaContext};

let model = LlamaModel::load_from_file("model.gguf", LlamaParams::default())?;
let mut ctx = model.new_context(&builder)?;
let output = ctx.predict(prompt, &params)?;

Recommended crates: llama-cpp-2 (LLM + embeddings), whisper-rs (Whisper via GGUF on CPU)

RKNN — `rknpu2` crate (`infer_rknn`)

use rknpu2::{Model, Tensor};

// Model must be pre-converted to .rknn format using Rockchip RKNN Toolkit 2
let model = Model::load("model.rknn")?;
let output = model.inference(&input_tensor)?;

Recommended crates: rknpu2 (Rockchip NPU bindings for CNN/Vision models)

RKLLM — `rkllm` crate (`infer_rkllm`)

// Model must be pre-converted to .rkllm format using Rockchip RKLLM Toolkit
// Then loaded via the RkllmAdapter (wraps librkllmrt.so)

Recommended crates: rkllm (Rockchip NPU bindings for LLMs/Transformers via rk-llama.cpp)

Testing Adapters

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_model_inference() {
        let model = load_model("test.onnx").unwrap();
        let input = create_test_input();
        let output = model.inference(&input).unwrap();
        assert!(!output.is_empty());
    }
}

For benchmarking, use std::time::Instant around model.inference() calls, or cargo bench with the criterion crate.

Model Integration Guide: Choosing the right model format and conversion path
ADR-004: Engine Architecture: High-level architecture decisions
Core Module: Orchestrator and InferenceInterface
Vision Module: Consumes vision frames for object detection, captioning
Audio Module: Provides PCM for Wake Word, STT, TTS
API Module: MCP Server role (engine as tool provider)
Workspace and Build: Feature flags and build configuration
Security Architecture: PermissionManager, HITL, tool execution gating