Inference Module (inference)
Overview
Section titled “Overview”The inference crate acts as the “Brain” (Driven Adapter) of the system, handling all machine learning tasks. It implements the InferenceInterface defined by the Core, providing a unified abstraction for heterogeneous AI backends while maintaining strict separation between “thinking” (inference) and “acting” (tool execution).
Ports & Adapters (Feature Flags)
Section titled “Ports & Adapters (Feature Flags)”| Adapter | Implements Port(s) | Capability Feature | Technology / Supported Models |
|---|---|---|---|
VisionModelAdapter | ObjectDetector, VisualQA, ImageCaptioner | infer_rknn | RKNN SDK → Rockchip NPU Vision/CNN models |
LanguageModelAdapter | LLM | infer_rkllm | RKLLM SDK → Rockchip NPU Language models (Llama, Qwen) |
LLM, EmbeddingModel | infer_llamacpp_cpu | llama-cpp-2 → CPU-based fallback (ARM NEON) | |
LLM, EmbeddingModel | infer_llamacpp_vulkan | llama-cpp-2 → GPU-accelerated (Vulkan/CUDA) | |
AudioMLAdapter | WakeWordDetector, SpeechToText, TextToSpeech, VoiceActivityDetector | infer_sherpa | Sherpa-ONNX → CPU-optimized Audio ML |
ToolAdapter | ToolExecution | infer_mcp_client | mcp-rs → External action routing via JSON-RPC |
MockAdapter | All Ports (Mock) | infer_mock | Simulated ML backend for CI/testing |
Architecture Context / Relationships
Section titled “Architecture Context / Relationships”The inference crate is the “Brain” of the system. It is a driven adapter that sits on the right side of the Hexagon. It is invoked purely by core to perform heavy ML tasks (STT, TTS, LLM generation, Wake Word detection). Interestingly, inference also implements an MCP Client adapter, meaning it can reach further out to external MCP Servers when the LLM decides to use a tool.
Crate Structure
Section titled “Crate Structure”crates/inference/├── src/│ ├── domain/ # InferenceController, ResourceManager│ │ └── ports.rs # Traits: `LLM`, `ObjectDetector`, `WakeWordDetector`, etc.│ ├── adapters/ # Concrete ML backend implementations│ │ ├── rknn.rs # Rockchip NPU Vision/CNN models│ │ ├── rkllm.rs # Rockchip NPU LLMs│ │ ├── llamacpp.rs # CPU/GPU fallback for LLMs│ │ ├── sherpa.rs # CPU-optimized Audio ML (Wake word, STT)│ │ └── mcp_client.rs # External Tool Execution via MCP│ └── lib.rs # Implements Core's `InferenceInterface`└── Cargo.tomlThe Inference Domain
Section titled “The Inference Domain”InferenceController
Section titled “InferenceController”The central facade implementing InferenceInterface:
pub struct InferenceController { resource_manager: ResourceManager, adapters: InferenceAdapterRegistry,}
impl InferenceInterface for InferenceController { fn execute_llm(&self, prompt: &str, context: &InferenceContext) -> Result<LLMResponse>; fn detect_wake_word(&self, audio: &[f32]) -> Result<WakeWordResult>; fn transcribe(&self, audio: &[f32]) -> Result<String>; // ... other inference methods}Inference Ports (Capability Traits)
Section titled “Inference Ports (Capability Traits)”The domain defines capability traits; adapters implement one or more:
| Category | Ports |
|---|---|
| Audio | WakeWordDetector, VoiceActivityDetector, SpeakerDiarizer, SpeechToText, TextToSpeech |
| Vision | ObjectDetector, ImageCaptioner (multimodal), VisualQA, SemanticChangeDetector |
| Language | LLM (text generation), EmbeddingModel (RAG), FunctionCaller (tool use / grammars) |
| Actions | ToolExecution (MCP client for external tool calls) |
Hardware Acceleration & Adapter Split
Section titled “Hardware Acceleration & Adapter Split”Rockchip NPU (Separated)
Section titled “Rockchip NPU (Separated)”We strictly split Rockchip NPU acceleration into two adapters:
| Feature | Adapter | Purpose | Models |
|---|---|---|---|
infer_rknn | RknnAdapter | Vision and CNNs | Object Detection, VisualQA, Image Classification |
infer_rkllm | RkllmAdapter | Large Language Models | Llama, Mistral, Qwen (via rk-llama.cpp fork) |
Why separate: Different C-APIs and runtimes, different load patterns and resource profiles, separate features prevent monolithic bloat.
Audio ML (Sherpa-ONNX on CPU)
Section titled “Audio ML (Sherpa-ONNX on CPU)”| Feature | Adapter | Purpose |
|---|---|---|
infer_sherpa | SherpaAdapter | Wake Word, STT, TTS, VAD (ONNX-based, CPU-optimized) |
Architectural decision: CPU vs. NPU for Audio:
- Battery: Wake Word runs 24/7; keeping NPU powered on drains battery. Lightweight ONNX on “LITTLE” CPU cores saves power.
- Compatibility: Audio models are difficult to quantize for RKNN without accuracy loss. CPU-based ONNX guarantees high accuracy.
- Resource reservation: CPU for audio reserves 100 percent of NPU for LLMs and Vision.
Fallback: LlamaCppAdapter
Section titled “Fallback: LlamaCppAdapter”| Features | Adapter | Ports | Purpose |
|---|---|---|---|
infer_llamacpp_cpu, infer_llamacpp_vulkan | LlamaCppAdapter | LLM, EmbeddingModel | Host development + fallback |
Essential for development on host/desktop where no NPU exists. Supports GGUF models with grammar/function-calling.
Adapter Selection
Section titled “Adapter Selection”The InferenceController selects adapters based on:
- Model format (RKNN → NPU, GGUF → LlamaCpp, ONNX → Sherpa)
- Hardware availability (NPU cores, GPU memory, CPU load)
- Feature flags (only enabled adapters are considered)
- Selection modes: Auto (best available), Manual (user-specified), Fallback (graceful degradation: NPU → GPU → CPU)
Resource Management (Edge-AI Critical)
Section titled “Resource Management (Edge-AI Critical)”InferenceResourceManager
Section titled “InferenceResourceManager”Problem: Rockchip SoC has unified but limited RAM (8–16GB total). Loading LLM + Vision + Audio simultaneously can cause OOM crashes.
Solution: Acts as the memory traffic controller:
- Monitors RAM/VRAM usage
- Dynamically unloads/swaps models using strategies: LRU, Priority-Based, or Size-Based
- Ensures only needed models are loaded at any time
Example scenario: User requests object detection (2GB Vision model loaded), then requests LLM query (needs 4GB). ResourceManager unloads Vision, loads LLM, executes. When Vision is needed again, it swaps back.
SharedBackendContext
Section titled “SharedBackendContext”Prevents redundant memory. If multiple ports (e.g., ImageCaptioner and VisualQA) need the same model, it’s loaded once and shared:
- Memory Efficiency: Single model instance serves multiple ports
- Performance: Shared KV cache for faster multi-turn LLM conversations
- Consistency: All ports using the same model see identical behavior
InferenceConfig & ModelProvider
Section titled “InferenceConfig & ModelProvider”- InferenceConfig: Model paths, context window sizes, temperature, hardware preferences, fallback strategies
- ModelProvider: Resolves model availability: local filesystem, remote download (HuggingFace), caching, integrity validation
MCP Client and Tool Execution
Section titled “MCP Client and Tool Execution”Strict Separation: “Thinking” vs “Acting”
Section titled “Strict Separation: “Thinking” vs “Acting””The LLM never executes tools directly. It generates structured JSON tool-calls that are routed through the ToolExecutionPort to external MCP servers.
Security: Direct execution of LLM output would expose the device to prompt injection and unpredictable behavior. The MCP Client enforces a clear security boundary and auditability.
McpClientAdapter
Section titled “McpClientAdapter”| Feature | Adapter | Protocol |
|---|---|---|
infer_mcp_client | McpClientAdapter | MCP (JSON-RPC 2.0 via mcp-rs) |
Flow:
- LLM generates structured tool-call JSON
- InferenceManager routes to
ToolExecutionPort(does not execute) McpClientAdapterresolves tool → MCP server, forwards via JSON-RPC- External MCP server executes, returns result
- Result fed back to LLM for follow-up reasoning
MCP Server Examples: Home Assistant (smart home), SQLite Memory (long-term memory), Web Search, File System, Email.
Benefits:
- Boundless capabilities via standard MCP protocol without engine changes
- Lightweight core: tool execution delegated to external systems
- Security: tool calls gated by PermissionManager (HITL)
Data Flow
Section titled “Data Flow”graph TD
Core[Core SessionManager] --> IC[InferenceController]
IC --> RM[ResourceManager<br/>memory check]
IC --> MP[ModelProvider]
MP --> SBC[SharedBackendContext]
RM --> SBC
SBC --> SA[SherpaAdapter]
SBC --> RA[RknnAdapter]
SBC --> RLA[RkllmAdapter]
SBC --> LCA[LlamaCppAdapter]
SBC --> MCA[McpClientAdapter]
SA -.-> M1[Wake Word, STT, TTS<br/>CPU/ONNX]
RA -.-> M2[Vision models<br/>NPU]
RLA -.-> M3[LLMs<br/>NPU]
LCA -.-> M4[LLMs<br/>CPU/GPU fallback]
MCA -.-> M5[External MCP Servers<br/>tool execution]
Crates & Integration Reference
Section titled “Crates & Integration Reference”Concrete crate usage per adapter. See Model Integration Guide for choosing the right format.
ONNX — ort crate (infer_sherpa / direct)
Section titled “ONNX — ort crate (infer_sherpa / direct)”use ort::{Session, GraphOptimizationLevel};
let model = Session::builder()? .with_optimization_level(GraphOptimizationLevel::Level3)? .commit_from_file("model.onnx")?;
let outputs = model.run(ort::inputs!["input" => input_tensor]?)?;Recommended crates: ort (ONNX Runtime bindings), silero-vad-rs (VAD), ten-vad-rs (alternative VAD)
GGUF / LlamaCpp — llama-cpp-2 crate (infer_llamacpp_*)
Section titled “GGUF / LlamaCpp — llama-cpp-2 crate (infer_llamacpp_*)”use llama_cpp_2::{model::LlamaModel, context::LlamaContext};
let model = LlamaModel::load_from_file("model.gguf", LlamaParams::default())?;let mut ctx = model.new_context(&builder)?;let output = ctx.predict(prompt, ¶ms)?;Recommended crates: llama-cpp-2 (LLM + embeddings), whisper-rs (Whisper via GGUF on CPU)
RKNN — rknpu2 crate (infer_rknn)
Section titled “RKNN — rknpu2 crate (infer_rknn)”use rknpu2::{Model, Tensor};
// Model must be pre-converted to .rknn format using Rockchip RKNN Toolkit 2let model = Model::load("model.rknn")?;let output = model.inference(&input_tensor)?;Recommended crates: rknpu2 (Rockchip NPU bindings for CNN/Vision models)
RKLLM — rkllm crate (infer_rkllm)
Section titled “RKLLM — rkllm crate (infer_rkllm)”// Model must be pre-converted to .rkllm format using Rockchip RKLLM Toolkit// Then loaded via the RkllmAdapter (wraps librkllmrt.so)Recommended crates: rkllm (Rockchip NPU bindings for LLMs/Transformers via rk-llama.cpp)
Testing Adapters
Section titled “Testing Adapters”#[cfg(test)]mod tests { use super::*;
#[test] fn test_model_inference() { let model = load_model("test.onnx").unwrap(); let input = create_test_input(); let output = model.inference(&input).unwrap(); assert!(!output.is_empty()); }}For benchmarking, use std::time::Instant around model.inference() calls, or cargo bench with the criterion crate.
Related Documentation
Section titled “Related Documentation”- Model Integration Guide: Choosing the right model format and conversion path
- ADR-004: Engine Architecture: High-level architecture decisions
- Core Module: Orchestrator and InferenceInterface
- Vision Module: Consumes vision frames for object detection, captioning
- Audio Module: Provides PCM for Wake Word, STT, TTS
- API Module: MCP Server role (engine as tool provider)
- Workspace and Build: Feature flags and build configuration
- Security Architecture: PermissionManager, HITL, tool execution gating