Skip to content

Inference Module (inference)

The inference crate acts as the “Brain” (Driven Adapter) of the system, handling all machine learning tasks. It implements the InferenceInterface defined by the Core, providing a unified abstraction for heterogeneous AI backends while maintaining strict separation between “thinking” (inference) and “acting” (tool execution).

AdapterImplements Port(s)Capability FeatureTechnology / Supported Models
VisionModelAdapterObjectDetector, VisualQA, ImageCaptionerinfer_rknnRKNN SDK → Rockchip NPU Vision/CNN models
LanguageModelAdapterLLMinfer_rkllmRKLLM SDK → Rockchip NPU Language models (Llama, Qwen)
LLM, EmbeddingModelinfer_llamacpp_cpullama-cpp-2 → CPU-based fallback (ARM NEON)
LLM, EmbeddingModelinfer_llamacpp_vulkanllama-cpp-2 → GPU-accelerated (Vulkan/CUDA)
AudioMLAdapterWakeWordDetector, SpeechToText, TextToSpeech, VoiceActivityDetectorinfer_sherpaSherpa-ONNX → CPU-optimized Audio ML
ToolAdapterToolExecutioninfer_mcp_clientmcp-rs → External action routing via JSON-RPC
MockAdapterAll Ports (Mock)infer_mockSimulated ML backend for CI/testing

The inference crate is the “Brain” of the system. It is a driven adapter that sits on the right side of the Hexagon. It is invoked purely by core to perform heavy ML tasks (STT, TTS, LLM generation, Wake Word detection). Interestingly, inference also implements an MCP Client adapter, meaning it can reach further out to external MCP Servers when the LLM decides to use a tool.

crates/inference/
├── src/
│ ├── domain/ # InferenceController, ResourceManager
│ │ └── ports.rs # Traits: `LLM`, `ObjectDetector`, `WakeWordDetector`, etc.
│ ├── adapters/ # Concrete ML backend implementations
│ │ ├── rknn.rs # Rockchip NPU Vision/CNN models
│ │ ├── rkllm.rs # Rockchip NPU LLMs
│ │ ├── llamacpp.rs # CPU/GPU fallback for LLMs
│ │ ├── sherpa.rs # CPU-optimized Audio ML (Wake word, STT)
│ │ └── mcp_client.rs # External Tool Execution via MCP
│ └── lib.rs # Implements Core's `InferenceInterface`
└── Cargo.toml

The central facade implementing InferenceInterface:

pub struct InferenceController {
resource_manager: ResourceManager,
adapters: InferenceAdapterRegistry,
}
impl InferenceInterface for InferenceController {
fn execute_llm(&self, prompt: &str, context: &InferenceContext) -> Result<LLMResponse>;
fn detect_wake_word(&self, audio: &[f32]) -> Result<WakeWordResult>;
fn transcribe(&self, audio: &[f32]) -> Result<String>;
// ... other inference methods
}

The domain defines capability traits; adapters implement one or more:

CategoryPorts
AudioWakeWordDetector, VoiceActivityDetector, SpeakerDiarizer, SpeechToText, TextToSpeech
VisionObjectDetector, ImageCaptioner (multimodal), VisualQA, SemanticChangeDetector
LanguageLLM (text generation), EmbeddingModel (RAG), FunctionCaller (tool use / grammars)
ActionsToolExecution (MCP client for external tool calls)

We strictly split Rockchip NPU acceleration into two adapters:

FeatureAdapterPurposeModels
infer_rknnRknnAdapterVision and CNNsObject Detection, VisualQA, Image Classification
infer_rkllmRkllmAdapterLarge Language ModelsLlama, Mistral, Qwen (via rk-llama.cpp fork)

Why separate: Different C-APIs and runtimes, different load patterns and resource profiles, separate features prevent monolithic bloat.

FeatureAdapterPurpose
infer_sherpaSherpaAdapterWake Word, STT, TTS, VAD (ONNX-based, CPU-optimized)

Architectural decision: CPU vs. NPU for Audio:

  • Battery: Wake Word runs 24/7; keeping NPU powered on drains battery. Lightweight ONNX on “LITTLE” CPU cores saves power.
  • Compatibility: Audio models are difficult to quantize for RKNN without accuracy loss. CPU-based ONNX guarantees high accuracy.
  • Resource reservation: CPU for audio reserves 100 percent of NPU for LLMs and Vision.
FeaturesAdapterPortsPurpose
infer_llamacpp_cpu, infer_llamacpp_vulkanLlamaCppAdapterLLM, EmbeddingModelHost development + fallback

Essential for development on host/desktop where no NPU exists. Supports GGUF models with grammar/function-calling.

The InferenceController selects adapters based on:

  • Model format (RKNN → NPU, GGUF → LlamaCpp, ONNX → Sherpa)
  • Hardware availability (NPU cores, GPU memory, CPU load)
  • Feature flags (only enabled adapters are considered)
  • Selection modes: Auto (best available), Manual (user-specified), Fallback (graceful degradation: NPU → GPU → CPU)

Problem: Rockchip SoC has unified but limited RAM (8–16GB total). Loading LLM + Vision + Audio simultaneously can cause OOM crashes.

Solution: Acts as the memory traffic controller:

  • Monitors RAM/VRAM usage
  • Dynamically unloads/swaps models using strategies: LRU, Priority-Based, or Size-Based
  • Ensures only needed models are loaded at any time

Example scenario: User requests object detection (2GB Vision model loaded), then requests LLM query (needs 4GB). ResourceManager unloads Vision, loads LLM, executes. When Vision is needed again, it swaps back.

Prevents redundant memory. If multiple ports (e.g., ImageCaptioner and VisualQA) need the same model, it’s loaded once and shared:

  • Memory Efficiency: Single model instance serves multiple ports
  • Performance: Shared KV cache for faster multi-turn LLM conversations
  • Consistency: All ports using the same model see identical behavior
  • InferenceConfig: Model paths, context window sizes, temperature, hardware preferences, fallback strategies
  • ModelProvider: Resolves model availability: local filesystem, remote download (HuggingFace), caching, integrity validation

Strict Separation: “Thinking” vs “Acting”

Section titled “Strict Separation: “Thinking” vs “Acting””

The LLM never executes tools directly. It generates structured JSON tool-calls that are routed through the ToolExecutionPort to external MCP servers.

Security: Direct execution of LLM output would expose the device to prompt injection and unpredictable behavior. The MCP Client enforces a clear security boundary and auditability.

FeatureAdapterProtocol
infer_mcp_clientMcpClientAdapterMCP (JSON-RPC 2.0 via mcp-rs)

Flow:

  1. LLM generates structured tool-call JSON
  2. InferenceManager routes to ToolExecutionPort (does not execute)
  3. McpClientAdapter resolves tool → MCP server, forwards via JSON-RPC
  4. External MCP server executes, returns result
  5. Result fed back to LLM for follow-up reasoning

MCP Server Examples: Home Assistant (smart home), SQLite Memory (long-term memory), Web Search, File System, Email.

Benefits:

  • Boundless capabilities via standard MCP protocol without engine changes
  • Lightweight core: tool execution delegated to external systems
  • Security: tool calls gated by PermissionManager (HITL)
graph TD
    Core[Core SessionManager] --> IC[InferenceController]
    IC --> RM[ResourceManager<br/>memory check]
    IC --> MP[ModelProvider]
    MP --> SBC[SharedBackendContext]
    RM --> SBC

    SBC --> SA[SherpaAdapter]
    SBC --> RA[RknnAdapter]
    SBC --> RLA[RkllmAdapter]
    SBC --> LCA[LlamaCppAdapter]
    SBC --> MCA[McpClientAdapter]

    SA -.-> M1[Wake Word, STT, TTS<br/>CPU/ONNX]
    RA -.-> M2[Vision models<br/>NPU]
    RLA -.-> M3[LLMs<br/>NPU]
    LCA -.-> M4[LLMs<br/>CPU/GPU fallback]
    MCA -.-> M5[External MCP Servers<br/>tool execution]

Concrete crate usage per adapter. See Model Integration Guide for choosing the right format.

ONNX — ort crate (infer_sherpa / direct)

Section titled “ONNX — ort crate (infer_sherpa / direct)”
use ort::{Session, GraphOptimizationLevel};
let model = Session::builder()?
.with_optimization_level(GraphOptimizationLevel::Level3)?
.commit_from_file("model.onnx")?;
let outputs = model.run(ort::inputs!["input" => input_tensor]?)?;

Recommended crates: ort (ONNX Runtime bindings), silero-vad-rs (VAD), ten-vad-rs (alternative VAD)

GGUF / LlamaCpp — llama-cpp-2 crate (infer_llamacpp_*)

Section titled “GGUF / LlamaCpp — llama-cpp-2 crate (infer_llamacpp_*)”
use llama_cpp_2::{model::LlamaModel, context::LlamaContext};
let model = LlamaModel::load_from_file("model.gguf", LlamaParams::default())?;
let mut ctx = model.new_context(&builder)?;
let output = ctx.predict(prompt, &params)?;

Recommended crates: llama-cpp-2 (LLM + embeddings), whisper-rs (Whisper via GGUF on CPU)

use rknpu2::{Model, Tensor};
// Model must be pre-converted to .rknn format using Rockchip RKNN Toolkit 2
let model = Model::load("model.rknn")?;
let output = model.inference(&input_tensor)?;

Recommended crates: rknpu2 (Rockchip NPU bindings for CNN/Vision models)

// Model must be pre-converted to .rkllm format using Rockchip RKLLM Toolkit
// Then loaded via the RkllmAdapter (wraps librkllmrt.so)

Recommended crates: rkllm (Rockchip NPU bindings for LLMs/Transformers via rk-llama.cpp)

#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_model_inference() {
let model = load_model("test.onnx").unwrap();
let input = create_test_input();
let output = model.inference(&input).unwrap();
assert!(!output.is_empty());
}
}

For benchmarking, use std::time::Instant around model.inference() calls, or cargo bench with the criterion crate.