Skip to content

Model Integration Guide

This guide helps you choose the right model format and integration path for your AI models in paiOS. For implementation details (crates, code examples), see the Inference Module.

flowchart TD
  Start[What type of model?] --> Audio{Audio Model?}
  Start --> LLM{Large Language Model?}
  Start --> Vision{Vision Model?}
  Start --> Other{Other?}

  Audio --> AudioSize{Less than 100 MB?}
  AudioSize -->|Yes| ONNX[Use ONNX]
  AudioSize -->|No| Whisper{Whisper?}
  Whisper -->|Yes - CPU| WhisperGGUF[Use GGUF via whisper-rs]
  Whisper -->|Yes - NPU| WhisperRKNN[Convert to RKNN]
  Whisper -->|No| ONNX

  LLM --> LLMTarget{Target NPU?}
  LLMTarget -->|Yes| RKLLM[Use RKLLM]
  LLMTarget -->|No| LLMFormat{Have GGUF?}
  LLMFormat -->|Yes| GGUF[Use GGUF]
  LLMFormat -->|No| ConvertGGUF[Convert to GGUF]

  Vision --> VisionTarget{Target NPU?}
  VisionTarget -->|Yes| VisionRKNN[Convert to RKNN]
  VisionTarget -->|No| VisionSize{Less than 500 MB?}
  VisionSize -->|Yes| ONNX
  VisionSize -->|No| Custom[Custom Adapter]

  Other --> OtherONNX[Try ONNX first]
Diagram (Expanded View)

Real-world use cases often require running multiple models at the same time: an LLM for reasoning, a wake word detector always listening, and vision for object detection. A single backend cannot serve all of these well simultaneously.

On Rockchip RK3588, the NPU has 3 cores (6 TOPS). An LLM typically occupies all of them, leaving no capacity for audio or vision tasks. Using the NPU for all inference would create a bottleneck whenever multiple models are needed at once.

The solution: mix backends across workloads.

A typical parallel configuration:

  • NPU via RKLLM: LLM inference (maximum efficiency)
  • CPU via ONNX (Sherpa): wake word + audio (lightweight, always-on)
  • GPU via GGUF/Vulkan or NPU via RKNN: vision when the LLM is idle

This is an explicit design goal of paiOS. For the full rationale and hardware allocation strategy, see ADR-004: Inference Flexibility and the Inference Module: Resource Management.


FormatBest ForEcosystemPerformanceFlexibilityEnergy Efficiency
ONNXSmall models (< 100 MB), VAD, classifiers★★★★★★★★☆☆★★★★★★★★☆☆
GGUFLLMs, Whisper (CPU/GPU)★★★★☆★★★☆☆★★★★★★★★☆☆
RKNNCNN-style models (YOLO, ResNet) on NPU★★☆☆☆★★★★★★★☆☆☆★★★★★
RKLLMLLMs/Transformers on NPU (Rockchip-specific)★★☆☆☆★★★★★★★☆☆☆★★★★★
Custom AdapterModel types not covered above: requires implementing a new inference adapter (contributor task)----

  • Voice Activity Detection (VAD)
  • Wake word detection
  • Small audio models (< 100 MB)
  • Vision classifiers
  • Embedding models
  • 10,000+ models on HuggingFace Hub
  • Pre-trained Silero VAD, OpenWakeWord, pyannote
  • Easy conversion from PyTorch/TensorFlow
  • Use quantized models (INT8) for embedded devices
  • ARM Compute Library (ACL) backend for NPU acceleration is experimental

  • Large Language Models (Llama, Mistral, Qwen)
  • Whisper (when using CPU)
  • Models requiring flexible quantization
  • HuggingFace GGUF library (growing)
  • Community-quantized models (Q4_K_M, Q5_K_M, Q8_0)
  • Direct compatibility with llama.cpp ecosystem
  • Use Vulkan backend for GPU acceleration on Rockchip (Mali-G610)
  • Choose quantization level based on accuracy/speed trade-off:
    • Q4_K_M: Fast, good for chat
    • Q5_K_M: Balanced
    • Q8_0: High accuracy

Rockchip provides two distinct NPU libraries:

  • RKNN: for CNN-style models (YOLO, ResNet, MobileNet, BERT, Whisper encoder). Converts from ONNX.
  • RKLLM: for LLMs and Transformer-based models. A separate library with its own conversion toolchain.
  • When performance is critical: 3-5x faster than CPU
  • When energy efficiency is paramount: NPU uses ~1/10th power of GPU
  • Production deployment with fixed, pre-converted models
  • Limited (manual conversion required for both RKNN and RKLLM)
  • Rockchip RKNN Toolkit 2 for ONNX → RKNN conversion
  • Rockchip RKLLM for LLM/Transformer → RKLLM conversion
  • Requires model architecture support (not all ops supported)
  • Rapid prototyping (conversion overhead slows iteration)
  • Experimental models not yet validated on Rockchip toolchain
  • Models requiring frequent updates

Recommended: ONNX + Silero VAD (via the infer_sherpa adapter)

Option A: CPU (Flexible): GGUF via Whisper bindings: standard models, easy updates.

Option B: NPU (Fast): RKNN-converted Whisper encoder: maximum performance, fixed model.

Decision: Use CPU for development and flexibility; use NPU for production where the model is stable.

Recommended: ONNX + OpenWakeWord or Silero (via the infer_sherpa adapter)


CPU/GPU (flexible, recommended for development):

  1. Find model on HuggingFace (preferably pre-quantized GGUF)
  2. Download a GGUF variant (e.g., Q4_K_M)
  3. Run via the infer_llamacpp_cpu or infer_llamacpp_vulkan adapter

NPU (maximum performance on Rockchip, production): Use RKLLM via the infer_rkllm adapter: requires converting the model using the RKLLM Toolkit. Best energy efficiency, but introduces vendor lock-in and conversion overhead.

QuantizationSize ReductionQuality LossUse Case
Q4_K_M~70 percentLowChat, general use
Q5_K_M~60 percentVery lowBalanced
Q6_K~50 percentMinimalHigh-quality responses
Q8_0~30 percentNearly noneProduction critical

  • Small classifiers (< 100 MB) → ONNX (CPU, flexible)
  • Object detection (YOLO, ResNet) on NPU → RKNN (convert from ONNX, use infer_rknn adapter)
  • Vision-Language Models (LlaVA) → GGUF (CPU/GPU) or RKLLM (NPU)
  • Unsupported architectures → requires a custom adapter (contributor task, see Inference Module)

Conversion is done offline using external tools, not at runtime.

Use torch.onnx.export. See the official PyTorch ONNX docs.

Use the convert.py script from the llama.cpp repository.

Use Rockchip RKNN Toolkit 2: configure target_platform='rk3588', load your ONNX model, and export to .rknn with quantization enabled.

Use the Rockchip RKLLM Toolkit to convert Llama-compatible models to the .rkllm format.


  • Verify operator support: ort doesn’t support all ONNX ops
  • Try exporting with an older ONNX opset version
  • Check model input/output shapes
  • Ensure Vulkan backend is enabled (VULKAN_SDK env var)
  • Try a lower quantization level
  • Check GPU/NPU availability
  • Verify model architecture is supported by RKNN Toolkit
  • Provide a calibration dataset for quantization
  • Check for unsupported operators