Model Integration Guide

This guide helps you choose the right model format and integration path for your AI models in paiOS. For implementation details (crates, code examples), see the Inference Module.

Quick Decision Tree

flowchart TD
  Start[What type of model?] --> Audio{Audio Model?}
  Start --> LLM{Large Language Model?}
  Start --> Vision{Vision Model?}
  Start --> Other{Other?}

  Audio --> AudioSize{Less than 100 MB?}
  AudioSize -->|Yes| ONNX[Use ONNX]
  AudioSize -->|No| Whisper{Whisper?}
  Whisper -->|Yes - CPU| WhisperGGUF[Use GGUF via whisper-rs]
  Whisper -->|Yes - NPU| WhisperRKNN[Convert to RKNN]
  Whisper -->|No| ONNX

  LLM --> LLMTarget{Target NPU?}
  LLMTarget -->|Yes| RKLLM[Use RKLLM]
  LLMTarget -->|No| LLMFormat{Have GGUF?}
  LLMFormat -->|Yes| GGUF[Use GGUF]
  LLMFormat -->|No| ConvertGGUF[Convert to GGUF]

  Vision --> VisionTarget{Target NPU?}
  VisionTarget -->|Yes| VisionRKNN[Convert to RKNN]
  VisionTarget -->|No| VisionSize{Less than 500 MB?}
  VisionSize -->|Yes| ONNX
  VisionSize -->|No| Custom[Custom Adapter]

  Other --> OtherONNX[Try ONNX first]

Why Multiple Formats?

Real-world use cases often require running multiple models at the same time: an LLM for reasoning, a wake word detector always listening, and vision for object detection. A single backend cannot serve all of these well simultaneously.

On Rockchip RK3588, the NPU has 3 cores (6 TOPS). An LLM typically occupies all of them, leaving no capacity for audio or vision tasks. Using the NPU for all inference would create a bottleneck whenever multiple models are needed at once.

The solution: mix backends across workloads.

A typical parallel configuration:

NPU via RKLLM: LLM inference (maximum efficiency)
CPU via ONNX (Sherpa): wake word + audio (lightweight, always-on)
GPU via GGUF/Vulkan or NPU via RKNN: vision when the LLM is idle

This is an explicit design goal of paiOS. For the full rationale and hardware allocation strategy, see ADR-004: Inference Flexibility and the Inference Module: Resource Management.

Format Comparison

Format	Best For	Ecosystem	Performance	Flexibility	Energy Efficiency
ONNX	Small models (< 100 MB), VAD, classifiers	★★★★★	★★★☆☆	★★★★★	★★★☆☆
GGUF	LLMs, Whisper (CPU/GPU)	★★★★☆	★★★☆☆	★★★★★	★★★☆☆
RKNN	CNN-style models (YOLO, ResNet) on NPU	★★☆☆☆	★★★★★	★★☆☆☆	★★★★★
RKLLM	LLMs/Transformers on NPU (Rockchip-specific)	★★☆☆☆	★★★★★	★★☆☆☆	★★★★★
Custom Adapter	Model types not covered above: requires implementing a new inference adapter (contributor task)	-	-	-	-

When to Use ONNX

Ideal Use Cases

Voice Activity Detection (VAD)
Wake word detection
Small audio models (< 100 MB)
Vision classifiers
Embedding models

Ecosystem

10,000+ models on HuggingFace Hub
Pre-trained Silero VAD, OpenWakeWord, pyannote
Easy conversion from PyTorch/TensorFlow

Performance Tips

Use quantized models (INT8) for embedded devices
ARM Compute Library (ACL) backend for NPU acceleration is experimental

When to Use GGUF

Ideal Use Cases

Large Language Models (Llama, Mistral, Qwen)
Whisper (when using CPU)
Models requiring flexible quantization

Ecosystem

HuggingFace GGUF library (growing)
Community-quantized models (Q4_K_M, Q5_K_M, Q8_0)
Direct compatibility with llama.cpp ecosystem

Performance Tips

Use Vulkan backend for GPU acceleration on Rockchip (Mali-G610)
Choose quantization level based on accuracy/speed trade-off:
- Q4_K_M: Fast, good for chat
- Q5_K_M: Balanced
- Q8_0: High accuracy

When to Use RKNN / RKLLM

Rockchip provides two distinct NPU libraries:

RKNN: for CNN-style models (YOLO, ResNet, MobileNet, BERT, Whisper encoder). Converts from ONNX.
RKLLM: for LLMs and Transformer-based models. A separate library with its own conversion toolchain.

Ideal Use Cases

When performance is critical: 3-5x faster than CPU
When energy efficiency is paramount: NPU uses ~1/10th power of GPU
Production deployment with fixed, pre-converted models

Ecosystem

Limited (manual conversion required for both RKNN and RKLLM)
Rockchip RKNN Toolkit 2 for ONNX → RKNN conversion
Rockchip RKLLM for LLM/Transformer → RKLLM conversion
Requires model architecture support (not all ops supported)

When NOT to Use RKNN / RKLLM

Rapid prototyping (conversion overhead slows iteration)
Experimental models not yet validated on Rockchip toolchain
Models requiring frequent updates

Audio Models

Voice Activity Detection

Recommended: ONNX + Silero VAD (via the infer_sherpa adapter)

Speech-to-Text (Whisper)

Option A: CPU (Flexible): GGUF via Whisper bindings: standard models, easy updates.

Option B: NPU (Fast): RKNN-converted Whisper encoder: maximum performance, fixed model.

Decision: Use CPU for development and flexibility; use NPU for production where the model is stable.

Wake Word Detection

Recommended: ONNX + OpenWakeWord or Silero (via the infer_sherpa adapter)

Large Language Models

Recommended Path

CPU/GPU (flexible, recommended for development):

Find model on HuggingFace (preferably pre-quantized GGUF)
Download a GGUF variant (e.g., Q4_K_M)
Run via the infer_llamacpp_cpu or infer_llamacpp_vulkan adapter

NPU (maximum performance on Rockchip, production): Use RKLLM via the infer_rkllm adapter: requires converting the model using the RKLLM Toolkit. Best energy efficiency, but introduces vendor lock-in and conversion overhead.

Quantization Guide

Quantization	Size Reduction	Quality Loss	Use Case
Q4_K_M	~70 percent	Low	Chat, general use
Q5_K_M	~60 percent	Very low	Balanced
Q6_K	~50 percent	Minimal	High-quality responses
Q8_0	~30 percent	Nearly none	Production critical

Vision Models

Preliminary Recommendations

Small classifiers (< 100 MB) → ONNX (CPU, flexible)
Object detection (YOLO, ResNet) on NPU → RKNN (convert from ONNX, use infer_rknn adapter)
Vision-Language Models (LlaVA) → GGUF (CPU/GPU) or RKLLM (NPU)
Unsupported architectures → requires a custom adapter (contributor task, see Inference Module)

Converting Models

Conversion is done offline using external tools, not at runtime.

PyTorch → ONNX

Use torch.onnx.export. See the official PyTorch ONNX docs.

Safetensors/PyTorch → GGUF

Use the convert.py script from the llama.cpp repository.

ONNX → RKNN

Use Rockchip RKNN Toolkit 2: configure target_platform='rk3588', load your ONNX model, and export to .rknn with quantization enabled.

Model → RKLLM

Use the Rockchip RKLLM Toolkit to convert Llama-compatible models to the .rkllm format.

Common Issues

ONNX Models Not Loading

Verify operator support: ort doesn’t support all ONNX ops
Try exporting with an older ONNX opset version
Check model input/output shapes

GGUF Models Slow

Ensure Vulkan backend is enabled (VULKAN_SDK env var)
Try a lower quantization level
Check GPU/NPU availability

RKNN Conversion Fails

Verify model architecture is supported by RKNN Toolkit
Provide a calibration dataset for quantization
Check for unsupported operators

Getting Help

Discord: paiOS Community
GitHub Issues: Report bugs or request features
Documentation: Architecture Overview

Inference Module: Adapter implementation details, crates, and code examples
ADR-004: Engine Architecture: Hardware allocation and hybrid inference strategy
Contributing Standards: Code quality guidelines

Model Integration Guide

Quick Decision Tree

Why Multiple Formats?

Format Comparison

When to Use ONNX

Ideal Use Cases

Ecosystem

Performance Tips

When to Use GGUF

Ideal Use Cases

Ecosystem

Performance Tips

When to Use RKNN / RKLLM

Ideal Use Cases

Ecosystem

When NOT to Use RKNN / RKLLM

Audio Models

Voice Activity Detection

Speech-to-Text (Whisper)

Wake Word Detection

Large Language Models

Recommended Path

Quantization Guide

Vision Models

Preliminary Recommendations

Converting Models

PyTorch → ONNX

Safetensors/PyTorch → GGUF

ONNX → RKNN

Model → RKLLM

Common Issues

ONNX Models Not Loading

GGUF Models Slow

RKNN Conversion Fails

Getting Help

Related