Model Integration Guide
This guide helps you choose the right model format and integration path for your AI models in paiOS. For implementation details (crates, code examples), see the Inference Module.
Quick Decision Tree
Section titled “Quick Decision Tree”flowchart TD
Start[What type of model?] --> Audio{Audio Model?}
Start --> LLM{Large Language Model?}
Start --> Vision{Vision Model?}
Start --> Other{Other?}
Audio --> AudioSize{Less than 100 MB?}
AudioSize -->|Yes| ONNX[Use ONNX]
AudioSize -->|No| Whisper{Whisper?}
Whisper -->|Yes - CPU| WhisperGGUF[Use GGUF via whisper-rs]
Whisper -->|Yes - NPU| WhisperRKNN[Convert to RKNN]
Whisper -->|No| ONNX
LLM --> LLMTarget{Target NPU?}
LLMTarget -->|Yes| RKLLM[Use RKLLM]
LLMTarget -->|No| LLMFormat{Have GGUF?}
LLMFormat -->|Yes| GGUF[Use GGUF]
LLMFormat -->|No| ConvertGGUF[Convert to GGUF]
Vision --> VisionTarget{Target NPU?}
VisionTarget -->|Yes| VisionRKNN[Convert to RKNN]
VisionTarget -->|No| VisionSize{Less than 500 MB?}
VisionSize -->|Yes| ONNX
VisionSize -->|No| Custom[Custom Adapter]
Other --> OtherONNX[Try ONNX first]
Why Multiple Formats?
Section titled “Why Multiple Formats?”Real-world use cases often require running multiple models at the same time: an LLM for reasoning, a wake word detector always listening, and vision for object detection. A single backend cannot serve all of these well simultaneously.
On Rockchip RK3588, the NPU has 3 cores (6 TOPS). An LLM typically occupies all of them, leaving no capacity for audio or vision tasks. Using the NPU for all inference would create a bottleneck whenever multiple models are needed at once.
The solution: mix backends across workloads.
A typical parallel configuration:
- NPU via RKLLM: LLM inference (maximum efficiency)
- CPU via ONNX (Sherpa): wake word + audio (lightweight, always-on)
- GPU via GGUF/Vulkan or NPU via RKNN: vision when the LLM is idle
This is an explicit design goal of paiOS. For the full rationale and hardware allocation strategy, see ADR-004: Inference Flexibility and the Inference Module: Resource Management.
Format Comparison
Section titled “Format Comparison”| Format | Best For | Ecosystem | Performance | Flexibility | Energy Efficiency |
|---|---|---|---|---|---|
| ONNX | Small models (< 100 MB), VAD, classifiers | ★★★★★ | ★★★☆☆ | ★★★★★ | ★★★☆☆ |
| GGUF | LLMs, Whisper (CPU/GPU) | ★★★★☆ | ★★★☆☆ | ★★★★★ | ★★★☆☆ |
| RKNN | CNN-style models (YOLO, ResNet) on NPU | ★★☆☆☆ | ★★★★★ | ★★☆☆☆ | ★★★★★ |
| RKLLM | LLMs/Transformers on NPU (Rockchip-specific) | ★★☆☆☆ | ★★★★★ | ★★☆☆☆ | ★★★★★ |
| Custom Adapter | Model types not covered above: requires implementing a new inference adapter (contributor task) | - | - | - | - |
When to Use ONNX
Section titled “When to Use ONNX”Ideal Use Cases
Section titled “Ideal Use Cases”- Voice Activity Detection (VAD)
- Wake word detection
- Small audio models (< 100 MB)
- Vision classifiers
- Embedding models
Ecosystem
Section titled “Ecosystem”- 10,000+ models on HuggingFace Hub
- Pre-trained Silero VAD, OpenWakeWord, pyannote
- Easy conversion from PyTorch/TensorFlow
Performance Tips
Section titled “Performance Tips”- Use quantized models (INT8) for embedded devices
- ARM Compute Library (ACL) backend for NPU acceleration is experimental
When to Use GGUF
Section titled “When to Use GGUF”Ideal Use Cases
Section titled “Ideal Use Cases”- Large Language Models (Llama, Mistral, Qwen)
- Whisper (when using CPU)
- Models requiring flexible quantization
Ecosystem
Section titled “Ecosystem”- HuggingFace GGUF library (growing)
- Community-quantized models (Q4_K_M, Q5_K_M, Q8_0)
- Direct compatibility with llama.cpp ecosystem
Performance Tips
Section titled “Performance Tips”- Use Vulkan backend for GPU acceleration on Rockchip (Mali-G610)
- Choose quantization level based on accuracy/speed trade-off:
Q4_K_M: Fast, good for chatQ5_K_M: BalancedQ8_0: High accuracy
When to Use RKNN / RKLLM
Section titled “When to Use RKNN / RKLLM”Rockchip provides two distinct NPU libraries:
- RKNN: for CNN-style models (YOLO, ResNet, MobileNet, BERT, Whisper encoder). Converts from ONNX.
- RKLLM: for LLMs and Transformer-based models. A separate library with its own conversion toolchain.
Ideal Use Cases
Section titled “Ideal Use Cases”- When performance is critical: 3-5x faster than CPU
- When energy efficiency is paramount: NPU uses ~1/10th power of GPU
- Production deployment with fixed, pre-converted models
Ecosystem
Section titled “Ecosystem”- Limited (manual conversion required for both RKNN and RKLLM)
- Rockchip RKNN Toolkit 2 for ONNX → RKNN conversion
- Rockchip RKLLM for LLM/Transformer → RKLLM conversion
- Requires model architecture support (not all ops supported)
When NOT to Use RKNN / RKLLM
Section titled “When NOT to Use RKNN / RKLLM”- Rapid prototyping (conversion overhead slows iteration)
- Experimental models not yet validated on Rockchip toolchain
- Models requiring frequent updates
Audio Models
Section titled “Audio Models”Voice Activity Detection
Section titled “Voice Activity Detection”Recommended: ONNX + Silero VAD (via the infer_sherpa adapter)
Speech-to-Text (Whisper)
Section titled “Speech-to-Text (Whisper)”Option A: CPU (Flexible): GGUF via Whisper bindings: standard models, easy updates.
Option B: NPU (Fast): RKNN-converted Whisper encoder: maximum performance, fixed model.
Decision: Use CPU for development and flexibility; use NPU for production where the model is stable.
Wake Word Detection
Section titled “Wake Word Detection”Recommended: ONNX + OpenWakeWord or Silero (via the infer_sherpa adapter)
Large Language Models
Section titled “Large Language Models”Recommended Path
Section titled “Recommended Path”CPU/GPU (flexible, recommended for development):
- Find model on HuggingFace (preferably pre-quantized GGUF)
- Download a GGUF variant (e.g.,
Q4_K_M) - Run via the
infer_llamacpp_cpuorinfer_llamacpp_vulkanadapter
NPU (maximum performance on Rockchip, production):
Use RKLLM via the infer_rkllm adapter: requires converting the model using the RKLLM Toolkit. Best energy efficiency, but introduces vendor lock-in and conversion overhead.
Quantization Guide
Section titled “Quantization Guide”| Quantization | Size Reduction | Quality Loss | Use Case |
|---|---|---|---|
| Q4_K_M | ~70 percent | Low | Chat, general use |
| Q5_K_M | ~60 percent | Very low | Balanced |
| Q6_K | ~50 percent | Minimal | High-quality responses |
| Q8_0 | ~30 percent | Nearly none | Production critical |
Vision Models
Section titled “Vision Models”Preliminary Recommendations
Section titled “Preliminary Recommendations”- Small classifiers (< 100 MB) → ONNX (CPU, flexible)
- Object detection (YOLO, ResNet) on NPU → RKNN (convert from ONNX, use
infer_rknnadapter) - Vision-Language Models (LlaVA) → GGUF (CPU/GPU) or RKLLM (NPU)
- Unsupported architectures → requires a custom adapter (contributor task, see Inference Module)
Converting Models
Section titled “Converting Models”Conversion is done offline using external tools, not at runtime.
PyTorch → ONNX
Section titled “PyTorch → ONNX”Use torch.onnx.export. See the official PyTorch ONNX docs.
Safetensors/PyTorch → GGUF
Section titled “Safetensors/PyTorch → GGUF”Use the convert.py script from the llama.cpp repository.
Use Rockchip RKNN Toolkit 2: configure target_platform='rk3588', load your ONNX model, and export to .rknn with quantization enabled.
Use the Rockchip RKLLM Toolkit to convert Llama-compatible models to the .rkllm format.
Common Issues
Section titled “Common Issues”ONNX Models Not Loading
Section titled “ONNX Models Not Loading”- Verify operator support:
ortdoesn’t support all ONNX ops - Try exporting with an older ONNX opset version
- Check model input/output shapes
GGUF Models Slow
Section titled “GGUF Models Slow”- Ensure Vulkan backend is enabled (
VULKAN_SDKenv var) - Try a lower quantization level
- Check GPU/NPU availability
RKNN Conversion Fails
Section titled “RKNN Conversion Fails”- Verify model architecture is supported by RKNN Toolkit
- Provide a calibration dataset for quantization
- Check for unsupported operators
Getting Help
Section titled “Getting Help”- Discord: paiOS Community
- GitHub Issues: Report bugs or request features
- Documentation: Architecture Overview
Related
Section titled “Related”- Inference Module: Adapter implementation details, crates, and code examples
- ADR-004: Engine Architecture: Hardware allocation and hybrid inference strategy
- Contributing Standards: Code quality guidelines