Vision Module
The Vision Module handles the high-performance image acquisition and preprocessing pipeline. While our initial implementation leverages the specific hardware capabilities of the Rockchip RK3588 SoC (ISP, RGA), the module’s architecture is entirely vendor-agnostic, allowing rapid porting to other SoCs while maintaining memory safety through Rust.
Ports & Adapters (Feature Flags)
Section titled “Ports & Adapters (Feature Flags)”| Adapter | Implements Port(s) | Capability Feature | Technology / Purpose |
|---|---|---|---|
CameraAdapter | FrameSource | vision_v4l2 | V4l2r → rkisp (MIPI-CSI) or USB Webcam |
vision_mock | Simulated frames for CI testing | ||
ImageProcessorAdapter | ImageProcessor | vision_rga | RGA (librga) → Radxa / Rockchip SoC 2D hardware acceleration |
vision_image | image-rs → desktop CPU-based processing; also fallback on Radxa | ||
vision_mock | Simulated image processing for CI | ||
MotionAdapter | MotionDetector | vision_isp_motion | Hardware MD / V4L2 Stats → Rockchip SoC |
vision_cpu_motion | image-rs → CPU frame diff (desktop fallback) | ||
vision_mock | Simulated motion detection for CI |
Architecture Context / Relationships
Section titled “Architecture Context / Relationships”The Vision Module is the reference implementation of Hexagonal Architecture in pai-engine. It is driven by the core orchestrator. It is split into domain (logic + ports + types) and adapters (platform-specific implementations). The system architecture diagram (and C4 Architecture) shows the Vision block with the same ports (FrameSource, ImageProcessor, MotionDetector), adapters, and feature flags as on this page. See Architecture Overview and ADR-004 for context.
Crate Structure
Section titled “Crate Structure”Like other domain crates, vision separates its internal domain logic from its hardware adapters.
crates/vision/├── src/│ ├── domain/ # VisionManager, FramePool, StreamProfile│ │ └── ports.rs # Internal Traits: `FrameSource`, `ImageProcessor`│ ├── adapters/ # Hardware implementations│ │ ├── v4l2.rs # `V4l2Adapter` for Linux cameras│ │ ├── mock.rs # Mock adapter for testing│ │ └── rga.rs # Rockchip RGA hardware acceleration│ └── lib.rs # Implements Core's `VisionInterface`└── Cargo.tomlDriven Sensor Philosophy
Section titled “Driven Sensor Philosophy”The camera is a Driven Adapter (right side of the Hexagon). It does not push frames and dictate the system tick. The Core’s SessionManager requests or polls frames when needed. Why: Predictable resource use, simpler testing, alignment with the Hexagonal pattern. The Orchestrator (in core) calls into the Vision domain; Vision’s VisionManager uses the ports, which are implemented by the adapters (Camera, RGA with Rockchip/x86/Mock).
Internal Domain Components
Section titled “Internal Domain Components”The Vision Module consists of six core components, each with a specific responsibility:
1. StreamProfile (Configuration)
Section titled “1. StreamProfile (Configuration)”Role: Configuration & State Definition
Description: Deserializes the runtime configuration (JSON) into strict Rust structures. It defines the initial state of the pipeline, including resolution, framerate, and pixel formats (e.g., “Initialize 1080p stream @ 30fps”).
Responsibilities:
- Parse and validate configuration from JSON
- Define pipeline state (resolution, framerate, pixel format)
- Provide type-safe configuration structures
2. FramePool (Memory Management)
Section titled “2. FramePool (Memory Management)”Role: DMA-BUF Allocator
Description: Manages a pool of pre-allocated DMA buffers. This enables a zero-copy architecture where image data is shared directly between the Camera ISP and the RGA hardware without expensive CPU memory copies. The pool has a fixed, bounded size: if the frame consumer (Inference) cannot keep up with the capture rate, the oldest frames are evicted. This backpressure behaviour prevents memory exhaustion under load.
Responsibilities:
- Pre-allocate DMA buffers for zero-copy operations
- Manage buffer lifecycle (allocation, deallocation, recycling)
- Provide buffer access to FrameSource and ImageProcessor
- Drop oldest frames when the pool is full (ring-buffer semantics), preventing OOM
3. FrameSource (Port) & CameraAdapter
Section titled “3. FrameSource (Port) & CameraAdapter”Role: Port = trait in domain; Adapter = CameraAdapter (feature-gated: see Ports & Adapters (Feature Flags) table in this section).
Description: The domain defines a FrameSource port (start/stop stream, get frames). The CameraAdapter implements it: e.g. V4L2 for Linux (MIPI CSI on Rockchip, webcam on x86) and a Mock for tests.
Responsibilities:
- Domain: define the port interface (no hardware).
- Adapters: V4L2 lifecycle, NV12 frames, error handling; Mock for CI.
- Platform selection via
#[cfg(target_arch = "...")]or similar inadapters/mod.rs.
4. ImageProcessor (Port) & ImageProcessorAdapter
Section titled “4. ImageProcessor (Port) & ImageProcessorAdapter”Role: Port = trait in domain; Adapter = ImageProcessorAdapter (feature-gated: see Ports & Adapters (Feature Flags) table in this section).
Description: The domain defines an ImageProcessor port (format conversion, resize, crop). The ImageProcessorAdapter implements it: e.g. RGA on Radxa/Rockchip SoC, image-rs on desktop and as CPU fallback on Radxa, and Mock for CI.
Responsibilities:
- Domain: port interface only.
- Adapters: NV12→RGB, resize, crop; hardware-accelerated path on target, software path on desktop and as fallback.
5. MotionDetector (Port) & MotionAdapter
Section titled “5. MotionDetector (Port) & MotionAdapter”Role: Port = trait in domain; Adapter = MotionAdapter (feature-gated: see Ports & Adapters (Feature Flags) table in this section).
Description: Motion gating for power saving. The ISP can output a low-resolution thumbnail stream (e.g., 64×64) alongside the full-resolution stream. The MotionAdapter implements the MotionDetector port: hardware thumbnail diff on target, CPU frame diff on desktop, and Mock for CI. The VisionManager uses motion detection to enable full-resolution capture and inference only when needed.
Why this matters: Full-resolution capture + RGA + inference consumes significant power. When a battery-powered AI device is idle (no motion detected), the system stays in a low-power state with only the ISP thumbnail stream active. Motion triggers the full pipeline.
6. VisionManager (Domain)
Section titled “6. VisionManager (Domain)”Role: Pipeline coordinator (domain logic).
Description: The central unit in the domain. It runs the event loop and orchestrates StreamProfile, FramePool, and the ports (FrameSource, ImageProcessor, MotionDetector). It has no hardware knowledge; adapters are injected (e.g. in main.rs).
Responsibilities:
- Coordinate data flow between domain types and ports
- Event loop and resource lifecycle
- Thread safety and cleanup
Cancellation-Safe Interface (Saga Rollback Support)
Section titled “Cancellation-Safe Interface (Saga Rollback Support)”vision is a participant in Core’s Saga rollback. When the SessionManager cancels an in-progress flow (e.g., a user interrupts a visual query), it issues compensating commands to Vision. VisionManager must support clean cancellation at any point:
stop_stream(): halts V4L2 capture, returns all FramePool buffers to the free-list, releases the camera device handle. Safe to call even if streaming was never started.cancel_processing(): signals any in-flight RGA operations to abort; does not block.
Failure to support these cleanly will leave the camera device open after the Core has moved back to Idle, blocking other processes and causing a Zombie State. See Core: User-Interruption & State Rollback for the full rollback sequence.
Data Flow
Section titled “Data Flow”Internal pipeline (Vision domain)
Section titled “Internal pipeline (Vision domain)”Within the Vision domain crate, frames move through a fixed sequence. This ordering is current design and may remain hardcoded for simplicity; any future per-flow or user-configurable ordering would be an extension.
flowchart LR A["FrameSource<br/>(NV12)"] --> B["FramePool<br/>(DMA-BUF)"] --> C["ImageProcessor<br/>(RGB/RGBA)"]
- FrameSource (camera) retrieves ISP-processed frames (NV12).
- FramePool provides DMA buffers for zero-copy transfer.
- ImageProcessor converts NV12 to RGB/RGBA and performs any required resizing.
Processed frames are then handed out via the VisionInterface to the caller. The Vision domain does not decide what happens next.
What happens after Vision: Core flows decide
Section titled “What happens after Vision: Core flows decide”Whether processed frames are passed to the Inference (or used for streaming, snapshots, or discarded) is not defined by the Vision domain. It is defined by the active flow in the Core: the flows component (FlowRunner, e.g. Voice, Stream, Chat, Interaction). The flow (and in future, the user or agent) decides if and when to call Inference, relay to a client, or drop the frame.
So the final end-to-end flow (e.g. Vision → Inference, or Vision → stream only) is always determined by the flows in Core; Vision only exposes the internal pipeline described earlier and implements the VisionInterface.
Hardware Integration
Section titled “Hardware Integration”The Vision Module leverages specific RK3588 hardware:
| Component | Hardware | Purpose |
|---|---|---|
| ISP | Image Signal Processor | Camera sensor processing, auto-exposure, white balance |
| RGA | Raster Graphics Accelerator | Hardware-accelerated format conversion and scaling |
| DMA | Direct Memory Access | Zero-copy buffer sharing between components |
Motion gating (ISP thumbnail stream for power saving) is implemented by the MotionDetector (Port) & MotionAdapter in the Internal Domain Components section.
Related Documentation
Section titled “Related Documentation”- ADR-004: Engine Architecture: High-level architecture decisions
- Inference Module: AI model execution (consumes vision frames)
- Workspace and Build: Feature flags and build configuration
- OS & Infrastructure: System layer overview
- C4 Architecture: System context and containers