Vision Module

The Vision Module handles the high-performance image acquisition and preprocessing pipeline. While our initial implementation leverages the specific hardware capabilities of the Rockchip RK3588 SoC (ISP, RGA), the module’s architecture is entirely vendor-agnostic, allowing rapid porting to other SoCs while maintaining memory safety through Rust.

Ports & Adapters (Feature Flags)

Adapter	Implements Port(s)	Capability Feature	Technology / Purpose
`CameraAdapter`	`FrameSource`	`vision_v4l2`	V4l2r → rkisp (MIPI-CSI) or USB Webcam
`CameraAdapter`	`FrameSource`	`vision_mock`	Simulated frames for CI testing
`ImageProcessorAdapter`	`ImageProcessor`	`vision_rga`	RGA (librga) → Radxa / Rockchip SoC 2D hardware acceleration
		`vision_image`	image-rs → desktop CPU-based processing; also fallback on Radxa
		`vision_mock`	Simulated image processing for CI
`MotionAdapter`	`MotionDetector`	`vision_isp_motion`	Hardware MD / V4L2 Stats → Rockchip SoC
		`vision_cpu_motion`	image-rs → CPU frame diff (desktop fallback)
		`vision_mock`	Simulated motion detection for CI

Architecture Context / Relationships

The Vision Module is the reference implementation of Hexagonal Architecture in pai-engine. It is driven by the core orchestrator. It is split into domain (logic + ports + types) and adapters (platform-specific implementations). The system architecture diagram (and C4 Architecture) shows the Vision block with the same ports (FrameSource, ImageProcessor, MotionDetector), adapters, and feature flags as on this page. See Architecture Overview and ADR-004 for context.

Crate Structure

Like other domain crates, vision separates its internal domain logic from its hardware adapters.

crates/vision/
├── src/
│   ├── domain/               # VisionManager, FramePool, StreamProfile
│   │   └── ports.rs          # Internal Traits: `FrameSource`, `ImageProcessor`
│   ├── adapters/             # Hardware implementations
│   │   ├── v4l2.rs           # `V4l2Adapter` for Linux cameras
│   │   ├── mock.rs           # Mock adapter for testing
│   │   └── rga.rs            # Rockchip RGA hardware acceleration
│   └── lib.rs                # Implements Core's `VisionInterface`
└── Cargo.toml

Driven Sensor Philosophy

The camera is a Driven Adapter (right side of the Hexagon). It does not push frames and dictate the system tick. The Core’s SessionManager requests or polls frames when needed. Why: Predictable resource use, simpler testing, alignment with the Hexagonal pattern. The Orchestrator (in core) calls into the Vision domain; Vision’s VisionManager uses the ports, which are implemented by the adapters (Camera, RGA with Rockchip/x86/Mock).

Internal Domain Components

The Vision Module consists of six core components, each with a specific responsibility:

1. StreamProfile (Configuration)

Role: Configuration & State Definition

Description: Deserializes the runtime configuration (JSON) into strict Rust structures. It defines the initial state of the pipeline, including resolution, framerate, and pixel formats (e.g., “Initialize 1080p stream @ 30fps”).

Responsibilities:

Parse and validate configuration from JSON
Define pipeline state (resolution, framerate, pixel format)
Provide type-safe configuration structures

2. FramePool (Memory Management)

Role: DMA-BUF Allocator

Description: Manages a pool of pre-allocated DMA buffers. This enables a zero-copy architecture where image data is shared directly between the Camera ISP and the RGA hardware without expensive CPU memory copies. The pool has a fixed, bounded size: if the frame consumer (Inference) cannot keep up with the capture rate, the oldest frames are evicted. This backpressure behaviour prevents memory exhaustion under load.

Responsibilities:

Pre-allocate DMA buffers for zero-copy operations
Manage buffer lifecycle (allocation, deallocation, recycling)
Provide buffer access to FrameSource and ImageProcessor
Drop oldest frames when the pool is full (ring-buffer semantics), preventing OOM

3. FrameSource (Port) & CameraAdapter

Role: Port = trait in domain; Adapter = CameraAdapter (feature-gated: see Ports & Adapters (Feature Flags) table in this section).

Description: The domain defines a FrameSource port (start/stop stream, get frames). The CameraAdapter implements it: e.g. V4L2 for Linux (MIPI CSI on Rockchip, webcam on x86) and a Mock for tests.

Responsibilities:

Domain: define the port interface (no hardware).
Adapters: V4L2 lifecycle, NV12 frames, error handling; Mock for CI.
Platform selection via #[cfg(target_arch = "...")] or similar in adapters/mod.rs.

4. ImageProcessor (Port) & ImageProcessorAdapter

Role: Port = trait in domain; Adapter = ImageProcessorAdapter (feature-gated: see Ports & Adapters (Feature Flags) table in this section).

Description: The domain defines an ImageProcessor port (format conversion, resize, crop). The ImageProcessorAdapter implements it: e.g. RGA on Radxa/Rockchip SoC, image-rs on desktop and as CPU fallback on Radxa, and Mock for CI.

Responsibilities:

Domain: port interface only.
Adapters: NV12→RGB, resize, crop; hardware-accelerated path on target, software path on desktop and as fallback.

5. MotionDetector (Port) & MotionAdapter

Role: Port = trait in domain; Adapter = MotionAdapter (feature-gated: see Ports & Adapters (Feature Flags) table in this section).

Description: Motion gating for power saving. The ISP can output a low-resolution thumbnail stream (e.g., 64×64) alongside the full-resolution stream. The MotionAdapter implements the MotionDetector port: hardware thumbnail diff on target, CPU frame diff on desktop, and Mock for CI. The VisionManager uses motion detection to enable full-resolution capture and inference only when needed.

Why this matters: Full-resolution capture + RGA + inference consumes significant power. When a battery-powered AI device is idle (no motion detected), the system stays in a low-power state with only the ISP thumbnail stream active. Motion triggers the full pipeline.

6. VisionManager (Domain)

Role: Pipeline coordinator (domain logic).

Description: The central unit in the domain. It runs the event loop and orchestrates StreamProfile, FramePool, and the ports (FrameSource, ImageProcessor, MotionDetector). It has no hardware knowledge; adapters are injected (e.g. in main.rs).

Responsibilities:

Coordinate data flow between domain types and ports
Event loop and resource lifecycle
Thread safety and cleanup

Cancellation-Safe Interface (Saga Rollback Support)

vision is a participant in Core’s Saga rollback. When the SessionManager cancels an in-progress flow (e.g., a user interrupts a visual query), it issues compensating commands to Vision. VisionManager must support clean cancellation at any point:

stop_stream(): halts V4L2 capture, returns all FramePool buffers to the free-list, releases the camera device handle. Safe to call even if streaming was never started.
cancel_processing(): signals any in-flight RGA operations to abort; does not block.

Failure to support these cleanly will leave the camera device open after the Core has moved back to Idle, blocking other processes and causing a Zombie State. See Core: User-Interruption & State Rollback for the full rollback sequence.

Data Flow

Internal pipeline (Vision domain)

Within the Vision domain crate, frames move through a fixed sequence. This ordering is current design and may remain hardcoded for simplicity; any future per-flow or user-configurable ordering would be an extension.

flowchart LR
  A["FrameSource<br/>(NV12)"] --> B["FramePool<br/>(DMA-BUF)"] --> C["ImageProcessor<br/>(RGB/RGBA)"]

FrameSource (camera) retrieves ISP-processed frames (NV12).
FramePool provides DMA buffers for zero-copy transfer.
ImageProcessor converts NV12 to RGB/RGBA and performs any required resizing.

Processed frames are then handed out via the VisionInterface to the caller. The Vision domain does not decide what happens next.

What happens after Vision: Core flows decide

Whether processed frames are passed to the Inference (or used for streaming, snapshots, or discarded) is not defined by the Vision domain. It is defined by the active flow in the Core: the flows component (FlowRunner, e.g. Voice, Stream, Chat, Interaction). The flow (and in future, the user or agent) decides if and when to call Inference, relay to a client, or drop the frame.

So the final end-to-end flow (e.g. Vision → Inference, or Vision → stream only) is always determined by the flows in Core; Vision only exposes the internal pipeline described earlier and implements the VisionInterface.

Hardware Integration

The Vision Module leverages specific RK3588 hardware:

Component	Hardware	Purpose
ISP	Image Signal Processor	Camera sensor processing, auto-exposure, white balance
RGA	Raster Graphics Accelerator	Hardware-accelerated format conversion and scaling
DMA	Direct Memory Access	Zero-copy buffer sharing between components

Motion gating (ISP thumbnail stream for power saving) is implemented by the MotionDetector (Port) & MotionAdapter in the Internal Domain Components section.

ADR-004: Engine Architecture: High-level architecture decisions
Inference Module: AI model execution (consumes vision frames)
Workspace and Build: Feature flags and build configuration
OS & Infrastructure: System layer overview
C4 Architecture: System context and containers

Vision Module

Ports & Adapters (Feature Flags)

Architecture Context / Relationships

Crate Structure

Driven Sensor Philosophy

Internal Domain Components

1. StreamProfile (Configuration)

2. FramePool (Memory Management)

3. FrameSource (Port) & CameraAdapter

4. ImageProcessor (Port) & ImageProcessorAdapter

5. MotionDetector (Port) & MotionAdapter

6. VisionManager (Domain)

Cancellation-Safe Interface (Saga Rollback Support)

Data Flow

Internal pipeline (Vision domain)

What happens after Vision: Core flows decide

Hardware Integration

Related Documentation