RFC-0001: ORCA Runtime Blueprint¶
Status: Draft Authors: Runtime Architecture Working Session Date: 2026-05-26 Related: docs/ADR.md, docs/RUNNER_GUIDE.md, docs/ASYNC_EXECUTION.md, docs/OBSERVABILITY.md
1. Context¶
The current runtime already provides:
- Deterministic skill execution and DAG scheduling.
- Safety gating and trust-level checks.
- Structured observability and audit records.
- HTTP/MCP/SDK adapters.
The primary gap for production-grade agent deployment is durable operational semantics:
- Checkpoint/resume/replay/fork as first-class runtime operations.
- Side-effect-safe re-execution with explicit idempotency semantics.
- Human-in-the-loop as persisted run state (not exception-only flow).
This RFC defines the target product architecture to evolve ORCA into an agent runtime platform with a clean-core migration: temporary legacy compatibility, explicit cutover, and explicit retirement.
2. Product Thesis¶
ORCA should operate as an execution operating system for intelligent workflows:
- Durable execution like workflow runtimes.
- Contract/version/policy rigor like data platforms.
- Developer ergonomics and deployment flow like modern platform tooling.
- Side-effect safety stronger than typical graph runtimes.
3. Architecture Planes¶
3.1 Execution Plane¶
Components:
- AsyncExecutionEngine (canonical execution core).
- Sync adapter over async core (temporary compatibility surface).
- Scheduler.
- Step lifecycle orchestrator.
Responsibilities:
- Skill and nested-skill execution.
- Control-flow semantics.
- Retry, timeout, cancellation.
- Step boundary hooks for checkpoints and side-effect recording.
3.2 Durability Plane¶
Components:
- RunStore v2.
- CheckpointStore.
- EventStore.
- ReplayEngine.
Responsibilities:
- Persist run lifecycle.
- Persist and load consistent state snapshots.
- Persist ordered runtime event stream.
- Support resume, replay, and fork workflows.
3.3 Side-Effect Safety Plane¶
Components:
- SideEffectLedger.
- IdempotencyKeyResolver.
- ReplayPolicyResolver.
- Optional CompensationHandlers.
Responsibilities:
- Ensure replay does not duplicate external actions by default.
- Record request/response hashes and effect state.
- Resolve behavior for re-execution under each replay policy.
3.4 Policy and Trust Plane¶
Components:
- PolicyManager.
- ApprovalManager.
- TenantPolicyResolver.
Responsibilities:
- Enforce trust/safety constraints.
- Persist human approval requests and decisions.
- Resolve effective policy by environment and tenant context.
3.5 Control Plane¶
Components:
- Registry/metadata integration.
- Binding activation and rollout manager.
- Promotion and environment control.
Responsibilities:
- Version snapshots per run.
- Binding rollout controls (including shadow mode later).
- Stable operational governance.
3.6 API and Streaming Plane¶
Components:
- Python runtime API.
- HTTP Runtime API.
- Event streaming API.
Responsibilities:
- Expose run lifecycle operations.
- Expose checkpoint and trace navigation.
- Expose approval and replay operations.
3.7 DX and Delivery Plane¶
Components:
- CLI runtime lifecycle commands.
- Test/eval harness.
- Replay-pack tooling.
Responsibilities:
- Shorten deploy-debug loop.
- Improve reproducibility for operators and contributors.
- Keep migration friction controlled while converging to a single clean architecture.
4. Canonical Runtime Entities¶
- Skill.
- Capability.
- Binding.
- Run.
- Checkpoint.
- SideEffectRecord.
- PolicySnapshot.
- Artifact.
5. Clean-Core Migration Strategy¶
- Define one canonical runtime core (async-first) as target architecture.
- Keep legacy runtime surfaces only during a bounded migration window.
- Introduce durability, side-effect safety, and HITL directly in canonical core.
- Keep compatibility shims thin and non-authoritative.
- Remove legacy paths after cutover gates are met.
Migration constraints:
- No indefinite dual-engine operation.
- No new feature development on legacy path after cutover start.
- Canonical semantics are owned by EventStore + RunStore v2 + CheckpointStore.
6. Known Collisions with Current Architecture¶
The following collisions are expected and require explicit decision before implementation:
-
ADR-001 Scheduler decision rejects asyncio-first scheduling while this blueprint introduces AsyncExecutionEngine. Decision: retain current scheduler logic where possible, but execution ownership moves to async core; sync becomes adapter.
-
docs/ASYNC_EXECUTION.md declares in-memory run store semantics and non-durable lifecycle. Decision: preserve endpoint compatibility temporarily; migrate backend behavior to durable semantics and retire non-durable mode.
-
Safety flow currently expresses human confirmation primarily through exceptions. Decision: persisted waiting_for_human is canonical; exception-only behavior remains as temporary compatibility projection.
-
Existing observability event names are log-centric and unversioned. Decision: EventStore versioned contract is source of truth; legacy logs remain as projection during migration.
-
Existing docs frame checkpoint.py as state serialization utility, not lifecycle primitive. Decision: checkpoint semantics are promoted to lifecycle primitive while keeping serializer format upgrade-compatible.
7. Non-Goals in This RFC¶
- Distributed scheduler redesign.
- Visual studio UI productization.
- Marketplace ecosystem expansion.
- Broad connector expansion beyond current priority set.
8. Success Criteria¶
- Run lifecycle supports pause/resume/replay/fork without breaking existing run/trace flows.
- Side-effect replays are safe by policy defaults.
- Every run captures reproducibility metadata (versions and policy snapshot).
- Canonical runtime core is singular (no indefinite dual path).
- Legacy path is retired after agreed cutover gates.
9. Follow-Up RFCs¶
- RFC-0002: Durable Execution State Machine.
- RFC-0003: Side-Effect Ledger and Replay Safety.
- RFC-0004: Runtime APIs and Event Contract.
- RFC-0005: Pseudocode and Integration Mapping.
- RFC-0006: Legacy Retirement Matrix and Cutover Gates.