RFC-0002: Durable Execution State Machine¶
Status: Draft Date: 2026-05-26 Depends on: docs/rfcs/RFC-0001-ORCA-RUNTIME-BLUEPRINT.md
1. Scope¶
Define canonical run lifecycle semantics for durable execution:
- Run status model.
- Allowed state transitions.
- Checkpoint boundaries.
- Resume/replay/fork semantics.
2. Canonical Run Statuses¶
- pending
- running
- waiting_for_human
- waiting_for_signal
- replaying
- completed
- failed
- canceled
3. Transition Rules¶
Allowed transitions:
- pending -> running
- running -> waiting_for_human
- waiting_for_human -> running
- running -> waiting_for_signal
- waiting_for_signal -> running
- running -> replaying
- replaying -> running
- running -> completed
- running -> failed
- running -> canceled
- replaying -> failed
- replaying -> canceled
Terminal states:
- completed
- failed
- canceled
4. Checkpoint Boundary Policy¶
Checkpoint creation is mandatory at these boundaries:
- Run started.
- Step completed (including degraded and skipped).
- Transition to waiting_for_human.
- Transition to waiting_for_signal.
- Run finished (completed, failed, canceled).
Checkpoint creation is optional at these boundaries:
- Step started.
- Retry scheduled.
Rationale: keep durable semantics strong while avoiding unnecessary persistence overhead.
5. RunStore v2 Shape¶
Minimum fields:
- run_id
- thread_id
- skill_id
- skill_version
- status
- created_at
- started_at
- finished_at
- current_step_id
- checkpoint_head
- resume_from_checkpoint_id
- trace_id
- tenant_id
- environment
- versions.capabilities
- versions.bindings
- versions.registry_ref
- policy_snapshot_id
Compatibility note: existing fields (result/error/status) remain populated during migration, but RunStore v2 is the canonical internal model.
6. Resume Semantics¶
Resume preconditions:
- run.status in {waiting_for_human, waiting_for_signal, failed, canceled}.
- checkpoint_head exists.
- policy re-evaluation succeeds unless forced recovery mode is explicitly enabled.
Resume behavior:
- Load checkpoint_head snapshot into execution state.
- Transition run to running.
- Emit run.resumed event.
- Continue scheduler from next eligible step boundary.
7. Replay Semantics¶
Replay behavior:
- Build a new run with source_run_id and replay metadata.
- Start from selected checkpoint or step boundary.
- Execute under replay policy for side effects.
- Record replay lineage (source run and source checkpoint).
Replay is never in-place mutation of an existing run.
8. Fork Semantics¶
Fork behavior:
- Clone a run from a selected checkpoint.
- Assign new run_id.
- Preserve source linkage metadata.
- Allow different runtime options/policies unless restricted by environment policy.
9. Collision Notes and Decision Proposals¶
-
Current run status model in docs/ASYNC_EXECUTION.md supports running/completed/failed only. Decision: canonical runtime status model is adopted now; legacy response shape is a temporary projection layer.
-
Current run_store.py stores minimal execution metadata. Decision: RunStoreV2 replaces authoritative run state. Existing RunStore methods are compatibility wrappers only.
-
Current checkpoint.py serializes state but has no lifecycle orchestration. Decision: CheckpointManager becomes mandatory lifecycle component and wraps existing serializer with versioned upgrade path.
10. Legacy Compatibility Window¶
- Compatibility layer remains only for migration releases.
- New run lifecycle features are not implemented on legacy-only structures.
- Retirement criteria are defined in RFC-0006.
11. Validation Scenarios¶
- Pause for confirmation and resume to completion.
- Replay from step boundary using recorded side-effect results.
- Fork from checkpoint and run with modified non-breaking options.
- Canonical run status is exposed consistently across Python and HTTP runtime APIs.