DAG Scheduler¶
This document describes the step scheduler used by the execution engine.
Overview¶
The scheduler (runtime/scheduler.py) replaces the original sequential step loop
with a DAG-based execution model. Steps may run in parallel when their dependencies
allow it.
The scheduler is backward-compatible: existing skills that do not declare
depends_on continue to execute sequentially with identical semantics.
Dependency Rules¶
- If a step declares
config.depends_on: [step_a, step_b], those are its explicit dependencies. The step runs only after all listed steps complete. - If a step declares
config.depends_on: [](explicit empty list), it has no dependencies and may run in parallel with other independent steps. - If a step does NOT declare
depends_onat all, it implicitly depends on the immediately preceding step in declared order. This preserves sequential semantics for all existing skills without any changes.
Examples¶
Sequential (default — no depends_on declared):
steps:
- id: step_a
uses: capability.one
- id: step_b
uses: capability.two # implicitly depends on step_a
- id: step_c
uses: capability.three # implicitly depends on step_b
Execution order: step_a → step_b → step_c
Parallel (explicit empty deps):
steps:
- id: fetch_a
uses: web.page.fetch
config:
depends_on: [] # no dependencies
- id: fetch_b
uses: web.page.fetch
config:
depends_on: [] # no dependencies
- id: combine
uses: text.content.template
config:
depends_on: [fetch_a, fetch_b] # waits for both
Execution order: fetch_a ∥ fetch_b → combine
Mixed (explicit deps on prior step):
steps:
- id: validate_events
uses: data.schema.validate
- id: analyze_trace
uses: ops.trace.analyze
config:
depends_on: [validate_events]
- id: monitor_trace
uses: ops.trace.monitor
config:
depends_on: [analyze_trace]
Execution order: validate_events → analyze_trace → monitor_trace
Thread Safety¶
The scheduler uses ThreadPoolExecutor for parallel step execution.
All mutations to shared ExecutionState (vars, outputs, events, working,
output, extensions, trace) are serialized through _StateLock, which is
attached to the execution context.
Key invariant: capability execution (LLM calls, HTTP requests, etc.) runs outside the lock. Only state mutations are serialized.
CognitiveState v1 namespaces (working, output, extensions, trace) follow the same thread-safety guarantees as legacy namespaces. See docs/COGNITIVE_STATE_V1.md.
Failure Handling¶
- If a step fails and
fail_fast=True(default), all pending futures are cancelled and execution returns immediately. - If a step fails and
fail_fast=False, steps that depend on the failed step are marked asskippedwitherror_message: "Skipped: dependency failed". Independent steps continue executing. - Circular or unsatisfied dependencies raise a
RuntimeErrorwith a deadlock message listing the unresolved steps.
Validation¶
The scheduler validates that all depends_on references point to existing
step IDs. An InvalidSkillSpecError is raised for unknown step references.
Configuration¶
max_workers: maximum parallel threads (default: 8).- The scheduler is instantiated by
ExecutionEngineand used automatically for all skill executions.
Testing¶
runtime/test_scheduler_functional.py: 5 functional tests covering sequential, parallel, mixed, and single-step scenarios.runtime/test_scheduler_stress.py: 5 stress tests covering fan-out, deep chains, diamond patterns, and concurrent safety.
Related Docs¶
- RUNNER_GUIDE.md: end-to-end execution flow
- SKILL_FORMAT.md:
config.depends_onfield spec - OBSERVABILITY.md: step-level event tracing