Runner Guide¶

This document explains how the runtime runner works, from skill input to final outputs.

1) What the Runner Is¶

The runner is the execution subsystem in runtime/ that:

loads skill and capability definitions
builds an execution plan
resolves and executes each step
maps capability/service responses back into skill outputs
emits execution events and structured logs

Primary entrypoint for manual usage is cli/main.py.

2) End-to-End Flow¶

Execution path (high-level):

CLI builds ExecutionRequest
ExecutionEngine loads skill and initializes ExecutionState
Scheduler builds DAG from step config.depends_on declarations
Scheduler dispatches steps (parallel when dependencies allow, sequential by default):
InputMapper resolves step input from inputs/vars/outputs refs
If uses starts with skill:, NestedSkillRunner executes recursively
Else CapabilityExecutor delegates to BindingExecutor
BindingExecutor pipeline:
BindingResolver selects binding
ServiceResolver resolves service descriptor
RequestBuilder builds protocol payload from input.* template
ProtocolRouter dispatches to protocol invoker
ResponseMapper maps response.* into capability outputs
OutputMapper writes step outputs to vars. and outputs.
ExecutionEngine validates required final outputs
SkillExecutionResult is returned

3) Core Runtime Modules¶

State and model layer:

runtime/models.py: typed runtime data contracts (includes CognitiveState v1 structures)
runtime/execution_state.py: mutable execution state + runtime events
runtime/errors.py: typed runtime exceptions

Planning and orchestration:

runtime/skill_loader.py: loads and normalizes skill specs
runtime/capability_loader.py: loads and normalizes capability specs
runtime/execution_planner.py: prepares step order
runtime/scheduler.py: DAG-based step scheduler (parallel/sequential)
runtime/execution_engine.py: orchestrates whole run
runtime/nested_skill_runner.py: executes skill: steps

Step input/output mapping:

runtime/reference_resolver.py: resolves data references (7 namespaces with path traversal)
runtime/input_mapper.py: materializes step input
runtime/output_mapper.py: writes produced values to runtime targets (5 writable namespaces, 4 merge strategies)

Binding execution layer:

runtime/binding_registry.py: loads services, bindings, defaults
runtime/active_binding_map.py: active override map
runtime/binding_resolver.py: chooses effective binding
runtime/service_resolver.py: resolves service descriptor
runtime/request_builder.py: builds invocation payload
runtime/protocol_router.py: routes by protocol kind
runtime/response_mapper.py: maps invocation response to capability output
runtime/binding_executor.py: full binding execution pipeline
runtime/capability_executor.py: runtime adapter around binding executor

Protocol invokers:

runtime/pythoncall_invoker.py
runtime/openapi_invoker.py
runtime/openrpc_invoker.py
runtime/mcp_invoker.py

OpenAPI invoker runtime knobs (metadata-driven):

binding metadata:
method (default POST)
timeout_seconds (overrides service/default timeout)
headers (string-to-string map merged over service headers)
response_mode (json default, text, or raw)
service metadata:
timeout_seconds (used when binding does not override)
headers (base header map)

Observability:

runtime/observability.py: structured logs, trace context, redaction

4) CognitiveState v1¶

ExecutionState now includes four cognitive blocks (frame, working, output, trace) plus an extensions namespace, enabling structured multi-step reasoning without breaking the legacy vars/outputs pipeline.

Key additions:

FrameState: immutable reasoning context (goal, constraints, success_criteria)
WorkingState: mutable working memory with 10 typed cognitive slots
OutputState: structured result metadata (result_type, summary, status_reason)
TraceState: per-step data lineage (reads/writes) and live aggregate metrics
extensions: open namespace for plugins

Reference resolution now supports 7 namespaces with path traversal through dataclass attributes, dict keys, and list indices:

inputs.*, vars.*, outputs.*, frame.*, working.*, output.*, extensions.*

Output mapping supports 5 writable namespaces with 4 merge strategies (overwrite, append, deep_merge, replace).

All features are backward-compatible. Existing skills are unaffected.

Full reference: docs/COGNITIVE_STATE_V1.md

5) Trace and Events¶

Two complementary tracing surfaces exist:

Runtime events (in ExecutionState.events)
event type/message/timestamp/step_id/trace_id/data
Structured logs (JSON lines)
skill/step/capability lifecycle
service lifecycle for critical services
correlation through trace_id
CognitiveState v1 trace (in ExecutionState.trace)
TraceStep per step: step_id, capability_id, status, reads, writes, latency_ms
TraceMetrics aggregate: step_count, llm_calls, tool_calls, tokens_in/out, elapsed_ms
Automatically populated by the execution engine

Trace propagation:

trace_id can be passed in ExecutionRequest
CLI supports --trace-id in run and trace commands
nested skills inherit parent trace_id

6) How to Run¶

Basic execution:

python cli/main.py run
python cli/main.py run --input "{\"key\":\"value\"}"
python cli/main.py run --input-file input.json

Execution with trace correlation:

python cli/main.py run --trace-id trace-001
python cli/main.py trace --trace-id trace-001

System checks:

python cli/main.py doctor

OpenAPI checks from CLI:

python cli/main.py openapi verify-bindings --all
python cli/main.py openapi verify-bindings --scenario tooling/openapi_scenarios/data.schema.validate.mock.json
python cli/main.py openapi verify-invoker
python cli/main.py openapi verify-errors

7) Validation and Health Commands¶

Contracts:

python tooling/test_capability_contracts.py

Smoke:

python tooling/verify_smoke_capabilities.py --report-file artifacts/smoke_report.json

Coverage and consistency:

python tooling/compute_runtime_coverage.py
python tooling/compute_runtime_stats.py
python tooling/compute_skill_executability.py

Registry side:

python ../agent-skill-registry/tools/validate_registry.py
python ../agent-skill-registry/tools/generate_catalog.py
python ../agent-skill-registry/tools/registry_stats.py

8) Failure Model (Practical)¶

Common failure categories:

Input mapping errors: missing input.* fields required by a step
Binding resolution errors: no binding or invalid default selection
Service resolution errors: binding points to missing/invalid service
Request/response mapping errors: template points to missing fields
Protocol routing/invocation errors: unsupported or failing protocol path
Final output validation errors: required skill outputs not produced

Debug order that works well:

Re-run with trace command and fixed trace_id
Inspect structured logs filtered by trace_id
Inspect failing step mapping in skill yaml
Inspect binding request/response templates
Confirm service implementation output shape

9) Configuration Surfaces¶

Repository-level:

bindings/official/
services/official/
policies/official_default_selection.yaml

Host-level overrides (.agent-skills):

services.yaml
bindings/local/
bindings/candidate/
active_bindings.json
overrides.yaml

10) Design Constraints¶

Current runner behavior intentionally keeps:

DAG-based step scheduling with backward-compatible implicit sequential deps
explicit depends_on: [] to opt into parallel execution
explicit mapping instead of implicit field matching
strict response mapping (missing fields fail fast)
protocol abstraction via invoker routing
thread-safe state mutations via _StateLock during parallel execution

See docs/SCHEDULER.md for full scheduler documentation.

11) Current Baseline¶

As documented in docs/PROJECT_STATUS.md, runner baseline is currently stable with:

45/45 contract pass
8/8 smoke pass
full capability coverage and skill executability (45/45 capabilities, 36/36 skills)
DAG scheduler functional tests: 5/5
DAG scheduler stress tests: 5/5
CognitiveState v1 regression tests: 86/86
CognitiveState v1 integration tests: 99/99

This is the recommended baseline before starting MCP/OpenAPI adapter expansion.

12) Safety Enforcement¶

Capabilities that declare a safety block are subject to runtime enforcement during step execution. The enforcement runs inside _execute_step() and is transparent to the rest of the pipeline.

Enforcement sequence¶

Before capability execution:

Trust-level check — The capability's safety.trust_level rank is compared against ExecutionOptions.trust_level. If the context rank is lower, a SafetyTrustLevelError is raised. Ranks: sandbox=0, standard=1, elevated=2, privileged=3.
Confirmation check — If safety.requires_confirmation is true and the capability id is not in ExecutionOptions.confirmed_capabilities, a SafetyConfirmationRequiredError is raised.
Mandatory pre-gates — Each gate in safety.mandatory_pre_gates is executed as a capability. If the gate returns {"allowed": false}, the on_fail policy determines behavior:
block — raises SafetyGateFailedError (default)
warn — emits a safety_gate_warning event and continues
degrade — returns a degraded StepResult (step skipped)
require_human — raises SafetyConfirmationRequiredError

After capability execution:

Mandatory post-gates — Same mechanism as pre-gates, but receives the produced output as input.

Capabilities without safety¶

Capabilities that do not declare a safety block are executed without any additional checks. The enforcement is zero-cost for the ~80% of capabilities that have no side effects.

ExecutionOptions safety fields¶

trust_level: str — Runtime trust level for the execution context (default: "standard").
confirmed_capabilities: frozenset[str] — Set of capability ids pre-confirmed by the caller, bypassing requires_confirmation.

Error types¶

SafetyTrustLevelError — trust level insufficient
SafetyGateFailedError — a mandatory gate blocked execution
SafetyConfirmationRequiredError — human confirmation required

All three inherit from RuntimeErrorBase and carry capability_id and step_id context.