Skip to content

Runner Guide

This document explains how the runtime runner works, from skill input to final outputs.

1) What the Runner Is

The runner is the execution subsystem in runtime/ that:

  • loads skill and capability definitions
  • builds an execution plan
  • resolves and executes each step
  • maps capability/service responses back into skill outputs
  • emits execution events and structured logs

Primary entrypoint for manual usage is cli/main.py.

2) End-to-End Flow

Execution path (high-level):

  1. CLI builds ExecutionRequest
  2. ExecutionEngine loads skill and initializes ExecutionState
  3. Scheduler builds DAG from step config.depends_on declarations
  4. Scheduler dispatches steps (parallel when dependencies allow, sequential by default):
  5. InputMapper resolves step input from inputs/vars/outputs refs
  6. If uses starts with skill:, NestedSkillRunner executes recursively
  7. Else CapabilityExecutor delegates to BindingExecutor
  8. BindingExecutor pipeline:
  9. BindingResolver selects binding
  10. ServiceResolver resolves service descriptor
  11. RequestBuilder builds protocol payload from input.* template
  12. ProtocolRouter dispatches to protocol invoker
  13. ResponseMapper maps response.* into capability outputs
  14. OutputMapper writes step outputs to vars. and outputs.
  15. ExecutionEngine validates required final outputs
  16. SkillExecutionResult is returned

3) Core Runtime Modules

State and model layer:

  • runtime/models.py: typed runtime data contracts (includes CognitiveState v1 structures)
  • runtime/execution_state.py: mutable execution state + runtime events
  • runtime/errors.py: typed runtime exceptions

Planning and orchestration:

  • runtime/skill_loader.py: loads and normalizes skill specs
  • runtime/capability_loader.py: loads and normalizes capability specs
  • runtime/execution_planner.py: prepares step order
  • runtime/scheduler.py: DAG-based step scheduler (parallel/sequential)
  • runtime/execution_engine.py: orchestrates whole run
  • runtime/nested_skill_runner.py: executes skill: steps

Step input/output mapping:

  • runtime/reference_resolver.py: resolves data references (7 namespaces with path traversal)
  • runtime/input_mapper.py: materializes step input
  • runtime/output_mapper.py: writes produced values to runtime targets (5 writable namespaces, 4 merge strategies)

Binding execution layer:

  • runtime/binding_registry.py: loads services, bindings, defaults
  • runtime/active_binding_map.py: active override map
  • runtime/binding_resolver.py: chooses effective binding
  • runtime/service_resolver.py: resolves service descriptor
  • runtime/request_builder.py: builds invocation payload
  • runtime/protocol_router.py: routes by protocol kind
  • runtime/response_mapper.py: maps invocation response to capability output
  • runtime/binding_executor.py: full binding execution pipeline
  • runtime/capability_executor.py: runtime adapter around binding executor

Protocol invokers:

  • runtime/pythoncall_invoker.py
  • runtime/openapi_invoker.py
  • runtime/openrpc_invoker.py
  • runtime/mcp_invoker.py

OpenAPI invoker runtime knobs (metadata-driven):

  • binding metadata:
  • method (default POST)
  • timeout_seconds (overrides service/default timeout)
  • headers (string-to-string map merged over service headers)
  • response_mode (json default, text, or raw)
  • service metadata:
  • timeout_seconds (used when binding does not override)
  • headers (base header map)

Observability:

  • runtime/observability.py: structured logs, trace context, redaction

4) CognitiveState v1

ExecutionState now includes four cognitive blocks (frame, working, output, trace) plus an extensions namespace, enabling structured multi-step reasoning without breaking the legacy vars/outputs pipeline.

Key additions:

  • FrameState: immutable reasoning context (goal, constraints, success_criteria)
  • WorkingState: mutable working memory with 10 typed cognitive slots
  • OutputState: structured result metadata (result_type, summary, status_reason)
  • TraceState: per-step data lineage (reads/writes) and live aggregate metrics
  • extensions: open namespace for plugins

Reference resolution now supports 7 namespaces with path traversal through dataclass attributes, dict keys, and list indices:

inputs.*, vars.*, outputs.*, frame.*, working.*, output.*, extensions.*

Output mapping supports 5 writable namespaces with 4 merge strategies (overwrite, append, deep_merge, replace).

All features are backward-compatible. Existing skills are unaffected.

Full reference: docs/COGNITIVE_STATE_V1.md

5) Trace and Events

Two complementary tracing surfaces exist:

  1. Runtime events (in ExecutionState.events)
  2. event type/message/timestamp/step_id/trace_id/data

  3. Structured logs (JSON lines)

  4. skill/step/capability lifecycle
  5. service lifecycle for critical services
  6. correlation through trace_id

  7. CognitiveState v1 trace (in ExecutionState.trace)

  8. TraceStep per step: step_id, capability_id, status, reads, writes, latency_ms
  9. TraceMetrics aggregate: step_count, llm_calls, tool_calls, tokens_in/out, elapsed_ms
  10. Automatically populated by the execution engine

Trace propagation:

  • trace_id can be passed in ExecutionRequest
  • CLI supports --trace-id in run and trace commands
  • nested skills inherit parent trace_id

6) How to Run

Basic execution:

  • python cli/main.py run
  • python cli/main.py run --input "{\"key\":\"value\"}"
  • python cli/main.py run --input-file input.json

Execution with trace correlation:

  • python cli/main.py run --trace-id trace-001
  • python cli/main.py trace --trace-id trace-001

System checks:

  • python cli/main.py doctor

OpenAPI checks from CLI:

  • python cli/main.py openapi verify-bindings --all
  • python cli/main.py openapi verify-bindings --scenario tooling/openapi_scenarios/data.schema.validate.mock.json
  • python cli/main.py openapi verify-invoker
  • python cli/main.py openapi verify-errors

7) Validation and Health Commands

Contracts:

  • python tooling/test_capability_contracts.py

Smoke:

  • python tooling/verify_smoke_capabilities.py --report-file artifacts/smoke_report.json

Coverage and consistency:

  • python tooling/compute_runtime_coverage.py
  • python tooling/compute_runtime_stats.py
  • python tooling/compute_skill_executability.py

Registry side:

  • python ../agent-skill-registry/tools/validate_registry.py
  • python ../agent-skill-registry/tools/generate_catalog.py
  • python ../agent-skill-registry/tools/registry_stats.py

8) Failure Model (Practical)

Common failure categories:

  • Input mapping errors: missing input.* fields required by a step
  • Binding resolution errors: no binding or invalid default selection
  • Service resolution errors: binding points to missing/invalid service
  • Request/response mapping errors: template points to missing fields
  • Protocol routing/invocation errors: unsupported or failing protocol path
  • Final output validation errors: required skill outputs not produced

Debug order that works well:

  1. Re-run with trace command and fixed trace_id
  2. Inspect structured logs filtered by trace_id
  3. Inspect failing step mapping in skill yaml
  4. Inspect binding request/response templates
  5. Confirm service implementation output shape

9) Configuration Surfaces

Repository-level:

  • bindings/official/
  • services/official/
  • policies/official_default_selection.yaml

Host-level overrides (.agent-skills):

  • services.yaml
  • bindings/local/
  • bindings/candidate/
  • active_bindings.json
  • overrides.yaml

10) Design Constraints

Current runner behavior intentionally keeps:

  • DAG-based step scheduling with backward-compatible implicit sequential deps
  • explicit depends_on: [] to opt into parallel execution
  • explicit mapping instead of implicit field matching
  • strict response mapping (missing fields fail fast)
  • protocol abstraction via invoker routing
  • thread-safe state mutations via _StateLock during parallel execution

See docs/SCHEDULER.md for full scheduler documentation.

11) Current Baseline

As documented in docs/PROJECT_STATUS.md, runner baseline is currently stable with:

  • 45/45 contract pass
  • 8/8 smoke pass
  • full capability coverage and skill executability (45/45 capabilities, 36/36 skills)
  • DAG scheduler functional tests: 5/5
  • DAG scheduler stress tests: 5/5
  • CognitiveState v1 regression tests: 86/86
  • CognitiveState v1 integration tests: 99/99

This is the recommended baseline before starting MCP/OpenAPI adapter expansion.

12) Safety Enforcement

Capabilities that declare a safety block are subject to runtime enforcement during step execution. The enforcement runs inside _execute_step() and is transparent to the rest of the pipeline.

Enforcement sequence

Before capability execution:

  1. Trust-level check — The capability's safety.trust_level rank is compared against ExecutionOptions.trust_level. If the context rank is lower, a SafetyTrustLevelError is raised. Ranks: sandbox=0, standard=1, elevated=2, privileged=3.

  2. Confirmation check — If safety.requires_confirmation is true and the capability id is not in ExecutionOptions.confirmed_capabilities, a SafetyConfirmationRequiredError is raised.

  3. Mandatory pre-gates — Each gate in safety.mandatory_pre_gates is executed as a capability. If the gate returns {"allowed": false}, the on_fail policy determines behavior:

  4. block — raises SafetyGateFailedError (default)
  5. warn — emits a safety_gate_warning event and continues
  6. degrade — returns a degraded StepResult (step skipped)
  7. require_human — raises SafetyConfirmationRequiredError

After capability execution:

  1. Mandatory post-gates — Same mechanism as pre-gates, but receives the produced output as input.

Capabilities without safety

Capabilities that do not declare a safety block are executed without any additional checks. The enforcement is zero-cost for the ~80% of capabilities that have no side effects.

ExecutionOptions safety fields

  • trust_level: str — Runtime trust level for the execution context (default: "standard").
  • confirmed_capabilities: frozenset[str] — Set of capability ids pre-confirmed by the caller, bypassing requires_confirmation.

Error types

  • SafetyTrustLevelError — trust level insufficient
  • SafetyGateFailedError — a mandatory gate blocked execution
  • SafetyConfirmationRequiredError — human confirmation required

All three inherit from RuntimeErrorBase and carry capability_id and step_id context.