Skip to content

ADR-001: DAG-Based Step Scheduler

Status

Accepted

Context

Skills in agent-skills can have multiple steps with dependencies between them. We needed a scheduling strategy that: - Supports parallel execution of independent steps - Preserves backward-compatible sequential ordering for legacy skills - Detects circular dependencies at plan time - Provides fail-fast and graceful degradation modes

Decision

We chose a DAG-based scheduler using Kahn's algorithm for topological ordering with a ThreadPoolExecutor for parallel dispatch.

Key design choices: 1. Implicit sequential dependencies: Steps without explicit depends_on implicitly depend on the immediately preceding step, preserving v1 sequential semantics. 2. Sharded state locks: Namespace-level locks (vars/outputs/working/events) reduce contention vs. a single global lock. 3. Pool saturation tracking: Metrics counter when ready steps exceed available workers.

Alternatives Considered

  • asyncio event loop: Rejected — ThreadPoolExecutor is simpler and works well with blocking binding calls (HTTP, subprocess MCP).
  • Celery/distributed queue: Rejected — adds infrastructure dependency for minimal benefit at single-instance scale.
  • Simple sequential loop: Too limiting — skills with independent steps (e.g., parallel data enrichment) benefit significantly from parallelism.

Consequences

  • (+) Independent steps execute concurrently, reducing wall-clock time
  • (+) Backward-compatible with all existing sequential skills
  • (-) Thread overhead for small skills (mitigated by configurable max_workers)
  • (-) _StateLock contention at high parallelism (mitigated by sharding in v2)

ADR-002: Safety Gate Architecture

Status

Accepted

Context

Agent skills can perform side-effecting operations (write files, send emails, call APIs). We needed a safety model that: - Prevents unauthorized capability execution - Supports human-in-the-loop confirmation - Allows declarative safety policies per capability

Decision

Safety enforcement uses a 3-layer model: 1. Trust levels: Capability declares minimum trust (sandbox < standard < elevated < privileged); context must meet or exceed. 2. Confirmation gates: Capability can require explicit human confirmation (requires_confirmation: true). 3. Mandatory gates: Pre/post gate capabilities that validate inputs/outputs with configurable failure policies (block/warn/degrade/require_human).

Extracted into a pluggable PolicyEngine (ADR-002b) for extensibility.

Consequences

  • (+) Safety chains cannot be bypassed — embedded in execution pipeline
  • (+) Declarative per-capability — no central policy file to maintain
  • (-) Gate execution adds latency (mitigated by gate capability caching)

ADR-003: CognitiveState v1 Design

Status

Accepted

Context

Multi-step skills need structured working memory for reasoning patterns (evaluation, risk analysis, decision justification). Ad-hoc vars.* usage leads to naming collisions and opaque state.

Decision

Introduced CognitiveState v1 with typed slots: - FrameState (frozen): immutable goal/constraints/success_criteria - WorkingState (mutable): typed cognitive categories (entities, options, criteria, evidence, risks, hypotheses, uncertainties, decisions, messages) - OutputState: structured result metadata - TraceState: execution trace with data lineage

Auto-wiring: cognitive_hints.produces in capability specs maps outputs to CognitiveState slots automatically when no explicit output_mapping exists.

Consequences

  • (+) Skills can compose rich reasoning patterns without custom state management
  • (+) Type-safe slot operations enable validation and tracing
  • (-) Learning curve for new developers (mitigated by tutorials)
  • (-) Memory footprint per run (acceptable — dies with the run)

ADR-004: Storage Abstraction

Status

Accepted

Context

Runtime components (audit, diagnostics, run_store) assume local filesystem via pathlib.Path. This prevents cloud-native deployment without code changes.

Decision

Introduced StorageBackend protocol with LocalFileStorage default. Components accept an optional storage_manager parameter. Path-traversal protection built into the default implementation.

Consequences

  • (+) Cloud backends (S3, GCS, Redis) can be plugged without modifying core runtime
  • (+) Default behavior unchanged — zero config for local development
  • (-) Migration effort for existing code to adopt the abstraction incrementally