Reliability Features¶

Retry, backoff, timeouts, concurrency, and audit durability.

1. HTTP Retry with Exponential Backoff¶

File: runtime/openapi_invoker.py

When an OpenAPI binding receives a transient HTTP error, the invoker retries automatically.

Transient status codes¶

429  Too Many Requests
502  Bad Gateway
503  Service Unavailable
504  Gateway Timeout

Default parameters¶

Parameter	Default	Override
Max retries	3	Binding YAML `retry_count` or service descriptor `retry_count`
Backoff base	1.0 s	Binding YAML `retry_backoff_base`
Backoff factor	2.0	Binding YAML `retry_backoff_factor`
Retry-After cap	60 s	Hardcoded (`_MAX_RETRY_AFTER_SECONDS`)

Backoff schedule (defaults)¶

Attempt	Delay
1st retry	1 s
2nd retry	2 s
3rd retry	4 s

Retry-After header¶

If the server responds with a Retry-After header, the invoker honors it (capped at 60 s). If the header is absent or unparseable, exponential backoff is used.

Per-binding override¶

# In a binding YAML
metadata:
  retry_count: 5
  retry_backoff_base: 0.5
  retry_backoff_factor: 3.0

2. Per-Step Timeout¶

File: runtime/execution_engine.py

Each skill step runs in a ThreadPoolExecutor(max_workers=1) with a timeout. If the step exceeds the timeout, StepTimeoutError is raised and the skill execution is aborted.

Timeout resolution order¶

step.config.timeout_seconds (per-step in skill YAML)
context.options.step_timeout_seconds (per-invocation option)
_DEFAULT_STEP_TIMEOUT_SECONDS = 60 s

Configuration in skill YAML¶

steps:
  - id: slow_step
    capability: web.page.fetch
    config:
      timeout_seconds: 120   # override to 2 minutes

Configuration at invocation time¶

engine.execute(request, options={"step_timeout_seconds": 30})

3. Worker Pool Sizing¶

File: runtime/scheduler.py

The DAG scheduler runs steps using a thread pool. Pool size is configurable via environment variable:

Env var	Default	Purpose
`AGENT_SKILLS_MAX_WORKERS`	`min(32, os.cpu_count() + 4)`	Max concurrent step threads

export AGENT_SKILLS_MAX_WORKERS=8

Steps that fail raise a structured StepResult with error details instead of crashing the entire skill execution.

4. Audit Durability¶

File: runtime/audit.py

File locking¶

All audit writes acquire an exclusive advisory lock:

Windows: msvcrt.locking(f.fileno(), msvcrt.LK_LOCK, 1)
Linux/macOS: fcntl.flock(f.fileno(), fcntl.LOCK_EX)

This prevents corruption from concurrent write access (e.g., multiple CLI invocations or the HTTP server processing parallel requests).

Atomic purge¶

The purge() operation uses a safe read-filter-replace cycle:

Lock the audit file.
Read all lines, filter out matching records.
Write kept records to a temp file (tempfile.mkstemp).
os.replace(tmp_path, audit_file) — atomic on most OS/filesystem combos.
Clean up temp file on failure.
Unlock.

5. Graceful Degradation¶

Python baselines: Every capability has a local Python fallback that works without network access. Quality is degraded but execution proceeds.
Binding fallback: OpenAI bindings declare fallback_binding_id pointing to the Python baseline; the runtime can fall through on API failure.
Audit modes: off | standard | full — operators can disable audit overhead entirely for latency-critical deployments.