Reliability Features¶
Retry, backoff, timeouts, concurrency, and audit durability.
1. HTTP Retry with Exponential Backoff¶
File: runtime/openapi_invoker.py
When an OpenAPI binding receives a transient HTTP error, the invoker retries automatically.
Transient status codes¶
429 Too Many Requests
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
Default parameters¶
| Parameter | Default | Override |
|---|---|---|
| Max retries | 3 | Binding YAML retry_count or service descriptor retry_count |
| Backoff base | 1.0 s | Binding YAML retry_backoff_base |
| Backoff factor | 2.0 | Binding YAML retry_backoff_factor |
| Retry-After cap | 60 s | Hardcoded (_MAX_RETRY_AFTER_SECONDS) |
Backoff schedule (defaults)¶
| Attempt | Delay |
|---|---|
| 1st retry | 1 s |
| 2nd retry | 2 s |
| 3rd retry | 4 s |
Retry-After header¶
If the server responds with a Retry-After header, the invoker honors it
(capped at 60 s). If the header is absent or unparseable, exponential backoff
is used.
Per-binding override¶
# In a binding YAML
metadata:
retry_count: 5
retry_backoff_base: 0.5
retry_backoff_factor: 3.0
2. Per-Step Timeout¶
File: runtime/execution_engine.py
Each skill step runs in a ThreadPoolExecutor(max_workers=1) with a timeout.
If the step exceeds the timeout, StepTimeoutError is raised and the skill
execution is aborted.
Timeout resolution order¶
step.config.timeout_seconds(per-step in skill YAML)context.options.step_timeout_seconds(per-invocation option)_DEFAULT_STEP_TIMEOUT_SECONDS= 60 s
Configuration in skill YAML¶
steps:
- id: slow_step
capability: web.page.fetch
config:
timeout_seconds: 120 # override to 2 minutes
Configuration at invocation time¶
engine.execute(request, options={"step_timeout_seconds": 30})
3. Worker Pool Sizing¶
File: runtime/scheduler.py
The DAG scheduler runs steps using a thread pool. Pool size is configurable via environment variable:
| Env var | Default | Purpose |
|---|---|---|
AGENT_SKILLS_MAX_WORKERS |
min(32, os.cpu_count() + 4) |
Max concurrent step threads |
export AGENT_SKILLS_MAX_WORKERS=8
Steps that fail raise a structured StepResult with error details instead of
crashing the entire skill execution.
4. Audit Durability¶
File: runtime/audit.py
File locking¶
All audit writes acquire an exclusive advisory lock:
- Windows:
msvcrt.locking(f.fileno(), msvcrt.LK_LOCK, 1) - Linux/macOS:
fcntl.flock(f.fileno(), fcntl.LOCK_EX)
This prevents corruption from concurrent write access (e.g., multiple CLI invocations or the HTTP server processing parallel requests).
Atomic purge¶
The purge() operation uses a safe read-filter-replace cycle:
- Lock the audit file.
- Read all lines, filter out matching records.
- Write kept records to a temp file (
tempfile.mkstemp). os.replace(tmp_path, audit_file)— atomic on most OS/filesystem combos.- Clean up temp file on failure.
- Unlock.
5. Graceful Degradation¶
- Python baselines: Every capability has a local Python fallback that works without network access. Quality is degraded but execution proceeds.
- Binding fallback: OpenAI bindings declare
fallback_binding_idpointing to the Python baseline; the runtime can fall through on API failure. - Audit modes:
off | standard | full— operators can disable audit overhead entirely for latency-critical deployments.