Skill Governance Manifesto¶
Date: 2026-03-13 Status: Proposed baseline for implementation Scope: Product-level trust model for portable skills over abstract capabilities
1. Product Thesis¶
The product must guarantee two things at the same time:
- Portability:
- Skills are high-level workflows built on abstract capabilities.
-
Capability contracts are the source of truth in registry YAML.
-
Trust:
- A skill should not be considered reliable only because it executes.
- Reliability must be evidenced through internal validation and field behavior.
This manifesto defines how to keep portability while controlling quality uncertainty introduced by provider and binding flexibility.
2. Non-Negotiable Invariants¶
- Source of truth for semantics:
- Capabilities are defined in registry YAML and remain canonical.
-
Runtime does not execute from generated catalogs.
-
Separation of concerns:
- Registry catalogs describe and index.
-
Operational quality artifacts evaluate behavior and trust.
-
Override safety model:
- User overrides are allowed and encouraged.
-
Last-resort fallback must always be the official default binding.
-
Explainability:
- Every effective execution path must be explainable to a human user.
3. Capability Leveling Model¶
Capabilities must be high-level enough for agent composition but narrow enough to minimize hallucination risk.
3.1 Risk classes¶
A. Deterministic operations (low epistemic risk) - Examples: data.schema.validate, data.json.parse, table.row.filter.
B. Retrieval operations (medium risk) - Examples: web.page.fetch, fs.file.read, pdf.document.read. - Variability comes from source quality and external system behavior.
C. Generative and decision operations (higher risk) - Examples: text.content.summarize, text.content.classify, agent.plan.generate, agent.input.route. - Variability comes from model behavior and provider differences.
3.2 Design rule¶
A capability is valid for this product if:
- It has a clear contract boundary.
- It can be tested for conformance independent of workflow context.
- It does not encode provider-specific semantics in its public contract.
4. Trust and Quality Model¶
4.1 Two evidence channels¶
- Internal evidence (cold-start readiness)
- Contract tests
- Smoke tests
- Human review
-
Control point validation
-
Field evidence (product maturity)
- Execution success rates
- Latency profiles
- User ratings
- User reports and severe incidents
4.2 Lifecycle states¶
- draft
- validated
- lab-verified
- trusted
- recommended
Cold-start is explicitly supported: skills can be lab-verified before field usage volume exists.
4.3 Scores¶
- readiness_score: internal evidence
- field_score: production behavior
- overall_score: weighted by sample volume
Weighting model by executions_30d:
- < 20: overall = readiness only
- 20 to 49: readiness-weighted mixed score
-
= 50: field-weighted mixed score
5. Mitigating Execution Uncertainty with Portability¶
Portability introduces backend variance. This is controlled by explicit conformance and routing policy.
5.1 Binding conformance profile¶
Each binding/provider path is tagged by conformance profile:
- strict: strongest contract compliance and deterministic behavior under policy
- standard: acceptable for production with known variance bounds
- experimental: available for opt-in, not default for trusted paths
5.2 Runtime selection policy¶
- User-selected primary binding executes first.
- Optional user-defined fallback executes next.
- Official default binding is mandatory terminal fallback.
5.3 Capability assurance contract¶
For each capability, define:
- Invariants that must hold across providers.
- Error taxonomy and normalization rules.
- Safety constraints and side-effect boundaries.
- Conformance test vectors and expected outcomes.
6. Human UX Guardrails (Critical)¶
Quality controls must not degrade usability.
6.1 No-complexity default experience¶
If user does nothing:
- Defaults are ready and executable.
- Credentials are clearly prompted when needed.
- Errors are actionable and short.
6.2 Optional complexity only when requested¶
Advanced controls (fallback chains, custom providers, strict profiles) are optional and progressive.
6.3 Explainability surfaces¶
For each capability/skill execution, user can inspect:
- Effective binding and service.
- Why it was selected.
- Fallback path used, if any.
- Trust state of the skill and evidence source.
6.4 Honest trust labels¶
Always show trust evidence source:
- internal-evidence
- mixed-evidence
- field-evidence
Users must never confuse internal validation with broad field reliability.
7. Architecture Changes Required¶
This section maps the strategy to implementation deltas.
7.1 Registry layer (agent-skill-registry)¶
Keep unchanged as semantic source:
- Capabilities remain canonical in YAML.
- Skills remain declarative workflows.
- Catalog generation remains descriptive.
Additions (non-breaking):
- Optional metadata conventions for trust-oriented documentation.
- Governance documentation for lifecycle semantics.
7.2 Runtime layer (agent-skills)¶
Add or extend:
- Binding conformance metadata and profile enforcement.
- Deterministic fallback resolution policy with official terminal fallback.
- Quality ingestion pipeline:
- internal evidence
- usage evidence
- feedback evidence
- Skill quality catalog generation and exposure.
- Explainability endpoints or commands for effective path introspection.
7.3 Observability layer¶
Add normalized metrics for quality:
- success and failure classes by capability and binding
- latency percentiles by capability and binding
- fallback activation counts
- user feedback aggregates
7.4 Consumer API and CLI¶
Add user-centric introspection operations:
- explain capability resolution
- explain skill trust state and evidence
- list recommended and trusted skills with evidence source
8. Rollout Plan¶
Phase 1: Foundation (short)
- Formalize lifecycle and scoring policy.
- Publish conformance profile definitions.
- Keep defaults simple and no-regression for current users.
Phase 2: Assurance and explainability
- Add capability assurance contracts for prioritized capabilities.
- Add binding conformance checks.
- Expose explainability surfaces in CLI/API.
Phase 3: Product trust maturity
- Ingest real usage and feedback signals.
- Promote skills from lab-verified to trusted/recommended based on evidence.
- Use trust states in discovery and routing preferences.
9. Success Criteria¶
- Users can run defaults without configuration burden.
- Skills have transparent trust states with clear evidence source.
- Overrides do not compromise baseline reliability due to terminal default fallback.
- Trusted and recommended states correlate with real quality outcomes.
- Portability remains intact while uncertainty is bounded and communicated.