Test

Prove it works. Multi-dimensional quality validation across functional, non-functional, security, DevOps, DX, and observability. Run after /build.

PublishedJun 4, 2026

Loading actions...

5 minBeginnerprompt17 files

Skill content

Main instructions and any bundled files for this skill.

markdown

Additional Files (16)

@CLAUDE.md

Test: $ARGUMENTS

If $ARGUMENTS is empty, test the current branch diff against the base branch.

Step 0: Classify PR Type

Detect the base branch from gh pr view --json baseRefName or fall back to main. Run git diff origin/<base-branch> --name-only and classify changed files:

Type	Patterns	Gates to Run
CODE	`.py`, `.ps1`, `.ts`, `.js`, `*.cs`	All 6 gates
WORKFLOW	`*.yml` in `.github/workflows/`	Gates 1, 3, 4
CONFIG	`.json`, `.yaml` (non-workflow)	Gates 3, 4
DOCS	`.md`, `.txt`, `*.rst`	Gate 5 only
MIXED	Combination	Apply per-file rules

Print: PR TYPE: [type]. Running gates: [list].

Skip non-applicable gates. Do not waste agent invocations on irrelevant dimensions.

Gate 1: Functional Testing

Invoke Skill(skill="code-qualities-assessment") for quality baseline.

Task(subagent_type="qa"): You are a senior QA engineer. Your job is to catch issues that will cause production incidents. Be skeptical. Cite specific file:line evidence for every finding. Evaluate:

Unit coverage - Each method in isolation, dependencies injected. Every new function has at least 1 test.
Integration coverage - Contracts between components verified. Cross-module boundaries exercised.
Acceptance coverage - Each requirement has a passing test. Map to acceptance criteria from /spec output.
Edge cases - Null/empty/boundary values, invalid types, concurrent access where applicable.
Error paths - Every catch/error branch tested. No silent swallowing. Resources cleaned up on failure.
Regression risk - High-risk areas (auth, data persistence, payments) require full coverage regardless of change size.

Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array.

Gate 2: Non-Functional Testing

Task(subagent_type="analyst"): You are a performance and reliability engineer. Focus on failure modes, not the happy path. Use measurable criteria, not subjective judgments. Evaluate:

Performance - No N+1 queries, no O(n*m) in hot paths, no blocking calls in async context.
Scalability - Will this bottleneck under load? Connection pooling, caching strategy, pagination.
Reliability - Retry logic, circuit breakers, graceful degradation. Failure modes tested.
Complexity - Cyclomatic complexity <=10. Methods <=60 lines. No deep nesting.
Maintainability - Readability, naming clarity, consistency with existing patterns.

Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array.

Gate 3: Security Testing

Invoke Skill(skill="security-scan") for CWE pattern detection.

Task(subagent_type="security"): You are a security auditor performing OWASP Top 10 review. Assume every input is malicious. Reference CWE numbers for every finding. Evaluate:

Injection - Shell (CWE-78), XSS (CWE-79), SQL (CWE-89). No string interpolation in queries.
Authentication - Session handling, credential storage, token validation.
Secrets - No hardcoded API keys, passwords, tokens in diff. Secrets via environment only.
Input validation - All user-facing inputs validated. LLM output treated as untrusted.
Dependencies - New packages reviewed for known vulnerabilities. Versions pinned.

Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array including CWE references.

Gate 4: DevOps Testing

Task(subagent_type="devops"): You are a build and release engineer. Focus on pipeline safety, reproducibility, and supply chain security. Evaluate:

Pipeline impact - Do changes affect CI/CD? Are workflow files valid YAML?
Actions security - Pinned to SHA? Permissions scoped minimally? No secrets in logs?
Shell quality - Input sanitization, exit code handling, error propagation.
Build reproducibility - Deterministic builds, locked dependencies, no floating versions.
Artifact integrity - Correct upload/download, retention policy, no sensitive data in artifacts.

Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array.

Gate 5: Developer Experience (DX)

Invoke Skill(skill="orphan-ref-validator"). Reject the gate on VERDICT: CRITICAL_FAIL or VERDICT: ERROR; VERDICT: WARN is non-blocking and surfaces in the test summary. This mirrors /build Mandatory Exit Gate 4 (per .claude/commands/build.md:56) so a reference to a deleted skill or a missing script path is caught at /test as well as at /build. To diagnose a failure, re-run the skill with --output human; each finding shows path:line plus a one-line recommendation. Manifest count drift is owned by the canonical build/scripts/validate_marketplace_counts.py (which the skill's COUNT_CLAIM_RE mirrors but does not duplicate emission); pass --enforce-counts only when you want single-plugin count_claim emission directly from the skill. The skill invocation is platform-agnostic; each platform mirror runs its own copy of scan.py. If pre-existing drift outside the PR's scope blocks the gate, fix it in the same PR (the directives at  and  are documented in the skill's SKILL.md).

Task(subagent_type="critic"): You are a developer advocate reviewing from the consumer perspective. Would a new contributor understand this code? Would the API frustrate or delight? Evaluate:

API ergonomics - Consumer perspective. Are signatures intuitive? Error messages helpful?
Documentation - Is changed behavior documented? Are code comments accurate (not stale)?
Debuggability - Can a developer diagnose failures from logs alone? Stack traces preserved?
Onboarding - Would a new contributor understand this code? Are conventions followed?
Tooling - Does this work with existing linters, formatters, IDE support?

Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array.

Gate 6: Observability and Monitoring

Task(subagent_type="architect"): You are an SRE reviewing production readiness. If this code fails at 3am, can oncall diagnose it without reading the source? Evaluate:

Logging - Are meaningful events logged? Structured logging with correlation IDs?
Metrics - Are SLIs defined for new features? Latency, error rate, throughput tracked?
Alerting - Would failures trigger alerts? Are thresholds appropriate?
Tracing - Are distributed traces propagated? Span context preserved across boundaries?
Health checks - New services have liveness/readiness probes? Degradation detectable?

Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array.

Principles

Testability is design feedback: Hard to test means poor encapsulation, tight coupling, Law of Demeter violation, weak cohesion, or procedural code.
Tests are proof: A passing test is evidence. A missing test is a gap in knowledge.
Hypothesis-driven debugging: When a test fails, form a hypothesis before changing code. Verify the hypothesis. Then fix.
Defense in depth: Assume the happy path works. Focus on failure modes.

Process

Identify what changed (git diff against base branch)
Classify PR type (Step 0). Skip non-applicable gates.
Run applicable gates sequentially. Each gate dispatches its own agent.
If any gate produces CRITICAL_FAIL: continue remaining gates (findings are additive). Mark overall verdict as CRITICAL_FAIL immediately.
For test failures: hypothesis, verify, fix (never change code without understanding why)
Invoke Skill(skill="quality-grades") to synthesize gate verdicts into overall quality score.

Output

Each gate MUST produce a verdict line and findings array:

GATE: [name]
VERDICT: PASS|WARN|CRITICAL_FAIL
FINDINGS:
- [SEVERITY] (file:line) description: recommendation

Synthesize into overall report:

Gate	Verdict	Findings	Evidence
Functional	PASS/WARN/CRITICAL_FAIL	Count	file:line citations
Non-Functional	PASS/WARN/CRITICAL_FAIL	Count	file:line citations
Security	PASS/WARN/CRITICAL_FAIL	Count	CWE references
DevOps	PASS/WARN/CRITICAL_FAIL	Count	file:line citations
DX	PASS/WARN/CRITICAL_FAIL	Count	file:line citations
Observability	PASS/WARN/CRITICAL_FAIL	Count	file:line citations

Overall verdict: CRITICAL_FAIL if any gate fails. WARN if any gate warns. PASS if all gates pass.

Contents

Prompt Playground

1 Variable

Fill Variables

SEVERITY

Preview

---
description: Prove it works. Multi-dimensional quality validation across functional, non-functional, security, DevOps, DX, and observability. Run after /build.
allowed-tools: Task, Skill, Read, Glob, Grep, Bash(*)
argument-hint: [component-or-failure-description]
---

@CLAUDE.md

Test: $ARGUMENTS

If $ARGUMENTS is empty, test the current branch diff against the base branch.

## Step 0: Classify PR Type

Detect the base branch from `gh pr view --json baseRefName` or fall back to `main`. Run `git diff origin/<base-branch> --name-only` and classify changed files:

| Type | Patterns | Gates to Run |
|------|----------|--------------|
| CODE | `*.py`, `*.ps1`, `*.ts`, `*.js`, `*.cs` | All 6 gates |
| WORKFLOW | `*.yml` in `.github/workflows/` | Gates 1, 3, 4 |
| CONFIG | `*.json`, `*.yaml` (non-workflow) | Gates 3, 4 |
| DOCS | `*.md`, `*.txt`, `*.rst` | Gate 5 only |
| MIXED | Combination | Apply per-file rules |

Print: `PR TYPE: [type]. Running gates: [list].`

Skip non-applicable gates. Do not waste agent invocations on irrelevant dimensions.

## Gate 1: Functional Testing

Invoke Skill(skill="code-qualities-assessment") for quality baseline.

Task(subagent_type="qa"): You are a senior QA engineer. Your job is to catch issues that will cause production incidents. Be skeptical. Cite specific file:line evidence for every finding. Evaluate:

1. **Unit coverage** - Each method in isolation, dependencies injected. Every new function has at least 1 test.
2. **Integration coverage** - Contracts between components verified. Cross-module boundaries exercised.
3. **Acceptance coverage** - Each requirement has a passing test. Map to acceptance criteria from /spec output.
4. **Edge cases** - Null/empty/boundary values, invalid types, concurrent access where applicable.
5. **Error paths** - Every catch/error branch tested. No silent swallowing. Resources cleaned up on failure.
6. **Regression risk** - High-risk areas (auth, data persistence, payments) require full coverage regardless of change size.

Output: `VERDICT: PASS|WARN|CRITICAL_FAIL` with findings array.

## Gate 2: Non-Functional Testing

Task(subagent_type="analyst"): You are a performance and reliability engineer. Focus on failure modes, not the happy path. Use measurable criteria, not subjective judgments. Evaluate:

1. **Performance** - No N+1 queries, no O(n*m) in hot paths, no blocking calls in async context.
2. **Scalability** - Will this bottleneck under load? Connection pooling, caching strategy, pagination.
3. **Reliability** - Retry logic, circuit breakers, graceful degradation. Failure modes tested.
4. **Complexity** - Cyclomatic complexity <=10. Methods <=60 lines. No deep nesting.
5. **Maintainability** - Readability, naming clarity, consistency with existing patterns.

Output: `VERDICT: PASS|WARN|CRITICAL_FAIL` with findings array.

## Gate 3: Security Testing

Invoke Skill(skill="security-scan") for CWE pattern detection.

Task(subagent_type="security"): You are a security auditor performing OWASP Top 10 review. Assume every input is malicious. Reference CWE numbers for every finding. Evaluate:

1. **Injection** - Shell (CWE-78), XSS (CWE-79), SQL (CWE-89). No string interpolation in queries.
2. **Authentication** - Session handling, credential storage, token validation.
3. **Secrets** - No hardcoded API keys, passwords, tokens in diff. Secrets via environment only.
4. **Input validation** - All user-facing inputs validated. LLM output treated as untrusted.
5. **Dependencies** - New packages reviewed for known vulnerabilities. Versions pinned.

Output: `VERDICT: PASS|WARN|CRITICAL_FAIL` with findings array including CWE references.

## Gate 4: DevOps Testing

Task(subagent_type="devops"): You are a build and release engineer. Focus on pipeline safety, reproducibility, and supply chain security. Evaluate:

1. **Pipeline impact** - Do changes affect CI/CD? Are workflow files valid YAML?
2. **Actions security** - Pinned to SHA? Permissions scoped minimally? No secrets in logs?
3. **Shell quality** - Input sanitization, exit code handling, error propagation.
4. **Build reproducibility** - Deterministic builds, locked dependencies, no floating versions.
5. **Artifact integrity** - Correct upload/download, retention policy, no sensitive data in artifacts.

Output: `VERDICT: PASS|WARN|CRITICAL_FAIL` with findings array.

## Gate 5: Developer Experience (DX)

Invoke Skill(skill="orphan-ref-validator"). Reject the gate on `VERDICT: CRITICAL_FAIL` or `VERDICT: ERROR`; `VERDICT: WARN` is non-blocking and surfaces in the test summary. This mirrors `/build` Mandatory Exit Gate 4 (per `.claude/commands/build.md:56`) so a reference to a deleted skill or a missing script path is caught at `/test` as well as at `/build`. To diagnose a failure, re-run the skill with `--output human`; each finding shows `path:line` plus a one-line recommendation. Manifest count drift is owned by the canonical `build/scripts/validate_marketplace_counts.py` (which the skill's `COUNT_CLAIM_RE` mirrors but does not duplicate emission); pass `--enforce-counts` only when you want single-plugin count_claim emission directly from the skill. The skill invocation is platform-agnostic; each platform mirror runs its own copy of `scan.py`. If pre-existing drift outside the PR's scope blocks the gate, fix it in the same PR (the directives at `<!-- orphan-ref-ignore -->` and `<!-- orphan-ref-ignore-file -->` are documented in the skill's SKILL.md).

Task(subagent_type="critic"): You are a developer advocate reviewing from the consumer perspective. Would a new contributor understand this code? Would the API frustrate or delight? Evaluate:

1. **API ergonomics** - Consumer perspective. Are signatures intuitive? Error messages helpful?
2. **Documentation** - Is changed behavior documented? Are code comments accurate (not stale)?
3. **Debuggability** - Can a developer diagnose failures from logs alone? Stack traces preserved?
4. **Onboarding** - Would a new contributor understand this code? Are conventions followed?
5. **Tooling** - Does this work with existing linters, formatters, IDE support?

Output: `VERDICT: PASS|WARN|CRITICAL_FAIL` with findings array.

## Gate 6: Observability and Monitoring

Task(subagent_type="architect"): You are an SRE reviewing production readiness. If this code fails at 3am, can oncall diagnose it without reading the source? Evaluate:

1. **Logging** - Are meaningful events logged? Structured logging with correlation IDs?
2. **Metrics** - Are SLIs defined for new features? Latency, error rate, throughput tracked?
3. **Alerting** - Would failures trigger alerts? Are thresholds appropriate?
4. **Tracing** - Are distributed traces propagated? Span context preserved across boundaries?
5. **Health checks** - New services have liveness/readiness probes? Degradation detectable?

Output: `VERDICT: PASS|WARN|CRITICAL_FAIL` with findings array.

## Principles

- **Testability is design feedback**: Hard to test means poor encapsulation, tight coupling, Law of Demeter violation, weak cohesion, or procedural code.
- **Tests are proof**: A passing test is evidence. A missing test is a gap in knowledge.
- **Hypothesis-driven debugging**: When a test fails, form a hypothesis before changing code. Verify the hypothesis. Then fix.
- **Defense in depth**: Assume the happy path works. Focus on failure modes.

## Process

1. Identify what changed (git diff against base branch)
2. Classify PR type (Step 0). Skip non-applicable gates.
3. Run applicable gates sequentially. Each gate dispatches its own agent.
4. If any gate produces CRITICAL_FAIL: continue remaining gates (findings are additive). Mark overall verdict as CRITICAL_FAIL immediately.
5. For test failures: hypothesis, verify, fix (never change code without understanding why)
6. Invoke Skill(skill="quality-grades") to synthesize gate verdicts into overall quality score.

## Output

Each gate MUST produce a verdict line and findings array:

```text
GATE: [name]
VERDICT: PASS|WARN|CRITICAL_FAIL
FINDINGS:
- [SEVERITY] (file:line) description: recommendation
```

Synthesize into overall report:

| Gate | Verdict | Findings | Evidence |
|------|---------|----------|----------|
| Functional | PASS/WARN/CRITICAL_FAIL | Count | file:line citations |
| Non-Functional | PASS/WARN/CRITICAL_FAIL | Count | file:line citations |
| Security | PASS/WARN/CRITICAL_FAIL | Count | CWE references |
| DevOps | PASS/WARN/CRITICAL_FAIL | Count | file:line citations |
| DX | PASS/WARN/CRITICAL_FAIL | Count | file:line citations |
| Observability | PASS/WARN/CRITICAL_FAIL | Count | file:line citations |

**Overall verdict**: CRITICAL_FAIL if any gate fails. WARN if any gate warns. PASS if all gates pass.

View Original Source

Related Skills

General

PromptBeginner5 minmarkdown

Untitled Skill

193

Jan 12, 2026

General

PromptBeginner5 minmarkdown

Frontend Typescript Linting.mdc

TypeScript and ESLint rules that MUST be followed when creating, modifying, or reviewing any file under apps/frontend/, including .ts, .tsx, .js, and .jsx files. Also apply when discussing frontend li...

160

Feb 15, 2026

General

PromptBeginner5 minmarkdown

2. Apply Deepthink Protocol (reason about dependencies

risks

126

Jan 15, 2026