workflow-health-triage¶
Triagiert einen roten GitHub-Actions-Workflow auf develop/main und dispatched den passendsten spezialisierten Agent zur Behebung.
Triages a failing GitHub Actions workflow run on develop or main per spec/project/workflow-health/. Classifies the failure into one of defect / flake / infra / stale pin / secret drift / other, dispatches the most specialised Claude agent that matches the classification, records the classification plus the dispatched agent's name in the eventual fix PR's Risk / rollout notes, and verifies the standard fix/-PR flow. Invoke when the user asks to \"triage this red workflow\", \"classify this CI failure\", or equivalent German-language requests. Don't use to silence checks via continue-on-error shortcuts or by removing required-checks entries (forbidden by spec); don't use to bypass branch protection (enforce_admins on develop has no exception path); don't use to merge a fix PR (use pull-request-merge). Supports resume on re-invocation per spec/claude/resumable-work/.
- Plugin:
nolte-shared - Phase: 6 Quality (
quality) - Tags:
audit,pull-request - Quelle: skills/workflow-health-triage/SKILL.md
Anwenden wenn¶
- you want to triage a red workflow run on develop or main
- you want to classify a CI failure (defect / flake / infra / stale pin / secret drift)
- you want the fix to land via the standard fix/-PR flow
Nicht anwenden wenn¶
- You want to merge the fix PR after triage →
pull-request-merge - You want a per-repo CVE audit, not CI triage →
dependency-audit
Siehe auch¶
Referenziert von¶
quality-gate-enforcercontinuous-improvement-triageissue-orchestrateportfolio-inflight-triagerelease-publish-triggerspec-drift-audit
Workflow Health Triage¶
Implements spec/project/workflow-health/ §Triage before remediation and §Specialised-agent dispatch for remediation. The skill is the generalist that classifies a failure and dispatches the hands-on fix to the most specialised available Claude agent; it never performs the fix itself when a matching specialised agent exists.
Why this is a skill, not an agent¶
- Externally-visible mutations gate on user confirmation. Classification ambiguity, agent-dispatch confirmation, and the fix-PR title / body are mid-flow user dialogues; an agent's fire-and-forget shape would miss them.
- Orchestrator pattern (per
skill-vs-agent). The work itself is classify, dispatch, verify; the dispatched specialised agent does the editing. The orchestrator stays in the main thread and chains other skills (pull-request-createfor the fix PR, optionallypull-request-mergeafter CI is green). - Per-classification user gating. At least three of the six classes (
defect,flake,secret drift) need a human "yes that's the right call" before the dispatcher commits to a remediation lane. - Counter-dimension considered: a narrow agent could handle the classification step in isolation and gain context-window protection, but every downstream lane (PR creation, agent dispatch, verification) is interactive—keeping the whole flow in one skill is simpler than splitting at the classification boundary.
User-language policy¶
Detect the user's language and respond in it. All git, gh, and Agent(subagent_type=…) invocations stay English so the audit trail (PR titles, classification labels, agent names) is grep-able portfolio-wide.
Preconditions¶
Before any classification:
- Confirm the working directory is a git repository and
gh auth statusreports authenticated. - Confirm
spec/project/workflow-health/<canonical_language>.mdexists in the current project. If missing, stop and report—without it the classifications are ad-hoc; this skill is the spec's implementer, not its replacement. - The user supplies the failing workflow run (a
gh run view <id>URL, a workflow file name likerelease-publish.yml, or "the red one on develop"). If nothing is supplied, rungh run list --status failure --branch develop --limit 5and ask which run to triage.
Operations¶
1. Inspect the failing run¶
Run in parallel:
gh run view <id> --json name,headBranch,headSha,event,status,conclusion,workflowName,url,jobsgh api repos/<owner>/<repo>/actions/runs/<id>/jobs --jq '.jobs[] | {name, conclusion, html_url}'gh run view <id> --log-failed(capture the failing-step output for classification)git log --oneline -1 <headSha>(resolve the commit under the run)
Confirm the run is on develop or main (the spec's scope) and is conclusion: failure. If it's still in_progress, stop and ask the user to wait for completion before triage; if it's cancelled, classify as other with a one-line note and stop.
2. Classify before any re-run¶
Apply the spec's six classes in order; stop at the first match:
| Signal in the failed-step output | Classification |
|---|---|
| Failing step references a file the head commit's diff modified | defect |
Re-run of the same headSha would produce green (no infra signal, no code change in the failing step's surface) |
flake |
| HTTP 5xx, rate-limit, registry-unreachable, GitHub status incident | infra |
uses: pin in the workflow points to a nolte/gh-plumbing (or other reusable) tag, and a newer tag exists with the relevant fix |
stale pin |
Token, deploy key, or OIDC trust expired or rotated; failure references 401, 403, expired, unauthorized |
secret drift |
| None of the above | other (with a short note explaining why) |
Confirm the chosen class with the user before proceeding—at minimum for defect, flake, and secret drift, where misclassification has the highest cost. Apply the classification by writing it to a scratch note that becomes the seed of the fix PR's Risk / rollout notes.
Hard: never call gh run rerun <id> more than once before a recorded classification exists; repeated blind re-runs are drift per spec/project/workflow-health/ §Triage before remediation.
3. Dispatch the most specialised available agent¶
The set of available agents changes over time; never freeze a snapshot of "which agents exist" inside this skill body. Resolve the dispatch target dynamically each invocation:
- Resolve the candidate set.
Globagents/*.md(plus~/.claude/agents/*.mdfor the project-distributed half), thenReadthedescription:line of every candidate. Build a (name,description) table; the table is the runtime inventory. - Match classification to candidate. Walk the candidates and pick the one whose
descriptionmost closely matches the (classification, failing-artefact-area) pair: adefectin a markdown spec/skill/agent file maps to whichever agent's description names "spec-conformant authoring" or the equivalent; adefectin a workflow YAML maps to whichever agent's description names "workflow YAML" or "GitHub Actions"; a documentationdefectmaps to whichever agent names an audience-targeted documentation responsibility; and so on. The match is on the description's stated responsibility, not on the agent's name. - Recognise the no-fix classifications.
flakeandsecret driftproduce no agent dispatch by design — the work is documenting the flake in the project's flake registry (FLAKES.mdor theflake-labelled issue set, whichever the repository uses) forflake, or rotating the credential outside Claude forsecret drift. The skill produces only the fix PR that re-references the rotated credential. - No match is a portfolio gap. When the candidate walk produces no plausible match and the failure class has occurred three or more times historically (use
gh run list --status failure --branch develop --limit 50plus a quick grep to estimate), surface this as a portfolio gap perspec/project/workflow-health/§Specialised-agent dispatch — the user is asked whether to author a new agent (viaclaude-plugin-developer) before the fix PR opens. When a match exists, dispatch withAgent(subagent_type="<plugin>:<agent>")and pass the classification, the run URL, the failing-step excerpt, and the fix-PR-title hint. Wait for the agent's report.
The dynamic-lookup design means a new specialised agent that lands in agents/ becomes dispatchable immediately, without a coordinated edit to this skill — and a renamed or removed agent stops being a target the next time the skill runs, with no stale snapshot to mislead the dispatch.
Old patterns¶
Earlier revisions of this skill enumerated specific agent names inline (for example workflow-yaml-fixer, claude-plugin-developer, audience-doc-author) as the dispatch table. That snapshot rotted whenever a new agent landed or an existing one renamed; the runtime-Glob design above replaces it. The historical mapping is preserved here only so a reader who recognises the prior wording can spot the transition: defect in workflow YAML used to fall back to generalist (no matching agent), defect in spec / skill / agent files used to dispatch claude-plugin-developer, defect in documentation used to dispatch audience-doc-author. Use the runtime lookup above instead of this snapshot.
4. Verify the fix PR carries the audit trail¶
Whether the editing was done by a specialised agent or the generalist, the resulting fix PR MUST carry both pieces of evidence in its Risk / rollout notes section per the pull-request-workflow spec:
- Triage classification verbatim (one of the six labels above, with a one-line note for
other) - Dispatched agent name (the literal
subagent_typeargument), or the literal phrase "no matching specialised agent—generalist remediation"
If the user wants the skill to open the fix PR itself, dispatch pull-request-create with these two lines pre-populated in the Risk / rollout notes; the user still confirms the title and body before push, per the pull-request-create spec's externally-visible-action gate.
5. Verify the standard PR gate¶
After the fix PR opens, the skill stops. The actual merge belongs to pull-request-merge, which re-validates the gate. The skill MUST NOT:
- merge the fix PR itself (out of scope;
pull-request-mergeowns that) - pass
--adminanywhere (enforce_admins: trueondevelophas no exception) - waive a still-failing required check on the fix PR (the spec forbids
continue-on-errormasking and required-check removal)
Report back the run ID, the classification, the dispatched agent name (or "generalist"), the fix-PR URL, and a one-line "next action: invoke pull-request-merge after CI is green".
Examples¶
- Read
examples/01-defect-classification-dispatch.mdwhen triaging a failure that classifies asdefectand dispatches to a specialised agent. - Read
examples/02-stale-pin-portfolio-gap.mdwhen the failure root-causes to a stale pin in the portfolio plumbing. - Read
examples/03-flake-no-fix-record-only.mdwhen the failure classifies asflakeand the skill records the classification without opening a fix PR.
Resumability¶
Per spec/claude/resumable-work/, this skill is resumable: true. State is persisted to .resume/workflow-health-triage/<run-id>.yml after every successful user-approval gate and after each named phase boundary. On re-invocation, scan that directory for files with status: in_progress whose inputs: snapshot matches the current invocation; if one matches, prompt the operator with Resume run <run_id> from phase <phase> (last checkpoint <last_checkpoint_at>)? [resume / start-new / discard]. The state-file envelope (schema_version, run_id, inputs, phase, decisions[], status, ...) and the fail-closed semantics on schema or YAML errors are load-bearing in the spec; don't duplicate those rules here.
Hard rules¶
- Never re-run a failed required workflow run more than once before a recorded triage classification exists; the spec calls repeated blind re-runs drift.
- Never classify a failure as
flakewithout reproducible evidence (re-run of the sameheadShareturned green and no infra signal explains the first failure). - Never mask a failure: don't propose
continue-on-error: trueon a required job, don't propose removing a check from the required-checks set in.github/settings.ymlwithout a tracking Issue, don't propose repointing anolte/gh-plumbingpin from a tag to a branch. - Never bypass branch protection on the fix PR;
enforce_admins: trueondevelophas no exception path. - Never dispatch a specialised agent with a classification the spec doesn't list. The closed set is
defect / flake / infra / stale pin / secret drift / other. - Never open a fix PR whose Risk / rollout notes don't carry the classification and the dispatched agent name (or the explicit "no matching specialised agent" note).
- Always prefer a plugin-distributed specialised agent over the generalist when one matches; the spec's §Specialised-agent dispatch makes this a SHOULD that this skill upgrades to a hard contract for the dispatch step.
- When
spec/project/workflow-health/disagrees with this skill, the spec wins. Propose updating this skill rather than silently diverging.
Gotchas¶
Per spec/claude/skill-management/ §Gotchas—concrete corrections to non-obvious environment facts the executing agent would otherwise get wrong.
GITHUB_TOKEN-cascade failures aren'tdefect. Arelease-drafter.ymlthat doesn't fire after anautomerge.yamlsquash-merge, or arelease-cd-refresh-master.ymlthat doesn't fire afterrelease-publish.yml, is the documentedinfraclass perspec/project/workflow-health/§Known platform constraints—and the remediation is upstream innolte/gh-plumbing, not in the consumer repo. Don't open a fix PR against the consumer's workflow YAML; document theinfraclassification and reference thenolte/gh-plumbingtracking Issue.pascalgn/automerge-actionexits 0 onmergeResult: 'merge_failed'. Aautomerge.yamlrun with conclusionsuccesswhose log carriesmergeResult: 'merge_failed'orFailed to merge PR:is astale pinfailure (the reusable'sMERGE_METHODdefault doesn't match the repo's allowed strategy in some pre-fix versions). Triage theautomerge.yamluses:tag, not the workflow YAML itself.- Renovate-generated bump PRs for
nolte/gh-plumbingaren't automerged in this portfolio. Astale pinremediation that proposes to enable Renovate automerge fornolte/gh-plumbingviolatesworkflow-health§Upstream drift and the AC against it. The remediation is a human-acknowledged Renovate PR, not an automerge rule. flakewithout reproducible evidence isdefect. A "let's just re-run and hope" reflex is exactly what the spec forbids. If the sameheadShadoesn't re-run cleanly green and no infra signal explains the first failure, the class isdefectand the work is a fix, not a tracking entry.
Multi-model testing¶
Examples and operations in this skill are verified on Claude Sonnet 4.6 as the default model; spot-checked on Haiku 4.5 for cost-sensitive runs; Opus 4.7 is appropriate for high-stakes audits that require deeper reasoning. The skill body has no model-specific assumptions beyond standard tool-call semantics.