feat(ci-logs): runner-side log mirror for agent fix-ci dispatch #1103

New issue

Closed

opened 2026-05-11 09:26:40 +00:00 by claude-desktop · 0 comments

claude-desktop commented

2026-05-11 09:26:40 +00:00

Collaborator

User story

As a code-lead or dev agent dispatched to fix a red CI run, I want the last ~200 lines of every failed step embedded in my task prompt, so that I can diagnose the failure without trying to fetch logs from a Forgejo API that does not exist — eliminating the "blocked on log access" stall that re-loops the fix-ci dispatch indefinitely.

Background

Forgejo 15 does not expose any REST endpoint to download workflow-run logs or artifacts. Confirmed by hitting https://forge.jacquin.app/swagger.v1.json (2026-05-11):

Action-related paths in swagger: runs, runs/{run_id}, runners, runners/jobs, secrets, variables, workflows/{name}/dispatches, tasks.
No …/logs, no …/artifacts/download, no …/jobs/{job_id}/logs.
The web UI uses an internal dbfs session-cookie route that returns dbfs open "actions_log/": file does not exist to API token callers.
Upstream tracking: go-gitea/gitea#35176 — open feature request for per-step log API. No ETA.

Symptom in our loop: PR #1091 / #1080 cycle. Code-lead dispatched, can't read CI logs, comments "blocked on log access", task ends. Next CI failure re-dispatches with same blocker. The fix exists in user-space (claude-hooks owns the runner host) — short-circuit the missing API by mirroring log tails from inside the workflow itself.

Acceptance criteria

Endpoint (claude-hooks server)

POST /internal/ci-log accepts { repo, run_id, job_id, step_name, head_sha, conclusion, log_tail } JSON body.
repo is <owner>/<name> form, validated against the watched-repos table; reject 404 if unknown.
log_tail is plain text, ≤64 KiB. Server truncates to last 200 lines or 32 KiB (whichever is smaller) for storage — workflow side is best-effort, server is the cap.
Returns 204 on success, 4xx on bad input. No body.
Path lives under /internal/* (already gated by the internal-only auth band — no public surface).

Auth

Endpoint requires the shared CLAUDE_HOOKS_INTERNAL_TOKEN env var (same band as existing /internal/* routes). Workflow reads it from a repo-level secret (CLAUDE_HOOKS_INTERNAL_TOKEN) and sends it via Authorization: Bearer ….
Reject any request from outside the local network (bind to 127.0.0.1 already enforced by the systemd unit — no change here, just document).

Storage

New table ci_logs (Drizzle migration): id INTEGER PK, repo TEXT, head_sha TEXT, run_id TEXT, job_id TEXT, step_name TEXT, conclusion TEXT, log_tail TEXT, created_at INTEGER (epoch ms).
Index on (repo, head_sha) — fast lookup at dispatch_fix_ci time.
Index on created_at — used by retention sweeper.

Workflow integration (forge-base/qa-bun.yml)

Add a final step if: failure() to qa-bun.yml (and any sibling reusable workflows we own) that:
- Captures stdout/stderr of every prior step that failed (Forgejo exposes step status via ${{ steps.<id>.outcome }}; iterate over the known step IDs: setup, typecheck, lint, fmt-check, test).
- Tails each failed step's log file from /var/run/act/... (path documented in forgejo-runner config) to ~200 lines.
- POSTs the JSON payload to ${{ secrets.CLAUDE_HOOKS_LOG_URL }} with the bearer token from ${{ secrets.CLAUDE_HOOKS_INTERNAL_TOKEN }}.
- Best-effort — never fail the job because the mirror endpoint is unreachable. Wrap in continue-on-error: true.
Pin the new step's bash invocation to a specific forge-base tag bump (v0.2.3).

dispatch_fix_ci integration

domain/workflow/post-ci.ts::dispatchFixCi reads ci_logs rows for (repo, head_sha) before constructing the task prompt.

If at least one row exists, append a fenced-code block to the task prompt:

## CI failure context (last <N> lines of <step_name>, run <run_id>)
<log_tail>

If zero rows exist (older runs, mirror endpoint down, workflow without the step), fall back to current behaviour — task prompt without log context. The op must remain functional with no logs.
No behavioural change to dispatch_fix_ci's dedup map — keyed by head SHA as today.

Retention

Background job (extend existing janitor.ts): delete ci_logs rows older than 14 days. Cheap; runs every 10 min.
No per-repo quota — text payload is small (≤200 lines × 32 KiB cap × few PRs per day = MB-scale, not GB).

Tests

Server: unit test on the route handler — happy path, 4xx on bad payload, 401 on missing auth.
Server: unit test on dispatchFixCi — fresh-log path injects the fenced block; zero-log path falls through.
Server: integration test on the retention sweeper — old rows deleted, fresh rows survive.
Workflow: dry-run by triggering a deliberate red CI on a throwaway branch and confirming a row lands in ci_logs within 10s of step failure.

Out of scope

Per-step log API on the Forgejo server itself (upstream issue #35176 — track, don't fork).
Mirroring successful runs — only if: failure(). Successful runs need no fix-ci dispatch.
Artifact mirroring (binary uploads, test reports, screenshots). Different problem, different storage shape.
Live log streaming. The agent gets a post-mortem tail at dispatch time, not a live feed.
Forgejo runner config changes — the runner already produces step logs on disk; the workflow step just reads them.
Web UI surfacing of ci_logs. Stored data is for agents, not operators; operators read logs via the existing Forgejo web UI.

References

Forgejo Actions reference (log retention): https://forgejo.org/docs/latest/user/actions/reference/
Forgejo Actions admin guide (storage.actions_log): https://forgejo.org/docs/latest/admin/actions/
Upstream feature request (per-step log API): https://github.com/go-gitea/gitea/issues/35176
PR #1091 incident — "Blocked on log access" comment chain demonstrating the stall.
docs/architect.md / docs/api.md — /internal/* endpoint band reference.
docs/database.md — Drizzle migration runner conventions.
forge-base repo — home of qa-bun.yml reusable workflow that ships the new step.
Global memory: "Forgejo Actions: workflow & job naming convention" (CLAUDE.md) — generic step names, project-agnostic workflow files.

## User story As a **code-lead** or **dev** agent dispatched to fix a red CI run, I want the last ~200 lines of every failed step embedded in my task prompt, so that I can diagnose the failure without trying to fetch logs from a Forgejo API that does not exist — eliminating the "blocked on log access" stall that re-loops the fix-ci dispatch indefinitely. ## Background Forgejo 15 does **not** expose any REST endpoint to download workflow-run logs or artifacts. Confirmed by hitting `https://forge.jacquin.app/swagger.v1.json` (2026-05-11): - Action-related paths in swagger: `runs`, `runs/{run_id}`, `runners`, `runners/jobs`, `secrets`, `variables`, `workflows/{name}/dispatches`, `tasks`. - No `…/logs`, no `…/artifacts/download`, no `…/jobs/{job_id}/logs`. - The web UI uses an internal `dbfs` session-cookie route that returns `dbfs open "actions_log/": file does not exist` to API token callers. - Upstream tracking: [go-gitea/gitea#35176](https://github.com/go-gitea/gitea/issues/35176) — open feature request for per-step log API. No ETA. Symptom in our loop: PR #1091 / #1080 cycle. Code-lead dispatched, can't read CI logs, comments "blocked on log access", task ends. Next CI failure re-dispatches with same blocker. The fix exists in user-space (claude-hooks owns the runner host) — short-circuit the missing API by mirroring log tails from inside the workflow itself. ## Acceptance criteria ### Endpoint (claude-hooks server) - [ ] `POST /internal/ci-log` accepts `{ repo, run_id, job_id, step_name, head_sha, conclusion, log_tail }` JSON body. - [ ] `repo` is `<owner>/<name>` form, validated against the watched-repos table; reject 404 if unknown. - [ ] `log_tail` is plain text, ≤64 KiB. Server truncates to last 200 lines or 32 KiB (whichever is smaller) for storage — workflow side is best-effort, server is the cap. - [ ] Returns 204 on success, 4xx on bad input. No body. - [ ] Path lives under `/internal/*` (already gated by the internal-only auth band — no public surface). ### Auth - [ ] Endpoint requires the shared `CLAUDE_HOOKS_INTERNAL_TOKEN` env var (same band as existing `/internal/*` routes). Workflow reads it from a repo-level secret (`CLAUDE_HOOKS_INTERNAL_TOKEN`) and sends it via `Authorization: Bearer …`. - [ ] Reject any request from outside the local network (bind to `127.0.0.1` already enforced by the systemd unit — no change here, just document). ### Storage - [ ] New table `ci_logs` (Drizzle migration): `id INTEGER PK`, `repo TEXT`, `head_sha TEXT`, `run_id TEXT`, `job_id TEXT`, `step_name TEXT`, `conclusion TEXT`, `log_tail TEXT`, `created_at INTEGER` (epoch ms). - [ ] Index on `(repo, head_sha)` — fast lookup at `dispatch_fix_ci` time. - [ ] Index on `created_at` — used by retention sweeper. ### Workflow integration (forge-base/qa-bun.yml) - [ ] Add a final step `if: failure()` to `qa-bun.yml` (and any sibling reusable workflows we own) that: - [ ] Captures stdout/stderr of every prior step that failed (Forgejo exposes step status via `${{ steps.<id>.outcome }}`; iterate over the known step IDs: `setup`, `typecheck`, `lint`, `fmt-check`, `test`). - [ ] Tails each failed step's log file from `/var/run/act/...` (path documented in forgejo-runner config) to ~200 lines. - [ ] POSTs the JSON payload to `${{ secrets.CLAUDE_HOOKS_LOG_URL }}` with the bearer token from `${{ secrets.CLAUDE_HOOKS_INTERNAL_TOKEN }}`. - [ ] Best-effort — never fail the job because the mirror endpoint is unreachable. Wrap in `continue-on-error: true`. - [ ] Pin the new step's bash invocation to a specific forge-base tag bump (`v0.2.3`). ### dispatch_fix_ci integration - [ ] `domain/workflow/post-ci.ts::dispatchFixCi` reads `ci_logs` rows for `(repo, head_sha)` before constructing the task prompt. - [ ] If at least one row exists, append a fenced-code block to the task prompt: ``` ## CI failure context (last <N> lines of <step_name>, run <run_id>) <log_tail> ``` - [ ] If zero rows exist (older runs, mirror endpoint down, workflow without the step), fall back to current behaviour — task prompt without log context. The op must remain functional with no logs. - [ ] No behavioural change to `dispatch_fix_ci`'s dedup map — keyed by head SHA as today. ### Retention - [ ] Background job (extend existing `janitor.ts`): delete `ci_logs` rows older than 14 days. Cheap; runs every 10 min. - [ ] No per-repo quota — text payload is small (≤200 lines × 32 KiB cap × few PRs per day = MB-scale, not GB). ### Tests - [ ] Server: unit test on the route handler — happy path, 4xx on bad payload, 401 on missing auth. - [ ] Server: unit test on `dispatchFixCi` — fresh-log path injects the fenced block; zero-log path falls through. - [ ] Server: integration test on the retention sweeper — old rows deleted, fresh rows survive. - [ ] Workflow: dry-run by triggering a deliberate red CI on a throwaway branch and confirming a row lands in `ci_logs` within 10s of step failure. ## Out of scope - Per-step log API on the Forgejo server itself (upstream issue #35176 — track, don't fork). - Mirroring **successful** runs — only `if: failure()`. Successful runs need no fix-ci dispatch. - Artifact mirroring (binary uploads, test reports, screenshots). Different problem, different storage shape. - Live log streaming. The agent gets a post-mortem tail at dispatch time, not a live feed. - Forgejo runner config changes — the runner already produces step logs on disk; the workflow step just reads them. - Web UI surfacing of `ci_logs`. Stored data is for agents, not operators; operators read logs via the existing Forgejo web UI. ## References - Forgejo Actions reference (log retention): https://forgejo.org/docs/latest/user/actions/reference/ - Forgejo Actions admin guide (`storage.actions_log`): https://forgejo.org/docs/latest/admin/actions/ - Upstream feature request (per-step log API): https://github.com/go-gitea/gitea/issues/35176 - PR #1091 incident — "Blocked on log access" comment chain demonstrating the stall. - `docs/architect.md` / `docs/api.md` — `/internal/*` endpoint band reference. - `docs/database.md` — Drizzle migration runner conventions. - `forge-base` repo — home of `qa-bun.yml` reusable workflow that ships the new step. - Global memory: "Forgejo Actions: workflow & job naming convention" (CLAUDE.md) — generic step names, project-agnostic workflow files.

claude-desktop added the

area:agents

type:user-story

labels

2026-05-11 09:26:50 +00:00