feat(ci-logs): runner-side log mirror for agent fix-ci dispatch #1103

Closed
opened 2026-05-11 09:26:40 +00:00 by claude-desktop · 0 comments
Collaborator

User story

As a code-lead or dev agent dispatched to fix a red CI run, I want the last ~200 lines of every failed step embedded in my task prompt, so that I can diagnose the failure without trying to fetch logs from a Forgejo API that does not exist — eliminating the "blocked on log access" stall that re-loops the fix-ci dispatch indefinitely.

Background

Forgejo 15 does not expose any REST endpoint to download workflow-run logs or artifacts. Confirmed by hitting https://forge.jacquin.app/swagger.v1.json (2026-05-11):

  • Action-related paths in swagger: runs, runs/{run_id}, runners, runners/jobs, secrets, variables, workflows/{name}/dispatches, tasks.
  • No …/logs, no …/artifacts/download, no …/jobs/{job_id}/logs.
  • The web UI uses an internal dbfs session-cookie route that returns dbfs open "actions_log/": file does not exist to API token callers.
  • Upstream tracking: go-gitea/gitea#35176 — open feature request for per-step log API. No ETA.

Symptom in our loop: PR #1091 / #1080 cycle. Code-lead dispatched, can't read CI logs, comments "blocked on log access", task ends. Next CI failure re-dispatches with same blocker. The fix exists in user-space (claude-hooks owns the runner host) — short-circuit the missing API by mirroring log tails from inside the workflow itself.

Acceptance criteria

Endpoint (claude-hooks server)

  • POST /internal/ci-log accepts { repo, run_id, job_id, step_name, head_sha, conclusion, log_tail } JSON body.
  • repo is <owner>/<name> form, validated against the watched-repos table; reject 404 if unknown.
  • log_tail is plain text, ≤64 KiB. Server truncates to last 200 lines or 32 KiB (whichever is smaller) for storage — workflow side is best-effort, server is the cap.
  • Returns 204 on success, 4xx on bad input. No body.
  • Path lives under /internal/* (already gated by the internal-only auth band — no public surface).

Auth

  • Endpoint requires the shared CLAUDE_HOOKS_INTERNAL_TOKEN env var (same band as existing /internal/* routes). Workflow reads it from a repo-level secret (CLAUDE_HOOKS_INTERNAL_TOKEN) and sends it via Authorization: Bearer ….
  • Reject any request from outside the local network (bind to 127.0.0.1 already enforced by the systemd unit — no change here, just document).

Storage

  • New table ci_logs (Drizzle migration): id INTEGER PK, repo TEXT, head_sha TEXT, run_id TEXT, job_id TEXT, step_name TEXT, conclusion TEXT, log_tail TEXT, created_at INTEGER (epoch ms).
  • Index on (repo, head_sha) — fast lookup at dispatch_fix_ci time.
  • Index on created_at — used by retention sweeper.

Workflow integration (forge-base/qa-bun.yml)

  • Add a final step if: failure() to qa-bun.yml (and any sibling reusable workflows we own) that:
    • Captures stdout/stderr of every prior step that failed (Forgejo exposes step status via ${{ steps.<id>.outcome }}; iterate over the known step IDs: setup, typecheck, lint, fmt-check, test).
    • Tails each failed step's log file from /var/run/act/... (path documented in forgejo-runner config) to ~200 lines.
    • POSTs the JSON payload to ${{ secrets.CLAUDE_HOOKS_LOG_URL }} with the bearer token from ${{ secrets.CLAUDE_HOOKS_INTERNAL_TOKEN }}.
    • Best-effort — never fail the job because the mirror endpoint is unreachable. Wrap in continue-on-error: true.
  • Pin the new step's bash invocation to a specific forge-base tag bump (v0.2.3).

dispatch_fix_ci integration

  • domain/workflow/post-ci.ts::dispatchFixCi reads ci_logs rows for (repo, head_sha) before constructing the task prompt.

  • If at least one row exists, append a fenced-code block to the task prompt:

    ## CI failure context (last <N> lines of <step_name>, run <run_id>)
    <log_tail>
    
  • If zero rows exist (older runs, mirror endpoint down, workflow without the step), fall back to current behaviour — task prompt without log context. The op must remain functional with no logs.

  • No behavioural change to dispatch_fix_ci's dedup map — keyed by head SHA as today.

Retention

  • Background job (extend existing janitor.ts): delete ci_logs rows older than 14 days. Cheap; runs every 10 min.
  • No per-repo quota — text payload is small (≤200 lines × 32 KiB cap × few PRs per day = MB-scale, not GB).

Tests

  • Server: unit test on the route handler — happy path, 4xx on bad payload, 401 on missing auth.
  • Server: unit test on dispatchFixCi — fresh-log path injects the fenced block; zero-log path falls through.
  • Server: integration test on the retention sweeper — old rows deleted, fresh rows survive.
  • Workflow: dry-run by triggering a deliberate red CI on a throwaway branch and confirming a row lands in ci_logs within 10s of step failure.

Out of scope

  • Per-step log API on the Forgejo server itself (upstream issue #35176 — track, don't fork).
  • Mirroring successful runs — only if: failure(). Successful runs need no fix-ci dispatch.
  • Artifact mirroring (binary uploads, test reports, screenshots). Different problem, different storage shape.
  • Live log streaming. The agent gets a post-mortem tail at dispatch time, not a live feed.
  • Forgejo runner config changes — the runner already produces step logs on disk; the workflow step just reads them.
  • Web UI surfacing of ci_logs. Stored data is for agents, not operators; operators read logs via the existing Forgejo web UI.

References

## User story As a **code-lead** or **dev** agent dispatched to fix a red CI run, I want the last ~200 lines of every failed step embedded in my task prompt, so that I can diagnose the failure without trying to fetch logs from a Forgejo API that does not exist — eliminating the "blocked on log access" stall that re-loops the fix-ci dispatch indefinitely. ## Background Forgejo 15 does **not** expose any REST endpoint to download workflow-run logs or artifacts. Confirmed by hitting `https://forge.jacquin.app/swagger.v1.json` (2026-05-11): - Action-related paths in swagger: `runs`, `runs/{run_id}`, `runners`, `runners/jobs`, `secrets`, `variables`, `workflows/{name}/dispatches`, `tasks`. - No `…/logs`, no `…/artifacts/download`, no `…/jobs/{job_id}/logs`. - The web UI uses an internal `dbfs` session-cookie route that returns `dbfs open "actions_log/": file does not exist` to API token callers. - Upstream tracking: [go-gitea/gitea#35176](https://github.com/go-gitea/gitea/issues/35176) — open feature request for per-step log API. No ETA. Symptom in our loop: PR #1091 / #1080 cycle. Code-lead dispatched, can't read CI logs, comments "blocked on log access", task ends. Next CI failure re-dispatches with same blocker. The fix exists in user-space (claude-hooks owns the runner host) — short-circuit the missing API by mirroring log tails from inside the workflow itself. ## Acceptance criteria ### Endpoint (claude-hooks server) - [ ] `POST /internal/ci-log` accepts `{ repo, run_id, job_id, step_name, head_sha, conclusion, log_tail }` JSON body. - [ ] `repo` is `<owner>/<name>` form, validated against the watched-repos table; reject 404 if unknown. - [ ] `log_tail` is plain text, ≤64 KiB. Server truncates to last 200 lines or 32 KiB (whichever is smaller) for storage — workflow side is best-effort, server is the cap. - [ ] Returns 204 on success, 4xx on bad input. No body. - [ ] Path lives under `/internal/*` (already gated by the internal-only auth band — no public surface). ### Auth - [ ] Endpoint requires the shared `CLAUDE_HOOKS_INTERNAL_TOKEN` env var (same band as existing `/internal/*` routes). Workflow reads it from a repo-level secret (`CLAUDE_HOOKS_INTERNAL_TOKEN`) and sends it via `Authorization: Bearer …`. - [ ] Reject any request from outside the local network (bind to `127.0.0.1` already enforced by the systemd unit — no change here, just document). ### Storage - [ ] New table `ci_logs` (Drizzle migration): `id INTEGER PK`, `repo TEXT`, `head_sha TEXT`, `run_id TEXT`, `job_id TEXT`, `step_name TEXT`, `conclusion TEXT`, `log_tail TEXT`, `created_at INTEGER` (epoch ms). - [ ] Index on `(repo, head_sha)` — fast lookup at `dispatch_fix_ci` time. - [ ] Index on `created_at` — used by retention sweeper. ### Workflow integration (forge-base/qa-bun.yml) - [ ] Add a final step `if: failure()` to `qa-bun.yml` (and any sibling reusable workflows we own) that: - [ ] Captures stdout/stderr of every prior step that failed (Forgejo exposes step status via `${{ steps.<id>.outcome }}`; iterate over the known step IDs: `setup`, `typecheck`, `lint`, `fmt-check`, `test`). - [ ] Tails each failed step's log file from `/var/run/act/...` (path documented in forgejo-runner config) to ~200 lines. - [ ] POSTs the JSON payload to `${{ secrets.CLAUDE_HOOKS_LOG_URL }}` with the bearer token from `${{ secrets.CLAUDE_HOOKS_INTERNAL_TOKEN }}`. - [ ] Best-effort — never fail the job because the mirror endpoint is unreachable. Wrap in `continue-on-error: true`. - [ ] Pin the new step's bash invocation to a specific forge-base tag bump (`v0.2.3`). ### dispatch_fix_ci integration - [ ] `domain/workflow/post-ci.ts::dispatchFixCi` reads `ci_logs` rows for `(repo, head_sha)` before constructing the task prompt. - [ ] If at least one row exists, append a fenced-code block to the task prompt: ``` ## CI failure context (last <N> lines of <step_name>, run <run_id>) <log_tail> ``` - [ ] If zero rows exist (older runs, mirror endpoint down, workflow without the step), fall back to current behaviour — task prompt without log context. The op must remain functional with no logs. - [ ] No behavioural change to `dispatch_fix_ci`'s dedup map — keyed by head SHA as today. ### Retention - [ ] Background job (extend existing `janitor.ts`): delete `ci_logs` rows older than 14 days. Cheap; runs every 10 min. - [ ] No per-repo quota — text payload is small (≤200 lines × 32 KiB cap × few PRs per day = MB-scale, not GB). ### Tests - [ ] Server: unit test on the route handler — happy path, 4xx on bad payload, 401 on missing auth. - [ ] Server: unit test on `dispatchFixCi` — fresh-log path injects the fenced block; zero-log path falls through. - [ ] Server: integration test on the retention sweeper — old rows deleted, fresh rows survive. - [ ] Workflow: dry-run by triggering a deliberate red CI on a throwaway branch and confirming a row lands in `ci_logs` within 10s of step failure. ## Out of scope - Per-step log API on the Forgejo server itself (upstream issue #35176 — track, don't fork). - Mirroring **successful** runs — only `if: failure()`. Successful runs need no fix-ci dispatch. - Artifact mirroring (binary uploads, test reports, screenshots). Different problem, different storage shape. - Live log streaming. The agent gets a post-mortem tail at dispatch time, not a live feed. - Forgejo runner config changes — the runner already produces step logs on disk; the workflow step just reads them. - Web UI surfacing of `ci_logs`. Stored data is for agents, not operators; operators read logs via the existing Forgejo web UI. ## References - Forgejo Actions reference (log retention): https://forgejo.org/docs/latest/user/actions/reference/ - Forgejo Actions admin guide (`storage.actions_log`): https://forgejo.org/docs/latest/admin/actions/ - Upstream feature request (per-step log API): https://github.com/go-gitea/gitea/issues/35176 - PR #1091 incident — "Blocked on log access" comment chain demonstrating the stall. - `docs/architect.md` / `docs/api.md` — `/internal/*` endpoint band reference. - `docs/database.md` — Drizzle migration runner conventions. - `forge-base` repo — home of `qa-bun.yml` reusable workflow that ships the new step. - Global memory: "Forgejo Actions: workflow & job naming convention" (CLAUDE.md) — generic step names, project-agnostic workflow files.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#1103
No description provided.