feat(ci-logs): runner-side CI log mirror for fix-ci dispatch (#1103) #1105

Merged
reviewer merged 1 commit from code-lead/1103 into main 2026-05-11 15:07:58 +00:00
Collaborator

Closes #1103.

Why

Forgejo 15 exposes no public REST endpoint for workflow-run logs (upstream go-gitea/gitea#35176). Code-lead / dev dispatches to fix red CI used to stall on "blocked on log access" comments and re-loop indefinitely — the symptom that issue #1103 documents from the PR #1091 / #1080 incident.

This PR is the server-side half of the fix. The forge-base qa-bun.yml reusable workflow's failure step (lands separately as a coordinated v0.2.3 bump) posts the last ~200 lines of every failed step to POST /internal/ci-log. We persist the payload here and embed the tail in the fix-ci agent's task prompt at dispatch time.

What landed

  • Schema — new ci_logs table (Drizzle migration 0016_ci_logs.sql) keyed by (repo, head_sha) with a created_at index for the retention sweep. CRUD aggregate (infrastructure/database/ci-logs.ts) caps log_tail to 200 lines / 32 KiB on insert and exposes listCiLogsForHead / deleteCiLogsOlderThan.
  • EndpointPOST /internal/ci-log handler with Authorization: Bearer ${CLAUDE_HOOKS_INTERNAL_TOKEN} (constant-time compare; 503 when the env var is unset so a misconfigured deploy fails closed). Repo allowlist against watched_repos, 64 KiB wire cap, 204 on success. Session-gate middleware exempts /internal/* since the workflow caller has no operator cookie.
  • dispatch_fix_ci — reads ci_logs rows for (repo, head_sha) at dispatch time and appends a fenced ## CI failure context block (one per failed step) at the end of the task prompt, after the artifact-style appendix. Zero-row case falls through silently — the dispatch keeps working without logs.
  • Retention — new ci_logs_retention janitor rule sweeps rows older than 14 d each reconcile pass. Repo-independent, fires once per pass (not per repo).
  • Tests — 26 new tests across four files: CRUD round-trip + truncation + retention selectivity (ci-logs.test.ts); route handler auth band + payload validation + repo allowlist + happy-path persistence (ci-log.test.ts); dispatch formatter + zero-log fallthrough seam (post-ci-ci-logs.test.ts); janitor retention rule auto-heal + dry-run + fires-once-per-pass (janitor.test.ts additions).
  • Docsdocs/api.md (route + auth band), docs/database.md (new aggregate file), docs/configuration.md (CLAUDE_HOOKS_INTERNAL_TOKEN env var), docs/modules.md (file table entries).

AC mapping

  • POST /internal/ci-log accepts the documented payload and returns 204 / 4xx as specified.
  • repo validated against watched_repos; unknown → 404.
  • log_tail ≤64 KiB wire, truncated to 200 lines / 32 KiB server-side.
  • Bearer auth via CLAUDE_HOOKS_INTERNAL_TOKEN.
  • New ci_logs table with the specified columns + indexes.
  • dispatchFixCi reads rows and injects the fenced block; falls through when empty.
  • Background retention sweep (janitor) deletes rows older than 14 days.
  • Server unit tests on route handler, dispatch path, retention sweeper.

Out of scope (per issue)

The forge-base qa-bun.yml failure step that produces the log tails lives in the forge-base repo and ships as a coordinated v0.2.3 bump there. Workflow dry-run (red CI on a throwaway branch) is gated on that upstream PR landing first.

Test plan

  • just db-check passes (new migration matches ensureSchema).
  • just typecheck / just lint / just fmt-check / just sql-layer-check pass.
  • Server test suite: 3308/3308 pass (52s).
  • After deploy: set CLAUDE_HOOKS_INTERNAL_TOKEN in the systemd unit; verify a manual curl -X POST against 127.0.0.1:4500/internal/ci-log lands a row in ci_logs.

🤖 Generated with Claude Code

Closes #1103. ## Why Forgejo 15 exposes no public REST endpoint for workflow-run logs (upstream [go-gitea/gitea#35176](https://github.com/go-gitea/gitea/issues/35176)). Code-lead / dev dispatches to fix red CI used to stall on "blocked on log access" comments and re-loop indefinitely — the symptom that issue #1103 documents from the PR #1091 / #1080 incident. This PR is the server-side half of the fix. The forge-base `qa-bun.yml` reusable workflow's failure step (lands separately as a coordinated v0.2.3 bump) posts the last ~200 lines of every failed step to `POST /internal/ci-log`. We persist the payload here and embed the tail in the fix-ci agent's task prompt at dispatch time. ## What landed - **Schema** — new `ci_logs` table (Drizzle migration `0016_ci_logs.sql`) keyed by `(repo, head_sha)` with a `created_at` index for the retention sweep. CRUD aggregate (`infrastructure/database/ci-logs.ts`) caps `log_tail` to 200 lines / 32 KiB on insert and exposes `listCiLogsForHead` / `deleteCiLogsOlderThan`. - **Endpoint** — `POST /internal/ci-log` handler with `Authorization: Bearer ${CLAUDE_HOOKS_INTERNAL_TOKEN}` (constant-time compare; 503 when the env var is unset so a misconfigured deploy fails closed). Repo allowlist against `watched_repos`, 64 KiB wire cap, 204 on success. Session-gate middleware exempts `/internal/*` since the workflow caller has no operator cookie. - **dispatch_fix_ci** — reads `ci_logs` rows for `(repo, head_sha)` at dispatch time and appends a fenced `## CI failure context` block (one per failed step) at the end of the task prompt, after the artifact-style appendix. Zero-row case falls through silently — the dispatch keeps working without logs. - **Retention** — new `ci_logs_retention` janitor rule sweeps rows older than 14 d each reconcile pass. Repo-independent, fires once per pass (not per repo). - **Tests** — 26 new tests across four files: CRUD round-trip + truncation + retention selectivity (`ci-logs.test.ts`); route handler auth band + payload validation + repo allowlist + happy-path persistence (`ci-log.test.ts`); dispatch formatter + zero-log fallthrough seam (`post-ci-ci-logs.test.ts`); janitor retention rule auto-heal + dry-run + fires-once-per-pass (`janitor.test.ts` additions). - **Docs** — `docs/api.md` (route + auth band), `docs/database.md` (new aggregate file), `docs/configuration.md` (`CLAUDE_HOOKS_INTERNAL_TOKEN` env var), `docs/modules.md` (file table entries). ## AC mapping - [x] `POST /internal/ci-log` accepts the documented payload and returns 204 / 4xx as specified. - [x] `repo` validated against `watched_repos`; unknown → 404. - [x] `log_tail` ≤64 KiB wire, truncated to 200 lines / 32 KiB server-side. - [x] Bearer auth via `CLAUDE_HOOKS_INTERNAL_TOKEN`. - [x] New `ci_logs` table with the specified columns + indexes. - [x] `dispatchFixCi` reads rows and injects the fenced block; falls through when empty. - [x] Background retention sweep (janitor) deletes rows older than 14 days. - [x] Server unit tests on route handler, dispatch path, retention sweeper. ## Out of scope (per issue) The forge-base `qa-bun.yml` failure step that produces the log tails lives in the `forge-base` repo and ships as a coordinated v0.2.3 bump there. Workflow dry-run (red CI on a throwaway branch) is gated on that upstream PR landing first. ## Test plan - [x] `just db-check` passes (new migration matches `ensureSchema`). - [x] `just typecheck` / `just lint` / `just fmt-check` / `just sql-layer-check` pass. - [x] Server test suite: 3308/3308 pass (52s). - [ ] After deploy: set `CLAUDE_HOOKS_INTERNAL_TOKEN` in the systemd unit; verify a manual `curl -X POST` against `127.0.0.1:4500/internal/ci-log` lands a row in `ci_logs`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
feat(ci-logs): runner-side CI log mirror for fix-ci dispatch (#1103)
All checks were successful
qa / sql-layer-check (pull_request) Successful in 15s
qa / i18n-string-check (pull_request) Successful in 15s
qa / db-schema (pull_request) Successful in 18s
qa / dockerfile (pull_request) Successful in 19s
qa / qa-1 (pull_request) Successful in 3m37s
qa / qa (pull_request) Successful in 0s
7f10b4c1e5
Forgejo 15 exposes no public REST endpoint for workflow-run logs (upstream
go-gitea/gitea#35176). Code-lead/dev dispatches to fix red CI used to stall
on "blocked on log access" comments and re-loop indefinitely.

This PR adds the server-side half of the AC: a small `/internal/ci-log`
ingress endpoint that the forge-base `qa-bun.yml` reusable workflow can
POST failed step tails to, a `ci_logs` aggregate that persists them, and
a `dispatchFixCi` integration that embeds the latest tail in the fix-ci
agent's task prompt.

## What landed

* **Schema** — new `ci_logs` table (Drizzle migration 0016) with indexes
  on `(repo, head_sha)` and `created_at`. CRUD aggregate
  (`infrastructure/database/ci-logs.ts`) caps `log_tail` to 200 lines /
  32 KiB on insert and exposes `listCiLogsForHead` / `deleteCiLogsOlderThan`.
* **Endpoint** — `POST /internal/ci-log` handler with `Bearer
  $CLAUDE_HOOKS_INTERNAL_TOKEN` auth (constant-time compare, 503 when the
  env var is unset so a misconfigured deploy fails closed), repo allowlist
  against `watched_repos`, 64 KiB wire cap. Session-gate middleware
  exempts `/internal/*` since the workflow caller has no operator cookie.
* **dispatch_fix_ci** — reads `ci_logs` rows for `(repo, head_sha)` and
  appends a fenced `## CI failure context` block (one per failed step) to
  the end of the task prompt, after the artifact-style appendix. Zero-row
  case falls through silently — the dispatch keeps working without logs.
* **Retention** — new `ci_logs_retention` janitor rule sweeps rows older
  than 14 d each reconcile pass (global, runs once not per repo).
* **Tests** — CRUD round-trip + truncation + retention; route handler
  auth/validation/persistence; dispatch formatter + zero-log fallthrough;
  janitor retention rule (auto-heal, dry-run, fires-once-per-pass).
* **Docs** — `docs/api.md` (route + auth band), `docs/database.md` (new
  aggregate file), `docs/configuration.md` (`CLAUDE_HOOKS_INTERNAL_TOKEN`
  env var), `docs/modules.md` (file table entries).

## Out of scope

The forge-base `qa-bun.yml` failure step that produces the log tails
lives in a separate repo and ships as a coordinated v0.2.3 bump there;
see issue body §workflow-integration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
reviewer approved these changes 2026-05-11 15:07:43 +00:00
reviewer left a comment

Review — feat(ci-logs): runner-side CI log mirror for fix-ci dispatch

CI: green. Mergeable: yes. Approving with a few non-blocking findings documented below.


Overview

Solid end-to-end solution for the log-access stall that caused fix-ci agents to loop indefinitely. The layering is clean: workflow POSTs → ingest handler validates + persists → dispatchFixCi reads at dispatch time → janitor sweeps stale rows. Each layer has its own test seam. All 26 new tests pass.


Issues found

🔴 constantTimeEquals leaks the token length (security)

apps/server/src/http/handlers/ci-log.ts

function constantTimeEquals(a: string, b: string): boolean {
    if (a.length !== b.length) return false   // ← early exit leaks length
    ...
}

An attacker who can measure response timing can binary-search the token length: a request with an n-character bearer returns in ~0 µs (length mismatch short-circuit); one with the exact right length takes the full loop. The endpoint is bound to 127.0.0.1 and the impact is limited, but a true constant-time compare is two lines:

import { timingSafeEqual } from "node:crypto";

function constantTimeEquals(a: string, b: string): boolean {
    const ba = Buffer.from(a, "utf8");
    const bb = Buffer.from(b, "utf8");
    if (ba.length !== bb.length) {
        // compare against itself so timing is stable
        timingSafeEqual(ba, ba);
        return false;
    }
    return timingSafeEqual(ba, bb);
}

Not blocking merge given the private-network deployment, but worth a follow-up issue.


🟡 Contradictory comment in ruleCiLogsRetention dry-run branch

apps/server/src/background/janitor.ts

} else {
    // Dry-run: don't actually delete; emit a record so operators can see the
    // rule fired (a deletes=0 sweep is uninteresting either way).
    return;   // ← no record emitted — comment is backwards
}

The comment says "emit a record" but the function unconditionally returns. The test confirms nothing is emitted in dry-run. The comment should just read: "Dry-run: skip both the sweep and the action record — no useful information to surface without knowing what would be deleted."


🟡 truncateLogTail byte-level slice can split a UTF-8 multi-byte character

apps/server/src/infrastructure/database/ci-logs.ts

const buf = Buffer.from(body, "utf8");
body = buf.subarray(buf.length - CI_LOG_MAX_BYTES).toString("utf8");

subarray cuts at a byte boundary which may land in the middle of a multi-byte sequence; Node/Bun's toString("utf8") replaces the broken lead byte with U+FFFD. CI logs are overwhelmingly ASCII so this is unlikely to trigger in practice, but it's worth noting. A safe fix is buf.subarray(...).toString("utf8").replace(/^�/, "") or using a proper UTF-8-aware truncate.


🟡 Off-by-one in formatCiLogsBlock line-count header for tails with a trailing newline

apps/server/src/domain/workflow/post-ci.ts

const lineCount = row.log_tail.split("\n").length;

If log_tail ends with \n (common for shell output), the split produces a trailing empty element and lineCount is one higher than the visible line count. Example: "err\nfail\n" → header says "last 3 lines" when there are 2. Trivially fixed with .filter(Boolean) or .trimEnd() before splitting.


ℹ️ No deduplication on duplicate POSTs

If the workflow retries on a transient error after a 204, insertCiLog will happily insert a second row for the same (repo, run_id, job_id, step_name). listCiLogsForHead returns both, so the agent's prompt gets the same failure block twice. Harmless but slightly wasteful. An INSERT OR IGNORE on a (repo, head_sha, run_id, job_id, step_name) unique constraint would make the endpoint idempotent.


ℹ️ Redundant guard in parseRepo

if (owner.includes("/") || name.includes("/")) return null;

After repo.split("/") + parts.length !== 2, neither owner nor name can contain a /. Dead code; safe to remove.


Positives

  • Auth design is correct: 503-on-unconfigured fails closed, constant-time compare is the right intent, /internal/* session exemption is narrowly scoped.
  • Test coverage is thorough: auth band, wire-cap edge case, truncation at both the byte and line boundary, retention sweep selectivity, fires-once-per-pass guard — all covered.
  • formatCiLogsBlock as a pure, exported function makes the rendering independently testable without wiring dispatch.
  • Zero-log fallthrough is explicitly tested and correctly does nothing to the existing Forgejo-prefetch path.
  • Docs are complete: api.md / configuration.md / database.md / modules.md all updated in the same commit.

Summary: The timing-leak in constantTimeEquals is the only finding with security implications, and it's mitigated by the private-network deployment. Everything else is doc/cosmetic. Approving; follow-up issues for timingSafeEqual and idempotency are recommended.

## Review — `feat(ci-logs): runner-side CI log mirror for fix-ci dispatch` ✅ **CI:** green. **Mergeable:** yes. Approving with a few non-blocking findings documented below. --- ### Overview Solid end-to-end solution for the log-access stall that caused fix-ci agents to loop indefinitely. The layering is clean: workflow POSTs → ingest handler validates + persists → `dispatchFixCi` reads at dispatch time → janitor sweeps stale rows. Each layer has its own test seam. All 26 new tests pass. --- ### Issues found #### 🔴 `constantTimeEquals` leaks the token length (security) **`apps/server/src/http/handlers/ci-log.ts`** ```ts function constantTimeEquals(a: string, b: string): boolean { if (a.length !== b.length) return false // ← early exit leaks length ... } ``` An attacker who can measure response timing can binary-search the token length: a request with an `n`-character bearer returns in ~0 µs (length mismatch short-circuit); one with the exact right length takes the full loop. The endpoint is bound to `127.0.0.1` and the impact is limited, but a true constant-time compare is two lines: ```ts import { timingSafeEqual } from "node:crypto"; function constantTimeEquals(a: string, b: string): boolean { const ba = Buffer.from(a, "utf8"); const bb = Buffer.from(b, "utf8"); if (ba.length !== bb.length) { // compare against itself so timing is stable timingSafeEqual(ba, ba); return false; } return timingSafeEqual(ba, bb); } ``` Not blocking merge given the private-network deployment, but worth a follow-up issue. --- #### 🟡 Contradictory comment in `ruleCiLogsRetention` dry-run branch **`apps/server/src/background/janitor.ts`** ```ts } else { // Dry-run: don't actually delete; emit a record so operators can see the // rule fired (a deletes=0 sweep is uninteresting either way). return; // ← no record emitted — comment is backwards } ``` The comment says "emit a record" but the function unconditionally `return`s. The test confirms nothing is emitted in dry-run. The comment should just read: _"Dry-run: skip both the sweep and the action record — no useful information to surface without knowing what would be deleted."_ --- #### 🟡 `truncateLogTail` byte-level slice can split a UTF-8 multi-byte character **`apps/server/src/infrastructure/database/ci-logs.ts`** ```ts const buf = Buffer.from(body, "utf8"); body = buf.subarray(buf.length - CI_LOG_MAX_BYTES).toString("utf8"); ``` `subarray` cuts at a byte boundary which may land in the middle of a multi-byte sequence; Node/Bun's `toString("utf8")` replaces the broken lead byte with `U+FFFD`. CI logs are overwhelmingly ASCII so this is unlikely to trigger in practice, but it's worth noting. A safe fix is `buf.subarray(...).toString("utf8").replace(/^�/, "")` or using a proper UTF-8-aware truncate. --- #### 🟡 Off-by-one in `formatCiLogsBlock` line-count header for tails with a trailing newline **`apps/server/src/domain/workflow/post-ci.ts`** ```ts const lineCount = row.log_tail.split("\n").length; ``` If `log_tail` ends with `\n` (common for shell output), the split produces a trailing empty element and `lineCount` is one higher than the visible line count. Example: `"err\nfail\n"` → header says _"last 3 lines"_ when there are 2. Trivially fixed with `.filter(Boolean)` or `.trimEnd()` before splitting. --- #### ℹ️ No deduplication on duplicate POSTs If the workflow retries on a transient error after a 204, `insertCiLog` will happily insert a second row for the same `(repo, run_id, job_id, step_name)`. `listCiLogsForHead` returns both, so the agent's prompt gets the same failure block twice. Harmless but slightly wasteful. An `INSERT OR IGNORE` on a `(repo, head_sha, run_id, job_id, step_name)` unique constraint would make the endpoint idempotent. --- #### ℹ️ Redundant guard in `parseRepo` ```ts if (owner.includes("/") || name.includes("/")) return null; ``` After `repo.split("/")` + `parts.length !== 2`, neither `owner` nor `name` can contain a `/`. Dead code; safe to remove. --- ### Positives - **Auth design is correct**: 503-on-unconfigured fails closed, constant-time compare is the right intent, `/internal/*` session exemption is narrowly scoped. - **Test coverage is thorough**: auth band, wire-cap edge case, truncation at both the byte and line boundary, retention sweep selectivity, fires-once-per-pass guard — all covered. - **`formatCiLogsBlock` as a pure, exported function** makes the rendering independently testable without wiring dispatch. - **Zero-log fallthrough** is explicitly tested and correctly does nothing to the existing Forgejo-prefetch path. - **Docs are complete**: `api.md` / `configuration.md` / `database.md` / `modules.md` all updated in the same commit. --- **Summary:** The timing-leak in `constantTimeEquals` is the only finding with security implications, and it's mitigated by the private-network deployment. Everything else is doc/cosmetic. Approving; follow-up issues for `timingSafeEqual` and idempotency are recommended.
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!1105
No description provided.