feat(ci-logs): runner-side CI log mirror for fix-ci dispatch (#1103) #1105
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!1105
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "code-lead/1103"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #1103.
Why
Forgejo 15 exposes no public REST endpoint for workflow-run logs (upstream go-gitea/gitea#35176). Code-lead / dev dispatches to fix red CI used to stall on "blocked on log access" comments and re-loop indefinitely — the symptom that issue #1103 documents from the PR #1091 / #1080 incident.
This PR is the server-side half of the fix. The forge-base
qa-bun.ymlreusable workflow's failure step (lands separately as a coordinated v0.2.3 bump) posts the last ~200 lines of every failed step toPOST /internal/ci-log. We persist the payload here and embed the tail in the fix-ci agent's task prompt at dispatch time.What landed
ci_logstable (Drizzle migration0016_ci_logs.sql) keyed by(repo, head_sha)with acreated_atindex for the retention sweep. CRUD aggregate (infrastructure/database/ci-logs.ts) capslog_tailto 200 lines / 32 KiB on insert and exposeslistCiLogsForHead/deleteCiLogsOlderThan.POST /internal/ci-loghandler withAuthorization: Bearer ${CLAUDE_HOOKS_INTERNAL_TOKEN}(constant-time compare; 503 when the env var is unset so a misconfigured deploy fails closed). Repo allowlist againstwatched_repos, 64 KiB wire cap, 204 on success. Session-gate middleware exempts/internal/*since the workflow caller has no operator cookie.ci_logsrows for(repo, head_sha)at dispatch time and appends a fenced## CI failure contextblock (one per failed step) at the end of the task prompt, after the artifact-style appendix. Zero-row case falls through silently — the dispatch keeps working without logs.ci_logs_retentionjanitor rule sweeps rows older than 14 d each reconcile pass. Repo-independent, fires once per pass (not per repo).ci-logs.test.ts); route handler auth band + payload validation + repo allowlist + happy-path persistence (ci-log.test.ts); dispatch formatter + zero-log fallthrough seam (post-ci-ci-logs.test.ts); janitor retention rule auto-heal + dry-run + fires-once-per-pass (janitor.test.tsadditions).docs/api.md(route + auth band),docs/database.md(new aggregate file),docs/configuration.md(CLAUDE_HOOKS_INTERNAL_TOKENenv var),docs/modules.md(file table entries).AC mapping
POST /internal/ci-logaccepts the documented payload and returns 204 / 4xx as specified.repovalidated againstwatched_repos; unknown → 404.log_tail≤64 KiB wire, truncated to 200 lines / 32 KiB server-side.CLAUDE_HOOKS_INTERNAL_TOKEN.ci_logstable with the specified columns + indexes.dispatchFixCireads rows and injects the fenced block; falls through when empty.Out of scope (per issue)
The forge-base
qa-bun.ymlfailure step that produces the log tails lives in theforge-baserepo and ships as a coordinated v0.2.3 bump there. Workflow dry-run (red CI on a throwaway branch) is gated on that upstream PR landing first.Test plan
just db-checkpasses (new migration matchesensureSchema).just typecheck/just lint/just fmt-check/just sql-layer-checkpass.CLAUDE_HOOKS_INTERNAL_TOKENin the systemd unit; verify a manualcurl -X POSTagainst127.0.0.1:4500/internal/ci-loglands a row inci_logs.🤖 Generated with Claude Code
Review —
feat(ci-logs): runner-side CI log mirror for fix-ci dispatch✅CI: green. Mergeable: yes. Approving with a few non-blocking findings documented below.
Overview
Solid end-to-end solution for the log-access stall that caused fix-ci agents to loop indefinitely. The layering is clean: workflow POSTs → ingest handler validates + persists →
dispatchFixCireads at dispatch time → janitor sweeps stale rows. Each layer has its own test seam. All 26 new tests pass.Issues found
🔴
constantTimeEqualsleaks the token length (security)apps/server/src/http/handlers/ci-log.tsAn attacker who can measure response timing can binary-search the token length: a request with an
n-character bearer returns in ~0 µs (length mismatch short-circuit); one with the exact right length takes the full loop. The endpoint is bound to127.0.0.1and the impact is limited, but a true constant-time compare is two lines:Not blocking merge given the private-network deployment, but worth a follow-up issue.
🟡 Contradictory comment in
ruleCiLogsRetentiondry-run branchapps/server/src/background/janitor.tsThe comment says "emit a record" but the function unconditionally
returns. The test confirms nothing is emitted in dry-run. The comment should just read: "Dry-run: skip both the sweep and the action record — no useful information to surface without knowing what would be deleted."🟡
truncateLogTailbyte-level slice can split a UTF-8 multi-byte characterapps/server/src/infrastructure/database/ci-logs.tssubarraycuts at a byte boundary which may land in the middle of a multi-byte sequence; Node/Bun'stoString("utf8")replaces the broken lead byte withU+FFFD. CI logs are overwhelmingly ASCII so this is unlikely to trigger in practice, but it's worth noting. A safe fix isbuf.subarray(...).toString("utf8").replace(/^�/, "")or using a proper UTF-8-aware truncate.🟡 Off-by-one in
formatCiLogsBlockline-count header for tails with a trailing newlineapps/server/src/domain/workflow/post-ci.tsIf
log_tailends with\n(common for shell output), the split produces a trailing empty element andlineCountis one higher than the visible line count. Example:"err\nfail\n"→ header says "last 3 lines" when there are 2. Trivially fixed with.filter(Boolean)or.trimEnd()before splitting.ℹ️ No deduplication on duplicate POSTs
If the workflow retries on a transient error after a 204,
insertCiLogwill happily insert a second row for the same(repo, run_id, job_id, step_name).listCiLogsForHeadreturns both, so the agent's prompt gets the same failure block twice. Harmless but slightly wasteful. AnINSERT OR IGNOREon a(repo, head_sha, run_id, job_id, step_name)unique constraint would make the endpoint idempotent.ℹ️ Redundant guard in
parseRepoAfter
repo.split("/")+parts.length !== 2, neitherownernornamecan contain a/. Dead code; safe to remove.Positives
/internal/*session exemption is narrowly scoped.formatCiLogsBlockas a pure, exported function makes the rendering independently testable without wiring dispatch.api.md/configuration.md/database.md/modules.mdall updated in the same commit.Summary: The timing-leak in
constantTimeEqualsis the only finding with security implications, and it's mitigated by the private-network deployment. Everything else is doc/cosmetic. Approving; follow-up issues fortimingSafeEqualand idempotency are recommended.