feat(harness): HTTP health polling on start (#3) #21

Merged
charles merged 1 commit from feat/3-health-polling into main 2026-04-11 18:42:23 +00:00
Owner

Closes #3.

Stacked on #20 (feat/2-harness-lifecycle).

Summary

Extends ProcessHarness::start() with the HTTP health-poll loop. After the child is spawned, if health_url is set, the harness polls it until 2xx, the deadline elapses, or the child exits.

What's in this PR

  • wait_for_health() private method invoked at the end of start(). Builds one reqwest::Client per start() call (not one per request).
  • Polls every health_poll_interval (default 100 ms) until 2xx.
  • Connection errors during polling (refused / DNS / TLS) are retried silently — those are expected during startup.
  • On timeout: kills the child via stop(), returns TestError::Connection with {url, last status, stderr tail}.
  • On early child exit: short-circuits immediately, returns TestError::Connection with {exit status, url, stderr tail}. A 50 ms drain delay before composing the error gives the background stdout/stderr reader tasks time to flush.
  • If health_url is None: silent no-op (start returns as soon as the child spawns).
  • Drops the #[allow(dead_code)] from stderr_tail() introduced in #2 — now actually used.

Checklist (from issue #3)

  • Polls health_url via reqwest GET at health_poll_interval (default 100 ms)
  • 2xx marks the process ready and resolves start() with Ok(())
  • On startup_timeout: kills child, returns TestError::Connection with URL + last HTTP status / transport error + stderr tail
  • If child exits early: short-circuits and reports exit status + stderr (no waiting out the full timeout)
  • Polling honours cancellation (drops cleanly when the future is cancelled)
  • Single reusable reqwest::Client (not one per request)
  • Connection errors silently retried until the timeout

Test plan

  • just qa green locally — 20/20 unit tests pass (4 new in this PR)
  • health_passes_when_endpoint_returns_200 — in-process HTTP responder, success path
  • health_fails_on_timeout_when_no_listener — 1 s deadline, asserts URL + error wording
  • health_fails_when_child_exits_immediately — child writes to stderr then exits 7, asserts the captured stderr line appears in the error
  • health_check_skipped_when_no_url_set — health is optional
  • CI green on Forgejo

Notes for the reviewer

  • The in-process HTTP responder is a tiny tokio::net::TcpListener that writes a literal HTTP/1.1 response. Avoids pulling in axum/warp/hyper just for test fixtures.
  • The 50 ms drain delay before composing the early-exit error is there because the stdout/stderr reader tasks run in separate tokio tasks; without the yield, the test was racing the reader and missing the captured stderr line.
  • reqwest::Client::builder().timeout(...) uses health_poll_interval.max(1s) so we don't try to set a sub-1 s HTTP timeout (some reqwest versions panic on tiny timeouts).
  • The stop() call inside the timeout-error branch will issue another SIGTERM and reap; the early-exit branch already has the child reaped via try_wait, so it just clears self.child.
Closes #3. **Stacked on #20** (feat/2-harness-lifecycle). ## Summary Extends `ProcessHarness::start()` with the HTTP health-poll loop. After the child is spawned, if `health_url` is set, the harness polls it until 2xx, the deadline elapses, or the child exits. ## What's in this PR - `wait_for_health()` private method invoked at the end of `start()`. Builds **one** `reqwest::Client` per `start()` call (not one per request). - Polls every `health_poll_interval` (default 100 ms) until 2xx. - Connection errors during polling (refused / DNS / TLS) are retried silently — those are expected during startup. - On timeout: kills the child via `stop()`, returns `TestError::Connection` with `{url, last status, stderr tail}`. - On early child exit: short-circuits immediately, returns `TestError::Connection` with `{exit status, url, stderr tail}`. A 50 ms drain delay before composing the error gives the background stdout/stderr reader tasks time to flush. - If `health_url` is `None`: silent no-op (start returns as soon as the child spawns). - Drops the `#[allow(dead_code)]` from `stderr_tail()` introduced in #2 — now actually used. ## Checklist (from issue #3) - [x] Polls `health_url` via reqwest GET at `health_poll_interval` (default 100 ms) - [x] 2xx marks the process ready and resolves `start()` with `Ok(())` - [x] On `startup_timeout`: kills child, returns `TestError::Connection` with URL + last HTTP status / transport error + stderr tail - [x] If child exits early: short-circuits and reports exit status + stderr (no waiting out the full timeout) - [x] Polling honours cancellation (drops cleanly when the future is cancelled) - [x] Single reusable `reqwest::Client` (not one per request) - [x] Connection errors silently retried until the timeout ## Test plan - [x] `just qa` green locally — 20/20 unit tests pass (4 new in this PR) - [x] `health_passes_when_endpoint_returns_200` — in-process HTTP responder, success path - [x] `health_fails_on_timeout_when_no_listener` — 1 s deadline, asserts URL + error wording - [x] `health_fails_when_child_exits_immediately` — child writes to stderr then exits 7, asserts the captured stderr line appears in the error - [x] `health_check_skipped_when_no_url_set` — health is optional - [ ] CI green on Forgejo ## Notes for the reviewer - The in-process HTTP responder is a tiny `tokio::net::TcpListener` that writes a literal HTTP/1.1 response. Avoids pulling in axum/warp/hyper just for test fixtures. - The 50 ms drain delay before composing the early-exit error is there because the stdout/stderr reader tasks run in separate tokio tasks; without the yield, the test was racing the reader and missing the captured stderr line. - `reqwest::Client::builder().timeout(...)` uses `health_poll_interval.max(1s)` so we don't try to set a sub-1 s HTTP timeout (some reqwest versions panic on tiny timeouts). - The `stop()` call inside the timeout-error branch will issue another SIGTERM and reap; the early-exit branch already has the child reaped via `try_wait`, so it just clears `self.child`.
Extends ProcessHarness::start() with the health-poll loop:

- After spawning the child, if the builder set health_url, poll it via
  a single reusable reqwest::Client at health_poll_interval (default
  100 ms).
- 2xx → ready, return Ok.
- Connection errors (refused / DNS / TLS) are silently retried until
  the deadline — those are expected during startup.
- If startup_timeout elapses, kill the child and return
  TestError::Connection containing the URL polled, the last HTTP
  status / transport error observed, and the last 50 lines of stderr.
- If the child exits before becoming ready, short-circuit immediately
  and report the exit status + stderr tail (50 ms drain delay so the
  background reader tasks can flush before composing the error).
- No health URL configured ⇒ start() returns Ok as soon as the child
  is spawned (silent no-op).

Removes the #[allow(dead_code)] from stderr_tail now that the health
path uses it.

Tests (4 new in src/harness.rs):
- health_passes_when_endpoint_returns_200 (in-process responder)
- health_fails_on_timeout_when_no_listener (1 s deadline)
- health_fails_when_child_exits_immediately (no listener, child exits
  during polling, asserts both the "child exited" wording and that
  the captured stderr line is in the error message)
- health_check_skipped_when_no_url_set

The in-process HTTP responder is a tiny tokio::TcpListener that writes
a literal HTTP/1.1 response — keeps the test hermetic and avoids
pulling in axum/warp/hyper just for fixtures.

just qa green: 20/20 unit tests, fmt and clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
charles left a comment

Review — HTTP health polling on start (#3)

Concern : sleep de 50ms après exit du child

tokio::time::sleep(Duration::from_millis(50)).await;
self.child = None;
return Err(self.compose_health_error(...));

C'est un heuristique "laisse les tâches I/O se vider". Si le child produit beaucoup de sortie juste avant de crasher, 50ms peut ne pas suffire. Pour les usages tests courants ça passe, mais c'est fragile. Une approche plus robuste serait d'attendre la fermeture d'un channel que la tâche reader signale à l'EOF. À noter dans un commentaire au minimum.

Minor : ordre des vérifications dans la boucle

Le check started.elapsed() >= startup_timeout intervient après le sleep. Si la deadline expire exactement pendant le sleep, une requête HTTP supplémentaire part avant le timeout. Inoffensif mais le timeout n'est pas parfaitement serré.

Ce qui est bien

  • Un seul reqwest::Client par appel start() — correct (évite les handshakes TLS répétés).
  • compose_health_error inclut la queue stderr — excellent pour diagnostiquer les crashs au démarrage.
  • max(health_poll_interval, 1s) comme timeout HTTP — évite des requêtes avec deadline absurde.
  • Fixtures de test en TCP pur, sans vrai serveur HTTP — rapides et déterministes.
## Review — HTTP health polling on start (#3) ### Concern : sleep de 50ms après exit du child ```rust tokio::time::sleep(Duration::from_millis(50)).await; self.child = None; return Err(self.compose_health_error(...)); ``` C'est un heuristique "laisse les tâches I/O se vider". Si le child produit beaucoup de sortie juste avant de crasher, 50ms peut ne pas suffire. Pour les usages tests courants ça passe, mais c'est fragile. Une approche plus robuste serait d'attendre la fermeture d'un channel que la tâche reader signale à l'EOF. À noter dans un commentaire au minimum. ### Minor : ordre des vérifications dans la boucle Le check `started.elapsed() >= startup_timeout` intervient *après* le `sleep`. Si la deadline expire exactement pendant le sleep, une requête HTTP supplémentaire part avant le timeout. Inoffensif mais le timeout n'est pas parfaitement serré. ### Ce qui est bien - Un seul `reqwest::Client` par appel `start()` — correct (évite les handshakes TLS répétés). - `compose_health_error` inclut la queue stderr — excellent pour diagnostiquer les crashs au démarrage. - `max(health_poll_interval, 1s)` comme timeout HTTP — évite des requêtes avec deadline absurde. - Fixtures de test en TCP pur, sans vrai serveur HTTP — rapides et déterministes.
charles left a comment

Commentaires sans bloquant. Le sleep de 50ms après exit du child est le seul point un peu fragile, à documenter au minimum.

Commentaires sans bloquant. Le sleep de 50ms après exit du child est le seul point un peu fragile, à documenter au minimum.
charles changed target branch from feat/2-harness-lifecycle to main 2026-04-11 18:42:15 +00:00
charles deleted branch feat/3-health-polling 2026-04-11 18:42:24 +00:00
Sign in to join this conversation.
No description provided.