fix: v1.9.15 full-mode warm-up race during h2 init (#924, #1029)

Fixes #924 — the canonical tracking thread for the v1.9.15 Full-mode regression cluster that spanned ~3 weeks and 18+ duplicate reports. **Root cause** (rigorously bisected to PR #799's `warm()`): PR #799 added HTTP/2 multiplexing on the relay leg, and gated the h1 socket-pool prewarm behind `ensure_h2().await`. `ensure_h2()` is bounded by `H2_OPEN_TIMEOUT_SECS = 8s` but can take the full window on a cold connection. During that window the h1 fallback pool is empty, so any request that arrives gets: 1. `Err((Relay("h2 unavailable"), No))` immediately → falls back to h1 2. h1 path calls `acquire()` → empty pool → cold `open()` → fresh TCP+TLS handshake to `connect_host:443` 3. Same network conditions that stalled h2 also stall h1; cold open exceeds the 30s `batch_timeout` enforced in `dispatch_full_tunnel` 4. User sees `batch timed out after 30s` while apps_script mode keeps working **Fix** (two commits, both `domain_fronter.rs`-only): Commit 1 — `warm h1 pool in parallel with h2`: Spawn h2 prewarm in a separate task so the h1 prewarm loop runs concurrently. Full `n` h1 sockets are warm before user traffic, even when the h2 handshake stalls or hits its 8s timeout. `run_pool_refill` trims the pool back to `POOL_MIN_H2_FALLBACK = 2` within 5s once h2 lands as the fast path. Commit 2 — `bound h1 open() + detect dead h2 cells synchronously`: - `H1_OPEN_TIMEOUT_SECS = 8` wraps the TCP+TLS handshake in `open()` so a stuck handshake to a blackholed `connect_host:443` doesn't block `acquire()` until the outer 30s batch budget elapses (same symptom #924 hits during the warm-race window). - `H2Cell.dead: Arc<AtomicBool>` flipped by the connection driver task when `Connection::await` ends (GOAWAY, network error, normal close). `ensure_h2`'s fast path and `run_pool_refill`'s pool-target check both consult the flag — known-dead cells are rejected within ≤5s instead of waiting for `H2_CONN_TTL_SECS = 540s` to expire or for a request to discover the breakage via `ready()` failure. **API impact**: `h2_handshake_post_tls`'s return type changes to `(SendRequest, Arc<AtomicBool>)`. One existing test (`h2_handshake_post_tls_returns_alpn_refused_when_peer_picks_h1`) tweaks its `Ok` arm to match — panic message unchanged. **Verified locally on top of v1.9.19**: - `cargo test --lib --release`: 209/209 (was 208; +1 new test `ensure_h2_rejects_dead_cell_within_ttl`) - `cargo build --release --features ui --bin mhrv-rs-ui`: clean - `(cd tunnel-node && cargo test --release)`: 36/36 **Live end-to-end** (from PR description): - 5 cold restarts (warm-up race window): 5/5 pass, 9.6-22.5s - Concurrent burst (5 simultaneous SOCKS5 streams): 5/5 - Default full.json baseline: 200 OK in 13.3s - `force_http1: true` sanity: 200 OK in 17.7s **A/B vs PR #903** (per-session pipelining): commits land in disjoint functions, cherry-picked clean on top. If #903 lands first, this needs a mechanical rebase only. Exemplary debugging work — bisect with concrete probe data, root cause identification down to specific commit + line, working fix with bounded timeouts on the two adjacent paths the same stall pattern could recur through, +1 regression test. The kind of PR that lands fast. Closes #924. Reviewed via Anthropic Claude. Co-Authored-By: rezaisrad <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 21:24:48 +03:00 · 2026-05-10 17:36:49 -04:00
parent 24534f7827
commit 4d3e62195c
1 changed files with 161 additions and 46 deletions
@@ -131,6 +131,14 @@ const H2_OPEN_TIMEOUT_SECS: u64 = 8;
 /// long. Prevents every concurrent caller during an h2 outage from
 /// paying its own full handshake-timeout cost in turn.
 const H2_OPEN_FAILURE_BACKOFF_SECS: u64 = 15;
+/// Same idea as `H2_OPEN_TIMEOUT_SECS` but for the legacy h1 socket
+/// path. Without this, a stuck TCP connect or TLS handshake to a
+/// blackholed `connect_host:443` would block `acquire()` (and the
+/// `warm()` prewarm loop) until the outer batch budget elapsed —
+/// the same symptom #924 hit during the warm-race window. Bounded
+/// here so a single hung handshake aborts fast and the loop / caller
+/// makes progress on the next attempt.
+const H1_OPEN_TIMEOUT_SECS: u64 = 8;
 /// Cadence for Apps Script container keepalive pings. Apps Script
 /// containers go cold after ~5min idle and cost 1-3s on the first
 /// request to wake back up — most painful on YouTube / streaming where
@@ -156,10 +164,23 @@ struct PoolEntry {
 /// `generation` is monotonic per fronter and lets `poison_h2_if_gen`
 /// avoid the race where task A's stale failure clears task B's
 /// freshly-reopened healthy cell.
+///
+/// `dead` is set by the spawned connection-driver task when the h2
+/// `Connection` future ends (GOAWAY, network error, normal close).
+/// Without this, the cell silently held a dead `SendRequest` after a
+/// mid-session disconnect — the next request paid a wasted h2 round
+/// trip to detect it via `ready()` failure, AND `run_pool_refill`
+/// kept maintaining the small `POOL_MIN_H2_FALLBACK` (2-socket) pool
+/// instead of expanding to `POOL_MIN` (8). With the flag,
+/// `run_pool_refill` notices h2 is dead within one tick (≤5 s) and
+/// pre-warms the larger fallback pool before the next request burst,
+/// and `ensure_h2` short-circuits the `H2_CONN_TTL_SECS`-based
+/// liveness check on a known-dead cell.
 struct H2Cell {
    send: h2::client::SendRequest<Bytes>,
    created: Instant,
    generation: u64,
+    dead: Arc<AtomicBool>,
 }

 /// "Did this request reach Apps Script?" signal carried out of every
@@ -864,6 +885,8 @@ impl DomainFronter {
    }

    async fn open(&self) -> Result<PooledStream, FronterError> {
+        // Bounded TCP+TLS open. See `H1_OPEN_TIMEOUT_SECS`.
+        let work = async {
            let tcp = TcpStream::connect((self.connect_host.as_str(), 443u16)).await?;
            let _ = tcp.set_nodelay(true);
            let sni = self.next_sni();
@@ -874,34 +897,51 @@ impl DomainFronter {
            // negotiate h2, which would then misframe every fallback
            // request that lands on them.
            let tls = self.tls_connector_h1.connect(name, tcp).await?;
-        Ok(tls)
+            Ok::<_, FronterError>(tls)
+        };
+        match tokio::time::timeout(Duration::from_secs(H1_OPEN_TIMEOUT_SECS), work).await {
+            Ok(r) => r,
+            Err(_) => Err(FronterError::Relay(format!(
+                "h1 open timed out after {}s",
+                H1_OPEN_TIMEOUT_SECS
+            ))),
+        }
    }

    /// Open outbound TLS connections eagerly so the first relay request
    /// doesn't pay a cold handshake.
    ///
-    /// When h2 is enabled, attempts to open the multiplexed h2 cell
-    /// first. Success there means one TCP/TLS handshake serves all
-    /// future requests, so we only need a tiny fallback h1 pool
-    /// (clamped to 2) instead of the full `n` requested. On h2 failure
-    /// (ALPN refusal, network error), falls back to the legacy
-    /// behavior: warm the full `n` h1 sockets.
+    /// h2 and h1 prewarm run in parallel: a request that arrives while
+    /// the h2 handshake is still in flight (or has just hit its 8 s
+    /// timeout) needs a warm h1 socket waiting for it, otherwise the
+    /// h1 fallback path pays a cold handshake on the same slow network
+    /// and the 30 s outer batch budget elapses (#924). v1.9.14 warmed
+    /// h1 unconditionally; v1.9.15 (PR #799) accidentally gated the h1
+    /// prewarm behind `ensure_h2()` so the h1 pool stayed empty during
+    /// the h2 init window.
    ///
-    /// Staggered 500 ms apart so we don't burst N TLS handshakes at the
-    /// Google edge simultaneously, and each connection gets an 8 s
-    /// expiry offset so they roll off gradually instead of all hitting
-    /// POOL_TTL_SECS at once.
+    /// The spawned h2 handshake races h1[0] — boot fires two TLS
+    /// handshakes back-to-back. The 500 ms stagger only applies between
+    /// h1[i] and h1[i+1] for i ≥ 1, so we don't burst the remaining
+    /// h1[1..n] handshakes at the Google edge simultaneously. Each
+    /// connection gets an 8 s expiry offset so they roll off gradually
+    /// instead of all hitting POOL_TTL_SECS at once. If h2 ends up the
+    /// active fast path, `run_pool_refill` trims the pool back down to
+    /// `POOL_MIN_H2_FALLBACK` on the next tick — the extra warm h1
+    /// sockets just age out naturally instead of being kept alive.
    pub async fn warm(self: &Arc<Self>, n: usize) {
-        // Try to bring up the h2 fast path first. If that succeeds,
-        // shrink the h1 pool warm count to the fallback minimum — the
-        // multiplexed h2 conn handles all real traffic, so the h1 pool
-        // only needs to cover the rare case where h2 dies mid-session.
-        let h2_alive = !self.h2_disabled.load(Ordering::Relaxed)
-            && self.ensure_h2().await.is_some();
-        let h1_target = if h2_alive { 2.min(n) } else { n };
+        // Spawn the h2 prewarm in parallel so the h1 prewarm loop
+        // below isn't blocked on it. Capturing the join handle lets
+        // us still log "h2 fast path active" / "h1 fallback only"
+        // accurately at the end.
+        let h2_self = self.clone();
+        let h2_handle = tokio::spawn(async move {
+            !h2_self.h2_disabled.load(Ordering::Relaxed)
+                && h2_self.ensure_h2().await.is_some()
+        });

        let mut warmed = 0usize;
-        for i in 0..h1_target {
+        for i in 0..n {
            if i > 0 {
                tokio::time::sleep(Duration::from_millis(500)).await;
            }
@@ -922,6 +962,17 @@ impl DomainFronter {
                }
            }
        }
+        // Join the h2 prewarm here only to log whether it landed; the
+        // h1 pool above is already populated either way. A panic in
+        // the spawned task surfaces as `JoinError` — log it explicitly
+        // so it isn't indistinguishable from a clean ALPN refusal.
+        let h2_alive = match h2_handle.await {
+            Ok(v) => v,
+            Err(e) => {
+                tracing::warn!("h2 prewarm task failed to join: {}", e);
+                false
+            }
+        };
        if h2_alive {
            tracing::info!(
                "h2 fast path active; h1 fallback pool pre-warmed with {} connection(s)",
@@ -970,7 +1021,10 @@ impl DomainFronter {
                let cell = self.h2_cell.lock().await;
                let h2_alive = cell
                    .as_ref()
-                    .map(|c| c.created.elapsed().as_secs() < H2_CONN_TTL_SECS)
+                    .map(|c| {
+                        c.created.elapsed().as_secs() < H2_CONN_TTL_SECS
+                            && !c.dead.load(Ordering::Relaxed)
+                    })
                    .unwrap_or(false);
                if h2_alive { POOL_MIN_H2_FALLBACK } else { POOL_MIN }
            };
@@ -1115,16 +1169,18 @@ impl DomainFronter {
            return None;
        }

-        // Fast path: existing cell, within TTL. Clone (Arc bump) and
-        // return without touching the open machinery. We can't peek at
-        // SendRequest liveness directly (h2 0.4 doesn't expose
-        // `is_closed`), so a request against a dead conn fails at
-        // `ready()`/`send_request` and the caller poisons by
-        // generation from there.
+        // Fast path: existing cell, within TTL and not flagged dead by
+        // the connection driver. We can't peek at SendRequest liveness
+        // synchronously (h2 0.4 doesn't expose `is_closed`), but the
+        // driver task does flip `dead` when the underlying connection
+        // ends — so a known-dead cell is rejected here without paying
+        // a wasted h2 round trip to discover it.
        {
            let cell = self.h2_cell.lock().await;
            if let Some(c) = cell.as_ref() {
-                if c.created.elapsed().as_secs() < H2_CONN_TTL_SECS {
+                if c.created.elapsed().as_secs() < H2_CONN_TTL_SECS
+                    && !c.dead.load(Ordering::Relaxed)
+                {
                    return Some((c.send.clone(), c.generation));
                }
            }
@@ -1155,7 +1211,9 @@ impl DomainFronter {
        {
            let cell = self.h2_cell.lock().await;
            if let Some(c) = cell.as_ref() {
-                if c.created.elapsed().as_secs() < H2_CONN_TTL_SECS {
+                if c.created.elapsed().as_secs() < H2_CONN_TTL_SECS
+                    && !c.dead.load(Ordering::Relaxed)
+                {
                    return Some((c.send.clone(), c.generation));
                }
            }
@@ -1168,8 +1226,8 @@ impl DomainFronter {
            tokio::time::timeout(Duration::from_secs(H2_OPEN_TIMEOUT_SECS), self.open_h2())
                .await;

-        let send = match open_result {
-            Ok(Ok(s)) => s,
+        let (send, dead) = match open_result {
+            Ok(Ok(pair)) => pair,
            Ok(Err(OpenH2Error::AlpnRefused)) => {
                // Definitive: this peer doesn't speak h2. Sticky-disable
                // so we never re-attempt the handshake.
@@ -1206,6 +1264,7 @@ impl DomainFronter {
            send: send.clone(),
            created: Instant::now(),
            generation,
+            dead,
        });
        Some((send, generation))
    }
@@ -1213,7 +1272,11 @@ impl DomainFronter {
    /// Open one TLS connection and run the h2 handshake. Returns a
    /// typed `OpenH2Error` so the caller can recognize ALPN refusal
    /// (sticky disable) without string-matching across boundaries.
-    async fn open_h2(&self) -> Result<h2::client::SendRequest<Bytes>, OpenH2Error> {
+    /// The returned `Arc<AtomicBool>` is the death flag the connection
+    /// driver flips when the h2 `Connection` future ends.
+    async fn open_h2(
+        &self,
+    ) -> Result<(h2::client::SendRequest<Bytes>, Arc<AtomicBool>), OpenH2Error> {
        let tcp = TcpStream::connect((self.connect_host.as_str(), 443u16)).await?;
        let _ = tcp.set_nodelay(true);
        let sni = self.next_sni();
@@ -1228,7 +1291,7 @@ impl DomainFronter {
    /// bypassing the hard-coded `connect_host:443` target.
    async fn h2_handshake_post_tls(
        tls: PooledStream,
-    ) -> Result<h2::client::SendRequest<Bytes>, OpenH2Error> {
+    ) -> Result<(h2::client::SendRequest<Bytes>, Arc<AtomicBool>), OpenH2Error> {
        let alpn_h2 = tls
            .get_ref()
            .1
@@ -1251,15 +1314,19 @@ impl DomainFronter {
            .map_err(|e| OpenH2Error::Handshake(e.to_string()))?;
        // The connection task drives frame I/O independently of any
        // SendRequest handle. When it ends (GOAWAY, network error, TTL),
-        // existing handles will start failing on `ready()` / `send_request`
-        // and `ensure_h2` will reopen on the next call.
+        // we flip the `dead` flag so `ensure_h2` and `run_pool_refill`
+        // can react within one refill tick instead of waiting for a
+        // request to discover the breakage via `ready()` failure.
+        let dead = Arc::new(AtomicBool::new(false));
+        let dead_for_driver = dead.clone();
        tokio::spawn(async move {
            if let Err(e) = conn.await {
                tracing::debug!("h2 connection closed: {}", e);
            }
+            dead_for_driver.store(true, Ordering::Relaxed);
        });
        tracing::info!("h2 connection established to relay edge");
-        Ok(send)
+        Ok((send, dead))
    }

    /// React to an h2-fronting-incompatibility HTTP response (status
@@ -5120,6 +5187,7 @@ hello";
                send: send_v2.clone(),
                created: Instant::now(),
                generation: 2,
+                dead: Arc::new(AtomicBool::new(false)),
            });
        }
        // Task A poisons with stale gen=1.
@@ -5141,6 +5209,52 @@ hello";
        server_handle.abort();
    }

+    #[tokio::test(flavor = "current_thread")]
+    async fn ensure_h2_rejects_dead_cell_within_ttl() {
+        // Cell is within H2_CONN_TTL_SECS but the connection driver
+        // already flipped `dead` (e.g., upstream sent GOAWAY). Without
+        // the dead-flag check `ensure_h2` would happily hand out the
+        // stale SendRequest and the next request would pay a wasted
+        // h2 round trip to discover the breakage. With the check in
+        // place a second pre-existing healthy cell still works fine —
+        // the dead one is replaced via the open-lock path.
+        let (addr, server_handle) = spawn_h2c_server(|_req| {
+            let resp = http::Response::builder().status(200).body(()).unwrap();
+            (resp, Vec::new())
+        })
+        .await;
+        let send = h2c_client(addr).await;
+
+        let fronter = fronter_for_test(false);
+        let dead = Arc::new(AtomicBool::new(true)); // simulate driver having exited
+        {
+            let mut cell = fronter.h2_cell.lock().await;
+            *cell = Some(H2Cell {
+                send,
+                created: Instant::now(), // well within TTL
+                generation: 1,
+                dead: dead.clone(),
+            });
+        }
+
+        // The fast path normally returns Some(send, gen) when the cell
+        // is within TTL. With dead=true it must NOT return the stale
+        // SendRequest. Pre-set the failure-backoff timestamp so
+        // ensure_h2 short-circuits at the backoff check (no network
+        // I/O) regardless of whatever's bound on 127.0.0.1:443 on the
+        // dev/CI host. This isolates the assertion to the new
+        // dead-flag check.
+        *fronter.h2_open_failed_at.lock().await = Some(Instant::now());
+
+        let result = fronter.ensure_h2().await;
+        assert!(
+            result.is_none(),
+            "ensure_h2 must not serve a cell whose driver flipped `dead`"
+        );
+
+        server_handle.abort();
+    }
+
    #[tokio::test(flavor = "current_thread")]
    async fn ensure_h2_skips_reopen_during_failure_backoff() {
        // After an open failure, ensure_h2 must return None for at
@@ -5566,6 +5680,7 @@ hello";
                send: send.clone(),
                created: Instant::now(),
                generation: 7,
+                dead: Arc::new(AtomicBool::new(false)),
            });
        }
        // Pretend a round-trip just incremented h2_calls (which is
@@ -5682,7 +5797,7 @@ hello";
        match result {
            Err(OpenH2Error::AlpnRefused) => {} // expected
            Err(other) => panic!("expected AlpnRefused, got {:?}", other),
-            Ok(_) => panic!("expected AlpnRefused, got Ok"),
+            Ok((_send, _dead)) => panic!("expected AlpnRefused, got Ok"),
        }
        server.await.unwrap();
    }