mirror of
https://github.com/therealaleph/MasterHttpRelayVPN-RUST.git
synced 2026-05-17 21:24:48 +03:00
Fixes #924 — the canonical tracking thread for the v1.9.15 Full-mode regression cluster that spanned ~3 weeks and 18+ duplicate reports. **Root cause** (rigorously bisected to PR #799's `warm()`): PR #799 added HTTP/2 multiplexing on the relay leg, and gated the h1 socket-pool prewarm behind `ensure_h2().await`. `ensure_h2()` is bounded by `H2_OPEN_TIMEOUT_SECS = 8s` but can take the full window on a cold connection. During that window the h1 fallback pool is empty, so any request that arrives gets: 1. `Err((Relay("h2 unavailable"), No))` immediately → falls back to h1 2. h1 path calls `acquire()` → empty pool → cold `open()` → fresh TCP+TLS handshake to `connect_host:443` 3. Same network conditions that stalled h2 also stall h1; cold open exceeds the 30s `batch_timeout` enforced in `dispatch_full_tunnel` 4. User sees `batch timed out after 30s` while apps_script mode keeps working **Fix** (two commits, both `domain_fronter.rs`-only): Commit 1 — `warm h1 pool in parallel with h2`: Spawn h2 prewarm in a separate task so the h1 prewarm loop runs concurrently. Full `n` h1 sockets are warm before user traffic, even when the h2 handshake stalls or hits its 8s timeout. `run_pool_refill` trims the pool back to `POOL_MIN_H2_FALLBACK = 2` within 5s once h2 lands as the fast path. Commit 2 — `bound h1 open() + detect dead h2 cells synchronously`: - `H1_OPEN_TIMEOUT_SECS = 8` wraps the TCP+TLS handshake in `open()` so a stuck handshake to a blackholed `connect_host:443` doesn't block `acquire()` until the outer 30s batch budget elapses (same symptom #924 hits during the warm-race window). - `H2Cell.dead: Arc<AtomicBool>` flipped by the connection driver task when `Connection::await` ends (GOAWAY, network error, normal close). `ensure_h2`'s fast path and `run_pool_refill`'s pool-target check both consult the flag — known-dead cells are rejected within ≤5s instead of waiting for `H2_CONN_TTL_SECS = 540s` to expire or for a request to discover the breakage via `ready()` failure. **API impact**: `h2_handshake_post_tls`'s return type changes to `(SendRequest, Arc<AtomicBool>)`. One existing test (`h2_handshake_post_tls_returns_alpn_refused_when_peer_picks_h1`) tweaks its `Ok` arm to match — panic message unchanged. **Verified locally on top of v1.9.19**: - `cargo test --lib --release`: 209/209 (was 208; +1 new test `ensure_h2_rejects_dead_cell_within_ttl`) - `cargo build --release --features ui --bin mhrv-rs-ui`: clean - `(cd tunnel-node && cargo test --release)`: 36/36 **Live end-to-end** (from PR description): - 5 cold restarts (warm-up race window): 5/5 pass, 9.6-22.5s - Concurrent burst (5 simultaneous SOCKS5 streams): 5/5 - Default full.json baseline: 200 OK in 13.3s - `force_http1: true` sanity: 200 OK in 17.7s **A/B vs PR #903** (per-session pipelining): commits land in disjoint functions, cherry-picked clean on top. If #903 lands first, this needs a mechanical rebase only. Exemplary debugging work — bisect with concrete probe data, root cause identification down to specific commit + line, working fix with bounded timeouts on the two adjacent paths the same stall pattern could recur through, +1 regression test. The kind of PR that lands fast. Closes #924. Reviewed via Anthropic Claude. Co-Authored-By: rezaisrad <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+151
-36
@@ -131,6 +131,14 @@ const H2_OPEN_TIMEOUT_SECS: u64 = 8;
|
||||
/// long. Prevents every concurrent caller during an h2 outage from
|
||||
/// paying its own full handshake-timeout cost in turn.
|
||||
const H2_OPEN_FAILURE_BACKOFF_SECS: u64 = 15;
|
||||
/// Same idea as `H2_OPEN_TIMEOUT_SECS` but for the legacy h1 socket
|
||||
/// path. Without this, a stuck TCP connect or TLS handshake to a
|
||||
/// blackholed `connect_host:443` would block `acquire()` (and the
|
||||
/// `warm()` prewarm loop) until the outer batch budget elapsed —
|
||||
/// the same symptom #924 hit during the warm-race window. Bounded
|
||||
/// here so a single hung handshake aborts fast and the loop / caller
|
||||
/// makes progress on the next attempt.
|
||||
const H1_OPEN_TIMEOUT_SECS: u64 = 8;
|
||||
/// Cadence for Apps Script container keepalive pings. Apps Script
|
||||
/// containers go cold after ~5min idle and cost 1-3s on the first
|
||||
/// request to wake back up — most painful on YouTube / streaming where
|
||||
@@ -156,10 +164,23 @@ struct PoolEntry {
|
||||
/// `generation` is monotonic per fronter and lets `poison_h2_if_gen`
|
||||
/// avoid the race where task A's stale failure clears task B's
|
||||
/// freshly-reopened healthy cell.
|
||||
///
|
||||
/// `dead` is set by the spawned connection-driver task when the h2
|
||||
/// `Connection` future ends (GOAWAY, network error, normal close).
|
||||
/// Without this, the cell silently held a dead `SendRequest` after a
|
||||
/// mid-session disconnect — the next request paid a wasted h2 round
|
||||
/// trip to detect it via `ready()` failure, AND `run_pool_refill`
|
||||
/// kept maintaining the small `POOL_MIN_H2_FALLBACK` (2-socket) pool
|
||||
/// instead of expanding to `POOL_MIN` (8). With the flag,
|
||||
/// `run_pool_refill` notices h2 is dead within one tick (≤5 s) and
|
||||
/// pre-warms the larger fallback pool before the next request burst,
|
||||
/// and `ensure_h2` short-circuits the `H2_CONN_TTL_SECS`-based
|
||||
/// liveness check on a known-dead cell.
|
||||
struct H2Cell {
|
||||
send: h2::client::SendRequest<Bytes>,
|
||||
created: Instant,
|
||||
generation: u64,
|
||||
dead: Arc<AtomicBool>,
|
||||
}
|
||||
|
||||
/// "Did this request reach Apps Script?" signal carried out of every
|
||||
@@ -864,6 +885,8 @@ impl DomainFronter {
|
||||
}
|
||||
|
||||
async fn open(&self) -> Result<PooledStream, FronterError> {
|
||||
// Bounded TCP+TLS open. See `H1_OPEN_TIMEOUT_SECS`.
|
||||
let work = async {
|
||||
let tcp = TcpStream::connect((self.connect_host.as_str(), 443u16)).await?;
|
||||
let _ = tcp.set_nodelay(true);
|
||||
let sni = self.next_sni();
|
||||
@@ -874,34 +897,51 @@ impl DomainFronter {
|
||||
// negotiate h2, which would then misframe every fallback
|
||||
// request that lands on them.
|
||||
let tls = self.tls_connector_h1.connect(name, tcp).await?;
|
||||
Ok(tls)
|
||||
Ok::<_, FronterError>(tls)
|
||||
};
|
||||
match tokio::time::timeout(Duration::from_secs(H1_OPEN_TIMEOUT_SECS), work).await {
|
||||
Ok(r) => r,
|
||||
Err(_) => Err(FronterError::Relay(format!(
|
||||
"h1 open timed out after {}s",
|
||||
H1_OPEN_TIMEOUT_SECS
|
||||
))),
|
||||
}
|
||||
}
|
||||
|
||||
/// Open outbound TLS connections eagerly so the first relay request
|
||||
/// doesn't pay a cold handshake.
|
||||
///
|
||||
/// When h2 is enabled, attempts to open the multiplexed h2 cell
|
||||
/// first. Success there means one TCP/TLS handshake serves all
|
||||
/// future requests, so we only need a tiny fallback h1 pool
|
||||
/// (clamped to 2) instead of the full `n` requested. On h2 failure
|
||||
/// (ALPN refusal, network error), falls back to the legacy
|
||||
/// behavior: warm the full `n` h1 sockets.
|
||||
/// h2 and h1 prewarm run in parallel: a request that arrives while
|
||||
/// the h2 handshake is still in flight (or has just hit its 8 s
|
||||
/// timeout) needs a warm h1 socket waiting for it, otherwise the
|
||||
/// h1 fallback path pays a cold handshake on the same slow network
|
||||
/// and the 30 s outer batch budget elapses (#924). v1.9.14 warmed
|
||||
/// h1 unconditionally; v1.9.15 (PR #799) accidentally gated the h1
|
||||
/// prewarm behind `ensure_h2()` so the h1 pool stayed empty during
|
||||
/// the h2 init window.
|
||||
///
|
||||
/// Staggered 500 ms apart so we don't burst N TLS handshakes at the
|
||||
/// Google edge simultaneously, and each connection gets an 8 s
|
||||
/// expiry offset so they roll off gradually instead of all hitting
|
||||
/// POOL_TTL_SECS at once.
|
||||
/// The spawned h2 handshake races h1[0] — boot fires two TLS
|
||||
/// handshakes back-to-back. The 500 ms stagger only applies between
|
||||
/// h1[i] and h1[i+1] for i ≥ 1, so we don't burst the remaining
|
||||
/// h1[1..n] handshakes at the Google edge simultaneously. Each
|
||||
/// connection gets an 8 s expiry offset so they roll off gradually
|
||||
/// instead of all hitting POOL_TTL_SECS at once. If h2 ends up the
|
||||
/// active fast path, `run_pool_refill` trims the pool back down to
|
||||
/// `POOL_MIN_H2_FALLBACK` on the next tick — the extra warm h1
|
||||
/// sockets just age out naturally instead of being kept alive.
|
||||
pub async fn warm(self: &Arc<Self>, n: usize) {
|
||||
// Try to bring up the h2 fast path first. If that succeeds,
|
||||
// shrink the h1 pool warm count to the fallback minimum — the
|
||||
// multiplexed h2 conn handles all real traffic, so the h1 pool
|
||||
// only needs to cover the rare case where h2 dies mid-session.
|
||||
let h2_alive = !self.h2_disabled.load(Ordering::Relaxed)
|
||||
&& self.ensure_h2().await.is_some();
|
||||
let h1_target = if h2_alive { 2.min(n) } else { n };
|
||||
// Spawn the h2 prewarm in parallel so the h1 prewarm loop
|
||||
// below isn't blocked on it. Capturing the join handle lets
|
||||
// us still log "h2 fast path active" / "h1 fallback only"
|
||||
// accurately at the end.
|
||||
let h2_self = self.clone();
|
||||
let h2_handle = tokio::spawn(async move {
|
||||
!h2_self.h2_disabled.load(Ordering::Relaxed)
|
||||
&& h2_self.ensure_h2().await.is_some()
|
||||
});
|
||||
|
||||
let mut warmed = 0usize;
|
||||
for i in 0..h1_target {
|
||||
for i in 0..n {
|
||||
if i > 0 {
|
||||
tokio::time::sleep(Duration::from_millis(500)).await;
|
||||
}
|
||||
@@ -922,6 +962,17 @@ impl DomainFronter {
|
||||
}
|
||||
}
|
||||
}
|
||||
// Join the h2 prewarm here only to log whether it landed; the
|
||||
// h1 pool above is already populated either way. A panic in
|
||||
// the spawned task surfaces as `JoinError` — log it explicitly
|
||||
// so it isn't indistinguishable from a clean ALPN refusal.
|
||||
let h2_alive = match h2_handle.await {
|
||||
Ok(v) => v,
|
||||
Err(e) => {
|
||||
tracing::warn!("h2 prewarm task failed to join: {}", e);
|
||||
false
|
||||
}
|
||||
};
|
||||
if h2_alive {
|
||||
tracing::info!(
|
||||
"h2 fast path active; h1 fallback pool pre-warmed with {} connection(s)",
|
||||
@@ -970,7 +1021,10 @@ impl DomainFronter {
|
||||
let cell = self.h2_cell.lock().await;
|
||||
let h2_alive = cell
|
||||
.as_ref()
|
||||
.map(|c| c.created.elapsed().as_secs() < H2_CONN_TTL_SECS)
|
||||
.map(|c| {
|
||||
c.created.elapsed().as_secs() < H2_CONN_TTL_SECS
|
||||
&& !c.dead.load(Ordering::Relaxed)
|
||||
})
|
||||
.unwrap_or(false);
|
||||
if h2_alive { POOL_MIN_H2_FALLBACK } else { POOL_MIN }
|
||||
};
|
||||
@@ -1115,16 +1169,18 @@ impl DomainFronter {
|
||||
return None;
|
||||
}
|
||||
|
||||
// Fast path: existing cell, within TTL. Clone (Arc bump) and
|
||||
// return without touching the open machinery. We can't peek at
|
||||
// SendRequest liveness directly (h2 0.4 doesn't expose
|
||||
// `is_closed`), so a request against a dead conn fails at
|
||||
// `ready()`/`send_request` and the caller poisons by
|
||||
// generation from there.
|
||||
// Fast path: existing cell, within TTL and not flagged dead by
|
||||
// the connection driver. We can't peek at SendRequest liveness
|
||||
// synchronously (h2 0.4 doesn't expose `is_closed`), but the
|
||||
// driver task does flip `dead` when the underlying connection
|
||||
// ends — so a known-dead cell is rejected here without paying
|
||||
// a wasted h2 round trip to discover it.
|
||||
{
|
||||
let cell = self.h2_cell.lock().await;
|
||||
if let Some(c) = cell.as_ref() {
|
||||
if c.created.elapsed().as_secs() < H2_CONN_TTL_SECS {
|
||||
if c.created.elapsed().as_secs() < H2_CONN_TTL_SECS
|
||||
&& !c.dead.load(Ordering::Relaxed)
|
||||
{
|
||||
return Some((c.send.clone(), c.generation));
|
||||
}
|
||||
}
|
||||
@@ -1155,7 +1211,9 @@ impl DomainFronter {
|
||||
{
|
||||
let cell = self.h2_cell.lock().await;
|
||||
if let Some(c) = cell.as_ref() {
|
||||
if c.created.elapsed().as_secs() < H2_CONN_TTL_SECS {
|
||||
if c.created.elapsed().as_secs() < H2_CONN_TTL_SECS
|
||||
&& !c.dead.load(Ordering::Relaxed)
|
||||
{
|
||||
return Some((c.send.clone(), c.generation));
|
||||
}
|
||||
}
|
||||
@@ -1168,8 +1226,8 @@ impl DomainFronter {
|
||||
tokio::time::timeout(Duration::from_secs(H2_OPEN_TIMEOUT_SECS), self.open_h2())
|
||||
.await;
|
||||
|
||||
let send = match open_result {
|
||||
Ok(Ok(s)) => s,
|
||||
let (send, dead) = match open_result {
|
||||
Ok(Ok(pair)) => pair,
|
||||
Ok(Err(OpenH2Error::AlpnRefused)) => {
|
||||
// Definitive: this peer doesn't speak h2. Sticky-disable
|
||||
// so we never re-attempt the handshake.
|
||||
@@ -1206,6 +1264,7 @@ impl DomainFronter {
|
||||
send: send.clone(),
|
||||
created: Instant::now(),
|
||||
generation,
|
||||
dead,
|
||||
});
|
||||
Some((send, generation))
|
||||
}
|
||||
@@ -1213,7 +1272,11 @@ impl DomainFronter {
|
||||
/// Open one TLS connection and run the h2 handshake. Returns a
|
||||
/// typed `OpenH2Error` so the caller can recognize ALPN refusal
|
||||
/// (sticky disable) without string-matching across boundaries.
|
||||
async fn open_h2(&self) -> Result<h2::client::SendRequest<Bytes>, OpenH2Error> {
|
||||
/// The returned `Arc<AtomicBool>` is the death flag the connection
|
||||
/// driver flips when the h2 `Connection` future ends.
|
||||
async fn open_h2(
|
||||
&self,
|
||||
) -> Result<(h2::client::SendRequest<Bytes>, Arc<AtomicBool>), OpenH2Error> {
|
||||
let tcp = TcpStream::connect((self.connect_host.as_str(), 443u16)).await?;
|
||||
let _ = tcp.set_nodelay(true);
|
||||
let sni = self.next_sni();
|
||||
@@ -1228,7 +1291,7 @@ impl DomainFronter {
|
||||
/// bypassing the hard-coded `connect_host:443` target.
|
||||
async fn h2_handshake_post_tls(
|
||||
tls: PooledStream,
|
||||
) -> Result<h2::client::SendRequest<Bytes>, OpenH2Error> {
|
||||
) -> Result<(h2::client::SendRequest<Bytes>, Arc<AtomicBool>), OpenH2Error> {
|
||||
let alpn_h2 = tls
|
||||
.get_ref()
|
||||
.1
|
||||
@@ -1251,15 +1314,19 @@ impl DomainFronter {
|
||||
.map_err(|e| OpenH2Error::Handshake(e.to_string()))?;
|
||||
// The connection task drives frame I/O independently of any
|
||||
// SendRequest handle. When it ends (GOAWAY, network error, TTL),
|
||||
// existing handles will start failing on `ready()` / `send_request`
|
||||
// and `ensure_h2` will reopen on the next call.
|
||||
// we flip the `dead` flag so `ensure_h2` and `run_pool_refill`
|
||||
// can react within one refill tick instead of waiting for a
|
||||
// request to discover the breakage via `ready()` failure.
|
||||
let dead = Arc::new(AtomicBool::new(false));
|
||||
let dead_for_driver = dead.clone();
|
||||
tokio::spawn(async move {
|
||||
if let Err(e) = conn.await {
|
||||
tracing::debug!("h2 connection closed: {}", e);
|
||||
}
|
||||
dead_for_driver.store(true, Ordering::Relaxed);
|
||||
});
|
||||
tracing::info!("h2 connection established to relay edge");
|
||||
Ok(send)
|
||||
Ok((send, dead))
|
||||
}
|
||||
|
||||
/// React to an h2-fronting-incompatibility HTTP response (status
|
||||
@@ -5120,6 +5187,7 @@ hello";
|
||||
send: send_v2.clone(),
|
||||
created: Instant::now(),
|
||||
generation: 2,
|
||||
dead: Arc::new(AtomicBool::new(false)),
|
||||
});
|
||||
}
|
||||
// Task A poisons with stale gen=1.
|
||||
@@ -5141,6 +5209,52 @@ hello";
|
||||
server_handle.abort();
|
||||
}
|
||||
|
||||
#[tokio::test(flavor = "current_thread")]
|
||||
async fn ensure_h2_rejects_dead_cell_within_ttl() {
|
||||
// Cell is within H2_CONN_TTL_SECS but the connection driver
|
||||
// already flipped `dead` (e.g., upstream sent GOAWAY). Without
|
||||
// the dead-flag check `ensure_h2` would happily hand out the
|
||||
// stale SendRequest and the next request would pay a wasted
|
||||
// h2 round trip to discover the breakage. With the check in
|
||||
// place a second pre-existing healthy cell still works fine —
|
||||
// the dead one is replaced via the open-lock path.
|
||||
let (addr, server_handle) = spawn_h2c_server(|_req| {
|
||||
let resp = http::Response::builder().status(200).body(()).unwrap();
|
||||
(resp, Vec::new())
|
||||
})
|
||||
.await;
|
||||
let send = h2c_client(addr).await;
|
||||
|
||||
let fronter = fronter_for_test(false);
|
||||
let dead = Arc::new(AtomicBool::new(true)); // simulate driver having exited
|
||||
{
|
||||
let mut cell = fronter.h2_cell.lock().await;
|
||||
*cell = Some(H2Cell {
|
||||
send,
|
||||
created: Instant::now(), // well within TTL
|
||||
generation: 1,
|
||||
dead: dead.clone(),
|
||||
});
|
||||
}
|
||||
|
||||
// The fast path normally returns Some(send, gen) when the cell
|
||||
// is within TTL. With dead=true it must NOT return the stale
|
||||
// SendRequest. Pre-set the failure-backoff timestamp so
|
||||
// ensure_h2 short-circuits at the backoff check (no network
|
||||
// I/O) regardless of whatever's bound on 127.0.0.1:443 on the
|
||||
// dev/CI host. This isolates the assertion to the new
|
||||
// dead-flag check.
|
||||
*fronter.h2_open_failed_at.lock().await = Some(Instant::now());
|
||||
|
||||
let result = fronter.ensure_h2().await;
|
||||
assert!(
|
||||
result.is_none(),
|
||||
"ensure_h2 must not serve a cell whose driver flipped `dead`"
|
||||
);
|
||||
|
||||
server_handle.abort();
|
||||
}
|
||||
|
||||
#[tokio::test(flavor = "current_thread")]
|
||||
async fn ensure_h2_skips_reopen_during_failure_backoff() {
|
||||
// After an open failure, ensure_h2 must return None for at
|
||||
@@ -5566,6 +5680,7 @@ hello";
|
||||
send: send.clone(),
|
||||
created: Instant::now(),
|
||||
generation: 7,
|
||||
dead: Arc::new(AtomicBool::new(false)),
|
||||
});
|
||||
}
|
||||
// Pretend a round-trip just incremented h2_calls (which is
|
||||
@@ -5682,7 +5797,7 @@ hello";
|
||||
match result {
|
||||
Err(OpenH2Error::AlpnRefused) => {} // expected
|
||||
Err(other) => panic!("expected AlpnRefused, got {:?}", other),
|
||||
Ok(_) => panic!("expected AlpnRefused, got Ok"),
|
||||
Ok((_send, _dead)) => panic!("expected AlpnRefused, got Ok"),
|
||||
}
|
||||
server.await.unwrap();
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user