fix(tunnel-node): raise long-poll to 15s, adaptive straggler settle up to 500ms

Apps like Telegram maintain persistent XMPP connections (:5222) and
Google Push uses :5228 — both rely on long-lived sessions with
periodic heartbeats. At the previous 5s long-poll deadline, the
tunnel-node returned empty responses frequently enough that Telegram
interpreted it as connection instability and rotated sessions. Each
reconnect costs a full TLS handshake (~4s through Apps Script),
causing visible video/voice interruptions and buffering.

Raising the long-poll deadline to 15s keeps these persistent
connections alive: the tunnel-node holds the response open until
server data actually arrives (push notification, chat message, media
chunk) rather than returning empty every 5s. Tested on censored
networks in Iran where users reported smoother Telegram video
playback and fewer session resets.

The straggler settle is now adaptive (40ms steps, 500ms max): after
the first session in a batch gets data, keep checking every 40ms
whether neighboring sessions also have data. Break early when all
sessions are ready — no fixed 500ms wait when data is already there.
On high-latency relays where each Apps Script call costs ~1.5s
overhead, packing more session responses into one batch saves quota
and reduces total round-trips.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
yyoyoian-pixel
2026-04-28 21:05:05 +02:00
parent 7e8e467d3d
commit ca76fe91d0
+56 -26
View File
@@ -42,16 +42,13 @@ const CODE_UNSUPPORTED_OP: &str = "UNSUPPORTED_OP";
/// milliseconds — once any session in the batch fires its notify. /// milliseconds — once any session in the batch fires its notify.
const ACTIVE_DRAIN_DEADLINE: Duration = Duration::from_millis(350); const ACTIVE_DRAIN_DEADLINE: Duration = Duration::from_millis(350);
/// After the first session in an active batch wakes the wait, we sleep /// Adaptive straggler settle: after the first session in an active batch
/// briefly so neighboring sessions whose responses land just after the /// wakes the drain, keep checking in STEP increments whether new data is
/// first one don't get reported empty and pay an extra round-trip. Only /// still arriving. Stops when no new data arrived in the last STEP (the
/// applies to active batches — for long-poll batches the wake event IS /// burst is over) or MAX is reached. Packing more session responses into
/// the data we want, so we deliver it immediately. /// one batch saves quota on high-latency relays (~1.5s Apps Script overhead).
/// const STRAGGLER_SETTLE_STEP: Duration = Duration::from_millis(40);
/// 30 ms is much shorter than the legacy two-pass retry (150 + 200 ms) const STRAGGLER_SETTLE_MAX: Duration = Duration::from_millis(500);
/// but covers the typical case of co-located upstreams whose RTTs
/// cluster within a few tens of ms of each other.
const STRAGGLER_SETTLE: Duration = Duration::from_millis(30);
/// Drain-phase deadline when the batch is a pure poll (no writes, no new /// Drain-phase deadline when the batch is a pure poll (no writes, no new
/// connections — clients just asking "any push data?"). Holding the /// connections — clients just asking "any push data?"). Holding the
@@ -65,18 +62,16 @@ const STRAGGLER_SETTLE: Duration = Duration::from_millis(30);
/// op per session), so any local bytes that arrive while the poll is /// op per session), so any local bytes that arrive while the poll is
/// being held are stuck in the kernel until the poll returns. /// being held are stuck in the kernel until the poll returns.
/// ///
/// * Lower (e.g. 2 s) — interactive shells / typing-burst flows feel /// 15 s keeps persistent connections (Telegram XMPP on :5222, Google
/// snappier, but push-only sessions pay more empty round-trips. /// Push on :5228) alive without forcing frequent reconnects. At 5 s,
/// * Higher (e.g. 20 s) — push delivery is near-RTT and round-trip /// apps like Telegram interpreted the frequent empty returns as
/// count is minimal, but a thinking pause between keystrokes can /// connection instability and rotated sessions — each reconnect costs
/// tax the next keystroke by up to the chosen value. /// a full TLS handshake (~4 s through Apps Script), causing visible
/// /// video/voice interruptions. 15 s is well below the client's
/// 5 s is a middle ground: a typing user pausing mid-thought pays at /// `BATCH_TIMEOUT` (30 s) and Apps Script's UrlFetch ceiling (~60 s).
/// most a 5 s nudge before their next keystroke flows, while idle /// Tested on censored networks in Iran where users reported smoother
/// sessions still get the bulk of the long-poll benefit. Must also /// Telegram video playback and fewer session resets at this value.
/// stay safely below the client's `BATCH_TIMEOUT` (30 s) and Apps const LONGPOLL_DEADLINE: Duration = Duration::from_secs(15);
/// Script's UrlFetch ceiling (~60 s).
const LONGPOLL_DEADLINE: Duration = Duration::from_secs(5);
/// Bound on each UDP session's inbound queue. Beyond this we drop oldest /// Bound on each UDP session's inbound queue. Beyond this we drop oldest
/// to keep recent voice/media packets moving — a stale RTP frame is /// to keep recent voice/media packets moving — a stale RTP frame is
@@ -914,7 +909,6 @@ async fn handle_batch(
.collect() .collect()
}; };
let wait_start = Instant::now();
// Wait for either side to wake. Running both concurrently means // Wait for either side to wake. Running both concurrently means
// a TCP-only batch isn't slowed by a stale UDP watch list, and // a TCP-only batch isn't slowed by a stale UDP watch list, and
// vice versa. // vice versa.
@@ -924,9 +918,45 @@ async fn handle_batch(
); );
if had_writes_or_connects { if had_writes_or_connects {
let remaining = deadline.saturating_sub(wait_start.elapsed()); // Adaptive settle: keep waiting in steps while new data
if !remaining.is_zero() { // keeps arriving. Break when:
tokio::time::sleep(STRAGGLER_SETTLE.min(remaining)).await; // 1. No new data arrived in the last step (burst is over)
// 2. 500ms max reached
let settle_end = Instant::now() + STRAGGLER_SETTLE_MAX;
let mut prev_tcp_bytes: usize = 0;
let mut prev_udp_pkts: usize = 0;
// Snapshot current buffer sizes.
for inner in &tcp_inners {
prev_tcp_bytes += inner.read_buf.lock().await.len();
}
for inner in &udp_inners {
prev_udp_pkts += inner.packets.lock().await.len();
}
loop {
let now = Instant::now();
if now >= settle_end {
break;
}
let remaining = settle_end.duration_since(now);
tokio::time::sleep(STRAGGLER_SETTLE_STEP.min(remaining)).await;
// Measure current buffer sizes.
let mut tcp_bytes: usize = 0;
let mut udp_pkts: usize = 0;
for inner in &tcp_inners {
tcp_bytes += inner.read_buf.lock().await.len();
}
for inner in &udp_inners {
udp_pkts += inner.packets.lock().await.len();
}
// No new data since last step — burst is over.
if tcp_bytes == prev_tcp_bytes && udp_pkts == prev_udp_pkts {
break;
}
prev_tcp_bytes = tcp_bytes;
prev_udp_pkts = udp_pkts;
} }
} }