fix(tunnel-node): raise long-poll to 15s, adaptive straggler settle up to 500ms

Apps like Telegram maintain persistent XMPP connections (:5222) and
Google Push uses :5228 — both rely on long-lived sessions with
periodic heartbeats. At the previous 5s long-poll deadline, the
tunnel-node returned empty responses frequently enough that Telegram
interpreted it as connection instability and rotated sessions. Each
reconnect costs a full TLS handshake (~4s through Apps Script),
causing visible video/voice interruptions and buffering.

Raising the long-poll deadline to 15s keeps these persistent
connections alive: the tunnel-node holds the response open until
server data actually arrives (push notification, chat message, media
chunk) rather than returning empty every 5s. Tested on censored
networks in Iran where users reported smoother Telegram video
playback and fewer session resets.

The straggler settle is now adaptive (40ms steps, 500ms max): after
the first session in a batch gets data, keep checking every 40ms
whether neighboring sessions also have data. Break early when all
sessions are ready — no fixed 500ms wait when data is already there.
On high-latency relays where each Apps Script call costs ~1.5s
overhead, packing more session responses into one batch saves quota
and reduces total round-trips.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
yyoyoian-pixel
2026-04-28 21:05:05 +02:00
parent 7e8e467d3d
commit ca76fe91d0
+56 -26
View File
@@ -42,16 +42,13 @@ const CODE_UNSUPPORTED_OP: &str = "UNSUPPORTED_OP";
/// milliseconds — once any session in the batch fires its notify.
const ACTIVE_DRAIN_DEADLINE: Duration = Duration::from_millis(350);
/// After the first session in an active batch wakes the wait, we sleep
/// briefly so neighboring sessions whose responses land just after the
/// first one don't get reported empty and pay an extra round-trip. Only
/// applies to active batches — for long-poll batches the wake event IS
/// the data we want, so we deliver it immediately.
///
/// 30 ms is much shorter than the legacy two-pass retry (150 + 200 ms)
/// but covers the typical case of co-located upstreams whose RTTs
/// cluster within a few tens of ms of each other.
const STRAGGLER_SETTLE: Duration = Duration::from_millis(30);
/// Adaptive straggler settle: after the first session in an active batch
/// wakes the drain, keep checking in STEP increments whether new data is
/// still arriving. Stops when no new data arrived in the last STEP (the
/// burst is over) or MAX is reached. Packing more session responses into
/// one batch saves quota on high-latency relays (~1.5s Apps Script overhead).
const STRAGGLER_SETTLE_STEP: Duration = Duration::from_millis(40);
const STRAGGLER_SETTLE_MAX: Duration = Duration::from_millis(500);
/// Drain-phase deadline when the batch is a pure poll (no writes, no new
/// connections — clients just asking "any push data?"). Holding the
@@ -65,18 +62,16 @@ const STRAGGLER_SETTLE: Duration = Duration::from_millis(30);
/// op per session), so any local bytes that arrive while the poll is
/// being held are stuck in the kernel until the poll returns.
///
/// * Lower (e.g. 2 s) — interactive shells / typing-burst flows feel
/// snappier, but push-only sessions pay more empty round-trips.
/// * Higher (e.g. 20 s) — push delivery is near-RTT and round-trip
/// count is minimal, but a thinking pause between keystrokes can
/// tax the next keystroke by up to the chosen value.
///
/// 5 s is a middle ground: a typing user pausing mid-thought pays at
/// most a 5 s nudge before their next keystroke flows, while idle
/// sessions still get the bulk of the long-poll benefit. Must also
/// stay safely below the client's `BATCH_TIMEOUT` (30 s) and Apps
/// Script's UrlFetch ceiling (~60 s).
const LONGPOLL_DEADLINE: Duration = Duration::from_secs(5);
/// 15 s keeps persistent connections (Telegram XMPP on :5222, Google
/// Push on :5228) alive without forcing frequent reconnects. At 5 s,
/// apps like Telegram interpreted the frequent empty returns as
/// connection instability and rotated sessions — each reconnect costs
/// a full TLS handshake (~4 s through Apps Script), causing visible
/// video/voice interruptions. 15 s is well below the client's
/// `BATCH_TIMEOUT` (30 s) and Apps Script's UrlFetch ceiling (~60 s).
/// Tested on censored networks in Iran where users reported smoother
/// Telegram video playback and fewer session resets at this value.
const LONGPOLL_DEADLINE: Duration = Duration::from_secs(15);
/// Bound on each UDP session's inbound queue. Beyond this we drop oldest
/// to keep recent voice/media packets moving — a stale RTP frame is
@@ -914,7 +909,6 @@ async fn handle_batch(
.collect()
};
let wait_start = Instant::now();
// Wait for either side to wake. Running both concurrently means
// a TCP-only batch isn't slowed by a stale UDP watch list, and
// vice versa.
@@ -924,9 +918,45 @@ async fn handle_batch(
);
if had_writes_or_connects {
let remaining = deadline.saturating_sub(wait_start.elapsed());
if !remaining.is_zero() {
tokio::time::sleep(STRAGGLER_SETTLE.min(remaining)).await;
// Adaptive settle: keep waiting in steps while new data
// keeps arriving. Break when:
// 1. No new data arrived in the last step (burst is over)
// 2. 500ms max reached
let settle_end = Instant::now() + STRAGGLER_SETTLE_MAX;
let mut prev_tcp_bytes: usize = 0;
let mut prev_udp_pkts: usize = 0;
// Snapshot current buffer sizes.
for inner in &tcp_inners {
prev_tcp_bytes += inner.read_buf.lock().await.len();
}
for inner in &udp_inners {
prev_udp_pkts += inner.packets.lock().await.len();
}
loop {
let now = Instant::now();
if now >= settle_end {
break;
}
let remaining = settle_end.duration_since(now);
tokio::time::sleep(STRAGGLER_SETTLE_STEP.min(remaining)).await;
// Measure current buffer sizes.
let mut tcp_bytes: usize = 0;
let mut udp_pkts: usize = 0;
for inner in &tcp_inners {
tcp_bytes += inner.read_buf.lock().await.len();
}
for inner in &udp_inners {
udp_pkts += inner.packets.lock().await.len();
}
// No new data since last step — burst is over.
if tcp_bytes == prev_tcp_bytes && udp_pkts == prev_udp_pkts {
break;
}
prev_tcp_bytes = tcp_bytes;
prev_udp_pkts = udp_pkts;
}
}