fix: v1.9.8 — Android disconnect crash + UI test-button gate for non-apps_script modes

Android (#666 from @ilok67 with full root cause):
- MainActivity.onStop was sending ACTION_STOP via startService() AND immediately calling stopService() on the same service. ACTION_STOP runs teardown() on a background thread that stopSelf()s at the end; the redundant stopService() triggered onDestroy() in parallel, racing the lifecycle and crashing on every Disconnect tap. Removed the stopService() — ACTION_STOP alone is sufficient for both the live-service and the zombie-after-process-death cases. The tornDown AtomicBoolean already guards against double-teardown of native state but couldn't protect against OS-level stopSelf vs stopService race.

UI (#665 from @cmptrnb):
- Test Relay button was showing red "test result: fail" status when used in full or direct mode. The underlying test_cmd::run deliberately refuses in those modes because probing Apps Script directly while the data plane goes via tunnel-node would give a misleading result, but the refuse path was getting translated to generic "test failed". UI now checks mode before running and shows a mode-specific explainer for full/direct (point users at https://whatismyipaddress.com in the browser via the proxy as the right way to verify).

Includes already-merged PR #674 from @yyoyoian-pixel: drop client coalesce_step + tunnel-node straggler settle_step from 40 ms → 10 ms, raise tunnel-node settle max from 500 ms → 1000 ms. Asymmetric tuning: fast-fire when nothing else is queued, but adaptive coalesce on bursts. Backwards compatible — existing configs with explicit `coalesce_step_ms: 40` keep old behavior.

Tests: 179 lib + 33 tunnel-node green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
therealaleph
2026-05-03 15:57:53 +03:00
parent 994dd0b23c
commit 677ec26bee
5 changed files with 76 additions and 21 deletions
Generated
+1 -1
View File
@@ -2222,7 +2222,7 @@ dependencies = [
[[package]]
name = "mhrv-rs"
version = "1.9.7"
version = "1.9.8"
dependencies = [
"base64 0.22.1",
"bytes",
+1 -1
View File
@@ -1,6 +1,6 @@
[package]
name = "mhrv-rs"
version = "1.9.7"
version = "1.9.8"
edition = "2021"
description = "Rust port of MasterHttpRelayVPN -- DPI bypass via Google Apps Script relay with domain fronting"
license = "MIT"
@@ -173,30 +173,36 @@ class MainActivity : AppCompatActivity() {
}
},
onStop = {
// Three-step teardown. Each step is defensive against a
// different failure mode we've actually hit in testing:
// Single-step graceful teardown. ACTION_STOP delivered via
// startService() reaches MhrvVpnService.onStartCommand,
// which spawns the `mhrv-teardown` background thread that
// tears down tun2proxy + the Rust runtime and then calls
// stopSelf() at the end of teardown. Service stops on its
// own — we don't need (and must not) follow up with
// stopService().
//
// 1. ACTION_STOP — graceful path. The service receives it,
// runs its teardown (stops tun2proxy, closes the TUN
// fd, shuts down the Rust runtime) and stopSelf()'s.
// This is what we want 99% of the time.
// History (#666 from @ilok67): we used to call stopService()
// immediately after startService(stopAction), as belt-and-
// suspenders against a "force-closed then reopened zombie"
// case. That second call was firing onDestroy() while the
// mhrv-teardown thread was still running, racing two threads
// through the lifecycle and crashing on tap-to-disconnect.
// The teardown thread's idempotency guard (tornDown
// AtomicBoolean) protects against double-teardown of native
// state, but it can't protect against OS-level lifecycle
// races on stopSelf vs stopService. ACTION_STOP alone is
// enough for both the live-service and zombie cases —
// startService creates a fresh service in the new process
// for zombies, runs teardown (no-op on already-clean state)
// and stops it.
//
// 2. stopService() — covers the "force-closed then
// reopened" zombie case. Android may auto-restart our
// START_STICKY service in a fresh process after the
// user swipes us away from Recents, and the user's
// next Stop tap needs to actually unbind even if our
// in-memory TUN fd reference is gone. stopService is
// idempotent so it's safe to follow the graceful path.
//
// 3. We do NOT touch the VpnService permission — that's
// the OS-wide VPN grant and the user approved it
// deliberately. Revoking it would force a re-prompt
// on next Start, which is worse UX.
// We do NOT touch the VpnService permission — that's the
// OS-wide VPN grant and the user approved it deliberately.
// Revoking it would force a re-prompt on next Start, which
// is worse UX.
val stopAction = Intent(this, MhrvVpnService::class.java)
.setAction(MhrvVpnService.ACTION_STOP)
startService(stopAction)
stopService(Intent(this, MhrvVpnService::class.java))
},
onInstallCaConfirmed = {
// The flow is (1) export cert, (2) copy it to Downloads so
+14
View File
@@ -0,0 +1,14 @@
<!-- see docs/changelog/v1.1.0.md for the file format: Persian, then `---`, then English. -->
• Fix v1.9.7 Android: کرش روی tap Disconnect ([#666](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/666) از @ilok67 با root cause + fix کامل): `MainActivity.onStop` بعد از `startService(ACTION_STOP)` بلافاصله `stopService()` رو هم می‌زد. ACTION_STOP داخل `MhrvVpnService` یک thread پس‌زمینه به نام `mhrv-teardown` می‌سازه که `teardown()` (بستن tun2proxy، fd TUN، runtime) رو اجرا می‌کنه و در پایانش `stopSelf()` رو فرامی‌خونه. ولی `stopService()` بلافاصله `onDestroy()` رو روی همان service trigger می‌کرد — دو thread همزمان دارن از lifecycle می‌گذرن، و OS process service رو می‌کشه قبل از اینکه teardown تمام بشه. crash بعد از تب Disconnect، در حدود ۹۹٪ از تستها قابل reproduce. حالا `stopService()` حذف شده — `ACTION_STOP` تنها کافی است (هم برای service زنده هم برای حالت زامبی). idempotency guard `tornDown` AtomicBoolean قبلاً موجود بود ولی محافظت OS-level lifecycle race رو نمی‌کرد. تشکر از @ilok67 برای triage عالی.
• Fix v1.9.7 UI: دکمهٔ Test Relay در حالت `full``direct`) "test result: fail" قرمز نشون می‌داد ([#665](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/665) از @cmptrnb). `mhrv-rs test` فقط برای حالت apps_script سیم‌کشی شده — در `full` mode عمداً refuse می‌کنه چون probe مستقیم Apps Script در حالی که data plane از tunnel-node رد می‌شه گمراه‌کننده است. ولی پیام refuse توسط UI به‌عنوان test failure ترجمه می‌شد + کاربر فکر می‌کرد proxy خراب است. حالا UI mode رو قبل از اجرای test چک می‌کنه + برای حالت‌های نامناسب پیام explainer می‌ده به‌جای fail قرمز:
> Test Relay is wired only for apps_script mode. In full mode the data plane is the tunnel-node — to verify it end-to-end, start the proxy and load https://whatismyipaddress.com in your browser via 127.0.0.1:8085. The IP shown should be your tunnel-node's VPS IP.
- Tune adaptive batch coalesce (PR [#674](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/pull/674) از @yyoyoian-pixel): از 40 ms → **10 ms** برای client coalesce step و tunnel-node straggler settle step. tunnel-node settle max از 500 ms → **1000 ms**. منطق asymmetric: وقتی هیچ op دیگری نیست، fast-fire (10 ms کافی برای catch کردن op‌هایی که در همان event-loop tick می‌رسن مثل ۶ موازی parallel browser connection)؛ ولی وقتی هر دو طرف data دارن (uploads، page load بستی)، adaptive reset همچنان batch می‌کنه تا 1 s cap. در short: «وقتی چیزی برای انتظار نیست منتظر نباش، وقتی هست با تمام توان batch کن.» سازگار به عقب: کاربران با `coalesce_step_ms: 40` در config.json رفتار قدیمی رو نگه می‌دارن.
• تست: ۱۷۹ lib + ۳۳ tunnel-node test همه pass.
---
• Fix Android crash on tap-Disconnect from v1.9.7 ([#666](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/666) by @ilok67 with full root cause + fix): `MainActivity.onStop` was calling `stopService()` immediately after `startService(ACTION_STOP)`. ACTION_STOP inside `MhrvVpnService` spawns the `mhrv-teardown` background thread that runs `teardown()` (stops tun2proxy, closes TUN fd, shuts down the Rust runtime) and then calls `stopSelf()` at the end. But `stopService()` immediately triggered `onDestroy()` on the same service — two threads racing through the lifecycle, and the OS would kill the process before teardown finished. Crash on every Disconnect tap, ~99% reproducible. Removed the `stopService()` call — `ACTION_STOP` alone is sufficient for both the live-service and the zombie-after-process-death cases. The existing `tornDown` AtomicBoolean idempotency guard protects against double-teardown of native state, but it can't protect against OS-level lifecycle races on stopSelf vs stopService. Thanks @ilok67 for the precise triage.
• Fix UI showing "test result: fail" red status for `full` (and `direct`) modes from v1.9.7 ([#665](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/665) by @cmptrnb). `mhrv-rs test` is wired only for the apps_script relay path — it deliberately refuses in `full` mode because probing Apps Script directly while the actual data plane goes via tunnel-node would give a misleading green result. But the refuse path was getting translated by the UI as a generic "test failed" with red status, scaring users into thinking their proxy was broken. Now the UI checks mode before running the test and shows a friendly explainer for `full`/`direct`:
> Test Relay is wired only for apps_script mode. In full mode the data plane is the tunnel-node — to verify it end-to-end, start the proxy and load https://whatismyipaddress.com in your browser via 127.0.0.1:8085. The IP shown should be your tunnel-node's VPS IP.
• Tune adaptive batch coalesce (PR [#674](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/pull/674) from @yyoyoian-pixel): client coalesce step + tunnel-node straggler settle step from 40 ms → **10 ms**, tunnel-node settle max from 500 ms → **1000 ms**. The asymmetric design — small step, generous max — picks up "fire-and-forget when nothing else is queued" without giving up batching on bursts. The 10 ms still catches ops that arrive in the same event-loop tick (e.g. a browser opening 6 parallel connections on page load), so we don't degenerate into single-op batches; but on a download where the client is just waiting for the next chunk, the per-batch dead-air shrinks by ~30 ms. Backwards-compatible: existing configs with explicit `coalesce_step_ms: 40` keep the old behaviour.
• Tests: 179 lib + 33 tunnel-node tests all passing.
+35
View File
@@ -2171,6 +2171,41 @@ fn background_thread(shared: Arc<Shared>, rx: Receiver<Cmd>) {
Ok(Cmd::Test(cfg)) => {
let shared2 = shared.clone();
// Short-circuit modes where `test_cmd::run` deliberately
// refuses (full mode, direct mode). Those return false
// even when the proxy is healthy, which surfaced as
// "Test failed" + alarming red status — see #665. Show
// a friendly notice instead and skip the test path.
let mode_kind = cfg.mode_kind().ok();
let mode_explainer = match mode_kind {
Some(mhrv_rs::config::Mode::Full) => Some(
"Test Relay is wired only for apps_script mode. \
In full mode the data plane is the tunnel-node — \
to verify it end-to-end, start the proxy and load \
https://whatismyipaddress.com in your browser \
via 127.0.0.1:8085. The IP shown should be your \
tunnel-node's VPS IP. Tracking a real Full-mode \
test in #160."
),
Some(mhrv_rs::config::Mode::Direct) => Some(
"Test Relay is wired only for apps_script mode. \
In direct mode there is no Apps Script relay — \
every request goes through the SNI-rewrite tunnel \
straight to Google's edge. Verify by loading \
https://www.google.com via the proxy."
),
_ => None,
};
if let Some(msg) = mode_explainer {
{
let mut st = shared.state.lock().unwrap();
st.last_test_ok = None;
st.last_test_msg = msg.into();
st.last_test_msg_at = Some(Instant::now());
}
push_log(&shared, &format!("[ui] test skipped: {}", msg));
continue;
}
push_log(&shared, "[ui] running test...");
rt.spawn(async move {
let ok = test_cmd::run(&cfg).await;