mirror of
https://github.com/therealaleph/MasterHttpRelayVPN-RUST.git
synced 2026-05-17 21:24:48 +03:00
fix: v1.9.9 — Android second disconnect crash + tunnel-node drain correctness
Android (#700 from @ilok67): - Reordered MhrvVpnService.teardown() to call Native.stopProxy() FIRST. The previous order (tun2proxy.stop → tun.close → join → stopProxy) crashed SIGSEGV ~2s after Disconnect: tun2proxy's worker thread was blocked in native code on a SOCKS5 socket read; after the 2s+4s timeouts expired with the worker still alive, Native.stopProxy freed the runtime including that socket, and the worker hit use-after-free in the next read. The old comment claimed "runtime shutdown will knock the rest of the world over" — wrong, Native.stopProxy can't forcibly terminate a separate native thread, it just frees memory the other thread is still using. New order closes the socket first, the worker's blocking read returns with EOF, the worker exits cleanly through its error path, and the join is then near-instant. tunnel-node (PR #695 from @dazzling-no-more, merged): - Cleanup now tracks eof'd sids from drain_now's return value, not the raw atomic — was silently dropping the tail on >16 MiB buffers when EOF arrived between polls. - Phase-1 `data` op no longer holds the sessions map across upstream write/flush — was head-of-line-blocking every other batch op. - Mixed TCP+UDP batch wait switched from tokio::join! to tokio::select! — was paying the UDP LONGPOLL_DEADLINE (15 s) on TCP-ready bursts. - Watcher tasks now wrapped in AbortOnDrop newtype — was leaking Arc<Inner> permits when select!'s loser arm dropped its future. - 2 new regression tests, 35/35 pass. Example configs: - config.exit-node.example.json: added aistudio.google.com + ai.google.dev to default hosts (#701 — AI Studio sanctions Iran IPs). - config.fronting-groups.example.json: PR #696 from @Shjpr9 added Reddit/Fastly/Pinterest/CNN/BuzzFeed family domains on the Fastly 151.101.x.x edge. Tests: 179 lib + 35 tunnel-node green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,22 @@
|
||||
<!-- see docs/changelog/v1.1.0.md for the file format: Persian, then `---`, then English. -->
|
||||
• Fix v1.9.8 Android: کرش جدید ~۲ ثانیه بعد از Disconnect ([#700](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/700) از @ilok67 با root cause + fix کامل): علیرغم fix v1.9.8 برای race lifecycle (#666)، crash جداگانه در `MhrvVpnService.teardown()` باقی مانده بود. ترتیب قبلی: tun2proxy.stop → tun.close → join → Native.stopProxy. مشکل: tun2proxy worker thread در native code blocked روی socket read از SOCKS5 proxy است. وقتی Tun2proxy.stop کالد میشه + 2s timeout میگذره + 4s join timeout میگذره (worker هنوز alive)، Native.stopProxy runtime Rust رو shutdown میکنه شامل listener socket — worker thread که در native blocking read از همان socket است → use-after-free → SIGSEGV. comment کد قدیمی ادعا میکرد "runtime shutdown will knock the rest of the world over" که اشتباه بود — Native.stopProxy نمیتونه force-terminate یک thread native دیگه. ترتیب جدید: **Native.stopProxy اول** (socket رو میبنده → blocking read worker با error برمیگرده → worker پاک exit میکنه از error path)، بعد Tun2proxy.stop (cooperative، redundant ولی ارزان) → tun.close → join (تقریباً همیشه فوری چون worker از قبل تموم شده). تشکر بیشتر از @ilok67 برای triage دقیق دومین crash.
|
||||
• Fix tunnel-node batch drain correctness + lock contention (PR [#695](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/pull/695) از @dazzling-no-more): چهار باگ، دو correctness، دو latency.
|
||||
- **Cleanup race tail-bytes drop میکرد:** session با buffer > ۱۶ MiB + EOF — `drain_now` صحیح `eof=false` برمیگردوند تا tail tail رو در poll بعدی drain کنه، ولی cleanup loop همان atomic رو میخوند، `true` میدید + session رو حذف میکرد + `reader_task` رو abort + tail هدر میرفت. حالا cleanup از مقدار return `drain_now` پیروی میکنه — session فقط بعد از shipped شدن drain که `eof=true` میفرسته، حذف میشه. data loss silent در 1Gbps+ VPS که buffer بین pollها پر میشد، fix شد.
|
||||
- **Sessions-map lock روی upstream await نگه میداشت:** phase-1 `data` op global sessions map رو نگه میداشت روی `last_active.lock`، `writer.lock`، `write_all`، و `flush` — head-of-line-block برای هر batch + connect/close op دیگه. حالا (مثل `udp_data` که قبلاً درست بود) Arc از under map clone میشه، lock drop، بعد write/flush.
|
||||
- **TCP+UDP batch deadline UDP رو میپرداخت:** `tokio::join!(wait_tcp, wait_udp)` conjunctive هست — TCP-ready burst هنوز LONGPOLL_DEADLINE 15 ثانیهای UDP رو میپرداخت قبل از پاسخ. comment میگفت "either side"، code "both sides" انجام میداد. تغییر به `select!`. test جدید `batch_tcp_ready_does_not_pay_udp_longpoll_deadline` این رد رو حفظ میکنه.
|
||||
- **Watcher tasks تحت `select!` cancellation leak میکرد:** `wait_for_any_drainable` فقط در trailing loop watcherها رو abort میکرد — past همه cancel pointها. با تبدیل phase-2 wait به `select!`، loser arm's future drop میشه و watcherهاش *detach* میشن (drop کردن `JoinHandle` abort نمیکنه). هر orphan یک `Arc<...Inner>` نگه میداشت + میتوانست `notify_one()` permit از batch بعدی بدزده. fix: `AbortOnDrop` newtype روی همه `JoinHandle` watcher.
|
||||
۲ test جدید + 35/35 pass.
|
||||
• Example config exit-node با `aistudio.google.com` و `ai.google.dev` — درخواست از [#701](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/701). AI Studio روی Iran IP sanction میخوره (نه Apps Script طرف ما). exit-node IP val.town رو میبینه که نه Iran است نه Google datacenter.
|
||||
• Example config fronting-groups با Reddit / Fastly / Pinterest / CNN / BuzzFeed family domains اضافه شد (PR [#696](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/pull/696) از @Shjpr9). همه روی Fastly Anycast 151.101.x.x — کاربران میتونن از example بیشتر دامنه برداشت کنن، اونی که در شبکهشان کار میکنه نگه دارن.
|
||||
• تست: ۱۷۹ lib + ۳۵ tunnel-node test همه pass.
|
||||
---
|
||||
• Fix Android `~2-second-delayed` crash on Disconnect from v1.9.8 ([#700](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/700) by @ilok67 with full root cause + fix): despite the v1.9.8 fix for the lifecycle race (#666), a separate crash inside `MhrvVpnService.teardown()` remained. Old order was tun2proxy.stop → tun.close → join → Native.stopProxy. Problem: tun2proxy's worker thread is blocked in native code on a socket read from the proxy's SOCKS5 port. After `Tun2proxy.stop()`'s 2s timeout and the 4s thread join both expire (worker still alive), `Native.stopProxy()` shuts down the Rust runtime — including the listener socket — and the worker, still reading from that socket in native code, hits use-after-free → SIGSEGV. The old code comment claimed "the runtime shutdown will knock the rest of the world over," which was wrong: `Native.stopProxy` cannot forcibly terminate a separate native thread. New order: **`Native.stopProxy` FIRST** (closes the socket → worker's blocking read returns with EOF/error → worker exits cleanly through its error path), then `Tun2proxy.stop` (cooperative, mostly redundant but cheap), `tun.close`, then `join` (almost always immediate now). Thanks @ilok67 again for the precise root-cause work on the second crash.
|
||||
• Fix tunnel-node batch drain correctness + lock contention (PR [#695](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/pull/695) from @dazzling-no-more): four bugs, two correctness + two latency.
|
||||
- **Cleanup race dropped tail bytes:** when a session's read buffer > 16 MiB and upstream signaled EOF, `drain_now` correctly returned `eof=false` and left the tail for the next poll, but the cleanup loop read the raw atomic, saw `true`, removed the session, aborted `reader_task`, dropped the tail. Cleanup now tracks eof'd sids from `drain_now`'s return value — the session is only removed once the drain that returned `eof=true` has shipped to the client. Silent data loss on 1Gbps+ VPS that filled the buffer between polls — fixed.
|
||||
- **Sessions-map lock held across upstream awaits:** phase-1 `data` op held the global sessions map across `last_active.lock`, `writer.lock`, `write_all`, and `flush` — head-of-line-blocking every other batch and connect/close op. Now (mirroring `udp_data`'s already-correct shape) it clones the `Arc` under the map lock, drops the lock, then awaits.
|
||||
- **Mixed TCP+UDP batch paid the slower side's deadline:** `tokio::join!(wait_tcp, wait_udp)` is conjunctive — a TCP-ready burst still paid the UDP `LONGPOLL_DEADLINE` (15 s) before responding. Comment said "either side", code did "both sides". Switched to `tokio::select!`. New test `batch_tcp_ready_does_not_pay_udp_longpoll_deadline` locks down the regression.
|
||||
- **Watcher tasks leaked under `select!` cancellation:** `wait_for_any_drainable` only aborted its watcher tasks in a trailing loop, past every cancellation point. With phase-2 wait flipped to `select!`, the loser arm's future drops and *detaches* its watchers (dropping a `JoinHandle` doesn't abort). Each orphan held an `Arc<...Inner>` and could steal a `notify_one()` permit from a future batch. Fix: `AbortOnDrop` newtype wraps every watcher `JoinHandle`.
|
||||
2 new tests + 35/35 pass.
|
||||
• Example config exit-node now lists `aistudio.google.com` and `ai.google.dev` — requested in [#701](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/issues/701). AI Studio sanctions Iran IPs (independently of any Apps Script issue on our side). Routing it through the exit-node makes the destination see val.town's IP, which is neither Iran nor a Google datacenter.
|
||||
• Example config fronting-groups gained Reddit / Fastly / Pinterest / CNN / BuzzFeed family domains (PR [#696](https://github.com/therealaleph/MasterHttpRelayVPN-RUST/pull/696) from @Shjpr9). All on the Fastly Anycast `151.101.x.x` edge — gives users a richer starter list to trim down based on what works in their network.
|
||||
• Tests: 179 lib + 35 tunnel-node tests all passing.
|
||||
Reference in New Issue
Block a user