Files
MasterHttpRelayVPN-RUST/docs/changelog/v1.9.9.md
T
therealaleph 49b6fbfae7 fix: v1.9.9 — Android second disconnect crash + tunnel-node drain correctness
Android (#700 from @ilok67):
- Reordered MhrvVpnService.teardown() to call Native.stopProxy() FIRST. The previous order (tun2proxy.stop → tun.close → join → stopProxy) crashed SIGSEGV ~2s after Disconnect: tun2proxy's worker thread was blocked in native code on a SOCKS5 socket read; after the 2s+4s timeouts expired with the worker still alive, Native.stopProxy freed the runtime including that socket, and the worker hit use-after-free in the next read. The old comment claimed "runtime shutdown will knock the rest of the world over" — wrong, Native.stopProxy can't forcibly terminate a separate native thread, it just frees memory the other thread is still using. New order closes the socket first, the worker's blocking read returns with EOF, the worker exits cleanly through its error path, and the join is then near-instant.

tunnel-node (PR #695 from @dazzling-no-more, merged):
- Cleanup now tracks eof'd sids from drain_now's return value, not the raw atomic — was silently dropping the tail on >16 MiB buffers when EOF arrived between polls.
- Phase-1 `data` op no longer holds the sessions map across upstream write/flush — was head-of-line-blocking every other batch op.
- Mixed TCP+UDP batch wait switched from tokio::join! to tokio::select! — was paying the UDP LONGPOLL_DEADLINE (15 s) on TCP-ready bursts.
- Watcher tasks now wrapped in AbortOnDrop newtype — was leaking Arc<Inner> permits when select!'s loser arm dropped its future.
- 2 new regression tests, 35/35 pass.

Example configs:
- config.exit-node.example.json: added aistudio.google.com + ai.google.dev to default hosts (#701 — AI Studio sanctions Iran IPs).
- config.fronting-groups.example.json: PR #696 from @Shjpr9 added Reddit/Fastly/Pinterest/CNN/BuzzFeed family domains on the Fastly 151.101.x.x edge.

Tests: 179 lib + 35 tunnel-node green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 03:44:20 +03:00

8.6 KiB

• Fix v1.9.8 Android: کرش جدید ~۲ ثانیه بعد از Disconnect (#700 از @ilok67 با root cause + fix کامل): علی‌رغم fix v1.9.8 برای race lifecycle (#666)، crash جداگانه در MhrvVpnService.teardown() باقی مانده بود. ترتیب قبلی: tun2proxy.stop → tun.close → join → Native.stopProxy. مشکل: tun2proxy worker thread در native code blocked روی socket read از SOCKS5 proxy است. وقتی Tun2proxy.stop کالد می‌شه + 2s timeout می‌گذره + 4s join timeout می‌گذره (worker هنوز alive)، Native.stopProxy runtime Rust رو shutdown می‌کنه شامل listener socket — worker thread که در native blocking read از همان socket است → use-after-free → SIGSEGV. comment کد قدیمی ادعا می‌کرد "runtime shutdown will knock the rest of the world over" که اشتباه بود — Native.stopProxy نمی‌تونه force-terminate یک thread native دیگه. ترتیب جدید: Native.stopProxy اول (socket رو می‌بنده → blocking read worker با error برمی‌گرده → worker پاک exit می‌کنه از error path)، بعد Tun2proxy.stop (cooperative، redundant ولی ارزان) → tun.close → join (تقریباً همیشه فوری چون worker از قبل تموم شده). تشکر بیشتر از @ilok67 برای triage دقیق دومین crash. • Fix tunnel-node batch drain correctness + lock contention (PR #695 از @dazzling-no-more): چهار باگ، دو correctness، دو latency.

  • Cleanup race tail-bytes drop می‌کرد: session با buffer > ۱۶ MiB + EOF — drain_now صحیح eof=false برمی‌گردوند تا tail tail رو در poll بعدی drain کنه، ولی cleanup loop همان atomic رو می‌خوند، true می‌دید + session رو حذف می‌کرد + reader_task رو abort + tail هدر می‌رفت. حالا cleanup از مقدار return drain_now پیروی می‌کنه — session فقط بعد از shipped شدن drain که eof=true می‌فرسته، حذف می‌شه. data loss silent در 1Gbps+ VPS که buffer بین poll‌ها پر می‌شد، fix شد.
  • Sessions-map lock روی upstream await نگه می‌داشت: phase-1 data op global sessions map رو نگه می‌داشت روی last_active.lock، writer.lock، write_all، و flush — head-of-line-block برای هر batch + connect/close op دیگه. حالا (مثل udp_data که قبلاً درست بود) Arc از under map clone می‌شه، lock drop، بعد write/flush.
  • TCP+UDP batch deadline UDP رو می‌پرداخت: tokio::join!(wait_tcp, wait_udp) conjunctive هست — TCP-ready burst هنوز LONGPOLL_DEADLINE 15 ثانیه‌ای UDP رو می‌پرداخت قبل از پاسخ. comment می‌گفت "either side"، code "both sides" انجام می‌داد. تغییر به select!. test جدید batch_tcp_ready_does_not_pay_udp_longpoll_deadline این رد رو حفظ می‌کنه.
  • Watcher tasks تحت select! cancellation leak می‌کرد: wait_for_any_drainable فقط در trailing loop watcher‌ها رو abort می‌کرد — past همه cancel point‌ها. با تبدیل phase-2 wait به select!، loser arm's future drop می‌شه و watcher‌هاش detach می‌شن (drop کردن JoinHandle abort نمی‌کنه). هر orphan یک Arc<...Inner> نگه می‌داشت + می‌توانست notify_one() permit از batch بعدی بدزده. fix: AbortOnDrop newtype روی همه JoinHandle watcher. ۲ test جدید + 35/35 pass. • Example config exit-node با aistudio.google.com و ai.google.dev — درخواست از #701. AI Studio روی Iran IP sanction می‌خوره (نه Apps Script طرف ما). exit-node IP val.town رو می‌بینه که نه Iran است نه Google datacenter. • Example config fronting-groups با Reddit / Fastly / Pinterest / CNN / BuzzFeed family domains اضافه شد (PR #696 از @Shjpr9). همه روی Fastly Anycast 151.101.x.x — کاربران می‌تونن از example بیشتر دامنه برداشت کنن، اونی که در شبکه‌شان کار می‌کنه نگه دارن. • تست: ۱۷۹ lib + ۳۵ tunnel-node test همه pass.

• Fix Android ~2-second-delayed crash on Disconnect from v1.9.8 (#700 by @ilok67 with full root cause + fix): despite the v1.9.8 fix for the lifecycle race (#666), a separate crash inside MhrvVpnService.teardown() remained. Old order was tun2proxy.stop → tun.close → join → Native.stopProxy. Problem: tun2proxy's worker thread is blocked in native code on a socket read from the proxy's SOCKS5 port. After Tun2proxy.stop()'s 2s timeout and the 4s thread join both expire (worker still alive), Native.stopProxy() shuts down the Rust runtime — including the listener socket — and the worker, still reading from that socket in native code, hits use-after-free → SIGSEGV. The old code comment claimed "the runtime shutdown will knock the rest of the world over," which was wrong: Native.stopProxy cannot forcibly terminate a separate native thread. New order: Native.stopProxy FIRST (closes the socket → worker's blocking read returns with EOF/error → worker exits cleanly through its error path), then Tun2proxy.stop (cooperative, mostly redundant but cheap), tun.close, then join (almost always immediate now). Thanks @ilok67 again for the precise root-cause work on the second crash. • Fix tunnel-node batch drain correctness + lock contention (PR #695 from @dazzling-no-more): four bugs, two correctness + two latency.

  • Cleanup race dropped tail bytes: when a session's read buffer > 16 MiB and upstream signaled EOF, drain_now correctly returned eof=false and left the tail for the next poll, but the cleanup loop read the raw atomic, saw true, removed the session, aborted reader_task, dropped the tail. Cleanup now tracks eof'd sids from drain_now's return value — the session is only removed once the drain that returned eof=true has shipped to the client. Silent data loss on 1Gbps+ VPS that filled the buffer between polls — fixed.
  • Sessions-map lock held across upstream awaits: phase-1 data op held the global sessions map across last_active.lock, writer.lock, write_all, and flush — head-of-line-blocking every other batch and connect/close op. Now (mirroring udp_data's already-correct shape) it clones the Arc under the map lock, drops the lock, then awaits.
  • Mixed TCP+UDP batch paid the slower side's deadline: tokio::join!(wait_tcp, wait_udp) is conjunctive — a TCP-ready burst still paid the UDP LONGPOLL_DEADLINE (15 s) before responding. Comment said "either side", code did "both sides". Switched to tokio::select!. New test batch_tcp_ready_does_not_pay_udp_longpoll_deadline locks down the regression.
  • Watcher tasks leaked under select! cancellation: wait_for_any_drainable only aborted its watcher tasks in a trailing loop, past every cancellation point. With phase-2 wait flipped to select!, the loser arm's future drops and detaches its watchers (dropping a JoinHandle doesn't abort). Each orphan held an Arc<...Inner> and could steal a notify_one() permit from a future batch. Fix: AbortOnDrop newtype wraps every watcher JoinHandle. 2 new tests + 35/35 pass. • Example config exit-node now lists aistudio.google.com and ai.google.dev — requested in #701. AI Studio sanctions Iran IPs (independently of any Apps Script issue on our side). Routing it through the exit-node makes the destination see val.town's IP, which is neither Iran nor a Google datacenter. • Example config fronting-groups gained Reddit / Fastly / Pinterest / CNN / BuzzFeed family domains (PR #696 from @Shjpr9). All on the Fastly Anycast 151.101.x.x edge — gives users a richer starter list to trim down based on what works in their network. • Tests: 179 lib + 35 tunnel-node tests all passing.