Files
MasterHttpRelayVPN-RUST/docs/maintainer/references/diagnostic-taxonomy.md
T

162 lines
10 KiB
Markdown

# Diagnostic taxonomy: the placeholder body
## What this is
Multiple distinct conditions cause Apps Script (or our own scripts on Apps Script) to return an HTML body that mhrv-rs's batch parser sees as `bad response: no json in batch response: <body prefix>`. Through user reports and iteration we've narrowed the body strings to **6 candidate causes**. Distinguishing them requires both client-side detection (string-match on body content) and server-side disambiguation (`DIAGNOSTIC_MODE` flag in Code.gs).
This taxonomy is the post-mortem evolution of v1.8.0 → v1.8.1 → v1.8.2 → v1.8.3 detection. v1.8.1 falsely asserted "AUTH_KEY mismatch" on body match; v1.8.2 softened to enumerate 4 candidates; v1.8.3 added the Persian-localized cause and the Workspace landing HTML cause for account-flagged deployments — bringing the count to 6.
## The 6 candidate causes
### 1. AUTH_KEY mismatch (intentional decoy)
**Body**:
```html
<!DOCTYPE html>
<html>
<head><title>Web App</title></head>
<body><p>The script completed but did not return anything.</p></body>
</html>
```
**Source**: Our `Code.gs` / `CodeFull.gs` returns this when `request.k !== AUTH_KEY` and `DIAGNOSTIC_MODE = false`. It mimics Apps Script's stock placeholder for empty-return scripts.
**Trigger**: User edited AUTH_KEY in Apps Script editor but didn't redeploy as new version, OR user has different AUTH_KEY in `config.json` than in `Code.gs`, OR user is using Code.gs deployment ID with `mode: full` (which expects CodeFull.gs).
**Disambiguator**: Set `DIAGNOSTIC_MODE = true` in Code.gs / CodeFull.gs + redeploy as new version. Then this case returns `{"e":"unauthorized"}` (explicit JSON) instead of the HTML. The other 5 cases are independent of DIAGNOSTIC_MODE and still return their natural body.
**Fix**: Align AUTH_KEY values + redeploy as new version.
### 2. Apps Script execution timeout
**Body**: same `"The script completed but did not return anything"` HTML, but emitted by Apps Script itself (not our script) when the execution exceeded the per-invocation cap.
**Source**: Apps Script's runtime kills the script after 6-min hard cap or 30s soft cap on Web App responses, then serves the placeholder body.
**Trigger**: Slow upstream destination, large response payload, network stall mid-fetch.
**Disambiguator**: With `DIAGNOSTIC_MODE = true`, AUTH_KEY mismatch (cause 1) goes away; if the placeholder body still appears for some batches, it's likely cause 2/3/4/5/6.
**Fix**: Lower `parallel_concurrency` in `config.json`, retry, accept some intermittent failures.
### 3. Apps Script soft-quota tear
**Body**: same placeholder HTML. Sometimes a different short HTML page mentioning Apps Script's quota system.
**Source**: Apps Script's per-100s rolling soft quota or per-account daily quota hit. Apps Script kills the request mid-execution.
**Trigger**: Account-aggregate UrlFetchApp throughput exceeded per-100s threshold (~30 concurrent or so). Common with multi-device single-deployment users during page load events (browsers fire 50+ requests in a burst).
**Disambiguator**: Same as 2 — DIAGNOSTIC_MODE rules out AUTH_KEY but doesn't distinguish 2 from 3 from 4. Check the per-script_id error rate over a few minutes — if a deployment has 30%+ failure rate during peak browser activity but works fine when idle, it's quota-related (3 or possibly 5).
**Fix**: Lower `parallel_concurrency`, add more deployments to `script_ids` rotation, distribute deployments across multiple Google accounts.
### 4. Iran ISP-side response truncation
**Body**: typically truncated mid-stream — the body that arrives at mhrv-rs is missing the trailing JSON envelope. The early bytes look like a valid Apps Script response prefix but the request was cut by an ISP-side TCP RST mid-flight.
**Source**: Iran's ISP infrastructure (especially TCI/مخابرات) actively RST-injects on TLS connections to specific Google IPs (the #313 pattern).
**Trigger**: Network-conditional. Active throttle periods (sometimes hours, sometimes days). Worse on certain Google IPs. Worse on certain Iranian ISPs.
**Disambiguator**: Direct curl test from the user's network (see `issue-patterns.md` Pattern 3). If curl-to-Apps-Script also gets timeouts/RST, confirmed ISP-side. The HTML body in this case is partial/truncated — sometimes just `<!DOCT...` rather than the full placeholder.
**Fix**: Workarounds in Pattern 3 — `disable_padding`, rotate `google_ip`, switch network, multi-deployment, Full mode + VPS.
### 5. Apps Script Persian-localized soft-quota body
**Body**:
```html
<html lang="fa" dir="rtl">
<head>
<meta name="description" content="پردازش کلمه وب، ارائه‌ها و صفحات گسترده">
...
```
May also include phrases like `از سهمیه پهنای باند مجاز فراتر رفته‌اید` ("you exceeded the allowed bandwidth quota") and `مقدار انتقال داده را کمتر کنید` ("reduce data transfer volume").
**Source**: Apps Script itself. Apps Script localizes its system error pages based on the deploying Google account's locale (fa-IR for Persian) and/or the request-origin IP.
**Trigger**: Account is Persian-locale (common for Iranian users) AND hit a quota threshold (cause 3) OR an internal Google-side hiccup.
**Disambiguator**: With `DIAGNOSTIC_MODE = true`, cause 1 returns explicit JSON; if Persian HTML still appears, it's not our script — it's Apps Script's own response.
**Important**: w0l4i's case in #404 traced through several wrong hypotheses before landing here:
- Initially diagnosed as AUTH_KEY mismatch → no, mixed success/failure on same `script_id`
- Then diagnosed as third-party relay (`g.workstream.ir` looks Iranian) → no, w0l4i clarified it's his own tunnel
- Then diagnosed as Iranian VPS provider appliance → no, Hetzner Nuremberg
- Final landing: Apps Script's own Persian-localized quota response based on Google account locale
This iteration is documented because the false starts are instructive — don't lock in on the first hypothesis.
**Fix**: Same as cause 3 (it's a quota issue presenting as Persian HTML).
### 6. Workspace landing HTML for account-flagged deployments
**Body**:
```html
<html lang="fa" dir="rtl">
<head>
<meta name="description" content="پردازش کلمه وب، ارائه‌ها و صفحات گسترده"...
<title>...</title>
```
The body is Google Workspace's landing page (the description "Word web processing, presentations, and spreadsheets" is the standard tagline for Google Docs/Sheets/Slides). It's served by Apps Script when the deployment owner's Google account is in a flagged state (post-warning, pre-suspension).
**Source**: Apps Script refuses to execute the deployed script when the owning account is restricted, and serves the Workspace landing page as a "log in" prompt instead.
**Trigger**: Account is in stage 1b or stage 2 of the suspension progression (see `issue-patterns.md` Pattern 8). Often correlates with phone-less new accounts that ignored the "action required" prompt.
**Disambiguator**: Owner of the deployment can log in to Google → see if there are pending warnings or restrictions. If yes → fix the account (add phone) or rotate the deployment to a healthier account.
**Fix**: Account-side, not config-side. Add phone verification, OR move to a different deployment owner via #325 workflow.
## v1.8.3 detection logic
```rust
// In src/tunnel_client.rs around line 893+
if err_msg.contains("The script completed but did not return anything") {
tracing::error!(
"batch failed (script {}): got the v1.8.0 decoy/placeholder body — \
could be (1) AUTH_KEY mismatch (run a direct curl probe against \
the deployment to verify), (2) Apps Script execution timeout or \
per-100s quota tear (try lowering parallel_concurrency), \
(3) Apps Script internal hiccup (transient, retry next batch), \
or (4) ISP-side response truncation (#313 pattern, try a \
different google_ip). To distinguish (1) from the rest: set \
DIAGNOSTIC_MODE=true at the top of Code.gs + redeploy as new \
version — only AUTH_KEY mismatch returns this body in diagnostic \
mode.",
sid_short
);
}
```
This is the v1.8.2 string. v1.8.3 adds detection for the Persian quota body and the Workspace landing HTML as separate paths.
## When responding to users showing this log
The right response shape is:
1. **Acknowledge** the log line they pasted
2. **Enumerate** the 6 (or 4-5 in older versions) candidate causes briefly
3. **Identify the most likely** for their specific case using context clues:
- Single-deployment user, fresh setup → likely cause 1 (AUTH_KEY)
- Mixed success/failure on same script_id → NOT cause 1 (AUTH_KEY would fail 100%)
- "Worked yesterday, broken today" → likely cause 4 (ISP throttle) or cause 8 (account flag in progression)
- High concurrency / many devices on one deployment → likely cause 3 (quota) or cause 5 (Persian quota variant)
- Persian HTML body → cause 5 or 6
- Hetzner/Iranian VPS Full-mode user → check if VPS is actually Iranian (provider appliance is real for Iranian VPS only)
4. **Give the disambiguator**: DIAGNOSTIC_MODE flip + redeploy
5. **Give the immediate workaround** appropriate to the most-likely cause
Don't claim certainty before disambiguator data. v1.8.1 over-asserted; v1.8.3 explicitly enumerates because we learned to.
## What v1.8.x roadmap is doing about this
- **Per-script_id error-category counter** — surface in CLI/UI: "deployment AKfycbz1: 95% success, 4% timeout, 1% quota, 0% auth_mismatch over last 5 min". Lets users diagnose without flipping DIAGNOSTIC_MODE.
- **Distinct error categories in client logs** — separate AUTH_KEY mismatch / timeout / quota / ISP truncation / Persian quota / Workspace landing into 6 distinct error log lines. Currently merged.
- **AIMD per-deployment auto-throttle** — automatically lower `parallel_concurrency` for deployments that hit quota too often. Find the sustainable rate per deployment without manual tuning.
These are queued for v1.8.x batch (~2-4 weeks).