diff --git a/.docs/Recurring-Scrape-Setup.md b/.docs/Recurring-Scrape-Setup.md index 798f4057..fc4ea0c2 100644 --- a/.docs/Recurring-Scrape-Setup.md +++ b/.docs/Recurring-Scrape-Setup.md @@ -322,6 +322,7 @@ Space requirements: - **Typical channel**: 1-10 MB per year of messages - **Large channels**: 50-100 MB per year - **Full guild**: 500 MB - several GB depending on activity +- **Multi-year catch-up in container:** may OOM on first export; set `DCE_CONTAINER_MEMORY=8g` in `scrape.env` and use `--salvage-before-scrape` (see [Troubleshooting](Recurring-Scrape-Troubleshooting.md#channel-export-skipped-oom--aborted--killed)) ## Smoke test validation diff --git a/.docs/Recurring-Scrape-Troubleshooting.md b/.docs/Recurring-Scrape-Troubleshooting.md index 22957018..63b97865 100644 --- a/.docs/Recurring-Scrape-Troubleshooting.md +++ b/.docs/Recurring-Scrape-Troubleshooting.md @@ -310,6 +310,89 @@ Not this: --- +### Channel Export SKIPPED (OOM / Aborted / Killed) + +**Symptoms:** Log shows `SKIPPED` for one channel, `Aborted (core dumped)`, `Killed`, or `out of memory`; other channels in the target may still succeed. + +**Cause:** Large multi-year catch-up (for example KotOR `yes_general`) builds a big in-memory JSON export inside the container. Partial progress is kept under `output_dir/.dce-temp/` for salvage on the next run. + +**Solutions:** + +1. **Salvage partial temps before re-scraping** (avoids re-downloading from the archive cursor): + ```bash + ./scripts/scrape-lock-status.sh + ./scripts/operator-handoff.sh --salvage-only --target KotOR_discord_msgs --channel 221726893064454144 + ``` + +2. **Raise container memory** in `scrape.env` (default `0` = no compose cap): + ```bash + # scrape.env + DCE_CONTAINER_MEMORY=8g + ``` + Then run a channel-scoped catch-up: + ```bash + DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \ + --salvage-before-scrape \ + --target KotOR_discord_msgs \ + --channel 221726893064454144 \ + --log-file logs/kotor-yes-general.log + ``` + +3. **Ensure only one scrape** holds `{archive_root}/.dce-scrape.lock` (see next section). + +4. **Confirm host disk headroom** — merges need temporary space on the archive volume (`df -h ~/Documents`). + +--- + +### Scrape Lock Already Held + +**Symptoms:** `Scrape lock is held` or `Another scrape is already running` when starting validation or documents scrape. + +**Cause:** Only one scrape should run per `archive_root`. A long validation, cron job, or a second checkout (for example Downloads vs MyBook) can hold `{archive_root}/.dce-scrape.lock`. + +**Solutions:** + +1. **Inspect lock state:** + ```bash + ./scripts/scrape-lock-status.sh + ``` + +2. **Wait** for the active scrape to finish if PID is live. + +3. **Reclaim stale lock** after a crash (only when status shows stale/free): + ```bash + ./scripts/scrape-lock-status.sh --reclaim-stale + ``` + +4. **Do not delete the lock** while a scrape is still running — twin exports can OOM-loop on the same channel. + +--- + +### Partial Export Stuck in `.dce-temp` + +**Symptoms:** Large folder under `output_dir/.dce-temp/export..*`; archive cursor not advancing; audit excludes `.dce-temp` (expected). + +**Solutions:** + +1. **Stop any active export** writing that temp (check lock status and running `podman`/`docker` processes). + +2. **Salvage quiescent temps** (default skips temps modified in the last ~120s): + ```bash + ./scripts/run-documents-scrape.sh --salvage-only --target NAME [--channel ID] + ``` + +3. **Force salvage of an active temp** only after confirming nothing is writing: + ```bash + DCE_SALVAGE_ACTIVE_TEMPS=1 ./scripts/run-documents-scrape.sh --salvage-only --target NAME --channel ID + ``` + +4. **Truncated JSON in the archive file itself** (not `.dce-temp`): + ```bash + ./scripts/salvage-truncated-export.sh path/to/archive.json + ``` + +--- + ### "Failed to write archive" or Permission Denied **Symptoms:** Export fails with write permission errors diff --git a/docs/gui-zip-recurring-scrape-bridge.md b/docs/gui-zip-recurring-scrape-bridge.md index ebd6162d..dd6a0a8a 100644 --- a/docs/gui-zip-recurring-scrape-bridge.md +++ b/docs/gui-zip-recurring-scrape-bridge.md @@ -44,6 +44,7 @@ After stopping a long run, merge quiescent partial exports before re-downloading ./scripts/operator-handoff.sh --salvage-only --target KotOR_discord_msgs --channel 221726893064454144 # Salvage then incremental catch-up (with audit + log) +# For large yes_general catch-up, set DCE_CONTAINER_MEMORY=8g in scrape.env first. DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \ --salvage-before-scrape \ --target KotOR_discord_msgs \ diff --git a/docs/plans/2026-06-04-064-docs-oom-lock-troubleshooting-plan.md b/docs/plans/2026-06-04-064-docs-oom-lock-troubleshooting-plan.md new file mode 100644 index 00000000..918a3943 --- /dev/null +++ b/docs/plans/2026-06-04-064-docs-oom-lock-troubleshooting-plan.md @@ -0,0 +1,56 @@ +--- +title: "docs: OOM, scrape lock, and salvage troubleshooting" +type: docs +status: complete +date: 2026-06-04 +origin: /lfg — plan 063 added DCE_CONTAINER_MEMORY; operator checklist and GUI bridge cover salvage/lock but Recurring-Scrape-Troubleshooting.md still lacks these runbook sections +--- + +# docs: OOM, scrape lock, and salvage troubleshooting + +## Summary + +Extend operator-facing docs so OOM skips, scrape-lock contention, partial `.dce-temp` salvage, and `DCE_CONTAINER_MEMORY` are documented in the troubleshooting guide and GUI bridge quick-start. + +## Requirements + +| ID | Requirement | +|----|-------------| +| R1 | `.docs/Recurring-Scrape-Troubleshooting.md` documents OOM/skipped channels and `DCE_CONTAINER_MEMORY` | +| R2 | Same file documents scrape lock held, twin runs, and `--reclaim-stale` | +| R3 | Same file documents partial `.dce-temp`, `--salvage-only`, and `--salvage-before-scrape` | +| R4 | `docs/gui-zip-recurring-scrape-bridge.md` mentions `DCE_CONTAINER_MEMORY=8g` for yes_general catch-up | +| R5 | `.docs/Recurring-Scrape-Setup.md` links or notes container memory for large channels | +| R6 | `sync-gui-bridge-doc-smoke.sh` still passes; `run-all-smokes.sh` → 21/21 | + +## Implementation Units + +### U1. Troubleshooting runbook sections + +**Files:** `.docs/Recurring-Scrape-Troubleshooting.md` + +Add under Export Issues (or new Runtime section): +- Channel SKIPPED / OOM / Aborted +- Scrape lock already held +- Stale partial exports under `.dce-temp` + +### U2. Setup and GUI bridge cross-links + +**Files:** `.docs/Recurring-Scrape-Setup.md`, `docs/gui-zip-recurring-scrape-bridge.md` + +- Setup: disk/memory note + pointer to troubleshooting +- GUI bridge: `DCE_CONTAINER_MEMORY` in yes_general salvage block + +## Verification + +```bash +./scripts/tests/sync-gui-bridge-doc-smoke.sh +DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh +``` + +## Scope Boundaries + +### Deferred + +- Live KotOR catch-up on host +- Per-target memory in `scrape-targets.json` diff --git a/docs/recurring-scrape-merge-readiness.md b/docs/recurring-scrape-merge-readiness.md index 1f466943..cb370c74 100644 --- a/docs/recurring-scrape-merge-readiness.md +++ b/docs/recurring-scrape-merge-readiness.md @@ -154,6 +154,8 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \ **Plan 063 (2026-06-04):** Optional `DCE_CONTAINER_MEMORY` compose `mem_limit` for large channel catch-up (default 0 = unlimited). +**Plan 064 (2026-06-04):** OOM, scrape-lock, and partial-temp salvage runbooks in `.docs/Recurring-Scrape-Troubleshooting.md`; GUI bridge notes `DCE_CONTAINER_MEMORY` for yes_general. + **Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom. ## CI note (fork PRs)