docs(scrape): add OOM, lock, and salvage troubleshooting

Document container OOM skips, scrape-lock contention, partial temp
salvage, and DCE_CONTAINER_MEMORY in the troubleshooting guide and
GUI bridge quick-start.
This commit is contained in:
Copilot 2026-06-03 09:32:31 -05:00
parent 69ce1ca539
commit e9a3fea9d1
5 changed files with 143 additions and 0 deletions

View file

@ -322,6 +322,7 @@ Space requirements:
- **Typical channel**: 1-10 MB per year of messages - **Typical channel**: 1-10 MB per year of messages
- **Large channels**: 50-100 MB per year - **Large channels**: 50-100 MB per year
- **Full guild**: 500 MB - several GB depending on activity - **Full guild**: 500 MB - several GB depending on activity
- **Multi-year catch-up in container:** may OOM on first export; set `DCE_CONTAINER_MEMORY=8g` in `scrape.env` and use `--salvage-before-scrape` (see [Troubleshooting](Recurring-Scrape-Troubleshooting.md#channel-export-skipped-oom--aborted--killed))
## Smoke test validation ## Smoke test validation

View file

@ -310,6 +310,89 @@ Not this:
--- ---
### Channel Export SKIPPED (OOM / Aborted / Killed)
**Symptoms:** Log shows `SKIPPED` for one channel, `Aborted (core dumped)`, `Killed`, or `out of memory`; other channels in the target may still succeed.
**Cause:** Large multi-year catch-up (for example KotOR `yes_general`) builds a big in-memory JSON export inside the container. Partial progress is kept under `output_dir/.dce-temp/` for salvage on the next run.
**Solutions:**
1. **Salvage partial temps before re-scraping** (avoids re-downloading from the archive cursor):
```bash
./scripts/scrape-lock-status.sh
./scripts/operator-handoff.sh --salvage-only --target KotOR_discord_msgs --channel 221726893064454144
```
2. **Raise container memory** in `scrape.env` (default `0` = no compose cap):
```bash
# scrape.env
DCE_CONTAINER_MEMORY=8g
```
Then run a channel-scoped catch-up:
```bash
DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
--salvage-before-scrape \
--target KotOR_discord_msgs \
--channel 221726893064454144 \
--log-file logs/kotor-yes-general.log
```
3. **Ensure only one scrape** holds `{archive_root}/.dce-scrape.lock` (see next section).
4. **Confirm host disk headroom** — merges need temporary space on the archive volume (`df -h ~/Documents`).
---
### Scrape Lock Already Held
**Symptoms:** `Scrape lock is held` or `Another scrape is already running` when starting validation or documents scrape.
**Cause:** Only one scrape should run per `archive_root`. A long validation, cron job, or a second checkout (for example Downloads vs MyBook) can hold `{archive_root}/.dce-scrape.lock`.
**Solutions:**
1. **Inspect lock state:**
```bash
./scripts/scrape-lock-status.sh
```
2. **Wait** for the active scrape to finish if PID is live.
3. **Reclaim stale lock** after a crash (only when status shows stale/free):
```bash
./scripts/scrape-lock-status.sh --reclaim-stale
```
4. **Do not delete the lock** while a scrape is still running — twin exports can OOM-loop on the same channel.
---
### Partial Export Stuck in `.dce-temp`
**Symptoms:** Large folder under `output_dir/.dce-temp/export.<channel_id>.*`; archive cursor not advancing; audit excludes `.dce-temp` (expected).
**Solutions:**
1. **Stop any active export** writing that temp (check lock status and running `podman`/`docker` processes).
2. **Salvage quiescent temps** (default skips temps modified in the last ~120s):
```bash
./scripts/run-documents-scrape.sh --salvage-only --target NAME [--channel ID]
```
3. **Force salvage of an active temp** only after confirming nothing is writing:
```bash
DCE_SALVAGE_ACTIVE_TEMPS=1 ./scripts/run-documents-scrape.sh --salvage-only --target NAME --channel ID
```
4. **Truncated JSON in the archive file itself** (not `.dce-temp`):
```bash
./scripts/salvage-truncated-export.sh path/to/archive.json
```
---
### "Failed to write archive" or Permission Denied ### "Failed to write archive" or Permission Denied
**Symptoms:** Export fails with write permission errors **Symptoms:** Export fails with write permission errors

View file

@ -44,6 +44,7 @@ After stopping a long run, merge quiescent partial exports before re-downloading
./scripts/operator-handoff.sh --salvage-only --target KotOR_discord_msgs --channel 221726893064454144 ./scripts/operator-handoff.sh --salvage-only --target KotOR_discord_msgs --channel 221726893064454144
# Salvage then incremental catch-up (with audit + log) # Salvage then incremental catch-up (with audit + log)
# For large yes_general catch-up, set DCE_CONTAINER_MEMORY=8g in scrape.env first.
DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
--salvage-before-scrape \ --salvage-before-scrape \
--target KotOR_discord_msgs \ --target KotOR_discord_msgs \

View file

@ -0,0 +1,56 @@
---
title: "docs: OOM, scrape lock, and salvage troubleshooting"
type: docs
status: complete
date: 2026-06-04
origin: /lfg — plan 063 added DCE_CONTAINER_MEMORY; operator checklist and GUI bridge cover salvage/lock but Recurring-Scrape-Troubleshooting.md still lacks these runbook sections
---
# docs: OOM, scrape lock, and salvage troubleshooting
## Summary
Extend operator-facing docs so OOM skips, scrape-lock contention, partial `.dce-temp` salvage, and `DCE_CONTAINER_MEMORY` are documented in the troubleshooting guide and GUI bridge quick-start.
## Requirements
| ID | Requirement |
|----|-------------|
| R1 | `.docs/Recurring-Scrape-Troubleshooting.md` documents OOM/skipped channels and `DCE_CONTAINER_MEMORY` |
| R2 | Same file documents scrape lock held, twin runs, and `--reclaim-stale` |
| R3 | Same file documents partial `.dce-temp`, `--salvage-only`, and `--salvage-before-scrape` |
| R4 | `docs/gui-zip-recurring-scrape-bridge.md` mentions `DCE_CONTAINER_MEMORY=8g` for yes_general catch-up |
| R5 | `.docs/Recurring-Scrape-Setup.md` links or notes container memory for large channels |
| R6 | `sync-gui-bridge-doc-smoke.sh` still passes; `run-all-smokes.sh` → 21/21 |
## Implementation Units
### U1. Troubleshooting runbook sections
**Files:** `.docs/Recurring-Scrape-Troubleshooting.md`
Add under Export Issues (or new Runtime section):
- Channel SKIPPED / OOM / Aborted
- Scrape lock already held
- Stale partial exports under `.dce-temp`
### U2. Setup and GUI bridge cross-links
**Files:** `.docs/Recurring-Scrape-Setup.md`, `docs/gui-zip-recurring-scrape-bridge.md`
- Setup: disk/memory note + pointer to troubleshooting
- GUI bridge: `DCE_CONTAINER_MEMORY` in yes_general salvage block
## Verification
```bash
./scripts/tests/sync-gui-bridge-doc-smoke.sh
DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh
```
## Scope Boundaries
### Deferred
- Live KotOR catch-up on host
- Per-target memory in `scrape-targets.json`

View file

@ -154,6 +154,8 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
**Plan 063 (2026-06-04):** Optional `DCE_CONTAINER_MEMORY` compose `mem_limit` for large channel catch-up (default 0 = unlimited). **Plan 063 (2026-06-04):** Optional `DCE_CONTAINER_MEMORY` compose `mem_limit` for large channel catch-up (default 0 = unlimited).
**Plan 064 (2026-06-04):** OOM, scrape-lock, and partial-temp salvage runbooks in `.docs/Recurring-Scrape-Troubleshooting.md`; GUI bridge notes `DCE_CONTAINER_MEMORY` for yes_general.
**Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom. **Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom.
## CI note (fork PRs) ## CI note (fork PRs)