docs(scrape): add salvage and lock operator playbook

Document scrape-lock-status, reclaim-stale, and salvage-before flags in
operator checklist, merge-readiness, and GUI bridge guide.
This commit is contained in:
Copilot 2026-06-03 07:10:18 -05:00
parent e82007a2c5
commit ad5384ecc1
4 changed files with 118 additions and 7 deletions

View file

@ -27,14 +27,46 @@ Optional integrity tools:
```bash
./scripts/audit-archive-json.sh
./scripts/scrape-lock-status.sh # show archive-root scrape lock
./scripts/scrape-lock-status.sh --reclaim-stale # clear dead-holder lock artifacts
# ./scripts/salvage-truncated-export.sh path/to/export.json
```
### Stuck or crashed export (partial `.dce-temp`)
After stopping a long run, merge quiescent partial exports before re-downloading history:
```bash
./scripts/scrape-lock-status.sh
./scripts/scrape-lock-status.sh --reclaim-stale # when state is stale
# Merge partial temps only (no Discord)
./scripts/operator-handoff.sh --salvage-only --target KotOR_discord_msgs --channel 221726893064454144
# Salvage then incremental catch-up (with audit + log)
DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
--salvage-before-scrape \
--target KotOR_discord_msgs \
--channel 221726893064454144 \
--log-file logs/kotor-yes-general-$(date -u +%Y%m%d-%H%M%S).log
```
Or direct documents scrape:
```bash
./scripts/run-documents-scrape.sh \
--salvage-before-scrape \
--target KotOR_discord_msgs \
--channel 221726893064454144
```
If a temp is still being written, stop the export first. To merge an active temp after confirming nothing is writing: `DCE_SALVAGE_ACTIVE_TEMPS=1`.
Archives: `config/scrape-targets.json` (typically `~/Documents/*` per target `output_dir`).
**Disk:** Free several GiB on `/home` and archive roots before large scrapes (`DCE_MIN_FREE_MB`, default 1024).
**Validate scripts:** `DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh` (19 offline smokes)
**Validate scripts:** `DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh` (21 offline smokes)
**Podman (Fedora):** install `podman-compose` when `docker compose` cannot reach the socket; scripts auto-prefer it.

View file

@ -0,0 +1,39 @@
---
title: "docs: Salvage and lock operator playbook"
type: docs
status: active
date: 2026-06-04
origin: /lfg — plans 054059 landed salvage/lock tooling; operator docs still show 19 smokes and omit catch-up playbook
---
# docs: Salvage and lock operator playbook
## Summary
Refresh operator-facing docs with scrape lock diagnostics, salvage flags, and KotOR yes_general catch-up playbook. Sync GUI bridge copy.
## Requirements
| ID | Requirement |
|----|-------------|
| R1 | `gui-zip-recurring-scrape-bridge.md` documents lock status, reclaim, salvage-only, salvage-before |
| R2 | `recurring-scrape-operator-checklist.md` adds stuck-channel / partial temp section |
| R3 | `recurring-scrape-merge-readiness.md` reflects 21 smokes and plans 054059 |
| R4 | Run `sync-gui-bridge-doc.sh` when sibling GUI zip path exists |
## Implementation Units
### U1. Operator doc updates
**Files:** `docs/gui-zip-recurring-scrape-bridge.md`, `docs/recurring-scrape-operator-checklist.md`, `docs/recurring-scrape-merge-readiness.md`
### U2. GUI bridge sync
**Command:** `./scripts/sync-gui-bridge-doc.sh`
## Scope Boundaries
### Deferred
- Live KotOR catch-up execution on host
- `.docs/Recurring-Scrape-Setup.md` full rewrite

View file

@ -1,18 +1,18 @@
# Recurring scrape — merge readiness
## Branch status (2026-05-30)
## Branch status (2026-06-04)
| Gate | Status |
|------|--------|
| Offline smokes (`run-all-smokes.sh`) | 19/19 pass (includes abort exit 134 skip regression) |
| Offline smokes (`run-all-smokes.sh`) | 21/21 pass |
| Live proof (`run-operator-proof.sh --sync-gui --target eod_discord`) | Passed on maintainer host |
| Monthly cron (`setup-cron.sh`) | Installed (`00 04 1 * *`); dry-run preflight OK for all enabled targets |
| Upstream CI (fork PR) | `action_required` until Tyrrrz approves workflow runs |
**Merge-ready** for upstream review. Further feature work should use a new branch; avoid additional `/lfg` passes unless scope changes.
Fork branch `feat/recurring-cli-scrape` adds append-only, Docker-based incremental exports with optional monthly cron. Intended for personal archive trees under a configurable `archive_root` (for example `~/Documents/*`).
**Recent operator tooling (plans 054059):** `salvage` subcommand, archive-root scrape lock + `scrape-lock-status.sh`, `--salvage-only` / `--salvage-before-scrape` on validation/documents/handoff/proof, lock gate before scrape, `--reclaim-stale` for dead holders.
GUI zip users: [docs/gui-zip-recurring-scrape-bridge.md](gui-zip-recurring-scrape-bridge.md).
## What ships
@ -22,7 +22,7 @@ GUI zip users: [docs/gui-zip-recurring-scrape-bridge.md](gui-zip-recurring-scrap
- **Host:** `scripts/run-discord-scrape-host.sh`, `scripts/run-documents-scrape.sh`, `scripts/bootstrap-recurring-scrape.sh`
- **Auth:** `scrape.env`, `scripts/setup-scrape-auth.sh`, `scripts/sync-token-from-gui.sh`
- **Cron:** `scripts/setup-cron.sh` (`--interval monthly` default)
- **Integrity:** `scripts/audit-archive-json.sh`, `scripts/salvage-truncated-export.sh`, `scripts/prove-incremental-append.sh`
- **Integrity:** `scripts/audit-archive-json.sh`, `scripts/salvage-truncated-export.sh`, `scripts/prove-incremental-append.sh`, `scripts/scrape-lock-status.sh`
- **CI:** `.github/workflows/main.yml` job `recurring-scrape-smoke` runs `./scripts/run-all-smokes.sh`
## Validate before merge
@ -63,6 +63,16 @@ Full validation with log (GUI token sync + scrape + audit):
./scripts/run-operator-validation.sh --sync-gui --target eod_discord
./scripts/run-operator-validation.sh --sync-gui --per-target --continue-on-error
./scripts/run-operator-validation.sh --dry-run
./scripts/run-operator-validation.sh --salvage-only --target KotOR_discord_msgs --channel 221726893064454144
./scripts/run-operator-validation.sh --salvage-before-scrape --target KotOR_discord_msgs --channel 221726893064454144
```
Lock and salvage helpers:
```bash
./scripts/scrape-lock-status.sh
./scripts/scrape-lock-status.sh --reclaim-stale
./scripts/run-documents-scrape.sh --salvage-before-scrape --target NAME [--channel ID]
```
Detail: [.docs/Recurring-Scrape-Setup.md](../.docs/Recurring-Scrape-Setup.md) · [operator checklist](recurring-scrape-operator-checklist.md) · [troubleshooting](../.docs/Recurring-Scrape-Troubleshooting.md)
@ -121,7 +131,9 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh --sync-gui --per-target -
**Plan 045 (2026-06-04):** `audit-archive-json.sh` and `verify-documents-archives.sh` skip `*/.dce-temp/*` (in-progress partial exports). Salvage run 2026-06-03: 7 merged, 17 unchanged, 3 skipped (+5404 messages); yes_general OOM-skipped with partial temps preserved for next salvage.
**Plan 044 (2026-06-04):** Offline smoke asserts partial temp preserved on OOM skip (channel 134). Host wrapper prefers `DISCORD_TOKEN_FILE` over inherited shell tokens. `run-all-smokes.sh` → 19/19 pass.
**Plan 044 (2026-06-04):** Offline smoke asserts partial temp preserved on OOM skip (channel 134). Host wrapper prefers `DISCORD_TOKEN_FILE` over inherited shell tokens.
**Plans 054059 (2026-06-04):** Salvage-only subcommand; archive-root lock with meta sidecar; operator validation/proof/handoff salvage flags; `scrape-lock-status.sh` + `--reclaim-stale`; documents scrape lock gate + `--salvage-before-scrape`. `run-all-smokes.sh` → 21/21 pass.
**KotOR / yes_general (plan 040043):** Incremental `--after` works for all channels; most return `UNCHANGED` in seconds. `yes_general` archive last message was **2021-01-17** — the first catch-up legitimately fetches years of history. Prior bug: OOM skip **deleted** partial temp exports, causing re-download loops. Plan 043 preserves partial temps and salvages on next run.

View file

@ -29,9 +29,37 @@ Installed jobs are marked `# BEGIN discord-scrape` in `crontab -l`. Logs append
```bash
./scripts/run-documents-scrape.sh --target KotOR_discord_msgs
./scripts/run-documents-scrape.sh --target KotOR_discord_msgs --channel CHANNEL_ID
./scripts/setup-cron.sh --target KotOR_discord_msgs --channel CHANNEL_ID
```
## Scrape lock and salvage
Only one scrape should run per `archive_root`. Lock file: `{archive_root}/.dce-scrape.lock`.
```bash
./scripts/scrape-lock-status.sh
./scripts/scrape-lock-status.sh --reclaim-stale # after crashed run; only when stale/free
```
Salvage partial exports under `output_dir/.dce-temp/` without calling Discord:
```bash
./scripts/operator-handoff.sh --salvage-only --target NAME [--channel ID]
./scripts/run-documents-scrape.sh --salvage-only --target NAME [--channel ID]
./scripts/run-operator-validation.sh --salvage-only --target NAME [--channel ID] --log-file logs/salvage.log
```
Salvage then incremental scrape:
```bash
./scripts/run-documents-scrape.sh --salvage-before-scrape --target NAME [--channel ID]
./scripts/run-operator-validation.sh --salvage-before-scrape --target NAME [--channel ID] --log-file logs/scrape.log
./scripts/run-operator-proof.sh --salvage-before-scrape --sync-gui --target NAME
```
**KotOR yes_general** (`221726893064454144`): first catch-up after a 2021 archive cursor can take hours and may OOM; salvage preserved partials before retrying. Stop duplicate validation processes (MyBook vs Downloads checkouts share the same lock).
## GUI zip only
See [gui-zip-recurring-scrape-bridge.md](gui-zip-recurring-scrape-bridge.md), run `./scripts/sync-gui-bridge-doc.sh`, or use `../DiscordChatExporter.linux-x64/bootstrap-recurring-scrape.sh`.