setup-cron.sh forwards --salvage-before-scrape to documents scrape for operators recovering from OOM partials on scheduled runs.
4.2 KiB
Recurring scrape operator checklist
Use this after cloning or opening the source repo (DiscordChatExporter, not the GUI zip alone).
One-time setup
./scripts/operator-handoff.sh— recommended: disk summary, verify-operator-ready, and documents dry-run in one step. Or./scripts/verify-operator-ready.shalone for prerequisites only.cp scrape.env.example scrape.envand setDISCORD_TOKEN, or./scripts/sync-token-from-gui.sh --force(reads GUISettings.dat)../scripts/bootstrap-recurring-scrape.sh --dry-run— confirm every enabled target has seeded JSON underoutput_dir../scripts/bootstrap-recurring-scrape.sh— verify archives, build image, preflight Discord../scripts/run-documents-scrape.sh— first incremental append-only scrape. Or./scripts/run-operator-proof.sh --sync-gui --target <name>— handoff + scrape + grow-only proof in one step../scripts/prove-incremental-append.sh --target <name>— optional if you did not userun-operator-proof.sh../scripts/audit-archive-json.sh— optional; lists invalid JSON before cron runs.
Monthly automation
./scripts/setup-cron.sh --dry-run
./scripts/setup-cron.sh --skip-preflight # after bootstrap preflight already succeeded
Defaults: first day of month at 04:00. Override with --interval weekly, --at HH:MM, or --cron '0 4 1 * *'.
Installed jobs are marked # BEGIN discord-scrape in crontab -l. Logs append to logs/discord-scrape.log.
Narrow a run
./scripts/run-documents-scrape.sh --target KotOR_discord_msgs
./scripts/run-documents-scrape.sh --target KotOR_discord_msgs --channel CHANNEL_ID
./scripts/setup-cron.sh --target KotOR_discord_msgs --channel CHANNEL_ID
# After OOM partials: add --salvage-before-scrape so cron merges stale .dce-temp before scrape
Scrape lock and salvage
Only one scrape should run per archive_root. Lock file: {archive_root}/.dce-scrape.lock.
./scripts/scrape-lock-status.sh
./scripts/scrape-lock-status.sh --reclaim-stale # after crashed run; only when stale/free
Salvage partial exports under output_dir/.dce-temp/ without calling Discord:
./scripts/operator-handoff.sh --salvage-only --target NAME [--channel ID]
./scripts/run-documents-scrape.sh --salvage-only --target NAME [--channel ID]
./scripts/run-operator-validation.sh --salvage-only --target NAME [--channel ID] --log-file logs/salvage.log
Salvage then incremental scrape:
./scripts/run-documents-scrape.sh --salvage-before-scrape --target NAME [--channel ID] [--log-file logs/scrape.log]
./scripts/run-operator-validation.sh --salvage-before-scrape --target NAME [--channel ID] --log-file logs/scrape.log
./scripts/run-operator-proof.sh --salvage-before-scrape --sync-gui --target NAME
# Live documents scrape auto-tees to logs/documents-scrape-<UTC>.log (or --log-file); summary at <log-basename>.summary.json
KotOR yes_general (221726893064454144): first catch-up after a 2021 archive cursor can take hours and may OOM; salvage preserved partials before retrying. Stop duplicate validation processes (MyBook vs Downloads checkouts share the same lock). KotOR_discord_msgs sets container_memory: "8g" in scrape-targets.json for single-target runs; override globally with DCE_CONTAINER_MEMORY in scrape.env if needed. Channel-scoped proof:
./scripts/run-operator-validation.sh --salvage-before-scrape \
--target KotOR_discord_msgs --channel 221726893064454144 \
--log-file logs/kotor-yes-general.log
# Also writes logs/kotor-yes-general.summary.json (machine-readable scrape totals)
# Inspect: ./scripts/print-scrape-summary.sh logs/kotor-yes-general.summary.json
./scripts/prove-incremental-append.sh \
--target KotOR_discord_msgs --channel 221726893064454144
GUI zip only
See gui-zip-recurring-scrape-bridge.md, run ./scripts/sync-gui-bridge-doc.sh, or use ../DiscordChatExporter.linux-x64/bootstrap-recurring-scrape.sh.
Validate scripts after changes:
./scripts/run-all-smokes.sh
Merge / review summary: recurring-scrape-merge-readiness.md
Full detail: .docs/Recurring-Scrape-Setup.md