DiscordChatExporter/docs/plans/2026-05-29-038-feat-scrape-logging-run-summary-plan.md
Copilot 71a443267e feat(scrape): run plan, channel ledger, and all-target proof
Log scrape plan/summary with per-file message deltas in the core script.
Host wrappers and operator entrypoints print target lists; operator-proof
defaults to all enabled targets when --target is omitted.
2026-05-29 20:34:22 -05:00

4.7 KiB

title type status date origin
feat: Scrape logging, run summary, and default-all-targets feat complete 2026-05-29 /lfg — operator scripts need explicit server/file/message visibility and sane defaults

feat: Scrape logging, run summary, and default-all-targets

Summary

Make recurring scrape scripts print a upfront run plan (which guilds/servers, which output folders), per-channel file I/O with message deltas, and a final change summary. Operator entrypoints default to all enabled targets from config/scrape-targets.json without requiring repeated --target flags.

Problem Frame

Operators cannot tell from current logs which Discord server was scraped, which archive files were touched, or how many messages were appended vs unchanged. run-operator-proof.sh still hardcodes eod_discord. The core engine (run-discord-scrape.sh) logs channel IDs but not guild names, paths, or before/after counts in one place.

Requirements

ID Requirement
R1 Before scrape/preflight, log config path and every selected target with name, resolved guild id/name(s), and output_dir
R2 For each channel processed, log destination file path, action (CREATED, MERGED, UNCHANGED, SKIPPED), and message counts before → after (plus fetched batch size when merged)
R3 After all targets complete, print a consolidated run summary with per-file deltas and totals
R4 run-documents-scrape.sh and host wrapper print the same run-plan header before invoking the container
R5 run-operator-proof.sh defaults to all enabled targets (loop handoff → scrape → prove) when --target is omitted
R6 Offline smokes pass; scrape smoke asserts summary markers exist

Key Technical Decisions

  • KTD1: Ledger in run-discord-scrape.sh: Keep summary state in bash arrays inside the core script rather than a new shared library — host wrappers only need jq-based target listing; the container owns channel-level detail.
  • KTD2: Guild labels from cache + export metadata: Resolve guild names from load_guild_cache at target start; enrich per-channel lines from export JSON when available.
  • KTD3: No behavior change to merge semantics: Logging only; append-only merge and skip rules stay unchanged.

Implementation Units

U1. Core scrape ledger and summary

Goal: Operator-visible run plan, per-channel I/O lines, and final summary in run-discord-scrape.sh.

Requirements: R1, R2, R3

Files: scripts/run-discord-scrape.sh, scripts/tests/run-discord-scrape-smoke.sh

Approach: Add SCRAPE_SUMMARY_ENTRIES, guild_name_for_id, describe_target_resolution, log_run_plan, record_channel_result, print_scrape_summary. Call from run_target_mode and scrape_target. Preflight reuses run plan header.

Test scenarios:

  • Happy path: smoke run shows Scrape run plan, MERGED/CREATED/UNCHANGED lines, and Scrape run summary
  • Edge: skipped channel appears as SKIPPED in summary
  • Error path: failure before summary still leaves partial ledger in stderr

Verification: ./scripts/tests/run-discord-scrape-smoke.sh passes with grep for summary markers.

U2. Host and documents wrapper banners

Goal: Host-side run plan before container execution.

Requirements: R4

Files: scripts/run-discord-scrape-host.sh, scripts/run-documents-scrape.sh, scripts/operator-handoff.sh

Approach: Shared helper pattern: jq list enabled/selected targets with output_dir; print subcommand and config paths. operator-handoff lists enabled targets in handoff header.

Test scenarios:

  • Happy path: documents-scrape dry-run output includes target list
  • Integration: host smoke unchanged (no regression)

Verification: ./scripts/tests/documents-scrape-smoke.sh, ./scripts/tests/run-discord-scrape-host-smoke.sh

U3. Operator proof defaults to all enabled targets

Goal: Remove hardcoded eod_discord; loop all enabled targets when --target omitted.

Requirements: R5

Files: scripts/run-operator-proof.sh, scripts/tests/run-operator-proof-smoke.sh (if present)

Approach: When TARGET empty, mapfile enabled names from config and run handoff once then scrape+prove per target; print per-target summary at end.

Test scenarios:

  • Happy path: smoke with fake scripts verifies multi-target loop
  • Edge: single --target still runs one target only

Verification: operator-proof smoke or documents smoke + manual grep.

Scope Boundaries

Deferred to Follow-Up Work

  • Structured JSON run logs for machine parsing
  • Changing prove-incremental-append.sh to require optional --target

Out of scope

  • Discord API or merge algorithm changes
  • New CLI flags beyond existing --target narrowing