Log scrape plan/summary with per-file message deltas in the core script. Host wrappers and operator entrypoints print target lists; operator-proof defaults to all enabled targets when --target is omitted.
4.7 KiB
| title | type | status | date | origin |
|---|---|---|---|---|
| feat: Scrape logging, run summary, and default-all-targets | feat | complete | 2026-05-29 | /lfg — operator scripts need explicit server/file/message visibility and sane defaults |
feat: Scrape logging, run summary, and default-all-targets
Summary
Make recurring scrape scripts print a upfront run plan (which guilds/servers, which output folders), per-channel file I/O with message deltas, and a final change summary. Operator entrypoints default to all enabled targets from config/scrape-targets.json without requiring repeated --target flags.
Problem Frame
Operators cannot tell from current logs which Discord server was scraped, which archive files were touched, or how many messages were appended vs unchanged. run-operator-proof.sh still hardcodes eod_discord. The core engine (run-discord-scrape.sh) logs channel IDs but not guild names, paths, or before/after counts in one place.
Requirements
| ID | Requirement |
|---|---|
| R1 | Before scrape/preflight, log config path and every selected target with name, resolved guild id/name(s), and output_dir |
| R2 | For each channel processed, log destination file path, action (CREATED, MERGED, UNCHANGED, SKIPPED), and message counts before → after (plus fetched batch size when merged) |
| R3 | After all targets complete, print a consolidated run summary with per-file deltas and totals |
| R4 | run-documents-scrape.sh and host wrapper print the same run-plan header before invoking the container |
| R5 | run-operator-proof.sh defaults to all enabled targets (loop handoff → scrape → prove) when --target is omitted |
| R6 | Offline smokes pass; scrape smoke asserts summary markers exist |
Key Technical Decisions
- KTD1: Ledger in
run-discord-scrape.sh: Keep summary state in bash arrays inside the core script rather than a new shared library — host wrappers only need jq-based target listing; the container owns channel-level detail. - KTD2: Guild labels from cache + export metadata: Resolve guild names from
load_guild_cacheat target start; enrich per-channel lines from export JSON when available. - KTD3: No behavior change to merge semantics: Logging only; append-only merge and skip rules stay unchanged.
Implementation Units
U1. Core scrape ledger and summary
Goal: Operator-visible run plan, per-channel I/O lines, and final summary in run-discord-scrape.sh.
Requirements: R1, R2, R3
Files: scripts/run-discord-scrape.sh, scripts/tests/run-discord-scrape-smoke.sh
Approach: Add SCRAPE_SUMMARY_ENTRIES, guild_name_for_id, describe_target_resolution, log_run_plan, record_channel_result, print_scrape_summary. Call from run_target_mode and scrape_target. Preflight reuses run plan header.
Test scenarios:
- Happy path: smoke run shows
Scrape run plan,MERGED/CREATED/UNCHANGEDlines, andScrape run summary - Edge: skipped channel appears as
SKIPPEDin summary - Error path: failure before summary still leaves partial ledger in stderr
Verification: ./scripts/tests/run-discord-scrape-smoke.sh passes with grep for summary markers.
U2. Host and documents wrapper banners
Goal: Host-side run plan before container execution.
Requirements: R4
Files: scripts/run-discord-scrape-host.sh, scripts/run-documents-scrape.sh, scripts/operator-handoff.sh
Approach: Shared helper pattern: jq list enabled/selected targets with output_dir; print subcommand and config paths. operator-handoff lists enabled targets in handoff header.
Test scenarios:
- Happy path: documents-scrape dry-run output includes target list
- Integration: host smoke unchanged (no regression)
Verification: ./scripts/tests/documents-scrape-smoke.sh, ./scripts/tests/run-discord-scrape-host-smoke.sh
U3. Operator proof defaults to all enabled targets
Goal: Remove hardcoded eod_discord; loop all enabled targets when --target omitted.
Requirements: R5
Files: scripts/run-operator-proof.sh, scripts/tests/run-operator-proof-smoke.sh (if present)
Approach: When TARGET empty, mapfile enabled names from config and run handoff once then scrape+prove per target; print per-target summary at end.
Test scenarios:
- Happy path: smoke with fake scripts verifies multi-target loop
- Edge: single
--targetstill runs one target only
Verification: operator-proof smoke or documents smoke + manual grep.
Scope Boundaries
Deferred to Follow-Up Work
- Structured JSON run logs for machine parsing
- Changing
prove-incremental-append.shto require optional--target
Out of scope
- Discord API or merge algorithm changes
- New CLI flags beyond existing
--targetnarrowing