DiscordChatExporter/docs/plans/2026-05-29-038-feat-scrape-logging-run-summary-plan.md
Copilot 71a443267e feat(scrape): run plan, channel ledger, and all-target proof
Log scrape plan/summary with per-file message deltas in the core script.
Host wrappers and operator entrypoints print target lists; operator-proof
defaults to all enabled targets when --target is omitted.
2026-05-29 20:34:22 -05:00

98 lines
4.7 KiB
Markdown

---
title: "feat: Scrape logging, run summary, and default-all-targets"
type: feat
status: complete
date: 2026-05-29
origin: /lfg — operator scripts need explicit server/file/message visibility and sane defaults
---
# feat: Scrape logging, run summary, and default-all-targets
## Summary
Make recurring scrape scripts print a upfront run plan (which guilds/servers, which output folders), per-channel file I/O with message deltas, and a final change summary. Operator entrypoints default to all enabled targets from `config/scrape-targets.json` without requiring repeated `--target` flags.
## Problem Frame
Operators cannot tell from current logs which Discord server was scraped, which archive files were touched, or how many messages were appended vs unchanged. `run-operator-proof.sh` still hardcodes `eod_discord`. The core engine (`run-discord-scrape.sh`) logs channel IDs but not guild names, paths, or before/after counts in one place.
## Requirements
| ID | Requirement |
|----|-------------|
| R1 | Before scrape/preflight, log config path and every selected target with `name`, resolved guild id/name(s), and `output_dir` |
| R2 | For each channel processed, log destination file path, action (`CREATED`, `MERGED`, `UNCHANGED`, `SKIPPED`), and message counts before → after (plus fetched batch size when merged) |
| R3 | After all targets complete, print a consolidated run summary with per-file deltas and totals |
| R4 | `run-documents-scrape.sh` and host wrapper print the same run-plan header before invoking the container |
| R5 | `run-operator-proof.sh` defaults to all enabled targets (loop handoff → scrape → prove) when `--target` is omitted |
| R6 | Offline smokes pass; scrape smoke asserts summary markers exist |
## Key Technical Decisions
- **KTD1: Ledger in `run-discord-scrape.sh`:** Keep summary state in bash arrays inside the core script rather than a new shared library — host wrappers only need jq-based target listing; the container owns channel-level detail.
- **KTD2: Guild labels from cache + export metadata:** Resolve guild names from `load_guild_cache` at target start; enrich per-channel lines from export JSON when available.
- **KTD3: No behavior change to merge semantics:** Logging only; append-only merge and skip rules stay unchanged.
## Implementation Units
### U1. Core scrape ledger and summary
**Goal:** Operator-visible run plan, per-channel I/O lines, and final summary in `run-discord-scrape.sh`.
**Requirements:** R1, R2, R3
**Files:** `scripts/run-discord-scrape.sh`, `scripts/tests/run-discord-scrape-smoke.sh`
**Approach:** Add `SCRAPE_SUMMARY_ENTRIES`, `guild_name_for_id`, `describe_target_resolution`, `log_run_plan`, `record_channel_result`, `print_scrape_summary`. Call from `run_target_mode` and `scrape_target`. Preflight reuses run plan header.
**Test scenarios:**
- Happy path: smoke run shows `Scrape run plan`, `MERGED`/`CREATED`/`UNCHANGED` lines, and `Scrape run summary`
- Edge: skipped channel appears as `SKIPPED` in summary
- Error path: failure before summary still leaves partial ledger in stderr
**Verification:** `./scripts/tests/run-discord-scrape-smoke.sh` passes with grep for summary markers.
### U2. Host and documents wrapper banners
**Goal:** Host-side run plan before container execution.
**Requirements:** R4
**Files:** `scripts/run-discord-scrape-host.sh`, `scripts/run-documents-scrape.sh`, `scripts/operator-handoff.sh`
**Approach:** Shared helper pattern: jq list enabled/selected targets with output_dir; print subcommand and config paths. `operator-handoff` lists enabled targets in handoff header.
**Test scenarios:**
- Happy path: documents-scrape dry-run output includes target list
- Integration: host smoke unchanged (no regression)
**Verification:** `./scripts/tests/documents-scrape-smoke.sh`, `./scripts/tests/run-discord-scrape-host-smoke.sh`
### U3. Operator proof defaults to all enabled targets
**Goal:** Remove hardcoded `eod_discord`; loop all enabled targets when `--target` omitted.
**Requirements:** R5
**Files:** `scripts/run-operator-proof.sh`, `scripts/tests/run-operator-proof-smoke.sh` (if present)
**Approach:** When `TARGET` empty, `mapfile` enabled names from config and run handoff once then scrape+prove per target; print per-target summary at end.
**Test scenarios:**
- Happy path: smoke with fake scripts verifies multi-target loop
- Edge: single `--target` still runs one target only
**Verification:** operator-proof smoke or documents smoke + manual grep.
## Scope Boundaries
### Deferred to Follow-Up Work
- Structured JSON run logs for machine parsing
- Changing `prove-incremental-append.sh` to require optional `--target`
### Out of scope
- Discord API or merge algorithm changes
- New CLI flags beyond existing `--target` narrowing