feat(scrape): per-target JSON summaries in multi-target loops

Validation --per-target and multi-target proof now pass --summary-file
per scrape so each target gets its own operator-*-<target>-UTC summary.
This commit is contained in:
Copilot 2026-06-03 11:08:44 -05:00
parent 8c36fdbdda
commit c8ed19d26b
7 changed files with 141 additions and 13 deletions

View file

@ -0,0 +1,63 @@
---
title: "feat: Per-target JSON summaries in multi-target loops"
type: feat
status: complete
date: 2026-06-04
origin: /lfg — plan 075 deferred per-target separate summary files in validation/proof loops
---
# feat: Per-target JSON summaries in multi-target loops
## Summary
When operator validation runs `--per-target` (all enabled targets) or operator proof scrapes multiple targets, pass `--summary-file` per target so each scrape writes `logs/<prefix>-<target>-<UTC>.summary.json` instead of overwriting a single combined path.
## Problem Frame
Plans 070075 auto-export JSON summaries for single-target and documents-scrape runs. Multi-target loops still set one global `DCE_RUN_SUMMARY_FILE` tied to the teed log basename — only the last target's scrape wins on disk, and recovery from the combined log cannot disambiguate targets.
## Requirements
| ID | Requirement |
|----|-------------|
| R1 | `scripts/lib/scrape-summary-json.sh` exposes `per_target_summary_file LOG_DIR PREFIX TARGET` with sanitized target slug |
| R2 | `run-operator-validation.sh --per-target` (no `--target`) skips global `DCE_RUN_SUMMARY_FILE`; each live scrape passes `--summary-file` |
| R3 | Validation logs `Per-target JSON summary: <path>` before each live scrape |
| R4 | `run-operator-proof.sh` with 2+ targets uses per-target `--summary-file`; single-target keeps log-basename summary |
| R5 | Proof logs per-target summary path in the target loop when exporting JSON |
| R6 | End-of-run log recovery skipped when per-target mode (files written directly by scrape) |
| R7 | `scrape-summary-json-smoke.sh` asserts helper output shape |
| R8 | `run-operator-validation-smoke.sh` multi-target dry-run still passes; optional fake-docker live per-target asserts two distinct `--summary-file` paths in subprocess output |
| R9 | `DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh` → 23/23 |
## Implementation Units
### U1. Shared helper
**Files:** `scripts/lib/scrape-summary-json.sh`, `scripts/tests/scrape-summary-json-smoke.sh`
### U2. Operator validation
**Files:** `scripts/run-operator-validation.sh`
### U3. Operator proof
**Files:** `scripts/run-operator-proof.sh`
### U4. Docs
**Files:** `docs/recurring-scrape-merge-readiness.md`, `docs/recurring-scrape-operator-checklist.md`
## Verification
```bash
DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh
```
## Scope Boundaries
### Deferred
- Live KotOR catch-up on host
- Tee full documents-scrape stdout to persistent log
- Refresh PR #1538 body with plans 070076 stamps

View file

@ -180,6 +180,8 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
**Plan 075 (2026-06-04):** `run-documents-scrape.sh` auto-writes `logs/documents-scrape-<UTC>.summary.json` on live scrapes.
**Plan 076 (2026-06-04):** Multi-target validation (`--per-target`) and proof loops write separate `logs/operator-*-<target>-<UTC>.summary.json` per scrape.
**Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom.
## CI note (fork PRs)

View file

@ -56,7 +56,8 @@ Salvage then incremental scrape:
./scripts/run-documents-scrape.sh --salvage-before-scrape --target NAME [--channel ID]
./scripts/run-operator-validation.sh --salvage-before-scrape --target NAME [--channel ID] --log-file logs/scrape.log
./scripts/run-operator-proof.sh --salvage-before-scrape --sync-gui --target NAME
# When scraping, also writes logs/operator-proof-<UTC>.summary.json beside the proof log
# When scraping one target, also writes logs/operator-proof-<UTC>.summary.json beside the proof log
# All enabled targets: each gets logs/operator-proof-<target>-<UTC>.summary.json
```
**KotOR yes_general** (`221726893064454144`): first catch-up after a 2021 archive cursor can take hours and may OOM; salvage preserved partials before retrying. Stop duplicate validation processes (MyBook vs Downloads checkouts share the same lock). `KotOR_discord_msgs` sets `container_memory: "8g"` in `scrape-targets.json` for single-target runs; override globally with `DCE_CONTAINER_MEMORY` in `scrape.env` if needed. Channel-scoped proof:

View file

@ -33,3 +33,19 @@ recover_json_summary_if_missing() {
[[ -s "$dest_file" ]] && return 1
extract_json_summary_from_log "$run_log" "$dest_file"
}
sanitize_target_slug() {
local raw=$1
printf '%s' "$raw" | sed 's/[^A-Za-z0-9._-]/_/g'
}
per_target_summary_file() {
local log_dir=$1
local prefix=$2
local target=$3
local slug
[[ -n "$log_dir" && -n "$prefix" && -n "$target" ]] || return 1
slug=$(sanitize_target_slug "$target")
printf '%s/%s-%s-%s.summary.json' "$log_dir" "$prefix" "$slug" "$(date -u +%Y%m%dT%H%M%SZ)"
}

View file

@ -9,9 +9,11 @@ HANDOFF="$REPO_ROOT/scripts/operator-handoff.sh"
DOCUMENTS="$REPO_ROOT/scripts/run-documents-scrape.sh"
PROVE="$REPO_ROOT/scripts/prove-incremental-append.sh"
SYNC_GUI="$REPO_ROOT/scripts/sync-token-from-gui.sh"
LOG_DIR="$REPO_ROOT/logs"
LOG_DIR="${DCE_LOG_DIR:-$REPO_ROOT/logs}"
# shellcheck source=lib/scrape-run-plan.sh
source "$SCRIPT_DIR/lib/scrape-run-plan.sh"
# shellcheck source=lib/scrape-summary-json.sh
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
TARGET=""
SYNC_GUI_FLAG=0
@ -36,8 +38,9 @@ When --target is omitted, all enabled targets in the config are processed.
--salvage-before-scrape Merge stale .dce-temp exports before incremental scrape
--log-file PATH Append output to this file (default: logs/operator-proof-UTC.log)
Logs append to logs/operator-proof-<timestamp>.log (or --log-file). When scraping, also writes
<log-basename>.summary.json unless DCE_RUN_SUMMARY_FILE is already set.
Logs append to logs/operator-proof-<timestamp>.log (or --log-file). When scraping one target, also writes
<log-basename>.summary.json unless DCE_RUN_SUMMARY_FILE is already set. Multiple targets each get
logs/operator-proof-<target>-<UTC>.summary.json.
EOF
}
@ -121,11 +124,17 @@ main() {
fi
local export_json_summary=0
local per_target_summaries=0
if ((${#targets[@]} > 1)); then
per_target_summaries=1
fi
if (( DRY_RUN == 0 && SALVAGE_ONLY == 0 )); then
export_json_summary=1
export DCE_RUN_SUMMARY_JSON=1
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
export DCE_RUN_SUMMARY_FILE="${log_file%.log}.summary.json"
if (( per_target_summaries == 0 )); then
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
export DCE_RUN_SUMMARY_FILE="${log_file%.log}.summary.json"
fi
fi
fi
@ -140,7 +149,11 @@ main() {
printf 'config: %s\n' "$CONFIG_PATH"
print_scrape_config_plan "$CONFIG_PATH" "Operator proof" "${targets[@]}"
if (( export_json_summary )); then
printf 'JSON summary file: %s\n' "${DCE_RUN_SUMMARY_FILE:-}"
if (( per_target_summaries )); then
printf 'JSON summaries: per-target under %s\n' "$(dirname "$log_file")"
else
printf 'JSON summary file: %s\n' "${DCE_RUN_SUMMARY_FILE:-}"
fi
fi
printf 'started: %s\n\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
@ -168,6 +181,12 @@ main() {
printf '\n--- Target: %s ---\n' "$name"
local -a scrape_args=(--config "$CONFIG_PATH" --target "$name")
scrape_args+=("${CHANNEL_ARGS[@]}")
if (( export_json_summary && per_target_summaries )); then
local summary_file
summary_file=$(per_target_summary_file "$(dirname "$log_file")" operator-proof "$name")
printf 'JSON summary file: %s\n' "$summary_file"
scrape_args+=(--summary-file "$summary_file")
fi
if (( SALVAGE_BEFORE )); then
if ! "$DOCUMENTS" "${scrape_args[@]}" --salvage-only; then
failed=$((failed + 1))
@ -189,7 +208,7 @@ main() {
(( failed == 0 )) || exit 1
} 2>&1 | tee "$log_file"
if (( export_json_summary )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
if (( export_json_summary )) && (( per_target_summaries == 0 )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
# shellcheck source=lib/scrape-summary-json.sh
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
if recover_json_summary_if_missing "$log_file" "$DCE_RUN_SUMMARY_FILE"; then

View file

@ -13,6 +13,8 @@ AUDIT_JSON="$REPO_ROOT/scripts/audit-archive-json.sh"
LOCK_STATUS="$REPO_ROOT/scripts/scrape-lock-status.sh"
# shellcheck source=lib/scrape-lock.sh
source "$SCRIPT_DIR/lib/scrape-lock.sh"
# shellcheck source=lib/scrape-summary-json.sh
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
DRY_RUN=0
SKIP_SCRAPE=0
@ -167,7 +169,10 @@ scrape_per_target() {
continue
fi
fi
if ! run_step "run-documents-scrape ($name)" "$DOCUMENTS_SCRAPE" "${per_args[@]}"; then
local summary_file
summary_file=$(per_target_summary_file "$LOG_DIR" operator-validation "$name")
log_step "Per-target JSON summary: $summary_file"
if ! run_step "run-documents-scrape ($name)" "$DOCUMENTS_SCRAPE" "${per_args[@]}" --summary-file "$summary_file"; then
log_step "Per-target failed: $name (scrape)"
failures=$((failures + 1))
if (( CONTINUE_ON_ERROR == 0 )); then
@ -269,11 +274,17 @@ main() {
fi
local export_json_summary=0
local per_target_summaries=0
if (( PER_TARGET )) && [[ -z "$TARGET" ]]; then
per_target_summaries=1
fi
if (( DRY_RUN == 0 && SKIP_SCRAPE == 0 && SALVAGE_ONLY == 0 )); then
export_json_summary=1
export DCE_RUN_SUMMARY_JSON=1
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
export DCE_RUN_SUMMARY_FILE="${LOG_FILE%.log}.summary.json"
if (( per_target_summaries == 0 )); then
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
export DCE_RUN_SUMMARY_FILE="${LOG_FILE%.log}.summary.json"
fi
fi
fi
@ -291,7 +302,11 @@ main() {
log_step "Enabled targets: $(enabled_targets | paste -sd, -)"
fi
if (( export_json_summary )); then
log_step "JSON summary file: ${DCE_RUN_SUMMARY_FILE:-}"
if (( per_target_summaries )); then
log_step "JSON summaries: per-target under $LOG_DIR"
else
log_step "JSON summary file: ${DCE_RUN_SUMMARY_FILE:-}"
fi
fi
if (( SYNC_GUI_FLAG )); then
run_step "sync-token-from-gui" "$SYNC_GUI" --force || failures=$((failures + 1))
@ -326,7 +341,7 @@ main() {
} 2>&1 | tee -a "$LOG_FILE"
local pipeline_status=${PIPESTATUS[0]}
if (( export_json_summary )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
if (( export_json_summary )) && (( per_target_summaries == 0 )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
# shellcheck source=lib/scrape-summary-json.sh
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
if recover_json_summary_if_missing "$LOG_FILE" "$DCE_RUN_SUMMARY_FILE"; then

View file

@ -77,4 +77,16 @@ if extract_json_summary_from_log "$TMP_DIR/bad.log" "$OUT_FILE" 2>/dev/null; the
exit 1
fi
path=$(per_target_summary_file "$TMP_DIR" operator-validation 'KotOR_discord_msgs')
[[ "$path" == "$TMP_DIR/operator-validation-KotOR_discord_msgs-"*.summary.json ]] || {
printf 'ERROR: unexpected per_target_summary_file path: %s\n' "$path" >&2
exit 1
}
slug_path=$(per_target_summary_file "$TMP_DIR" operator-proof 'weird name!')
[[ "$slug_path" == "$TMP_DIR/operator-proof-weird_name_-"*.summary.json ]] || {
printf 'ERROR: expected sanitized slug in path: %s\n' "$slug_path" >&2
exit 1
}
printf 'scrape-summary-json-smoke: ok\n'