mirror of
https://github.com/Tyrrrz/DiscordChatExporter.git
synced 2026-06-09 15:52:37 -06:00
feat(scrape): per-target JSON summaries in multi-target loops
Validation --per-target and multi-target proof now pass --summary-file per scrape so each target gets its own operator-*-<target>-UTC summary.
This commit is contained in:
parent
8c36fdbdda
commit
c8ed19d26b
|
|
@ -0,0 +1,63 @@
|
|||
---
|
||||
title: "feat: Per-target JSON summaries in multi-target loops"
|
||||
type: feat
|
||||
status: complete
|
||||
date: 2026-06-04
|
||||
origin: /lfg — plan 075 deferred per-target separate summary files in validation/proof loops
|
||||
---
|
||||
|
||||
# feat: Per-target JSON summaries in multi-target loops
|
||||
|
||||
## Summary
|
||||
|
||||
When operator validation runs `--per-target` (all enabled targets) or operator proof scrapes multiple targets, pass `--summary-file` per target so each scrape writes `logs/<prefix>-<target>-<UTC>.summary.json` instead of overwriting a single combined path.
|
||||
|
||||
## Problem Frame
|
||||
|
||||
Plans 070–075 auto-export JSON summaries for single-target and documents-scrape runs. Multi-target loops still set one global `DCE_RUN_SUMMARY_FILE` tied to the teed log basename — only the last target's scrape wins on disk, and recovery from the combined log cannot disambiguate targets.
|
||||
|
||||
## Requirements
|
||||
|
||||
| ID | Requirement |
|
||||
|----|-------------|
|
||||
| R1 | `scripts/lib/scrape-summary-json.sh` exposes `per_target_summary_file LOG_DIR PREFIX TARGET` with sanitized target slug |
|
||||
| R2 | `run-operator-validation.sh --per-target` (no `--target`) skips global `DCE_RUN_SUMMARY_FILE`; each live scrape passes `--summary-file` |
|
||||
| R3 | Validation logs `Per-target JSON summary: <path>` before each live scrape |
|
||||
| R4 | `run-operator-proof.sh` with 2+ targets uses per-target `--summary-file`; single-target keeps log-basename summary |
|
||||
| R5 | Proof logs per-target summary path in the target loop when exporting JSON |
|
||||
| R6 | End-of-run log recovery skipped when per-target mode (files written directly by scrape) |
|
||||
| R7 | `scrape-summary-json-smoke.sh` asserts helper output shape |
|
||||
| R8 | `run-operator-validation-smoke.sh` multi-target dry-run still passes; optional fake-docker live per-target asserts two distinct `--summary-file` paths in subprocess output |
|
||||
| R9 | `DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh` → 23/23 |
|
||||
|
||||
## Implementation Units
|
||||
|
||||
### U1. Shared helper
|
||||
|
||||
**Files:** `scripts/lib/scrape-summary-json.sh`, `scripts/tests/scrape-summary-json-smoke.sh`
|
||||
|
||||
### U2. Operator validation
|
||||
|
||||
**Files:** `scripts/run-operator-validation.sh`
|
||||
|
||||
### U3. Operator proof
|
||||
|
||||
**Files:** `scripts/run-operator-proof.sh`
|
||||
|
||||
### U4. Docs
|
||||
|
||||
**Files:** `docs/recurring-scrape-merge-readiness.md`, `docs/recurring-scrape-operator-checklist.md`
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh
|
||||
```
|
||||
|
||||
## Scope Boundaries
|
||||
|
||||
### Deferred
|
||||
|
||||
- Live KotOR catch-up on host
|
||||
- Tee full documents-scrape stdout to persistent log
|
||||
- Refresh PR #1538 body with plans 070–076 stamps
|
||||
|
|
@ -180,6 +180,8 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
|
|||
|
||||
**Plan 075 (2026-06-04):** `run-documents-scrape.sh` auto-writes `logs/documents-scrape-<UTC>.summary.json` on live scrapes.
|
||||
|
||||
**Plan 076 (2026-06-04):** Multi-target validation (`--per-target`) and proof loops write separate `logs/operator-*-<target>-<UTC>.summary.json` per scrape.
|
||||
|
||||
**Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom.
|
||||
|
||||
## CI note (fork PRs)
|
||||
|
|
|
|||
|
|
@ -56,7 +56,8 @@ Salvage then incremental scrape:
|
|||
./scripts/run-documents-scrape.sh --salvage-before-scrape --target NAME [--channel ID]
|
||||
./scripts/run-operator-validation.sh --salvage-before-scrape --target NAME [--channel ID] --log-file logs/scrape.log
|
||||
./scripts/run-operator-proof.sh --salvage-before-scrape --sync-gui --target NAME
|
||||
# When scraping, also writes logs/operator-proof-<UTC>.summary.json beside the proof log
|
||||
# When scraping one target, also writes logs/operator-proof-<UTC>.summary.json beside the proof log
|
||||
# All enabled targets: each gets logs/operator-proof-<target>-<UTC>.summary.json
|
||||
```
|
||||
|
||||
**KotOR yes_general** (`221726893064454144`): first catch-up after a 2021 archive cursor can take hours and may OOM; salvage preserved partials before retrying. Stop duplicate validation processes (MyBook vs Downloads checkouts share the same lock). `KotOR_discord_msgs` sets `container_memory: "8g"` in `scrape-targets.json` for single-target runs; override globally with `DCE_CONTAINER_MEMORY` in `scrape.env` if needed. Channel-scoped proof:
|
||||
|
|
|
|||
|
|
@ -33,3 +33,19 @@ recover_json_summary_if_missing() {
|
|||
[[ -s "$dest_file" ]] && return 1
|
||||
extract_json_summary_from_log "$run_log" "$dest_file"
|
||||
}
|
||||
|
||||
sanitize_target_slug() {
|
||||
local raw=$1
|
||||
printf '%s' "$raw" | sed 's/[^A-Za-z0-9._-]/_/g'
|
||||
}
|
||||
|
||||
per_target_summary_file() {
|
||||
local log_dir=$1
|
||||
local prefix=$2
|
||||
local target=$3
|
||||
local slug
|
||||
|
||||
[[ -n "$log_dir" && -n "$prefix" && -n "$target" ]] || return 1
|
||||
slug=$(sanitize_target_slug "$target")
|
||||
printf '%s/%s-%s-%s.summary.json' "$log_dir" "$prefix" "$slug" "$(date -u +%Y%m%dT%H%M%SZ)"
|
||||
}
|
||||
|
|
|
|||
|
|
@ -9,9 +9,11 @@ HANDOFF="$REPO_ROOT/scripts/operator-handoff.sh"
|
|||
DOCUMENTS="$REPO_ROOT/scripts/run-documents-scrape.sh"
|
||||
PROVE="$REPO_ROOT/scripts/prove-incremental-append.sh"
|
||||
SYNC_GUI="$REPO_ROOT/scripts/sync-token-from-gui.sh"
|
||||
LOG_DIR="$REPO_ROOT/logs"
|
||||
LOG_DIR="${DCE_LOG_DIR:-$REPO_ROOT/logs}"
|
||||
# shellcheck source=lib/scrape-run-plan.sh
|
||||
source "$SCRIPT_DIR/lib/scrape-run-plan.sh"
|
||||
# shellcheck source=lib/scrape-summary-json.sh
|
||||
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
||||
|
||||
TARGET=""
|
||||
SYNC_GUI_FLAG=0
|
||||
|
|
@ -36,8 +38,9 @@ When --target is omitted, all enabled targets in the config are processed.
|
|||
--salvage-before-scrape Merge stale .dce-temp exports before incremental scrape
|
||||
--log-file PATH Append output to this file (default: logs/operator-proof-UTC.log)
|
||||
|
||||
Logs append to logs/operator-proof-<timestamp>.log (or --log-file). When scraping, also writes
|
||||
<log-basename>.summary.json unless DCE_RUN_SUMMARY_FILE is already set.
|
||||
Logs append to logs/operator-proof-<timestamp>.log (or --log-file). When scraping one target, also writes
|
||||
<log-basename>.summary.json unless DCE_RUN_SUMMARY_FILE is already set. Multiple targets each get
|
||||
logs/operator-proof-<target>-<UTC>.summary.json.
|
||||
EOF
|
||||
}
|
||||
|
||||
|
|
@ -121,11 +124,17 @@ main() {
|
|||
fi
|
||||
|
||||
local export_json_summary=0
|
||||
local per_target_summaries=0
|
||||
if ((${#targets[@]} > 1)); then
|
||||
per_target_summaries=1
|
||||
fi
|
||||
if (( DRY_RUN == 0 && SALVAGE_ONLY == 0 )); then
|
||||
export_json_summary=1
|
||||
export DCE_RUN_SUMMARY_JSON=1
|
||||
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||
export DCE_RUN_SUMMARY_FILE="${log_file%.log}.summary.json"
|
||||
if (( per_target_summaries == 0 )); then
|
||||
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||
export DCE_RUN_SUMMARY_FILE="${log_file%.log}.summary.json"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
|
|
@ -140,7 +149,11 @@ main() {
|
|||
printf 'config: %s\n' "$CONFIG_PATH"
|
||||
print_scrape_config_plan "$CONFIG_PATH" "Operator proof" "${targets[@]}"
|
||||
if (( export_json_summary )); then
|
||||
printf 'JSON summary file: %s\n' "${DCE_RUN_SUMMARY_FILE:-}"
|
||||
if (( per_target_summaries )); then
|
||||
printf 'JSON summaries: per-target under %s\n' "$(dirname "$log_file")"
|
||||
else
|
||||
printf 'JSON summary file: %s\n' "${DCE_RUN_SUMMARY_FILE:-}"
|
||||
fi
|
||||
fi
|
||||
printf 'started: %s\n\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
|
||||
|
||||
|
|
@ -168,6 +181,12 @@ main() {
|
|||
printf '\n--- Target: %s ---\n' "$name"
|
||||
local -a scrape_args=(--config "$CONFIG_PATH" --target "$name")
|
||||
scrape_args+=("${CHANNEL_ARGS[@]}")
|
||||
if (( export_json_summary && per_target_summaries )); then
|
||||
local summary_file
|
||||
summary_file=$(per_target_summary_file "$(dirname "$log_file")" operator-proof "$name")
|
||||
printf 'JSON summary file: %s\n' "$summary_file"
|
||||
scrape_args+=(--summary-file "$summary_file")
|
||||
fi
|
||||
if (( SALVAGE_BEFORE )); then
|
||||
if ! "$DOCUMENTS" "${scrape_args[@]}" --salvage-only; then
|
||||
failed=$((failed + 1))
|
||||
|
|
@ -189,7 +208,7 @@ main() {
|
|||
(( failed == 0 )) || exit 1
|
||||
} 2>&1 | tee "$log_file"
|
||||
|
||||
if (( export_json_summary )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||
if (( export_json_summary )) && (( per_target_summaries == 0 )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||
# shellcheck source=lib/scrape-summary-json.sh
|
||||
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
||||
if recover_json_summary_if_missing "$log_file" "$DCE_RUN_SUMMARY_FILE"; then
|
||||
|
|
|
|||
|
|
@ -13,6 +13,8 @@ AUDIT_JSON="$REPO_ROOT/scripts/audit-archive-json.sh"
|
|||
LOCK_STATUS="$REPO_ROOT/scripts/scrape-lock-status.sh"
|
||||
# shellcheck source=lib/scrape-lock.sh
|
||||
source "$SCRIPT_DIR/lib/scrape-lock.sh"
|
||||
# shellcheck source=lib/scrape-summary-json.sh
|
||||
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
||||
|
||||
DRY_RUN=0
|
||||
SKIP_SCRAPE=0
|
||||
|
|
@ -167,7 +169,10 @@ scrape_per_target() {
|
|||
continue
|
||||
fi
|
||||
fi
|
||||
if ! run_step "run-documents-scrape ($name)" "$DOCUMENTS_SCRAPE" "${per_args[@]}"; then
|
||||
local summary_file
|
||||
summary_file=$(per_target_summary_file "$LOG_DIR" operator-validation "$name")
|
||||
log_step "Per-target JSON summary: $summary_file"
|
||||
if ! run_step "run-documents-scrape ($name)" "$DOCUMENTS_SCRAPE" "${per_args[@]}" --summary-file "$summary_file"; then
|
||||
log_step "Per-target failed: $name (scrape)"
|
||||
failures=$((failures + 1))
|
||||
if (( CONTINUE_ON_ERROR == 0 )); then
|
||||
|
|
@ -269,11 +274,17 @@ main() {
|
|||
fi
|
||||
|
||||
local export_json_summary=0
|
||||
local per_target_summaries=0
|
||||
if (( PER_TARGET )) && [[ -z "$TARGET" ]]; then
|
||||
per_target_summaries=1
|
||||
fi
|
||||
if (( DRY_RUN == 0 && SKIP_SCRAPE == 0 && SALVAGE_ONLY == 0 )); then
|
||||
export_json_summary=1
|
||||
export DCE_RUN_SUMMARY_JSON=1
|
||||
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||
export DCE_RUN_SUMMARY_FILE="${LOG_FILE%.log}.summary.json"
|
||||
if (( per_target_summaries == 0 )); then
|
||||
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||
export DCE_RUN_SUMMARY_FILE="${LOG_FILE%.log}.summary.json"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
|
|
@ -291,7 +302,11 @@ main() {
|
|||
log_step "Enabled targets: $(enabled_targets | paste -sd, -)"
|
||||
fi
|
||||
if (( export_json_summary )); then
|
||||
log_step "JSON summary file: ${DCE_RUN_SUMMARY_FILE:-}"
|
||||
if (( per_target_summaries )); then
|
||||
log_step "JSON summaries: per-target under $LOG_DIR"
|
||||
else
|
||||
log_step "JSON summary file: ${DCE_RUN_SUMMARY_FILE:-}"
|
||||
fi
|
||||
fi
|
||||
if (( SYNC_GUI_FLAG )); then
|
||||
run_step "sync-token-from-gui" "$SYNC_GUI" --force || failures=$((failures + 1))
|
||||
|
|
@ -326,7 +341,7 @@ main() {
|
|||
} 2>&1 | tee -a "$LOG_FILE"
|
||||
local pipeline_status=${PIPESTATUS[0]}
|
||||
|
||||
if (( export_json_summary )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||
if (( export_json_summary )) && (( per_target_summaries == 0 )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||
# shellcheck source=lib/scrape-summary-json.sh
|
||||
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
||||
if recover_json_summary_if_missing "$LOG_FILE" "$DCE_RUN_SUMMARY_FILE"; then
|
||||
|
|
|
|||
|
|
@ -77,4 +77,16 @@ if extract_json_summary_from_log "$TMP_DIR/bad.log" "$OUT_FILE" 2>/dev/null; the
|
|||
exit 1
|
||||
fi
|
||||
|
||||
path=$(per_target_summary_file "$TMP_DIR" operator-validation 'KotOR_discord_msgs')
|
||||
[[ "$path" == "$TMP_DIR/operator-validation-KotOR_discord_msgs-"*.summary.json ]] || {
|
||||
printf 'ERROR: unexpected per_target_summary_file path: %s\n' "$path" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
slug_path=$(per_target_summary_file "$TMP_DIR" operator-proof 'weird name!')
|
||||
[[ "$slug_path" == "$TMP_DIR/operator-proof-weird_name_-"*.summary.json ]] || {
|
||||
printf 'ERROR: expected sanitized slug in path: %s\n' "$slug_path" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
printf 'scrape-summary-json-smoke: ok\n'
|
||||
|
|
|
|||
Loading…
Reference in a new issue