mirror of
https://github.com/Tyrrrz/DiscordChatExporter.git
synced 2026-06-10 00:02:37 -06:00
feat(scrape): per-target JSON summaries in multi-target loops
Validation --per-target and multi-target proof now pass --summary-file per scrape so each target gets its own operator-*-<target>-UTC summary.
This commit is contained in:
parent
8c36fdbdda
commit
c8ed19d26b
|
|
@ -0,0 +1,63 @@
|
||||||
|
---
|
||||||
|
title: "feat: Per-target JSON summaries in multi-target loops"
|
||||||
|
type: feat
|
||||||
|
status: complete
|
||||||
|
date: 2026-06-04
|
||||||
|
origin: /lfg — plan 075 deferred per-target separate summary files in validation/proof loops
|
||||||
|
---
|
||||||
|
|
||||||
|
# feat: Per-target JSON summaries in multi-target loops
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
When operator validation runs `--per-target` (all enabled targets) or operator proof scrapes multiple targets, pass `--summary-file` per target so each scrape writes `logs/<prefix>-<target>-<UTC>.summary.json` instead of overwriting a single combined path.
|
||||||
|
|
||||||
|
## Problem Frame
|
||||||
|
|
||||||
|
Plans 070–075 auto-export JSON summaries for single-target and documents-scrape runs. Multi-target loops still set one global `DCE_RUN_SUMMARY_FILE` tied to the teed log basename — only the last target's scrape wins on disk, and recovery from the combined log cannot disambiguate targets.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
| ID | Requirement |
|
||||||
|
|----|-------------|
|
||||||
|
| R1 | `scripts/lib/scrape-summary-json.sh` exposes `per_target_summary_file LOG_DIR PREFIX TARGET` with sanitized target slug |
|
||||||
|
| R2 | `run-operator-validation.sh --per-target` (no `--target`) skips global `DCE_RUN_SUMMARY_FILE`; each live scrape passes `--summary-file` |
|
||||||
|
| R3 | Validation logs `Per-target JSON summary: <path>` before each live scrape |
|
||||||
|
| R4 | `run-operator-proof.sh` with 2+ targets uses per-target `--summary-file`; single-target keeps log-basename summary |
|
||||||
|
| R5 | Proof logs per-target summary path in the target loop when exporting JSON |
|
||||||
|
| R6 | End-of-run log recovery skipped when per-target mode (files written directly by scrape) |
|
||||||
|
| R7 | `scrape-summary-json-smoke.sh` asserts helper output shape |
|
||||||
|
| R8 | `run-operator-validation-smoke.sh` multi-target dry-run still passes; optional fake-docker live per-target asserts two distinct `--summary-file` paths in subprocess output |
|
||||||
|
| R9 | `DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh` → 23/23 |
|
||||||
|
|
||||||
|
## Implementation Units
|
||||||
|
|
||||||
|
### U1. Shared helper
|
||||||
|
|
||||||
|
**Files:** `scripts/lib/scrape-summary-json.sh`, `scripts/tests/scrape-summary-json-smoke.sh`
|
||||||
|
|
||||||
|
### U2. Operator validation
|
||||||
|
|
||||||
|
**Files:** `scripts/run-operator-validation.sh`
|
||||||
|
|
||||||
|
### U3. Operator proof
|
||||||
|
|
||||||
|
**Files:** `scripts/run-operator-proof.sh`
|
||||||
|
|
||||||
|
### U4. Docs
|
||||||
|
|
||||||
|
**Files:** `docs/recurring-scrape-merge-readiness.md`, `docs/recurring-scrape-operator-checklist.md`
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
```bash
|
||||||
|
DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Scope Boundaries
|
||||||
|
|
||||||
|
### Deferred
|
||||||
|
|
||||||
|
- Live KotOR catch-up on host
|
||||||
|
- Tee full documents-scrape stdout to persistent log
|
||||||
|
- Refresh PR #1538 body with plans 070–076 stamps
|
||||||
|
|
@ -180,6 +180,8 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
|
||||||
|
|
||||||
**Plan 075 (2026-06-04):** `run-documents-scrape.sh` auto-writes `logs/documents-scrape-<UTC>.summary.json` on live scrapes.
|
**Plan 075 (2026-06-04):** `run-documents-scrape.sh` auto-writes `logs/documents-scrape-<UTC>.summary.json` on live scrapes.
|
||||||
|
|
||||||
|
**Plan 076 (2026-06-04):** Multi-target validation (`--per-target`) and proof loops write separate `logs/operator-*-<target>-<UTC>.summary.json` per scrape.
|
||||||
|
|
||||||
**Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom.
|
**Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom.
|
||||||
|
|
||||||
## CI note (fork PRs)
|
## CI note (fork PRs)
|
||||||
|
|
|
||||||
|
|
@ -56,7 +56,8 @@ Salvage then incremental scrape:
|
||||||
./scripts/run-documents-scrape.sh --salvage-before-scrape --target NAME [--channel ID]
|
./scripts/run-documents-scrape.sh --salvage-before-scrape --target NAME [--channel ID]
|
||||||
./scripts/run-operator-validation.sh --salvage-before-scrape --target NAME [--channel ID] --log-file logs/scrape.log
|
./scripts/run-operator-validation.sh --salvage-before-scrape --target NAME [--channel ID] --log-file logs/scrape.log
|
||||||
./scripts/run-operator-proof.sh --salvage-before-scrape --sync-gui --target NAME
|
./scripts/run-operator-proof.sh --salvage-before-scrape --sync-gui --target NAME
|
||||||
# When scraping, also writes logs/operator-proof-<UTC>.summary.json beside the proof log
|
# When scraping one target, also writes logs/operator-proof-<UTC>.summary.json beside the proof log
|
||||||
|
# All enabled targets: each gets logs/operator-proof-<target>-<UTC>.summary.json
|
||||||
```
|
```
|
||||||
|
|
||||||
**KotOR yes_general** (`221726893064454144`): first catch-up after a 2021 archive cursor can take hours and may OOM; salvage preserved partials before retrying. Stop duplicate validation processes (MyBook vs Downloads checkouts share the same lock). `KotOR_discord_msgs` sets `container_memory: "8g"` in `scrape-targets.json` for single-target runs; override globally with `DCE_CONTAINER_MEMORY` in `scrape.env` if needed. Channel-scoped proof:
|
**KotOR yes_general** (`221726893064454144`): first catch-up after a 2021 archive cursor can take hours and may OOM; salvage preserved partials before retrying. Stop duplicate validation processes (MyBook vs Downloads checkouts share the same lock). `KotOR_discord_msgs` sets `container_memory: "8g"` in `scrape-targets.json` for single-target runs; override globally with `DCE_CONTAINER_MEMORY` in `scrape.env` if needed. Channel-scoped proof:
|
||||||
|
|
|
||||||
|
|
@ -33,3 +33,19 @@ recover_json_summary_if_missing() {
|
||||||
[[ -s "$dest_file" ]] && return 1
|
[[ -s "$dest_file" ]] && return 1
|
||||||
extract_json_summary_from_log "$run_log" "$dest_file"
|
extract_json_summary_from_log "$run_log" "$dest_file"
|
||||||
}
|
}
|
||||||
|
|
||||||
|
sanitize_target_slug() {
|
||||||
|
local raw=$1
|
||||||
|
printf '%s' "$raw" | sed 's/[^A-Za-z0-9._-]/_/g'
|
||||||
|
}
|
||||||
|
|
||||||
|
per_target_summary_file() {
|
||||||
|
local log_dir=$1
|
||||||
|
local prefix=$2
|
||||||
|
local target=$3
|
||||||
|
local slug
|
||||||
|
|
||||||
|
[[ -n "$log_dir" && -n "$prefix" && -n "$target" ]] || return 1
|
||||||
|
slug=$(sanitize_target_slug "$target")
|
||||||
|
printf '%s/%s-%s-%s.summary.json' "$log_dir" "$prefix" "$slug" "$(date -u +%Y%m%dT%H%M%SZ)"
|
||||||
|
}
|
||||||
|
|
|
||||||
|
|
@ -9,9 +9,11 @@ HANDOFF="$REPO_ROOT/scripts/operator-handoff.sh"
|
||||||
DOCUMENTS="$REPO_ROOT/scripts/run-documents-scrape.sh"
|
DOCUMENTS="$REPO_ROOT/scripts/run-documents-scrape.sh"
|
||||||
PROVE="$REPO_ROOT/scripts/prove-incremental-append.sh"
|
PROVE="$REPO_ROOT/scripts/prove-incremental-append.sh"
|
||||||
SYNC_GUI="$REPO_ROOT/scripts/sync-token-from-gui.sh"
|
SYNC_GUI="$REPO_ROOT/scripts/sync-token-from-gui.sh"
|
||||||
LOG_DIR="$REPO_ROOT/logs"
|
LOG_DIR="${DCE_LOG_DIR:-$REPO_ROOT/logs}"
|
||||||
# shellcheck source=lib/scrape-run-plan.sh
|
# shellcheck source=lib/scrape-run-plan.sh
|
||||||
source "$SCRIPT_DIR/lib/scrape-run-plan.sh"
|
source "$SCRIPT_DIR/lib/scrape-run-plan.sh"
|
||||||
|
# shellcheck source=lib/scrape-summary-json.sh
|
||||||
|
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
||||||
|
|
||||||
TARGET=""
|
TARGET=""
|
||||||
SYNC_GUI_FLAG=0
|
SYNC_GUI_FLAG=0
|
||||||
|
|
@ -36,8 +38,9 @@ When --target is omitted, all enabled targets in the config are processed.
|
||||||
--salvage-before-scrape Merge stale .dce-temp exports before incremental scrape
|
--salvage-before-scrape Merge stale .dce-temp exports before incremental scrape
|
||||||
--log-file PATH Append output to this file (default: logs/operator-proof-UTC.log)
|
--log-file PATH Append output to this file (default: logs/operator-proof-UTC.log)
|
||||||
|
|
||||||
Logs append to logs/operator-proof-<timestamp>.log (or --log-file). When scraping, also writes
|
Logs append to logs/operator-proof-<timestamp>.log (or --log-file). When scraping one target, also writes
|
||||||
<log-basename>.summary.json unless DCE_RUN_SUMMARY_FILE is already set.
|
<log-basename>.summary.json unless DCE_RUN_SUMMARY_FILE is already set. Multiple targets each get
|
||||||
|
logs/operator-proof-<target>-<UTC>.summary.json.
|
||||||
EOF
|
EOF
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -121,13 +124,19 @@ main() {
|
||||||
fi
|
fi
|
||||||
|
|
||||||
local export_json_summary=0
|
local export_json_summary=0
|
||||||
|
local per_target_summaries=0
|
||||||
|
if ((${#targets[@]} > 1)); then
|
||||||
|
per_target_summaries=1
|
||||||
|
fi
|
||||||
if (( DRY_RUN == 0 && SALVAGE_ONLY == 0 )); then
|
if (( DRY_RUN == 0 && SALVAGE_ONLY == 0 )); then
|
||||||
export_json_summary=1
|
export_json_summary=1
|
||||||
export DCE_RUN_SUMMARY_JSON=1
|
export DCE_RUN_SUMMARY_JSON=1
|
||||||
|
if (( per_target_summaries == 0 )); then
|
||||||
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||||
export DCE_RUN_SUMMARY_FILE="${log_file%.log}.summary.json"
|
export DCE_RUN_SUMMARY_FILE="${log_file%.log}.summary.json"
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
local failed=0 succeeded=0 name
|
local failed=0 succeeded=0 name
|
||||||
|
|
||||||
|
|
@ -140,8 +149,12 @@ main() {
|
||||||
printf 'config: %s\n' "$CONFIG_PATH"
|
printf 'config: %s\n' "$CONFIG_PATH"
|
||||||
print_scrape_config_plan "$CONFIG_PATH" "Operator proof" "${targets[@]}"
|
print_scrape_config_plan "$CONFIG_PATH" "Operator proof" "${targets[@]}"
|
||||||
if (( export_json_summary )); then
|
if (( export_json_summary )); then
|
||||||
|
if (( per_target_summaries )); then
|
||||||
|
printf 'JSON summaries: per-target under %s\n' "$(dirname "$log_file")"
|
||||||
|
else
|
||||||
printf 'JSON summary file: %s\n' "${DCE_RUN_SUMMARY_FILE:-}"
|
printf 'JSON summary file: %s\n' "${DCE_RUN_SUMMARY_FILE:-}"
|
||||||
fi
|
fi
|
||||||
|
fi
|
||||||
printf 'started: %s\n\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
|
printf 'started: %s\n\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
|
||||||
|
|
||||||
if (( SYNC_GUI_FLAG == 1 )); then
|
if (( SYNC_GUI_FLAG == 1 )); then
|
||||||
|
|
@ -168,6 +181,12 @@ main() {
|
||||||
printf '\n--- Target: %s ---\n' "$name"
|
printf '\n--- Target: %s ---\n' "$name"
|
||||||
local -a scrape_args=(--config "$CONFIG_PATH" --target "$name")
|
local -a scrape_args=(--config "$CONFIG_PATH" --target "$name")
|
||||||
scrape_args+=("${CHANNEL_ARGS[@]}")
|
scrape_args+=("${CHANNEL_ARGS[@]}")
|
||||||
|
if (( export_json_summary && per_target_summaries )); then
|
||||||
|
local summary_file
|
||||||
|
summary_file=$(per_target_summary_file "$(dirname "$log_file")" operator-proof "$name")
|
||||||
|
printf 'JSON summary file: %s\n' "$summary_file"
|
||||||
|
scrape_args+=(--summary-file "$summary_file")
|
||||||
|
fi
|
||||||
if (( SALVAGE_BEFORE )); then
|
if (( SALVAGE_BEFORE )); then
|
||||||
if ! "$DOCUMENTS" "${scrape_args[@]}" --salvage-only; then
|
if ! "$DOCUMENTS" "${scrape_args[@]}" --salvage-only; then
|
||||||
failed=$((failed + 1))
|
failed=$((failed + 1))
|
||||||
|
|
@ -189,7 +208,7 @@ main() {
|
||||||
(( failed == 0 )) || exit 1
|
(( failed == 0 )) || exit 1
|
||||||
} 2>&1 | tee "$log_file"
|
} 2>&1 | tee "$log_file"
|
||||||
|
|
||||||
if (( export_json_summary )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
if (( export_json_summary )) && (( per_target_summaries == 0 )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||||
# shellcheck source=lib/scrape-summary-json.sh
|
# shellcheck source=lib/scrape-summary-json.sh
|
||||||
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
||||||
if recover_json_summary_if_missing "$log_file" "$DCE_RUN_SUMMARY_FILE"; then
|
if recover_json_summary_if_missing "$log_file" "$DCE_RUN_SUMMARY_FILE"; then
|
||||||
|
|
|
||||||
|
|
@ -13,6 +13,8 @@ AUDIT_JSON="$REPO_ROOT/scripts/audit-archive-json.sh"
|
||||||
LOCK_STATUS="$REPO_ROOT/scripts/scrape-lock-status.sh"
|
LOCK_STATUS="$REPO_ROOT/scripts/scrape-lock-status.sh"
|
||||||
# shellcheck source=lib/scrape-lock.sh
|
# shellcheck source=lib/scrape-lock.sh
|
||||||
source "$SCRIPT_DIR/lib/scrape-lock.sh"
|
source "$SCRIPT_DIR/lib/scrape-lock.sh"
|
||||||
|
# shellcheck source=lib/scrape-summary-json.sh
|
||||||
|
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
||||||
|
|
||||||
DRY_RUN=0
|
DRY_RUN=0
|
||||||
SKIP_SCRAPE=0
|
SKIP_SCRAPE=0
|
||||||
|
|
@ -167,7 +169,10 @@ scrape_per_target() {
|
||||||
continue
|
continue
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
if ! run_step "run-documents-scrape ($name)" "$DOCUMENTS_SCRAPE" "${per_args[@]}"; then
|
local summary_file
|
||||||
|
summary_file=$(per_target_summary_file "$LOG_DIR" operator-validation "$name")
|
||||||
|
log_step "Per-target JSON summary: $summary_file"
|
||||||
|
if ! run_step "run-documents-scrape ($name)" "$DOCUMENTS_SCRAPE" "${per_args[@]}" --summary-file "$summary_file"; then
|
||||||
log_step "Per-target failed: $name (scrape)"
|
log_step "Per-target failed: $name (scrape)"
|
||||||
failures=$((failures + 1))
|
failures=$((failures + 1))
|
||||||
if (( CONTINUE_ON_ERROR == 0 )); then
|
if (( CONTINUE_ON_ERROR == 0 )); then
|
||||||
|
|
@ -269,13 +274,19 @@ main() {
|
||||||
fi
|
fi
|
||||||
|
|
||||||
local export_json_summary=0
|
local export_json_summary=0
|
||||||
|
local per_target_summaries=0
|
||||||
|
if (( PER_TARGET )) && [[ -z "$TARGET" ]]; then
|
||||||
|
per_target_summaries=1
|
||||||
|
fi
|
||||||
if (( DRY_RUN == 0 && SKIP_SCRAPE == 0 && SALVAGE_ONLY == 0 )); then
|
if (( DRY_RUN == 0 && SKIP_SCRAPE == 0 && SALVAGE_ONLY == 0 )); then
|
||||||
export_json_summary=1
|
export_json_summary=1
|
||||||
export DCE_RUN_SUMMARY_JSON=1
|
export DCE_RUN_SUMMARY_JSON=1
|
||||||
|
if (( per_target_summaries == 0 )); then
|
||||||
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
if [[ -z "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||||
export DCE_RUN_SUMMARY_FILE="${LOG_FILE%.log}.summary.json"
|
export DCE_RUN_SUMMARY_FILE="${LOG_FILE%.log}.summary.json"
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
local failures=0
|
local failures=0
|
||||||
|
|
||||||
|
|
@ -291,8 +302,12 @@ main() {
|
||||||
log_step "Enabled targets: $(enabled_targets | paste -sd, -)"
|
log_step "Enabled targets: $(enabled_targets | paste -sd, -)"
|
||||||
fi
|
fi
|
||||||
if (( export_json_summary )); then
|
if (( export_json_summary )); then
|
||||||
|
if (( per_target_summaries )); then
|
||||||
|
log_step "JSON summaries: per-target under $LOG_DIR"
|
||||||
|
else
|
||||||
log_step "JSON summary file: ${DCE_RUN_SUMMARY_FILE:-}"
|
log_step "JSON summary file: ${DCE_RUN_SUMMARY_FILE:-}"
|
||||||
fi
|
fi
|
||||||
|
fi
|
||||||
if (( SYNC_GUI_FLAG )); then
|
if (( SYNC_GUI_FLAG )); then
|
||||||
run_step "sync-token-from-gui" "$SYNC_GUI" --force || failures=$((failures + 1))
|
run_step "sync-token-from-gui" "$SYNC_GUI" --force || failures=$((failures + 1))
|
||||||
fi
|
fi
|
||||||
|
|
@ -326,7 +341,7 @@ main() {
|
||||||
} 2>&1 | tee -a "$LOG_FILE"
|
} 2>&1 | tee -a "$LOG_FILE"
|
||||||
local pipeline_status=${PIPESTATUS[0]}
|
local pipeline_status=${PIPESTATUS[0]}
|
||||||
|
|
||||||
if (( export_json_summary )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
if (( export_json_summary )) && (( per_target_summaries == 0 )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||||
# shellcheck source=lib/scrape-summary-json.sh
|
# shellcheck source=lib/scrape-summary-json.sh
|
||||||
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
||||||
if recover_json_summary_if_missing "$LOG_FILE" "$DCE_RUN_SUMMARY_FILE"; then
|
if recover_json_summary_if_missing "$LOG_FILE" "$DCE_RUN_SUMMARY_FILE"; then
|
||||||
|
|
|
||||||
|
|
@ -77,4 +77,16 @@ if extract_json_summary_from_log "$TMP_DIR/bad.log" "$OUT_FILE" 2>/dev/null; the
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
path=$(per_target_summary_file "$TMP_DIR" operator-validation 'KotOR_discord_msgs')
|
||||||
|
[[ "$path" == "$TMP_DIR/operator-validation-KotOR_discord_msgs-"*.summary.json ]] || {
|
||||||
|
printf 'ERROR: unexpected per_target_summary_file path: %s\n' "$path" >&2
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
slug_path=$(per_target_summary_file "$TMP_DIR" operator-proof 'weird name!')
|
||||||
|
[[ "$slug_path" == "$TMP_DIR/operator-proof-weird_name_-"*.summary.json ]] || {
|
||||||
|
printf 'ERROR: expected sanitized slug in path: %s\n' "$slug_path" >&2
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
printf 'scrape-summary-json-smoke: ok\n'
|
printf 'scrape-summary-json-smoke: ok\n'
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue