feat(scrape): recover JSON summary from teed validation log

When DCE_RUN_SUMMARY_FILE is missing after operator validation, extract
the last DCE_JSON_SUMMARY line from the log. Refresh KotOR operator docs.
This commit is contained in:
Copilot 2026-06-03 10:25:23 -05:00
parent 5cfb2ed144
commit fcea842fe3
5 changed files with 194 additions and 3 deletions

View file

@ -0,0 +1,88 @@
---
title: "feat: Recover JSON scrape summary from tee log"
type: feat
status: complete
date: 2026-06-04
origin: /lfg — plan 070 deferred auto-extract when DCE_RUN_SUMMARY_FILE write fails inside container
---
# feat: Recover JSON scrape summary from tee log
## Summary
When operator validation enables JSON summary export but the container does not write `DCE_RUN_SUMMARY_FILE`, recover the summary from the last `DCE_JSON_SUMMARY:` line in the teed validation log.
## Problem Frame
Plan 070 mounted `logs/` and mapped summary paths for compose runs. File writes can still fail (permissions, missing mount on ad-hoc runs, partial compose failures). The scrape script always logs `DCE_JSON_SUMMARY:` when `DCE_RUN_SUMMARY_JSON=1`, and operator validation tees all output to `--log-file`. A host-side fallback avoids losing machine-readable totals.
## Requirements
| ID | Requirement |
|----|-------------|
| R1 | Shared helper extracts compact JSON after the last `DCE_JSON_SUMMARY:` prefix in a log file |
| R2 | Helper validates JSON with `jq` and writes pretty-printed output to the destination path |
| R3 | `run-operator-validation.sh` invokes fallback when JSON export enabled and summary file missing or empty after tee completes |
| R4 | Recovery success logs `JSON summary recovered from log:` with the file path |
| R5 | Offline smoke covers extract-from-log without live Discord |
| R6 | `DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh` → 21/21 |
## Key Technical Decisions
- **Last line wins:** Multiple scrapes in one validation run may emit several summaries; use the last `DCE_JSON_SUMMARY:` line (most recent scrape totals).
- **No overwrite:** Only recover when destination file is missing or zero-length; do not replace an existing valid file.
## Implementation Units
### U1. Extract helper library
**Goal:** Reusable log-line recovery for JSON summaries.
**Files:** `scripts/lib/scrape-summary-json.sh`, `scripts/tests/scrape-summary-json-smoke.sh`
**Approach:** `extract_json_summary_from_log(source_log dest_file)` greps for `DCE_JSON_SUMMARY:`, takes the last match, strips prefix, `jq .` to dest. Return 0 on success, 1 when no line or invalid JSON.
**Test scenarios:**
- Happy path: log with `[timestamp] DCE_JSON_SUMMARY: {"version":1,...}` → valid pretty JSON file
- No marker line → returns 1, dest unchanged
- Invalid JSON after prefix → returns 1
**Verification:** smoke script passes standalone.
### U2. Wire into operator validation
**Goal:** Auto-recover after teed validation when file write failed.
**Dependencies:** U1
**Files:** `scripts/run-operator-validation.sh`, `scripts/tests/run-operator-validation-smoke.sh`
**Approach:** Source lib after tee; if `export_json_summary` and `DCE_RUN_SUMMARY_FILE` set and file not `-s`, call extract; log recovery on success.
**Test scenarios:**
- Smoke simulates log-only summary (write fake log + call helper path, or dry-run skip unchanged)
- Existing dry-run smoke still asserts no JSON summary path logged
**Verification:** operator-validation smoke passes.
### U3. Docs stamp
**Goal:** Record plan 071 in merge-readiness.
**Files:** `docs/recurring-scrape-merge-readiness.md`
**Approach:** Add Plan 071 bullet; refresh stale KotOR block (lines 147153) to cite per-target `container_memory: "8g"` and channel-scoped validation with `.summary.json`.
## Verification
```bash
DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh
```
## Scope Boundaries
### Deferred
- Live KotOR catch-up on host
- Host runner post-scrape recovery when stdout is not teed to a file
- Merging multiple per-target summaries into one JSON artifact

View file

@ -144,12 +144,14 @@ docker compose build # or podman-compose build
DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh --target KotOR_discord_msgs DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh --target KotOR_discord_msgs
``` ```
Large `yes_general` may still skip without a higher container cap; set `DCE_CONTAINER_MEMORY=8g` in `scrape.env` and export that channel separately: Large `yes_general` may still skip without a higher container cap; `KotOR_discord_msgs` sets `container_memory: "8g"` in `scrape-targets.json` for single-target runs (override globally with `DCE_CONTAINER_MEMORY` in `scrape.env`):
```bash ```bash
# scrape.env: DCE_CONTAINER_MEMORY=8g
DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
--salvage-before-scrape --target KotOR_discord_msgs --channel 221726893064454144 --salvage-before-scrape --target KotOR_discord_msgs \
--channel 221726893064454144 \
--log-file logs/kotor-yes-general.log
# Also writes logs/kotor-yes-general.summary.json (or recovers from log if file write fails)
``` ```
**Plan 063 (2026-06-04):** Optional `DCE_CONTAINER_MEMORY` compose `mem_limit` for large channel catch-up (default 0 = unlimited). **Plan 063 (2026-06-04):** Optional `DCE_CONTAINER_MEMORY` compose `mem_limit` for large channel catch-up (default 0 = unlimited).
@ -168,6 +170,8 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
**Plan 070 (2026-06-04):** Compose mounts `logs/` at `/logs`; host runner passthrough; operator-validation auto-writes `*.summary.json` beside `--log-file`. **Plan 070 (2026-06-04):** Compose mounts `logs/` at `/logs`; host runner passthrough; operator-validation auto-writes `*.summary.json` beside `--log-file`.
**Plan 071 (2026-06-04):** When summary file write fails, operator validation recovers JSON from the last `DCE_JSON_SUMMARY:` line in the teed log.
**Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom. **Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom.
## CI note (fork PRs) ## CI note (fork PRs)

View file

@ -0,0 +1,26 @@
#!/usr/bin/env bash
# Recover machine-readable scrape summaries from teed operator logs.
extract_json_summary_from_log() {
local source_log=$1
local dest_file=$2
local line json_payload
[[ -n "$source_log" && -n "$dest_file" ]] || return 1
[[ -f "$source_log" && -r "$source_log" ]] || return 1
command -v jq >/dev/null 2>&1 || return 1
line=$(grep 'DCE_JSON_SUMMARY:' "$source_log" | tail -1) || return 1
[[ -n "$line" ]] || return 1
json_payload=${line#*DCE_JSON_SUMMARY: }
[[ -n "$json_payload" ]] || return 1
if ! jq -e . >/dev/null 2>&1 <<<"$json_payload"; then
return 1
fi
mkdir -p "$(dirname "$dest_file")"
jq . <<<"$json_payload" >"$dest_file"
}

View file

@ -326,6 +326,16 @@ main() {
} 2>&1 | tee -a "$LOG_FILE" } 2>&1 | tee -a "$LOG_FILE"
local pipeline_status=${PIPESTATUS[0]} local pipeline_status=${PIPESTATUS[0]}
if (( export_json_summary )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
if [[ ! -s "${DCE_RUN_SUMMARY_FILE}" ]]; then
# shellcheck source=lib/scrape-summary-json.sh
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
if extract_json_summary_from_log "$LOG_FILE" "$DCE_RUN_SUMMARY_FILE"; then
printf 'JSON summary recovered from log: %s\n' "$DCE_RUN_SUMMARY_FILE"
fi
fi
fi
printf 'Log: %s\n' "$LOG_FILE" printf 'Log: %s\n' "$LOG_FILE"
exit "$pipeline_status" exit "$pipeline_status"
} }

View file

@ -0,0 +1,63 @@
#!/usr/bin/env bash
set -Eeuo pipefail
REPO_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd -P)
# shellcheck source=../lib/scrape-summary-json.sh
source "$REPO_ROOT/scripts/lib/scrape-summary-json.sh"
TMP_DIR=$(mktemp -d "${TMPDIR:-/tmp}/dce-summary-json-smoke.XXXXXX")
trap 'rm -rf "$TMP_DIR"' EXIT
LOG_FILE="$TMP_DIR/scrape.log"
OUT_FILE="$TMP_DIR/recovered.summary.json"
cat >"$LOG_FILE" <<'LOG'
[2026-06-04T12:00:00Z] scrape started
[2026-06-04T12:01:00Z] DCE_JSON_SUMMARY: {"version":1,"totals":{"created":0,"merged":1,"unchanged":2,"skipped":0,"skipped_oom":0,"messages_appended":3}}
LOG
extract_json_summary_from_log "$LOG_FILE" "$OUT_FILE" || {
printf 'ERROR: expected extract to succeed on valid marker line\n' >&2
exit 1
}
[[ -s "$OUT_FILE" ]] || {
printf 'ERROR: recovered summary file missing\n' >&2
exit 1
}
jq -e '.totals.merged == 1 and .totals.messages_appended == 3' "$OUT_FILE" >/dev/null || {
printf 'ERROR: recovered JSON content mismatch\n' >&2
exit 1
}
printf '[2026-06-04T12:02:00Z] DCE_JSON_SUMMARY: {"version":1,"totals":{"merged":9}}\n' >>"$LOG_FILE"
extract_json_summary_from_log "$LOG_FILE" "$OUT_FILE" || {
printf 'ERROR: expected second extract to succeed\n' >&2
exit 1
}
jq -e '.totals.merged == 9' "$OUT_FILE" >/dev/null || {
printf 'ERROR: expected last DCE_JSON_SUMMARY line to win\n' >&2
exit 1
}
if extract_json_summary_from_log "$TMP_DIR/missing.log" "$OUT_FILE" 2>/dev/null; then
printf 'ERROR: extract should fail on missing log\n' >&2
exit 1
fi
printf '[2026-06-04T12:03:00Z] no summary here\n' >"$TMP_DIR/empty.log"
if extract_json_summary_from_log "$TMP_DIR/empty.log" "$OUT_FILE" 2>/dev/null; then
printf 'ERROR: extract should fail when marker absent\n' >&2
exit 1
fi
printf '[2026-06-04T12:04:00Z] DCE_JSON_SUMMARY: not-json\n' >"$TMP_DIR/bad.log"
if extract_json_summary_from_log "$TMP_DIR/bad.log" "$OUT_FILE" 2>/dev/null; then
printf 'ERROR: extract should fail on invalid JSON\n' >&2
exit 1
fi
printf 'scrape-summary-json-smoke: ok\n'