mirror of
https://github.com/Tyrrrz/DiscordChatExporter.git
synced 2026-06-10 00:02:37 -06:00
feat(scrape): recover JSON summary from teed validation log
When DCE_RUN_SUMMARY_FILE is missing after operator validation, extract the last DCE_JSON_SUMMARY line from the log. Refresh KotOR operator docs.
This commit is contained in:
parent
5cfb2ed144
commit
fcea842fe3
|
|
@ -0,0 +1,88 @@
|
||||||
|
---
|
||||||
|
title: "feat: Recover JSON scrape summary from tee log"
|
||||||
|
type: feat
|
||||||
|
status: complete
|
||||||
|
date: 2026-06-04
|
||||||
|
origin: /lfg — plan 070 deferred auto-extract when DCE_RUN_SUMMARY_FILE write fails inside container
|
||||||
|
---
|
||||||
|
|
||||||
|
# feat: Recover JSON scrape summary from tee log
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
When operator validation enables JSON summary export but the container does not write `DCE_RUN_SUMMARY_FILE`, recover the summary from the last `DCE_JSON_SUMMARY:` line in the teed validation log.
|
||||||
|
|
||||||
|
## Problem Frame
|
||||||
|
|
||||||
|
Plan 070 mounted `logs/` and mapped summary paths for compose runs. File writes can still fail (permissions, missing mount on ad-hoc runs, partial compose failures). The scrape script always logs `DCE_JSON_SUMMARY:` when `DCE_RUN_SUMMARY_JSON=1`, and operator validation tees all output to `--log-file`. A host-side fallback avoids losing machine-readable totals.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
| ID | Requirement |
|
||||||
|
|----|-------------|
|
||||||
|
| R1 | Shared helper extracts compact JSON after the last `DCE_JSON_SUMMARY:` prefix in a log file |
|
||||||
|
| R2 | Helper validates JSON with `jq` and writes pretty-printed output to the destination path |
|
||||||
|
| R3 | `run-operator-validation.sh` invokes fallback when JSON export enabled and summary file missing or empty after tee completes |
|
||||||
|
| R4 | Recovery success logs `JSON summary recovered from log:` with the file path |
|
||||||
|
| R5 | Offline smoke covers extract-from-log without live Discord |
|
||||||
|
| R6 | `DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh` → 21/21 |
|
||||||
|
|
||||||
|
## Key Technical Decisions
|
||||||
|
|
||||||
|
- **Last line wins:** Multiple scrapes in one validation run may emit several summaries; use the last `DCE_JSON_SUMMARY:` line (most recent scrape totals).
|
||||||
|
- **No overwrite:** Only recover when destination file is missing or zero-length; do not replace an existing valid file.
|
||||||
|
|
||||||
|
## Implementation Units
|
||||||
|
|
||||||
|
### U1. Extract helper library
|
||||||
|
|
||||||
|
**Goal:** Reusable log-line recovery for JSON summaries.
|
||||||
|
|
||||||
|
**Files:** `scripts/lib/scrape-summary-json.sh`, `scripts/tests/scrape-summary-json-smoke.sh`
|
||||||
|
|
||||||
|
**Approach:** `extract_json_summary_from_log(source_log dest_file)` greps for `DCE_JSON_SUMMARY:`, takes the last match, strips prefix, `jq .` to dest. Return 0 on success, 1 when no line or invalid JSON.
|
||||||
|
|
||||||
|
**Test scenarios:**
|
||||||
|
- Happy path: log with `[timestamp] DCE_JSON_SUMMARY: {"version":1,...}` → valid pretty JSON file
|
||||||
|
- No marker line → returns 1, dest unchanged
|
||||||
|
- Invalid JSON after prefix → returns 1
|
||||||
|
|
||||||
|
**Verification:** smoke script passes standalone.
|
||||||
|
|
||||||
|
### U2. Wire into operator validation
|
||||||
|
|
||||||
|
**Goal:** Auto-recover after teed validation when file write failed.
|
||||||
|
|
||||||
|
**Dependencies:** U1
|
||||||
|
|
||||||
|
**Files:** `scripts/run-operator-validation.sh`, `scripts/tests/run-operator-validation-smoke.sh`
|
||||||
|
|
||||||
|
**Approach:** Source lib after tee; if `export_json_summary` and `DCE_RUN_SUMMARY_FILE` set and file not `-s`, call extract; log recovery on success.
|
||||||
|
|
||||||
|
**Test scenarios:**
|
||||||
|
- Smoke simulates log-only summary (write fake log + call helper path, or dry-run skip unchanged)
|
||||||
|
- Existing dry-run smoke still asserts no JSON summary path logged
|
||||||
|
|
||||||
|
**Verification:** operator-validation smoke passes.
|
||||||
|
|
||||||
|
### U3. Docs stamp
|
||||||
|
|
||||||
|
**Goal:** Record plan 071 in merge-readiness.
|
||||||
|
|
||||||
|
**Files:** `docs/recurring-scrape-merge-readiness.md`
|
||||||
|
|
||||||
|
**Approach:** Add Plan 071 bullet; refresh stale KotOR block (lines 147–153) to cite per-target `container_memory: "8g"` and channel-scoped validation with `.summary.json`.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
```bash
|
||||||
|
DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Scope Boundaries
|
||||||
|
|
||||||
|
### Deferred
|
||||||
|
|
||||||
|
- Live KotOR catch-up on host
|
||||||
|
- Host runner post-scrape recovery when stdout is not teed to a file
|
||||||
|
- Merging multiple per-target summaries into one JSON artifact
|
||||||
|
|
@ -144,12 +144,14 @@ docker compose build # or podman-compose build
|
||||||
DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh --target KotOR_discord_msgs
|
DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh --target KotOR_discord_msgs
|
||||||
```
|
```
|
||||||
|
|
||||||
Large `yes_general` may still skip without a higher container cap; set `DCE_CONTAINER_MEMORY=8g` in `scrape.env` and export that channel separately:
|
Large `yes_general` may still skip without a higher container cap; `KotOR_discord_msgs` sets `container_memory: "8g"` in `scrape-targets.json` for single-target runs (override globally with `DCE_CONTAINER_MEMORY` in `scrape.env`):
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# scrape.env: DCE_CONTAINER_MEMORY=8g
|
|
||||||
DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
|
DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
|
||||||
--salvage-before-scrape --target KotOR_discord_msgs --channel 221726893064454144
|
--salvage-before-scrape --target KotOR_discord_msgs \
|
||||||
|
--channel 221726893064454144 \
|
||||||
|
--log-file logs/kotor-yes-general.log
|
||||||
|
# Also writes logs/kotor-yes-general.summary.json (or recovers from log if file write fails)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Plan 063 (2026-06-04):** Optional `DCE_CONTAINER_MEMORY` compose `mem_limit` for large channel catch-up (default 0 = unlimited).
|
**Plan 063 (2026-06-04):** Optional `DCE_CONTAINER_MEMORY` compose `mem_limit` for large channel catch-up (default 0 = unlimited).
|
||||||
|
|
@ -168,6 +170,8 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh \
|
||||||
|
|
||||||
**Plan 070 (2026-06-04):** Compose mounts `logs/` at `/logs`; host runner passthrough; operator-validation auto-writes `*.summary.json` beside `--log-file`.
|
**Plan 070 (2026-06-04):** Compose mounts `logs/` at `/logs`; host runner passthrough; operator-validation auto-writes `*.summary.json` beside `--log-file`.
|
||||||
|
|
||||||
|
**Plan 071 (2026-06-04):** When summary file write fails, operator validation recovers JSON from the last `DCE_JSON_SUMMARY:` line in the teed log.
|
||||||
|
|
||||||
**Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom.
|
**Disk:** ~65 GiB free on `/home` (2026-05-30); large channel merges still need headroom.
|
||||||
|
|
||||||
## CI note (fork PRs)
|
## CI note (fork PRs)
|
||||||
|
|
|
||||||
26
scripts/lib/scrape-summary-json.sh
Normal file
26
scripts/lib/scrape-summary-json.sh
Normal file
|
|
@ -0,0 +1,26 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
|
||||||
|
# Recover machine-readable scrape summaries from teed operator logs.
|
||||||
|
|
||||||
|
extract_json_summary_from_log() {
|
||||||
|
local source_log=$1
|
||||||
|
local dest_file=$2
|
||||||
|
local line json_payload
|
||||||
|
|
||||||
|
[[ -n "$source_log" && -n "$dest_file" ]] || return 1
|
||||||
|
[[ -f "$source_log" && -r "$source_log" ]] || return 1
|
||||||
|
command -v jq >/dev/null 2>&1 || return 1
|
||||||
|
|
||||||
|
line=$(grep 'DCE_JSON_SUMMARY:' "$source_log" | tail -1) || return 1
|
||||||
|
[[ -n "$line" ]] || return 1
|
||||||
|
|
||||||
|
json_payload=${line#*DCE_JSON_SUMMARY: }
|
||||||
|
[[ -n "$json_payload" ]] || return 1
|
||||||
|
|
||||||
|
if ! jq -e . >/dev/null 2>&1 <<<"$json_payload"; then
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
mkdir -p "$(dirname "$dest_file")"
|
||||||
|
jq . <<<"$json_payload" >"$dest_file"
|
||||||
|
}
|
||||||
|
|
@ -326,6 +326,16 @@ main() {
|
||||||
} 2>&1 | tee -a "$LOG_FILE"
|
} 2>&1 | tee -a "$LOG_FILE"
|
||||||
local pipeline_status=${PIPESTATUS[0]}
|
local pipeline_status=${PIPESTATUS[0]}
|
||||||
|
|
||||||
|
if (( export_json_summary )) && [[ -n "${DCE_RUN_SUMMARY_FILE:-}" ]]; then
|
||||||
|
if [[ ! -s "${DCE_RUN_SUMMARY_FILE}" ]]; then
|
||||||
|
# shellcheck source=lib/scrape-summary-json.sh
|
||||||
|
source "$SCRIPT_DIR/lib/scrape-summary-json.sh"
|
||||||
|
if extract_json_summary_from_log "$LOG_FILE" "$DCE_RUN_SUMMARY_FILE"; then
|
||||||
|
printf 'JSON summary recovered from log: %s\n' "$DCE_RUN_SUMMARY_FILE"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
printf 'Log: %s\n' "$LOG_FILE"
|
printf 'Log: %s\n' "$LOG_FILE"
|
||||||
exit "$pipeline_status"
|
exit "$pipeline_status"
|
||||||
}
|
}
|
||||||
|
|
|
||||||
63
scripts/tests/scrape-summary-json-smoke.sh
Executable file
63
scripts/tests/scrape-summary-json-smoke.sh
Executable file
|
|
@ -0,0 +1,63 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
|
||||||
|
set -Eeuo pipefail
|
||||||
|
|
||||||
|
REPO_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd -P)
|
||||||
|
# shellcheck source=../lib/scrape-summary-json.sh
|
||||||
|
source "$REPO_ROOT/scripts/lib/scrape-summary-json.sh"
|
||||||
|
|
||||||
|
TMP_DIR=$(mktemp -d "${TMPDIR:-/tmp}/dce-summary-json-smoke.XXXXXX")
|
||||||
|
trap 'rm -rf "$TMP_DIR"' EXIT
|
||||||
|
|
||||||
|
LOG_FILE="$TMP_DIR/scrape.log"
|
||||||
|
OUT_FILE="$TMP_DIR/recovered.summary.json"
|
||||||
|
|
||||||
|
cat >"$LOG_FILE" <<'LOG'
|
||||||
|
[2026-06-04T12:00:00Z] scrape started
|
||||||
|
[2026-06-04T12:01:00Z] DCE_JSON_SUMMARY: {"version":1,"totals":{"created":0,"merged":1,"unchanged":2,"skipped":0,"skipped_oom":0,"messages_appended":3}}
|
||||||
|
LOG
|
||||||
|
|
||||||
|
extract_json_summary_from_log "$LOG_FILE" "$OUT_FILE" || {
|
||||||
|
printf 'ERROR: expected extract to succeed on valid marker line\n' >&2
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
[[ -s "$OUT_FILE" ]] || {
|
||||||
|
printf 'ERROR: recovered summary file missing\n' >&2
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
jq -e '.totals.merged == 1 and .totals.messages_appended == 3' "$OUT_FILE" >/dev/null || {
|
||||||
|
printf 'ERROR: recovered JSON content mismatch\n' >&2
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
printf '[2026-06-04T12:02:00Z] DCE_JSON_SUMMARY: {"version":1,"totals":{"merged":9}}\n' >>"$LOG_FILE"
|
||||||
|
extract_json_summary_from_log "$LOG_FILE" "$OUT_FILE" || {
|
||||||
|
printf 'ERROR: expected second extract to succeed\n' >&2
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
jq -e '.totals.merged == 9' "$OUT_FILE" >/dev/null || {
|
||||||
|
printf 'ERROR: expected last DCE_JSON_SUMMARY line to win\n' >&2
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
if extract_json_summary_from_log "$TMP_DIR/missing.log" "$OUT_FILE" 2>/dev/null; then
|
||||||
|
printf 'ERROR: extract should fail on missing log\n' >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '[2026-06-04T12:03:00Z] no summary here\n' >"$TMP_DIR/empty.log"
|
||||||
|
if extract_json_summary_from_log "$TMP_DIR/empty.log" "$OUT_FILE" 2>/dev/null; then
|
||||||
|
printf 'ERROR: extract should fail when marker absent\n' >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '[2026-06-04T12:04:00Z] DCE_JSON_SUMMARY: not-json\n' >"$TMP_DIR/bad.log"
|
||||||
|
if extract_json_summary_from_log "$TMP_DIR/bad.log" "$OUT_FILE" 2>/dev/null; then
|
||||||
|
printf 'ERROR: extract should fail on invalid JSON\n' >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'scrape-summary-json-smoke: ok\n'
|
||||||
Loading…
Reference in a new issue