From 928c0ef682e7737849267ade4a2bba3b7c0a0819 Mon Sep 17 00:00:00 2001 From: Copilot Date: Wed, 3 Jun 2026 05:59:54 -0500 Subject: [PATCH] fix(audit): exclude .dce-temp partial exports from JSON audit Operator validation failed when yes_general OOM left truncated exports under .dce-temp. Audit and archive verification now skip in-progress temps; smoke covers the partial-temp case. KotOR audit passes with temps present. --- ...-04-045-fix-audit-exclude-dce-temp-plan.md | 38 +++++++++++++++++++ docs/recurring-scrape-merge-readiness.md | 8 +++- scripts/audit-archive-json.sh | 2 +- scripts/tests/audit-archive-json-smoke.sh | 3 ++ scripts/verify-documents-archives.sh | 4 +- 5 files changed, 50 insertions(+), 5 deletions(-) create mode 100644 docs/plans/2026-06-04-045-fix-audit-exclude-dce-temp-plan.md diff --git a/docs/plans/2026-06-04-045-fix-audit-exclude-dce-temp-plan.md b/docs/plans/2026-06-04-045-fix-audit-exclude-dce-temp-plan.md new file mode 100644 index 00000000..035ccaa5 --- /dev/null +++ b/docs/plans/2026-06-04-045-fix-audit-exclude-dce-temp-plan.md @@ -0,0 +1,38 @@ +--- +title: "fix: Exclude .dce-temp from archive JSON audit" +type: fix +status: complete +date: 2026-06-04 +origin: /lfg — KotOR validation audit fails on in-progress partial exports in .dce-temp +--- + +# fix: Exclude .dce-temp from archive JSON audit + +## Problem + +`audit-archive-json.sh` scans every `*.json` under a target output dir. Partial/incomplete exports in `.dce-temp/export.*` (truncated mid-catch-up) fail `jq empty`, so operator validation reports audit failure even when archive files are valid. + +Observed: `INVALID .../KotOR_discord_msgs/.dce-temp/export.221726893064454144.kbMFiP/export.json` after yes_general OOM skip preserved partial temp (plan 043). + +## Requirements + +| ID | Requirement | +|----|-------------| +| R1 | `audit-archive-json.sh` skips files under `*/.dce-temp/*` (same as `.dce-meta`) | +| R2 | `verify-documents-archives.sh` uses consistent exclusion if it scans JSON | +| R3 | `audit-archive-json-smoke.sh` covers a fixture partial under `.dce-temp` that must not fail audit | +| R4 | `run-all-smokes.sh` passes (19/19) | +| R5 | KotOR audit passes while partial temp exists; update `docs/recurring-scrape-merge-readiness.md` | + +## Verification + +```bash +./scripts/tests/audit-archive-json-smoke.sh +DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh +./scripts/audit-archive-json.sh --config config/scrape-targets.json --target KotOR_discord_msgs +``` + +## Out of scope + +- Completing yes_general multi-hour catch-up inside LFG +- Container memory tuning diff --git a/docs/recurring-scrape-merge-readiness.md b/docs/recurring-scrape-merge-readiness.md index 67ad256f..e70a15c5 100644 --- a/docs/recurring-scrape-merge-readiness.md +++ b/docs/recurring-scrape-merge-readiness.md @@ -111,9 +111,13 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh --sync-gui --per-target - | expanded_kotor_discord | pass | pass | validation-resume | | eod_discord | pass | pass | validation-resume | | DS_Discord_msgs | pass | pass | validation-resume; some channels forbidden | -| KotOR_discord_msgs | **in progress** | — | plan 044 validation started 2026-06-04 (`logs/kotor-validation-20260604.log`); `yes_general` catch-up + preserve-partial smoke | +| KotOR_discord_msgs | **scrape pass / audit pass*** | pass* | plan 045: audit excludes `.dce-temp` partials; yes_general catch-up in progress with preserved partial temps (~23–29 MiB) | -**Plan 044 (2026-06-04):** Offline smoke asserts partial temp preserved on OOM skip (channel 134). Host wrapper always writes explicit compose env from `DISCORD_TOKEN_FILE` (fixes auth-retry when shell exports a stale `DISCORD_TOKEN`). `run-all-smokes.sh` → 19/19 pass. +\* Audit failed before plan 045 because truncated partial exports under `.dce-temp/` were scanned as archives. After fix, audit passes while partial temps exist. + +**Plan 045 (2026-06-04):** `audit-archive-json.sh` and `verify-documents-archives.sh` skip `*/.dce-temp/*` (in-progress partial exports). Salvage run 2026-06-03: 7 merged, 17 unchanged, 3 skipped (+5404 messages); yes_general OOM-skipped with partial temps preserved for next salvage. + +**Plan 044 (2026-06-04):** Offline smoke asserts partial temp preserved on OOM skip (channel 134). Host wrapper prefers `DISCORD_TOKEN_FILE` over inherited shell tokens. `run-all-smokes.sh` → 19/19 pass. **KotOR / yes_general (plan 040–043):** Incremental `--after` works for all channels; most return `UNCHANGED` in seconds. `yes_general` archive last message was **2021-01-17** — the first catch-up legitimately fetches years of history. Prior bug: OOM skip **deleted** partial temp exports, causing re-download loops. Plan 043 preserves partial temps and salvages on next run. diff --git a/scripts/audit-archive-json.sh b/scripts/audit-archive-json.sh index 191815ab..454dcbe2 100755 --- a/scripts/audit-archive-json.sh +++ b/scripts/audit-archive-json.sh @@ -35,7 +35,7 @@ audit_dir() { fi printf 'INVALID\t%s\n' "$file_path" FAILURES=$((FAILURES + 1)) - done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' -print0 2>/dev/null) + done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' ! -path '*/.dce-temp/*' -print0 2>/dev/null) } main() { diff --git a/scripts/tests/audit-archive-json-smoke.sh b/scripts/tests/audit-archive-json-smoke.sh index 255d4202..f8bac9c3 100755 --- a/scripts/tests/audit-archive-json-smoke.sh +++ b/scripts/tests/audit-archive-json-smoke.sh @@ -21,6 +21,9 @@ JSON printf '{"messages":[\n' >"$ARCHIVE_ROOT/bad/truncated [222].json" +mkdir -p "$ARCHIVE_ROOT/good/.dce-temp/export.111.PARTIAL" +printf '{"messages":[\n' >"$ARCHIVE_ROOT/good/.dce-temp/export.111.PARTIAL/export.json" + cat >"$CONFIG_PATH" </dev/null | wc -l | tr -d ' ' + find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' ! -path '*/.dce-temp/*' 2>/dev/null | wc -l | tr -d ' ' } count_seeded_channel_ids() { @@ -42,7 +42,7 @@ count_seeded_channel_ids() { if [[ "$file_name" =~ \[([0-9]{16,22})\]\.json$ ]]; then printf '%s\n' "${BASH_REMATCH[1]}" fi - done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' -print0 2>/dev/null) | sort -u | wc -l | tr -d ' ' + done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' ! -path '*/.dce-temp/*' -print0 2>/dev/null) | sort -u | wc -l | tr -d ' ' } count_channel_map_entries() {