fix(audit): exclude .dce-temp partial exports from JSON audit

Operator validation failed when yes_general OOM left truncated exports
under .dce-temp. Audit and archive verification now skip in-progress temps;
smoke covers the partial-temp case. KotOR audit passes with temps present.
This commit is contained in:
Copilot 2026-06-03 05:59:54 -05:00
parent 8b54b6a498
commit 928c0ef682
5 changed files with 50 additions and 5 deletions

View file

@ -0,0 +1,38 @@
---
title: "fix: Exclude .dce-temp from archive JSON audit"
type: fix
status: complete
date: 2026-06-04
origin: /lfg — KotOR validation audit fails on in-progress partial exports in .dce-temp
---
# fix: Exclude .dce-temp from archive JSON audit
## Problem
`audit-archive-json.sh` scans every `*.json` under a target output dir. Partial/incomplete exports in `.dce-temp/export.*` (truncated mid-catch-up) fail `jq empty`, so operator validation reports audit failure even when archive files are valid.
Observed: `INVALID .../KotOR_discord_msgs/.dce-temp/export.221726893064454144.kbMFiP/export.json` after yes_general OOM skip preserved partial temp (plan 043).
## Requirements
| ID | Requirement |
|----|-------------|
| R1 | `audit-archive-json.sh` skips files under `*/.dce-temp/*` (same as `.dce-meta`) |
| R2 | `verify-documents-archives.sh` uses consistent exclusion if it scans JSON |
| R3 | `audit-archive-json-smoke.sh` covers a fixture partial under `.dce-temp` that must not fail audit |
| R4 | `run-all-smokes.sh` passes (19/19) |
| R5 | KotOR audit passes while partial temp exists; update `docs/recurring-scrape-merge-readiness.md` |
## Verification
```bash
./scripts/tests/audit-archive-json-smoke.sh
DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh
./scripts/audit-archive-json.sh --config config/scrape-targets.json --target KotOR_discord_msgs
```
## Out of scope
- Completing yes_general multi-hour catch-up inside LFG
- Container memory tuning

View file

@ -111,9 +111,13 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh --sync-gui --per-target -
| expanded_kotor_discord | pass | pass | validation-resume |
| eod_discord | pass | pass | validation-resume |
| DS_Discord_msgs | pass | pass | validation-resume; some channels forbidden |
| KotOR_discord_msgs | **in progress** | — | plan 044 validation started 2026-06-04 (`logs/kotor-validation-20260604.log`); `yes_general` catch-up + preserve-partial smoke |
| KotOR_discord_msgs | **scrape pass / audit pass*** | pass* | plan 045: audit excludes `.dce-temp` partials; yes_general catch-up in progress with preserved partial temps (~2329 MiB) |
**Plan 044 (2026-06-04):** Offline smoke asserts partial temp preserved on OOM skip (channel 134). Host wrapper always writes explicit compose env from `DISCORD_TOKEN_FILE` (fixes auth-retry when shell exports a stale `DISCORD_TOKEN`). `run-all-smokes.sh` → 19/19 pass.
\* Audit failed before plan 045 because truncated partial exports under `.dce-temp/` were scanned as archives. After fix, audit passes while partial temps exist.
**Plan 045 (2026-06-04):** `audit-archive-json.sh` and `verify-documents-archives.sh` skip `*/.dce-temp/*` (in-progress partial exports). Salvage run 2026-06-03: 7 merged, 17 unchanged, 3 skipped (+5404 messages); yes_general OOM-skipped with partial temps preserved for next salvage.
**Plan 044 (2026-06-04):** Offline smoke asserts partial temp preserved on OOM skip (channel 134). Host wrapper prefers `DISCORD_TOKEN_FILE` over inherited shell tokens. `run-all-smokes.sh` → 19/19 pass.
**KotOR / yes_general (plan 040043):** Incremental `--after` works for all channels; most return `UNCHANGED` in seconds. `yes_general` archive last message was **2021-01-17** — the first catch-up legitimately fetches years of history. Prior bug: OOM skip **deleted** partial temp exports, causing re-download loops. Plan 043 preserves partial temps and salvages on next run.

View file

@ -35,7 +35,7 @@ audit_dir() {
fi
printf 'INVALID\t%s\n' "$file_path"
FAILURES=$((FAILURES + 1))
done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' -print0 2>/dev/null)
done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' ! -path '*/.dce-temp/*' -print0 2>/dev/null)
}
main() {

View file

@ -21,6 +21,9 @@ JSON
printf '{"messages":[\n' >"$ARCHIVE_ROOT/bad/truncated [222].json"
mkdir -p "$ARCHIVE_ROOT/good/.dce-temp/export.111.PARTIAL"
printf '{"messages":[\n' >"$ARCHIVE_ROOT/good/.dce-temp/export.111.PARTIAL/export.json"
cat >"$CONFIG_PATH" <<JSON
{
"archive_root": "$ARCHIVE_ROOT",

View file

@ -28,7 +28,7 @@ require_command() {
count_archive_json() {
local output_dir=$1
find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' 2>/dev/null | wc -l | tr -d ' '
find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' ! -path '*/.dce-temp/*' 2>/dev/null | wc -l | tr -d ' '
}
count_seeded_channel_ids() {
@ -42,7 +42,7 @@ count_seeded_channel_ids() {
if [[ "$file_name" =~ \[([0-9]{16,22})\]\.json$ ]]; then
printf '%s\n' "${BASH_REMATCH[1]}"
fi
done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' -print0 2>/dev/null) | sort -u | wc -l | tr -d ' '
done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' ! -path '*/.dce-temp/*' -print0 2>/dev/null) | sort -u | wc -l | tr -d ' '
}
count_channel_map_entries() {