mirror of
https://github.com/Tyrrrz/DiscordChatExporter.git
synced 2026-06-09 15:52:37 -06:00
fix(audit): exclude .dce-temp partial exports from JSON audit
Operator validation failed when yes_general OOM left truncated exports under .dce-temp. Audit and archive verification now skip in-progress temps; smoke covers the partial-temp case. KotOR audit passes with temps present.
This commit is contained in:
parent
8b54b6a498
commit
928c0ef682
38
docs/plans/2026-06-04-045-fix-audit-exclude-dce-temp-plan.md
Normal file
38
docs/plans/2026-06-04-045-fix-audit-exclude-dce-temp-plan.md
Normal file
|
|
@ -0,0 +1,38 @@
|
||||||
|
---
|
||||||
|
title: "fix: Exclude .dce-temp from archive JSON audit"
|
||||||
|
type: fix
|
||||||
|
status: complete
|
||||||
|
date: 2026-06-04
|
||||||
|
origin: /lfg — KotOR validation audit fails on in-progress partial exports in .dce-temp
|
||||||
|
---
|
||||||
|
|
||||||
|
# fix: Exclude .dce-temp from archive JSON audit
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
`audit-archive-json.sh` scans every `*.json` under a target output dir. Partial/incomplete exports in `.dce-temp/export.*` (truncated mid-catch-up) fail `jq empty`, so operator validation reports audit failure even when archive files are valid.
|
||||||
|
|
||||||
|
Observed: `INVALID .../KotOR_discord_msgs/.dce-temp/export.221726893064454144.kbMFiP/export.json` after yes_general OOM skip preserved partial temp (plan 043).
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
| ID | Requirement |
|
||||||
|
|----|-------------|
|
||||||
|
| R1 | `audit-archive-json.sh` skips files under `*/.dce-temp/*` (same as `.dce-meta`) |
|
||||||
|
| R2 | `verify-documents-archives.sh` uses consistent exclusion if it scans JSON |
|
||||||
|
| R3 | `audit-archive-json-smoke.sh` covers a fixture partial under `.dce-temp` that must not fail audit |
|
||||||
|
| R4 | `run-all-smokes.sh` passes (19/19) |
|
||||||
|
| R5 | KotOR audit passes while partial temp exists; update `docs/recurring-scrape-merge-readiness.md` |
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/tests/audit-archive-json-smoke.sh
|
||||||
|
DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh
|
||||||
|
./scripts/audit-archive-json.sh --config config/scrape-targets.json --target KotOR_discord_msgs
|
||||||
|
```
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
- Completing yes_general multi-hour catch-up inside LFG
|
||||||
|
- Container memory tuning
|
||||||
|
|
@ -111,9 +111,13 @@ DCE_MIN_FREE_MB=0 ./scripts/run-operator-validation.sh --sync-gui --per-target -
|
||||||
| expanded_kotor_discord | pass | pass | validation-resume |
|
| expanded_kotor_discord | pass | pass | validation-resume |
|
||||||
| eod_discord | pass | pass | validation-resume |
|
| eod_discord | pass | pass | validation-resume |
|
||||||
| DS_Discord_msgs | pass | pass | validation-resume; some channels forbidden |
|
| DS_Discord_msgs | pass | pass | validation-resume; some channels forbidden |
|
||||||
| KotOR_discord_msgs | **in progress** | — | plan 044 validation started 2026-06-04 (`logs/kotor-validation-20260604.log`); `yes_general` catch-up + preserve-partial smoke |
|
| KotOR_discord_msgs | **scrape pass / audit pass*** | pass* | plan 045: audit excludes `.dce-temp` partials; yes_general catch-up in progress with preserved partial temps (~23–29 MiB) |
|
||||||
|
|
||||||
**Plan 044 (2026-06-04):** Offline smoke asserts partial temp preserved on OOM skip (channel 134). Host wrapper always writes explicit compose env from `DISCORD_TOKEN_FILE` (fixes auth-retry when shell exports a stale `DISCORD_TOKEN`). `run-all-smokes.sh` → 19/19 pass.
|
\* Audit failed before plan 045 because truncated partial exports under `.dce-temp/` were scanned as archives. After fix, audit passes while partial temps exist.
|
||||||
|
|
||||||
|
**Plan 045 (2026-06-04):** `audit-archive-json.sh` and `verify-documents-archives.sh` skip `*/.dce-temp/*` (in-progress partial exports). Salvage run 2026-06-03: 7 merged, 17 unchanged, 3 skipped (+5404 messages); yes_general OOM-skipped with partial temps preserved for next salvage.
|
||||||
|
|
||||||
|
**Plan 044 (2026-06-04):** Offline smoke asserts partial temp preserved on OOM skip (channel 134). Host wrapper prefers `DISCORD_TOKEN_FILE` over inherited shell tokens. `run-all-smokes.sh` → 19/19 pass.
|
||||||
|
|
||||||
**KotOR / yes_general (plan 040–043):** Incremental `--after` works for all channels; most return `UNCHANGED` in seconds. `yes_general` archive last message was **2021-01-17** — the first catch-up legitimately fetches years of history. Prior bug: OOM skip **deleted** partial temp exports, causing re-download loops. Plan 043 preserves partial temps and salvages on next run.
|
**KotOR / yes_general (plan 040–043):** Incremental `--after` works for all channels; most return `UNCHANGED` in seconds. `yes_general` archive last message was **2021-01-17** — the first catch-up legitimately fetches years of history. Prior bug: OOM skip **deleted** partial temp exports, causing re-download loops. Plan 043 preserves partial temps and salvages on next run.
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -35,7 +35,7 @@ audit_dir() {
|
||||||
fi
|
fi
|
||||||
printf 'INVALID\t%s\n' "$file_path"
|
printf 'INVALID\t%s\n' "$file_path"
|
||||||
FAILURES=$((FAILURES + 1))
|
FAILURES=$((FAILURES + 1))
|
||||||
done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' -print0 2>/dev/null)
|
done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' ! -path '*/.dce-temp/*' -print0 2>/dev/null)
|
||||||
}
|
}
|
||||||
|
|
||||||
main() {
|
main() {
|
||||||
|
|
|
||||||
|
|
@ -21,6 +21,9 @@ JSON
|
||||||
|
|
||||||
printf '{"messages":[\n' >"$ARCHIVE_ROOT/bad/truncated [222].json"
|
printf '{"messages":[\n' >"$ARCHIVE_ROOT/bad/truncated [222].json"
|
||||||
|
|
||||||
|
mkdir -p "$ARCHIVE_ROOT/good/.dce-temp/export.111.PARTIAL"
|
||||||
|
printf '{"messages":[\n' >"$ARCHIVE_ROOT/good/.dce-temp/export.111.PARTIAL/export.json"
|
||||||
|
|
||||||
cat >"$CONFIG_PATH" <<JSON
|
cat >"$CONFIG_PATH" <<JSON
|
||||||
{
|
{
|
||||||
"archive_root": "$ARCHIVE_ROOT",
|
"archive_root": "$ARCHIVE_ROOT",
|
||||||
|
|
|
||||||
|
|
@ -28,7 +28,7 @@ require_command() {
|
||||||
|
|
||||||
count_archive_json() {
|
count_archive_json() {
|
||||||
local output_dir=$1
|
local output_dir=$1
|
||||||
find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' 2>/dev/null | wc -l | tr -d ' '
|
find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' ! -path '*/.dce-temp/*' 2>/dev/null | wc -l | tr -d ' '
|
||||||
}
|
}
|
||||||
|
|
||||||
count_seeded_channel_ids() {
|
count_seeded_channel_ids() {
|
||||||
|
|
@ -42,7 +42,7 @@ count_seeded_channel_ids() {
|
||||||
if [[ "$file_name" =~ \[([0-9]{16,22})\]\.json$ ]]; then
|
if [[ "$file_name" =~ \[([0-9]{16,22})\]\.json$ ]]; then
|
||||||
printf '%s\n' "${BASH_REMATCH[1]}"
|
printf '%s\n' "${BASH_REMATCH[1]}"
|
||||||
fi
|
fi
|
||||||
done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' -print0 2>/dev/null) | sort -u | wc -l | tr -d ' '
|
done < <(find "$output_dir" -type f -name '*.json' ! -path '*/.dce-meta/*' ! -path '*/.dce-temp/*' -print0 2>/dev/null) | sort -u | wc -l | tr -d ' '
|
||||||
}
|
}
|
||||||
|
|
||||||
count_channel_map_entries() {
|
count_channel_map_entries() {
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue