mirror of
https://github.com/Tyrrrz/DiscordChatExporter.git
synced 2026-06-09 15:52:37 -06:00
fix(scrape): salvage stale temp exports before re-downloading
When a previous export crashes (OOM, abort, kill), the partially- downloaded temp export under .dce-temp/ was orphaned. Subsequent runs started the incremental from the archive's last message ID, re-downloading everything the failed run had already fetched. Now scrape_target() checks for orphaned temp exports before each channel export, salvages truncated JSON (same marker-based repair as salvage-truncated-export.sh), merges recovered messages into the archive, and cleans up stale temp dirs. The incremental then starts from the truly latest message. Adds salvage-stale smoke test with truncated fixture.
This commit is contained in:
parent
87284816d0
commit
c13c4167be
|
|
@ -0,0 +1,91 @@
|
|||
---
|
||||
title: "fix: Salvage stale temp exports before re-downloading"
|
||||
type: fix
|
||||
status: active
|
||||
date: 2026-06-03
|
||||
origin: /lfg — yes_general re-downloads 514 MB because prior aborted run's temp data is never recovered
|
||||
---
|
||||
|
||||
# fix: Salvage stale temp exports before re-downloading
|
||||
|
||||
## Problem
|
||||
|
||||
`scrape_target()` always creates a fresh temp directory and starts `export_channel_incremental` from `last_message_id(archive)`. When a previous run crashes (OOM, abort, kill), the partially-downloaded temp export is orphaned under `.dce-temp/export.<channel_id>.*` but never cleaned up or reused.
|
||||
|
||||
For `yes_general` (channel `221726893064454144`):
|
||||
- Archive last message: **2021-01-17** (`800354246440648745`), 264K messages, 312 MB
|
||||
- Stale temp export from May 29: **514 MB** of truncated JSON (messages 2021→mid-2026)
|
||||
- Every re-run downloads all of those messages again from scratch
|
||||
|
||||
The `salvage-truncated-export.sh` script already exists but is never called automatically.
|
||||
|
||||
## Requirements
|
||||
|
||||
| ID | Requirement |
|
||||
|----|-------------|
|
||||
| R1 | Before exporting a channel, `scrape_target` checks for orphaned temp dirs matching `.dce-temp/export.<channel_id>.*` |
|
||||
| R2 | If an orphaned temp export contains truncated JSON, salvage it to valid JSON using the same logic as `salvage-truncated-export.sh` |
|
||||
| R3 | If salvage succeeds, merge the recovered messages into the archive (same merge_exports + commit_merged_export path) |
|
||||
| R4 | Clean up stale temp dirs after salvage (success or failure) |
|
||||
| R5 | After salvage-merge, `last_message_id` returns the advanced ID so the incremental only fetches truly new messages |
|
||||
| R6 | If salvage fails (can't find a safe truncation point), delete the stale temp and proceed normally with a full incremental |
|
||||
| R7 | Existing 19 smokes + new salvage smoke pass |
|
||||
|
||||
## Files
|
||||
|
||||
- `scripts/run-discord-scrape.sh` — add `salvage_stale_temp_exports()` called at top of per-channel loop in `scrape_target()`
|
||||
- `scripts/tests/run-discord-scrape-smoke.sh` — add `salvage-stale` smoke: seed a truncated temp export, run scrape, verify messages are merged and `--after` advances
|
||||
|
||||
## Implementation
|
||||
|
||||
### `salvage_stale_temp_exports()`
|
||||
|
||||
```
|
||||
salvage_stale_temp_exports(output_dir, channel_id):
|
||||
glob = output_dir/.dce-temp/export.<channel_id>.*/export.json
|
||||
for each stale_export matching glob:
|
||||
if jq empty succeeds → already valid JSON
|
||||
else → run inline python salvage (same as salvage-truncated-export.sh)
|
||||
if salvage fails → rm -rf stale_dir, continue
|
||||
|
||||
validate channel identity
|
||||
if archive exists:
|
||||
merge_exports(archive, stale_export, temp_merged)
|
||||
commit_merged_export(archive, temp_merged)
|
||||
log "SALVAGED" with message counts
|
||||
else:
|
||||
mv stale_export → archive destination
|
||||
rm -rf stale_dir
|
||||
```
|
||||
|
||||
Called in `scrape_target()` before line 963 (`after_id=$(last_message_id ...)`), so the salvaged data is already in the archive when `--after` is computed.
|
||||
|
||||
### Smoke test
|
||||
|
||||
1. Seed archive with 2 messages (existing fixture)
|
||||
2. Create a fake stale `.dce-temp/export.<channel_id>.STALE/export.json` with truncated JSON containing message id "3"
|
||||
3. Run scrape in append mode
|
||||
4. Verify archive has 3+ messages (salvaged + incremental)
|
||||
5. Verify stale temp dir is cleaned up
|
||||
|
||||
## Test scenarios
|
||||
|
||||
| Scenario | Expected |
|
||||
|----------|----------|
|
||||
| Stale temp with truncated JSON → salvageable | Messages merged, temp cleaned, `--after` advances |
|
||||
| Stale temp with unsalvageable data (too short) | Temp deleted, normal incremental proceeds |
|
||||
| Stale temp with valid JSON (complete export) | Merged directly, temp cleaned |
|
||||
| No stale temps | Normal behavior, no change |
|
||||
| Multiple stale temps for same channel | All salvaged in order, then normal incremental |
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
./scripts/tests/run-discord-scrape-smoke.sh
|
||||
DCE_MIN_FREE_MB=0 ./scripts/run-all-smokes.sh
|
||||
```
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Configurable skip-vs-retry for OOM channels (separate concern)
|
||||
- Increasing container memory limits
|
||||
|
|
@ -549,6 +549,96 @@ message_count() {
|
|||
jq -r '(.messages | length) // 0' "$export_path"
|
||||
}
|
||||
|
||||
salvage_truncated_json() {
|
||||
local export_path=$1
|
||||
if jq empty "$export_path" >/dev/null 2>&1; then
|
||||
return 0
|
||||
fi
|
||||
command -v python3 >/dev/null 2>&1 || return 1
|
||||
python3 - "$export_path" <<'PY' || return 1
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
path = Path(sys.argv[1])
|
||||
data = path.read_bytes()
|
||||
marker = b"},\n {"
|
||||
idx = data.rfind(marker)
|
||||
if idx < 0:
|
||||
sys.exit(1)
|
||||
|
||||
truncated = data[: idx + 1]
|
||||
suffix = b'\n ],\n "messageCount": 0\n}'
|
||||
path.write_bytes(truncated + suffix)
|
||||
PY
|
||||
jq empty "$export_path" >/dev/null 2>&1 || return 1
|
||||
local temp_file
|
||||
temp_file=$(mktemp "${TMPDIR:-/tmp}/dce-salvage-fix.XXXXXX.json")
|
||||
if jq '.messageCount = (.messages | length)' "$export_path" >"$temp_file" 2>/dev/null; then
|
||||
mv -f "$temp_file" "$export_path"
|
||||
else
|
||||
rm -f "$temp_file"
|
||||
fi
|
||||
}
|
||||
|
||||
salvage_stale_temp_exports() {
|
||||
local output_dir=$1
|
||||
local channel_id=$2
|
||||
local destination_path=$3
|
||||
|
||||
local stale_dirs stale_dir stale_export salvage_merged
|
||||
mapfile -t stale_dirs < <(
|
||||
find "$output_dir/.dce-temp" -maxdepth 1 -type d -name "export.${channel_id}.*" 2>/dev/null || true
|
||||
)
|
||||
|
||||
(( ${#stale_dirs[@]} > 0 )) || return 0
|
||||
|
||||
for stale_dir in "${stale_dirs[@]}"; do
|
||||
stale_export="$stale_dir/export.json"
|
||||
[[ -f "$stale_export" ]] || { rm -rf "$stale_dir"; continue; }
|
||||
[[ -s "$stale_export" ]] || { rm -rf "$stale_dir"; continue; }
|
||||
|
||||
if ! salvage_truncated_json "$stale_export"; then
|
||||
log " Stale temp export unsalvageable, discarding: $stale_dir"
|
||||
rm -rf "$stale_dir"
|
||||
continue
|
||||
fi
|
||||
|
||||
local stale_channel_id
|
||||
stale_channel_id=$(channel_id_from_export "$stale_export" 2>/dev/null) || true
|
||||
if [[ -n "$stale_channel_id" && "$stale_channel_id" != "$channel_id" ]]; then
|
||||
log " Stale temp export wrong channel ($stale_channel_id != $channel_id), discarding: $stale_dir"
|
||||
rm -rf "$stale_dir"
|
||||
continue
|
||||
fi
|
||||
|
||||
local salvage_count
|
||||
salvage_count=$(message_count "$stale_export")
|
||||
if (( salvage_count == 0 )); then
|
||||
rm -rf "$stale_dir"
|
||||
continue
|
||||
fi
|
||||
|
||||
if [[ -n "$destination_path" && -f "$destination_path" ]]; then
|
||||
salvage_merged="$stale_dir/merged.json"
|
||||
if merge_exports "$destination_path" "$stale_export" "$salvage_merged" && [[ -s "$salvage_merged" ]]; then
|
||||
if jq empty "$salvage_merged" >/dev/null 2>&1; then
|
||||
local before_count after_count
|
||||
before_count=$(message_count "$destination_path")
|
||||
commit_merged_export "$destination_path" "$salvage_merged"
|
||||
after_count=$(message_count "$destination_path")
|
||||
log " SALVAGED $destination_path (+$((after_count - before_count)) messages from stale temp, $before_count → $after_count)"
|
||||
fi
|
||||
fi
|
||||
elif [[ -n "$destination_path" ]]; then
|
||||
mkdir -p "$(dirname "$destination_path")"
|
||||
cp "$stale_export" "$destination_path"
|
||||
log " SALVAGED $destination_path (${salvage_count} messages from stale temp, new archive)"
|
||||
fi
|
||||
|
||||
rm -rf "$stale_dir"
|
||||
done
|
||||
}
|
||||
|
||||
is_skippable_channel_export_failure() {
|
||||
local log_file=$1
|
||||
grep -qiE \
|
||||
|
|
@ -960,8 +1050,13 @@ scrape_target() {
|
|||
guild_label=$(guild_label_from_export "$destination_path")
|
||||
fi
|
||||
|
||||
after_id=$(last_message_id "$destination_path")
|
||||
mkdir -p "$output_dir/.dce-temp"
|
||||
salvage_stale_temp_exports "$output_dir" "$channel_id" "$destination_path"
|
||||
|
||||
if [[ -n "$destination_path" && -f "$destination_path" ]]; then
|
||||
before_count=$(message_count "$destination_path")
|
||||
fi
|
||||
after_id=$(last_message_id "$destination_path")
|
||||
temp_dir=$(mktemp -d "$output_dir/.dce-temp/export.${channel_id}.XXXXXX")
|
||||
temp_export="$temp_dir/export.json"
|
||||
temp_merged="$temp_dir/merged.json"
|
||||
|
|
|
|||
|
|
@ -134,6 +134,14 @@ cat >"$CONFIG_PATH" <<JSON
|
|||
"channel_ids": ["111", "134"],
|
||||
"guild_ids": [],
|
||||
"guild_name_patterns": []
|
||||
},
|
||||
{
|
||||
"name": "salvage-stale",
|
||||
"kind": "guild",
|
||||
"output_dir": "$ARCHIVE_ROOT/salvage-stale",
|
||||
"channel_ids": ["111"],
|
||||
"guild_ids": [],
|
||||
"guild_name_patterns": []
|
||||
}
|
||||
]
|
||||
}
|
||||
|
|
@ -379,6 +387,22 @@ SKIP_ABORT_DEST="$ARCHIVE_ROOT/skip-abort/$DEFAULT_FILE_NAME"
|
|||
[[ ! -e "$ARCHIVE_ROOT/skip-abort/channels/134.json" ]] || { echo "unexpected fallback file for skipped abort channel" >&2; exit 1; }
|
||||
grep -q 'SKIPPED.*134' "$SKIP_ABORT_LOG" || { echo "expected SKIPPED line for abort channel 134" >&2; exit 1; }
|
||||
|
||||
# Salvage stale temp export smoke
|
||||
mkdir -p "$ARCHIVE_ROOT/salvage-stale"
|
||||
cp "$FIXTURE_DIR/append-existing.json" "$ARCHIVE_ROOT/salvage-stale/$DEFAULT_FILE_NAME"
|
||||
mkdir -p "$ARCHIVE_ROOT/salvage-stale/.dce-meta"
|
||||
printf '{\"111\":\"%s\"}\n' "$ARCHIVE_ROOT/salvage-stale/$DEFAULT_FILE_NAME" >"$ARCHIVE_ROOT/salvage-stale/.dce-meta/channel-map.json"
|
||||
mkdir -p "$ARCHIVE_ROOT/salvage-stale/.dce-temp/export.111.STALE"
|
||||
cp "$FIXTURE_DIR/salvage-truncated.json" "$ARCHIVE_ROOT/salvage-stale/.dce-temp/export.111.STALE/export.json"
|
||||
SALVAGE_LOG="$TMP_DIR/salvage-stale.log"
|
||||
run_wrapper salvage-stale append 2>"$SALVAGE_LOG"
|
||||
SALVAGE_DEST="$ARCHIVE_ROOT/salvage-stale/$DEFAULT_FILE_NAME"
|
||||
SALVAGE_COUNT=$(jq -r '.messages | length' "$SALVAGE_DEST")
|
||||
(( SALVAGE_COUNT >= 3 )) || { echo "expected salvage-stale archive to have at least 3 messages (got $SALVAGE_COUNT)" >&2; exit 1; }
|
||||
jq -e '.messages[] | select(.id == "3")' "$SALVAGE_DEST" >/dev/null || { echo "expected salvaged message id 3 in archive" >&2; exit 1; }
|
||||
[[ ! -d "$ARCHIVE_ROOT/salvage-stale/.dce-temp/export.111.STALE" ]] || { echo "expected stale temp dir cleaned up after salvage" >&2; exit 1; }
|
||||
grep -q 'SALVAGED' "$SALVAGE_LOG" || { echo "expected SALVAGED line in salvage log" >&2; exit 1; }
|
||||
|
||||
# shellcheck disable=SC1091
|
||||
source "$REPO_ROOT/scripts/run-discord-scrape.sh"
|
||||
SHRINK_EXISTING="$TMP_DIR/shrink-existing.json"
|
||||
|
|
|
|||
20
scripts/tests/test-fixtures/salvage-truncated.json
Normal file
20
scripts/tests/test-fixtures/salvage-truncated.json
Normal file
|
|
@ -0,0 +1,20 @@
|
|||
{
|
||||
"guild": {
|
||||
"id": "222",
|
||||
"name": "Fixture Guild"
|
||||
},
|
||||
"channel": {
|
||||
"id": "111",
|
||||
"name": "fixture-room",
|
||||
"category": "Testing Grounds"
|
||||
},
|
||||
"messages": [
|
||||
{
|
||||
"id": "3",
|
||||
"timestamp": "2026-01-03T00:00:00Z",
|
||||
"content": "third"
|
||||
},
|
||||
{
|
||||
"id": "4",
|
||||
"timestamp": "2026-01-04T00:00:00Z",
|
||||
"content": "fourth - this message is trun
|
||||
Loading…
Reference in a new issue