Commit graph

12 commits

Author SHA1 Message Date
Copilot ee62078f5b fix(scrape): skip SIGTERM/SIGINT export aborts like OOM
Stopping validation with kill/Ctrl+C returned exit 143/130 and failed
the whole target instead of SKIPPED + preserve partial. Added smoke for
exit 143; gitignore .dce-scrape.lock.
2026-06-03 06:06:15 -05:00
Copilot 8b54b6a498 test(scrape): preserve-partial smoke; fix host token-file precedence
Add offline regression for OOM skip preserving partial export temps.
Host wrapper now prefers DISCORD_TOKEN_FILE over inherited shell tokens
and always writes explicit compose env for auth-retry. All 19 smokes pass.
2026-06-03 05:52:39 -05:00
Copilot c13c4167be fix(scrape): salvage stale temp exports before re-downloading
When a previous export crashes (OOM, abort, kill), the partially-
downloaded temp export under .dce-temp/ was orphaned. Subsequent
runs started the incremental from the archive's last message ID,
re-downloading everything the failed run had already fetched.

Now scrape_target() checks for orphaned temp exports before each
channel export, salvages truncated JSON (same marker-based repair
as salvage-truncated-export.sh), merges recovered messages into
the archive, and cleans up stale temp dirs. The incremental then
starts from the truly latest message.

Adds salvage-stale smoke test with truncated fixture.
2026-06-03 01:11:28 -05:00
Copilot 87284816d0 test(scrape): add abort exit 134 skip smoke; plan 041 closure
Extend run-discord-scrape-smoke with skip-abort target so OOM/abort
channel skip from plan 040 has offline regression coverage. Update
merge-readiness for 2026-05-30 and KotOR validation retry in progress.
2026-06-03 00:57:11 -05:00
Copilot 71a443267e feat(scrape): run plan, channel ledger, and all-target proof
Log scrape plan/summary with per-file message deltas in the core script.
Host wrappers and operator entrypoints print target lists; operator-proof
defaults to all enabled targets when --target is omitted.
2026-05-29 20:34:22 -05:00
Boden 1e35761dbb test(scrape): lock mixed-length snowflake cursor selection
Add cursor-mixed-length smoke where string max_by would pick the wrong
--after value; padded sort_by in last_message_id already picks the max.
2026-05-29 16:33:00 -05:00
Boden 57d472f8e8 fix(scrape): auth discovery, skip forbidden channels, mount host script
Discover Discord tokens from env, token files, GUI Settings.dat, and desktop
leveldb; bind-mount the host scrape script so container preflight uses
partition/--after cursors; skip inaccessible channels without aborting targets;
fix set -e and busybox mktemp for incremental exports under ~/Documents.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-28 14:57:16 -05:00
Boden 8c14dbbf45 fix(scrape): append safely under Documents with flexible auth
Bootstrap channel-map entries from existing archive filenames, reject merges
that would shrink large JSON exports, accept exported DISCORD_TOKEN when
scrape.env is missing, and disable the duplicate OpenKotOR target folder.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-28 00:58:03 -05:00
Boden df499568d9 fix: harden recurring scrape scripts from review residuals
Use max message ID for incremental exports, validate custom cron
expressions, drop eval from host/preflight paths, restrict reauth to
executable repo scripts, and run smoke tests in CI.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-28 00:08:22 -05:00
Boden d66b9dab63 feat(validation): comprehensive recurring scraper validation suite and documentation
IMPLEMENTATION UNITS (U1-U6):

U1: Append-only merge test coverage
- Enhanced run-discord-scrape-smoke.sh with additional test scenarios
- Created append-partial-write.json and append-concurrent-conflict.json fixtures
- Added assertions for message sorting, deduplication, and idempotency
- All 10 merge scenarios validated

U2: Error handling validation
- Created error-path-smoke.sh with 6 error scenario tests
- Added test configs for invalid paths, missing files, bad JSON
- Verified fail-closed behavior on all error paths
- No silent data loss on any failure

U3: Cron idempotency and lifecycle
- Created cron-idempotency-smoke.sh with full lifecycle testing
- Created fixture crontab with unrelated entries (preservation test)
- Verified idempotent install, update, and remove operations
- Confirmed dry-run and entry preservation

U4: Preflight and end-to-end setup
- Created end-to-end-preflight-smoke.sh with 10 validation tests
- Verified preflight is read-only and gates cron installation
- Confirmed host-retry auth flow (commit 090884f)
- Added preflight validation section to Scheduling-Linux.md

U5: Documentation completion
- Updated Readme.md with recurring-scraper link
- Created Recurring-Scrape-Setup.md (6300+ chars comprehensive guide)
- Created Recurring-Scrape-Troubleshooting.md (9200+ chars with 30+ scenarios)
- Enhanced .docs/Scheduling-Linux.md with preflight section
- All documented behavior matches implementation

U6: Production-readiness checklist
- Created docs/recurring-scrape-production-checklist.md
- Compiled all validation results (33+ scenarios across U1-U5)
- Documented test execution commands for re-validation
- Provided deployment notes and monitoring guidance
- Clear sign-off criteria established

ARTIFACTS:
- 4 new smoke test scripts (1000+ lines total)
- 4 new fixtures and test configs
- 3 new documentation files (15500+ chars)
- 2 updated documentation files
- 1 validation checklist tracking document
- All tests passing

SAFETY GUARANTEES VERIFIED:
 No silent data loss on any error path
 Fail-closed behavior throughout
 Archive updates are append-only and idempotent
 Cron installation is idempotent
 Unrelated cron entries preserved
 Preflight is read-only
 Token validated before operations
 Path traversal prevented

STATUS: Production Ready
All 6 implementation units complete and validated.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-27 12:57:32 -05:00
Your Name 07151924cf fix(review): apply autofix feedback
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 20:30:37 -05:00
Your Name 43f5fa3b71 Add recurring CLI scrape automation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 17:04:07 -05:00