DiscordChatExporter/docs/plans/2026-05-27-004-fix-recurring-scrape-residual-hardening-plan.md
Boden df499568d9 fix: harden recurring scrape scripts from review residuals
Use max message ID for incremental exports, validate custom cron
expressions, drop eval from host/preflight paths, restrict reauth to
executable repo scripts, and run smoke tests in CI.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-28 00:08:22 -05:00

70 lines
2.8 KiB
Markdown

---
title: fix: Harden recurring scrape scripts from code review residuals
type: fix
status: completed
date: 2026-05-27
origin: PR #1538 residual review findings
---
# fix: Harden recurring scrape scripts from code review residuals
## Summary
Address actionable security and correctness findings deferred from the finalization validation pass on `feat/recurring-cli-scrape`. Scope is limited to script-layer hardening and CI wiring—no C# exporter changes.
## Problem Frame
The recurring scrape automation is validated by smoke tests but still has review-flagged risks: arbitrary command execution via `eval`/`bash -lc`, unvalidated custom cron expressions, incorrect incremental cursor selection, and smoke tests not running in CI.
## Requirements
| ID | Requirement | Files |
|----|-------------|-------|
| U1 | Incremental exports must use the highest message snowflake ID, not the last array element | `scripts/run-discord-scrape.sh`, smoke fixture + assertion |
| U2 | `--cron` expressions must be validated (five fields, safe charset) before crontab install | `scripts/setup-cron.sh`, smoke test |
| U3 | Host runner and setup preflight must invoke compose without `eval` | `scripts/run-discord-scrape-host.sh`, `scripts/setup-cron.sh` |
| U4 | `DCE_REAUTH_COMMAND` must be an executable script under the repo root (no `bash -lc` arbitrary strings) | `scripts/run-discord-scrape-host.sh` |
| U5 | Recurring scrape smoke suite runs in GitHub Actions on PR/push | `.github/workflows/main.yml` |
## Implementation Units
### U1 — max message ID cursor
- Change `last_message_id` to `max_by(.id)` with empty-array guard.
- Add fixture where max ID is not the last message; extend fake CLI to log/assert `--after` value.
### U2 — cron validation
- Add `validate_cron_expression` called when `CRON_EXPRESSION` is set.
- Reject invalid field counts and shell metacharacters.
### U3 — remove eval
- Refactor `compose_run_command` to populate a bash array via nameref; execute with `"${args[@]}"`.
- Refactor `build_target_args` / `run_preflight` similarly.
### U4 — reauth script allowlist
- Resolve and require absolute path under `$REPO_ROOT`, executable file.
- Execute directly instead of `bash -lc`.
### U5 — CI smoke job
- Add `recurring-scrape-smoke` job: install `jq`, run all `scripts/tests/*-smoke.sh` scripts.
## Test Scenarios
- Smoke: unordered archive still passes `--after` with max ID.
- Smoke: invalid `--cron` exits non-zero before crontab mutation.
- CI: new job passes on clean branch.
## Out of Scope
- Changing upstream PR approval for fork workflows.
- Replacing `%q` cron job line construction (values are operator-controlled paths).
## Success Criteria
- All existing smoke scripts pass locally.
- Residual P0/P1 script findings from review are resolved or explicitly superseded by this plan.