mirror of
https://github.com/Tyrrrz/DiscordChatExporter.git
synced 2026-06-10 00:02:37 -06:00
fix: harden recurring scrape scripts from review residuals
Use max message ID for incremental exports, validate custom cron expressions, drop eval from host/preflight paths, restrict reauth to executable repo scripts, and run smoke tests in CI. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
parent
ebc153868f
commit
df499568d9
24
.github/workflows/main.yml
vendored
24
.github/workflows/main.yml
vendored
|
|
@ -50,6 +50,30 @@ jobs:
|
|||
if: ${{ failure() && steps.verify.outcome == 'failure' }}
|
||||
run: echo "::error title=Bad formatting::Formatting issues detected. Please build the solution locally to fix them."
|
||||
|
||||
recurring-scrape-smoke:
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 10
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
|
||||
|
||||
- name: Install jq
|
||||
run: sudo apt-get update && sudo apt-get install -y jq
|
||||
|
||||
- name: Run recurring scrape smoke tests
|
||||
run: |
|
||||
chmod +x scripts/*.sh scripts/tests/*.sh
|
||||
./scripts/tests/run-discord-scrape-smoke.sh
|
||||
./scripts/tests/error-path-smoke.sh
|
||||
./scripts/tests/cron-idempotency-smoke.sh
|
||||
./scripts/tests/end-to-end-preflight-smoke.sh
|
||||
./scripts/tests/setup-cron-smoke.sh
|
||||
./scripts/tests/run-discord-scrape-host-smoke.sh
|
||||
|
||||
test:
|
||||
# Tests need access to secrets, so we can't run them against PRs because of limited trust
|
||||
if: ${{ github.event_name != 'pull_request' }}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,69 @@
|
|||
---
|
||||
title: fix: Harden recurring scrape scripts from code review residuals
|
||||
type: fix
|
||||
status: completed
|
||||
date: 2026-05-27
|
||||
origin: PR #1538 residual review findings
|
||||
---
|
||||
|
||||
# fix: Harden recurring scrape scripts from code review residuals
|
||||
|
||||
## Summary
|
||||
|
||||
Address actionable security and correctness findings deferred from the finalization validation pass on `feat/recurring-cli-scrape`. Scope is limited to script-layer hardening and CI wiring—no C# exporter changes.
|
||||
|
||||
## Problem Frame
|
||||
|
||||
The recurring scrape automation is validated by smoke tests but still has review-flagged risks: arbitrary command execution via `eval`/`bash -lc`, unvalidated custom cron expressions, incorrect incremental cursor selection, and smoke tests not running in CI.
|
||||
|
||||
## Requirements
|
||||
|
||||
| ID | Requirement | Files |
|
||||
|----|-------------|-------|
|
||||
| U1 | Incremental exports must use the highest message snowflake ID, not the last array element | `scripts/run-discord-scrape.sh`, smoke fixture + assertion |
|
||||
| U2 | `--cron` expressions must be validated (five fields, safe charset) before crontab install | `scripts/setup-cron.sh`, smoke test |
|
||||
| U3 | Host runner and setup preflight must invoke compose without `eval` | `scripts/run-discord-scrape-host.sh`, `scripts/setup-cron.sh` |
|
||||
| U4 | `DCE_REAUTH_COMMAND` must be an executable script under the repo root (no `bash -lc` arbitrary strings) | `scripts/run-discord-scrape-host.sh` |
|
||||
| U5 | Recurring scrape smoke suite runs in GitHub Actions on PR/push | `.github/workflows/main.yml` |
|
||||
|
||||
## Implementation Units
|
||||
|
||||
### U1 — max message ID cursor
|
||||
|
||||
- Change `last_message_id` to `max_by(.id)` with empty-array guard.
|
||||
- Add fixture where max ID is not the last message; extend fake CLI to log/assert `--after` value.
|
||||
|
||||
### U2 — cron validation
|
||||
|
||||
- Add `validate_cron_expression` called when `CRON_EXPRESSION` is set.
|
||||
- Reject invalid field counts and shell metacharacters.
|
||||
|
||||
### U3 — remove eval
|
||||
|
||||
- Refactor `compose_run_command` to populate a bash array via nameref; execute with `"${args[@]}"`.
|
||||
- Refactor `build_target_args` / `run_preflight` similarly.
|
||||
|
||||
### U4 — reauth script allowlist
|
||||
|
||||
- Resolve and require absolute path under `$REPO_ROOT`, executable file.
|
||||
- Execute directly instead of `bash -lc`.
|
||||
|
||||
### U5 — CI smoke job
|
||||
|
||||
- Add `recurring-scrape-smoke` job: install `jq`, run all `scripts/tests/*-smoke.sh` scripts.
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
- Smoke: unordered archive still passes `--after` with max ID.
|
||||
- Smoke: invalid `--cron` exits non-zero before crontab mutation.
|
||||
- CI: new job passes on clean branch.
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Changing upstream PR approval for fork workflows.
|
||||
- Replacing `%q` cron job line construction (values are operator-controlled paths).
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- All existing smoke scripts pass locally.
|
||||
- Residual P0/P1 script findings from review are resolved or explicitly superseded by this plan.
|
||||
|
|
@ -3,7 +3,8 @@ DISCORD_TOKEN=
|
|||
# Optional: file whose first line contains DISCORD_TOKEN. Useful for token rotation without editing this env file.
|
||||
DISCORD_TOKEN_FILE=
|
||||
# Optional (manual runs only): command to refresh Discord auth/session before one retry.
|
||||
# Example: DCE_REAUTH_COMMAND="agent-browser --headed open https://discord.com/channels/@me"
|
||||
# Optional absolute path to an executable reauth script under the repository root.
|
||||
# Example: DCE_REAUTH_COMMAND="/path/to/repo/scripts/reauth-discord.sh"
|
||||
DCE_REAUTH_COMMAND=
|
||||
TZ=UTC
|
||||
|
||||
|
|
|
|||
|
|
@ -29,7 +29,7 @@ Options:
|
|||
Environment:
|
||||
DISCORD_TOKEN Direct token value (highest precedence after refresh).
|
||||
DISCORD_TOKEN_FILE Optional path to a file containing the Discord token.
|
||||
DCE_REAUTH_COMMAND Optional command to run for interactive/manual reauth when auth fails.
|
||||
DCE_REAUTH_COMMAND Optional absolute path to an executable reauth script under the repo root.
|
||||
EOF
|
||||
}
|
||||
|
||||
|
|
@ -92,13 +92,14 @@ ensure_token_present() {
|
|||
[[ -n "${DISCORD_TOKEN:-}" ]] || die "DISCORD_TOKEN is not set. Set DISCORD_TOKEN or DISCORD_TOKEN_FILE in $ENV_FILE."
|
||||
}
|
||||
|
||||
compose_run_command() {
|
||||
local subcommand=$1
|
||||
shift
|
||||
local -a command_parts
|
||||
compose_run_args() {
|
||||
local -n _out=$1
|
||||
local subcommand=$2
|
||||
shift 2
|
||||
|
||||
_out=()
|
||||
if [[ -n "$COMPOSE_BIN" ]]; then
|
||||
command_parts=(
|
||||
_out=(
|
||||
"$COMPOSE_BIN"
|
||||
--env-file "$ENV_FILE"
|
||||
-f "$COMPOSE_FILE"
|
||||
|
|
@ -109,7 +110,7 @@ compose_run_command() {
|
|||
"$subcommand"
|
||||
)
|
||||
elif (( DOCKER_BIN_OVERRIDDEN == 0 )) && command -v docker-compose >/dev/null 2>&1; then
|
||||
command_parts=(
|
||||
_out=(
|
||||
docker-compose
|
||||
--env-file "$ENV_FILE"
|
||||
-f "$COMPOSE_FILE"
|
||||
|
|
@ -120,7 +121,7 @@ compose_run_command() {
|
|||
"$subcommand"
|
||||
)
|
||||
else
|
||||
command_parts=(
|
||||
_out=(
|
||||
"$DOCKER_BIN"
|
||||
compose
|
||||
--env-file "$ENV_FILE"
|
||||
|
|
@ -133,8 +134,26 @@ compose_run_command() {
|
|||
)
|
||||
fi
|
||||
|
||||
command_parts+=("$@")
|
||||
printf '%q ' "${command_parts[@]}"
|
||||
_out+=("$@")
|
||||
}
|
||||
|
||||
resolve_reauth_command() {
|
||||
local candidate=$1
|
||||
local resolved_dir resolved_path
|
||||
|
||||
[[ -n "$candidate" ]] || return 1
|
||||
[[ "$candidate" == /* ]] || die "DCE_REAUTH_COMMAND must be an absolute path to an executable script under the repository."
|
||||
|
||||
resolved_dir=$(cd "$(dirname "$candidate")" && pwd -P)
|
||||
resolved_path="$resolved_dir/$(basename "$candidate")"
|
||||
[[ -f "$resolved_path" ]] || die "DCE_REAUTH_COMMAND does not exist: $candidate"
|
||||
[[ -x "$resolved_path" ]] || die "DCE_REAUTH_COMMAND is not executable: $candidate"
|
||||
case "$resolved_path" in
|
||||
"$REPO_ROOT"/*) ;;
|
||||
*) die "DCE_REAUTH_COMMAND must be a script inside the repository root." ;;
|
||||
esac
|
||||
|
||||
printf '%s\n' "$resolved_path"
|
||||
}
|
||||
|
||||
is_discord_auth_failure() {
|
||||
|
|
@ -145,22 +164,26 @@ is_discord_auth_failure() {
|
|||
}
|
||||
|
||||
try_interactive_reauth() {
|
||||
local reauth_script
|
||||
|
||||
[[ -n "$REAUTH_COMMAND" ]] || return 1
|
||||
[[ -t 0 && -t 1 ]] || return 1
|
||||
reauth_script=$(resolve_reauth_command "$REAUTH_COMMAND")
|
||||
printf 'Auth failed; running DCE_REAUTH_COMMAND...\n' >&2
|
||||
bash -lc "$REAUTH_COMMAND"
|
||||
"$reauth_script"
|
||||
}
|
||||
|
||||
run_subcommand_with_retry() {
|
||||
local subcommand=$1
|
||||
shift
|
||||
local run_command output_file
|
||||
local -a run_args=()
|
||||
local output_file
|
||||
|
||||
ensure_token_present
|
||||
output_file=$(mktemp "${TMPDIR:-/tmp}/dce-host-run.XXXXXX.log")
|
||||
|
||||
run_command=$(compose_run_command "$subcommand" "$@")
|
||||
if eval "$run_command" >"$output_file" 2>&1; then
|
||||
compose_run_args run_args "$subcommand" "$@"
|
||||
if "${run_args[@]}" >"$output_file" 2>&1; then
|
||||
cat "$output_file"
|
||||
rm -f "$output_file"
|
||||
return 0
|
||||
|
|
@ -178,7 +201,7 @@ run_subcommand_with_retry() {
|
|||
try_interactive_reauth || true
|
||||
ensure_token_present
|
||||
|
||||
if eval "$run_command" >"$output_file" 2>&1; then
|
||||
if "${run_args[@]}" >"$output_file" 2>&1; then
|
||||
cat "$output_file"
|
||||
rm -f "$output_file"
|
||||
return 0
|
||||
|
|
|
|||
|
|
@ -331,7 +331,10 @@ last_message_id() {
|
|||
local export_path=$1
|
||||
|
||||
[[ -f "$export_path" ]] || return 0
|
||||
jq -r '(.messages | last | .id) // empty' "$export_path"
|
||||
jq -r '
|
||||
(.messages // [])
|
||||
| if length == 0 then empty else (max_by(.id) | .id) end
|
||||
' "$export_path"
|
||||
}
|
||||
|
||||
message_count() {
|
||||
|
|
|
|||
|
|
@ -97,25 +97,37 @@ strip_existing_job() {
|
|||
' <<<"$existing_crontab"
|
||||
}
|
||||
|
||||
build_target_args() {
|
||||
local -a command_parts
|
||||
validate_cron_expression() {
|
||||
local expr=$1
|
||||
local -a fields=()
|
||||
local field
|
||||
|
||||
read -r -a fields <<<"$expr"
|
||||
((${#fields[@]} == 5)) || die "--cron must contain exactly five fields (minute hour day month weekday)."
|
||||
|
||||
for field in "${fields[@]}"; do
|
||||
[[ -n "$field" ]] || die "Empty field in --cron expression."
|
||||
[[ "$field" =~ ^[0-9*,/-]+$ ]] || die "Invalid cron field '$field' in --cron expression."
|
||||
done
|
||||
}
|
||||
|
||||
append_target_args() {
|
||||
local -n _out=$1
|
||||
|
||||
local target
|
||||
for target in "${TARGETS[@]}"; do
|
||||
command_parts+=(--target "$target")
|
||||
_out+=(--target "$target")
|
||||
done
|
||||
|
||||
local guild_id
|
||||
for guild_id in "${GUILDS[@]}"; do
|
||||
command_parts+=(--guild "$guild_id")
|
||||
_out+=(--guild "$guild_id")
|
||||
done
|
||||
|
||||
local channel_id
|
||||
for channel_id in "${CHANNELS[@]}"; do
|
||||
command_parts+=(--channel "$channel_id")
|
||||
_out+=(--channel "$channel_id")
|
||||
done
|
||||
|
||||
printf '%q ' "${command_parts[@]}"
|
||||
}
|
||||
|
||||
ensure_target_directories() {
|
||||
|
|
@ -162,12 +174,17 @@ validate_targets() {
|
|||
}
|
||||
|
||||
run_preflight() {
|
||||
local preflight_command target_args
|
||||
local -a preflight_args=()
|
||||
|
||||
[[ -f "$ENV_FILE" ]] || die "Missing env file: $ENV_FILE"
|
||||
target_args=$(build_target_args)
|
||||
preflight_command="$(printf '%q ' "$HOST_RUNNER") --env-file $(printf '%q' "$ENV_FILE") --compose-file $(printf '%q' "$COMPOSE_FILE") preflight ${target_args}"
|
||||
eval "$preflight_command"
|
||||
preflight_args=(
|
||||
"$HOST_RUNNER"
|
||||
--env-file "$ENV_FILE"
|
||||
--compose-file "$COMPOSE_FILE"
|
||||
preflight
|
||||
)
|
||||
append_target_args preflight_args
|
||||
"${preflight_args[@]}"
|
||||
}
|
||||
|
||||
main() {
|
||||
|
|
@ -263,6 +280,7 @@ main() {
|
|||
|
||||
local cron_line
|
||||
if [[ -n "$CRON_EXPRESSION" ]]; then
|
||||
validate_cron_expression "$CRON_EXPRESSION"
|
||||
cron_line=$CRON_EXPRESSION
|
||||
else
|
||||
cron_line=$(cron_from_schedule "$INTERVAL" "$RUN_AT")
|
||||
|
|
@ -270,7 +288,8 @@ main() {
|
|||
|
||||
local begin_marker="# BEGIN ${JOB_NAME}"
|
||||
local end_marker="# END ${JOB_NAME}"
|
||||
local current_crontab cleaned_crontab scrape_command target_args job_line lock_prefix
|
||||
local current_crontab cleaned_crontab scrape_command job_line lock_prefix
|
||||
local -a scrape_args=()
|
||||
current_crontab=$("$CRONTAB_BIN" -l 2>/dev/null || true)
|
||||
cleaned_crontab=$(strip_existing_job "$current_crontab" "$begin_marker" "$end_marker")
|
||||
|
||||
|
|
@ -291,8 +310,14 @@ main() {
|
|||
run_preflight
|
||||
fi
|
||||
|
||||
target_args=$(build_target_args)
|
||||
scrape_command="$(printf '%q ' "$HOST_RUNNER") --env-file $(printf '%q' "$ENV_FILE") --compose-file $(printf '%q' "$COMPOSE_FILE") scrape ${target_args}"
|
||||
scrape_args=(
|
||||
"$HOST_RUNNER"
|
||||
--env-file "$ENV_FILE"
|
||||
--compose-file "$COMPOSE_FILE"
|
||||
scrape
|
||||
)
|
||||
append_target_args scrape_args
|
||||
scrape_command=$(printf '%q ' "${scrape_args[@]}")
|
||||
if command -v flock >/dev/null 2>&1; then
|
||||
lock_prefix=$(printf '%q ' "$(command -v flock)" "-n" "/tmp/${JOB_NAME}.lock")
|
||||
else
|
||||
|
|
|
|||
|
|
@ -94,6 +94,14 @@ cat >"$CONFIG_PATH" <<JSON
|
|||
"channel_ids": ["111"],
|
||||
"guild_ids": [],
|
||||
"guild_name_patterns": []
|
||||
},
|
||||
{
|
||||
"name": "cursor-max-id",
|
||||
"kind": "guild",
|
||||
"output_dir": "$ARCHIVE_ROOT/cursor-max-id",
|
||||
"channel_ids": ["111"],
|
||||
"guild_ids": [],
|
||||
"guild_name_patterns": []
|
||||
}
|
||||
]
|
||||
}
|
||||
|
|
@ -111,13 +119,22 @@ shift || true
|
|||
case "$subcommand" in
|
||||
export)
|
||||
output=""
|
||||
after=""
|
||||
while (($#)); do
|
||||
case "$1" in
|
||||
--output)
|
||||
output=$2
|
||||
shift 2
|
||||
;;
|
||||
--channel|--format|--after)
|
||||
--after)
|
||||
after=$2
|
||||
if [[ -n "${FAKE_DCE_EXPECT_AFTER:-}" && "$after" != "${FAKE_DCE_EXPECT_AFTER}" ]]; then
|
||||
echo "unexpected --after value: $after (expected ${FAKE_DCE_EXPECT_AFTER})" >&2
|
||||
exit 1
|
||||
fi
|
||||
shift 2
|
||||
;;
|
||||
--channel|--format)
|
||||
shift 2
|
||||
;;
|
||||
*)
|
||||
|
|
@ -129,6 +146,7 @@ case "$subcommand" in
|
|||
case "$mode" in
|
||||
initial) cp "$fixture_dir/append-existing.json" "$output" ;;
|
||||
append) cp "$fixture_dir/append-incremental.json" "$output" ;;
|
||||
append-after-high-id) cp "$fixture_dir/append-after-high-id.json" "$output" ;;
|
||||
partial-write) cp "$fixture_dir/append-partial-write.json" "$output" ;;
|
||||
concurrent-conflict) cp "$fixture_dir/append-concurrent-conflict.json" "$output" ;;
|
||||
wrong-channel) cp "$fixture_dir/wrong-channel.json" "$output" ;;
|
||||
|
|
@ -153,6 +171,7 @@ run_wrapper() {
|
|||
DCE_FALLBACK_CONFIG="$CONFIG_PATH" \
|
||||
FAKE_DCE_FIXTURE_DIR="$FIXTURE_DIR" \
|
||||
FAKE_DCE_MODE="$mode" \
|
||||
FAKE_DCE_EXPECT_AFTER="${FAKE_DCE_EXPECT_AFTER:-}" \
|
||||
"$REPO_ROOT/scripts/run-discord-scrape.sh" scrape --target "$target_name"
|
||||
}
|
||||
|
||||
|
|
@ -262,5 +281,11 @@ IDEMPOTENT_CHECKSUM_2=$(sha256sum "$IDEMPOTENT_DEST" | awk '{print $1}')
|
|||
[[ "$(jq -r '.channel.id' "$DEST")" == "111" ]] || { echo "expected channel id to be preserved after merge" >&2; exit 1; }
|
||||
[[ "$(jq -r '.messages[0] | has("id") and has("timestamp") and has("content")' "$DEST")" == "true" ]] || { echo "expected message structure to be complete after merge" >&2; exit 1; }
|
||||
|
||||
mkdir -p "$ARCHIVE_ROOT/cursor-max-id"
|
||||
cp "$FIXTURE_DIR/append-unordered-cursor.json" "$ARCHIVE_ROOT/cursor-max-id/$DEFAULT_FILE_NAME"
|
||||
FAKE_DCE_EXPECT_AFTER=999 run_wrapper cursor-max-id append-after-high-id
|
||||
CURSOR_DEST="$ARCHIVE_ROOT/cursor-max-id/$DEFAULT_FILE_NAME"
|
||||
[[ "$(jq -r '.messages | length' "$CURSOR_DEST")" == "4" ]] || { echo "expected cursor-max-id archive to contain four messages" >&2; exit 1; }
|
||||
|
||||
echo "U1: append-only merge test coverage passed"
|
||||
|
||||
|
|
|
|||
|
|
@ -129,4 +129,12 @@ cmp -s "$CRONTAB_FILE" "$TMP_DIR/crontab-before-preflight-fail.txt" || {
|
|||
exit 1
|
||||
}
|
||||
|
||||
if run_setup --cron "0 2 * * *; touch /tmp/evil" --skip-preflight 2>/dev/null; then
|
||||
echo "expected invalid --cron expression to fail validation" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
preview_custom_cron=$(run_setup --cron "15 03 * * 0" --skip-preflight --dry-run)
|
||||
grep -q '^15 03 \* \* 0 ' <<<"$preview_custom_cron" || { echo "expected validated custom cron in dry-run output" >&2; exit 1; }
|
||||
|
||||
echo "setup-cron smoke test passed"
|
||||
|
|
|
|||
23
scripts/tests/test-fixtures/append-after-high-id.json
Normal file
23
scripts/tests/test-fixtures/append-after-high-id.json
Normal file
|
|
@ -0,0 +1,23 @@
|
|||
{
|
||||
"guild": {
|
||||
"id": "222",
|
||||
"name": "Fixture Guild"
|
||||
},
|
||||
"channel": {
|
||||
"id": "111",
|
||||
"name": "fixture-room",
|
||||
"category": "Testing Grounds"
|
||||
},
|
||||
"messages": [
|
||||
{
|
||||
"id": "1000",
|
||||
"timestamp": "2026-01-03T00:00:00Z",
|
||||
"content": "after-high-id"
|
||||
}
|
||||
],
|
||||
"dateRange": {
|
||||
"after": "999",
|
||||
"before": null
|
||||
},
|
||||
"exportedAt": "2026-01-03T00:00:00Z"
|
||||
}
|
||||
33
scripts/tests/test-fixtures/append-unordered-cursor.json
Normal file
33
scripts/tests/test-fixtures/append-unordered-cursor.json
Normal file
|
|
@ -0,0 +1,33 @@
|
|||
{
|
||||
"guild": {
|
||||
"id": "222",
|
||||
"name": "Fixture Guild"
|
||||
},
|
||||
"channel": {
|
||||
"id": "111",
|
||||
"name": "fixture-room",
|
||||
"category": "Testing Grounds"
|
||||
},
|
||||
"messages": [
|
||||
{
|
||||
"id": "1",
|
||||
"timestamp": "2026-01-01T00:00:00Z",
|
||||
"content": "first"
|
||||
},
|
||||
{
|
||||
"id": "999",
|
||||
"timestamp": "2026-01-01T12:00:00Z",
|
||||
"content": "high-id-not-last"
|
||||
},
|
||||
{
|
||||
"id": "2",
|
||||
"timestamp": "2026-01-02T00:00:00Z",
|
||||
"content": "second"
|
||||
}
|
||||
],
|
||||
"dateRange": {
|
||||
"after": null,
|
||||
"before": null
|
||||
},
|
||||
"exportedAt": "2026-01-02T00:00:00Z"
|
||||
}
|
||||
Loading…
Reference in a new issue