DiscordChatExporter/.docs/Recurring-Scrape-Setup.md
Copilot 3d65c0e8e5 feat(scrape): cron opt-in salvage-before-scrape
setup-cron.sh forwards --salvage-before-scrape to documents scrape for
operators recovering from OOM partials on scheduled runs.
2026-06-03 11:35:50 -05:00

13 KiB

Recurring Discord Scrape Automation - Setup Guide

This guide walks you through setting up automated recurring Discord exports using the built-in wrapper scripts.

Prerequisites

  • Linux or macOS with bash and cron
  • Docker or Podman installed
  • A Discord bot token or user token with access to the channels you want to export
  • Read/write access to a directory for archive storage

Quick Start

Fastest path: ./scripts/operator-handoff.sh (disk + verify + archive dry-run), then ./scripts/bootstrap-recurring-scrape.sh without --dry-run when ready (see operator checklist).

One-target live proof: ./scripts/run-operator-proof.sh --sync-gui --target eod_discord (handoff → scrape → grow-only check; use --dry-run for handoff only).

Disk space: verify-operator-ready and scrape entrypoints fail below 1 GiB free by default (DCE_MIN_FREE_MB, default 1024). Large channel JSON merges need extra headroom.

Append-only contract (read first)

  • Each target writes under its configured output_dir (for example ~/Documents/KotOR_discord_msgs/).
  • Existing files named Guild - Category - Channel [channel_id].json are discovered automatically and updated in place.
  • On the first run against an existing archive tree, the wrapper bootstraps output_dir/.dce-meta/channel-map.json from those filenames so it never creates a parallel export file.
  • Incremental exports use DiscordChatExporter --after with the highest existing message id, then merge new messages by id.
  • A merge that would reduce message count is rejected; the on-disk archive is left unchanged.

1. Configure Your Targets

Create or edit config/scrape-targets.json with your channel selections:

{
  "archive_root": "/home/user/discord-archives",
  "defaults": {
    "include_threads": "all",
    "include_voice_channels": false
  },
  "targets": [
    {
      "name": "my-servers",
      "kind": "guild",
      "output_dir": "/home/user/discord-archives/my-servers",
      "guild_ids": ["123456789"],
      "channel_ids": [],
      "guild_name_patterns": []
    }
  ]
}

Key fields:

  • archive_root: Parent directory for all exports (used for validation and path safety)
  • output_dir: Specific directory for each target (must be under archive_root)
  • guild_ids: Explicit Discord guild IDs (especially important for bot tokens)
  • channel_ids: Specific channels to export (leave empty to export all accessible)
  • guild_name_patterns: Regex patterns to match guild names (not used by bot tokens)

2. Set Your Discord Token

Cron requires scrape.env. Manual export DISCORD_TOKEN works for one-off runs, but scheduled jobs run in a minimal environment and need a persisted env file.

Either copy the environment template:

cp scrape.env.example scrape.env
# Edit scrape.env and set DISCORD_TOKEN=your-token-here
# OR set DISCORD_TOKEN_FILE=/path/to/token/file for automatic token rotation

Or export the token directly in your shell (the host wrapper accepts this when scrape.env is absent):

export DISCORD_TOKEN="your-token-here"
# optional: export DISCORD_TOKEN_FILE=/path/to/token/file

When no explicit token is set, the host wrapper runs scripts/discover-discord-token.sh, which tries (in order): DISCORD_TOKEN / DISCORD_TOKEN_FILE, optional ~/.config/discord-scrape/token, DiscordChatExporter GUI Settings.dat (via scripts/read-dce-gui-token.sh when DISCORDCHATEXPORTER_SETTINGS_PATH or a sibling Settings.dat next to the CLI binary is present), then Discord desktop leveldb token candidates (longest match wins).

To materialize scrape.env from exported credentials (mode 600, no manual editing):

export DISCORD_TOKEN="your-token-here"
./scripts/setup-scrape-auth.sh

2b. Verify existing ~/Documents archives

Before the first incremental run, confirm each enabled target points at the correct on-disk server folder and already has seeded channel JSON exports (the scraper appends in place and bootstraps .dce-meta/channel-map.json from these files):

./scripts/verify-documents-archives.sh --config config/scrape-targets.json

Each enabled target should show a non-zero JSON count and SEEDED channel IDs under your configured output_dir values (see archive_root in config/scrape-targets.json).

One-command workflow (verify → preflight → incremental scrape):

export DISCORD_TOKEN="your-token"   # or place token in ~/.config/discord-scrape/token
./scripts/run-documents-scrape.sh
./scripts/run-documents-scrape.sh --target KotOR_discord_msgs   # single server
./scripts/run-documents-scrape.sh --dry-run                     # archives only, no Discord

After a scrape, prove archives only grew in place:

./scripts/prove-incremental-append.sh --target KotOR_discord_msgs

3. Run Preflight Validation

Before installing cron, validate your setup:

export DISCORD_TOKEN="your-token"
./scripts/run-discord-scrape-host.sh preflight --config config/scrape-targets.json

Or run inside the container workflow directly:

./scripts/run-discord-scrape.sh preflight --config config/scrape-targets.json

This will:

  • Check token validity
  • Verify all configured targets are accessible
  • Show which channels will be scraped
  • Confirm archive directories are writable
  • Make NO changes to archives or cron

4. Install the Cron Job

Once preflight passes, install the recurring export:

./scripts/setup-cron.sh --config config/scrape-targets.json

This creates a managed cron entry that runs monthly (default). The entry can be updated or removed later.

For KotOR yes_general or other post-OOM catch-up, add --salvage-before-scrape so each run merges stale .dce-temp exports before incremental scrape:

./scripts/setup-cron.sh --config config/scrape-targets.json \
  --target KotOR_discord_msgs --channel 221726893064454144 --salvage-before-scrape

5. Verify Installation

Check that the cron job was installed:

crontab -l | grep discord-scrape

Customizing the Schedule

The default is monthly. Customize it with:

# Run every day at 2 AM
./scripts/setup-cron.sh --config config/scrape-targets.json --interval "daily" --at "2:00"

# Run every Sunday at noon (weekly uses Sunday; time is HH:MM only)
./scripts/setup-cron.sh --config config/scrape-targets.json --interval weekly --at "12:00"

# Custom cron expression (every 6 hours)
./scripts/setup-cron.sh --config config/scrape-targets.json --cron "0 */6 * * *"

Token Rotation

If using DISCORD_TOKEN_FILE, the host wrapper can automatically reload your token on each run:

# Protect your token file
chmod 600 /path/to/token/file

# Configure in scrape.env
DISCORD_TOKEN_FILE=/path/to/token/file
DCE_USERNS_MODE=keep-id  # for rootless podman

On each scheduled run, if the export fails with a 401 or 403 error, the wrapper:

  1. Reloads the token file
  2. Retries the export once
  3. Logs the result

This keeps your token fresh without manual intervention.

Archive Layout

After first export, your archive directory will contain:

archive_root/
├── .dce-meta/
│   ├── channel-map.json        # Channel ID to file mappings
│   └── locks/                  # Per-target locks (during active runs)
├── my-servers/
│   ├── .dce-meta/
│   │   └── channel-map.json
│   ├── Guild Name - Category - Channel [123456].json
│   ├── Another Guild - General [789012].json
│   └── ...
└── ...

Existing exports are updated in-place with new messages appended and deduplicated by message ID. See Append-only contract at the top of this guide.

Troubleshooting

For common issues and solutions, see Recurring-Scrape-Troubleshooting.md.

Advanced Configuration

Bot Tokens vs User Tokens

Bot tokens cannot enumerate guilds or DMs, so you must provide explicit IDs:

{
  "name": "bot-scraped",
  "kind": "guild",
  "guild_ids": ["123456789", "987654321"],
  "channel_ids": ["111222333"]
}

User tokens can auto-discover but are against Discord TOS for automated use:

{
  "name": "user-scraped",
  "kind": "guild",
  "guild_ids": [],  // Will auto-discover
  "channel_ids": []  // Will auto-discover
}

Disabling Targets

Temporarily disable a target without removing it:

{
  "name": "disabled-target",
  "enabled": false,
  "kind": "guild",
  ...
}

SELinux and Rootless Podman

For SELinux:

# Label mounts for relabeling (already in docker-compose.yml)
DCE_MOUNT_OPTIONS=z

For rootless podman:

# Keep mounted dirs writable as your user
DCE_USERNS_MODE=keep-id
DCE_UID=$(id -u)
DCE_GID=$(id -g)

Managing Cron

View Current Schedule

crontab -l

Update Schedule

Re-run setup with new parameters (old entry replaced):

./scripts/setup-cron.sh --config config/scrape-targets.json --interval "daily"

Dry-run (Preview Changes)

./scripts/setup-cron.sh --config config/scrape-targets.json --dry-run

Remove Cron Entry

./scripts/setup-cron.sh --remove

Monitoring Exports

Check logs from your last run:

# Primary log file (default from setup-cron.sh)
tail -f logs/discord-scrape.log

# Machine-readable totals beside the cron log
./scripts/print-scrape-summary.sh logs/discord-scrape.summary.json

# Recent cron execution (system log)
sudo grep discord-scrape /var/log/syslog  # Debian/Ubuntu
sudo grep discord-scrape /var/log/cron    # CentOS/RHEL

# Container build/run issues
docker compose logs -f

After a scheduled run, confirm archives grew in place:

./scripts/prove-incremental-append.sh --target KotOR_discord_msgs

Performance Considerations

  • First export of a channel can be slow (API rate-limited)
  • Incremental updates are much faster (only new messages)
  • Large channels (100k+ messages) may take several minutes
  • Rate limiting: Discord's API has strict per-user limits; repeated failures may indicate you've hit them

Space requirements:

  • Typical channel: 1-10 MB per year of messages
  • Large channels: 50-100 MB per year
  • Full guild: 500 MB - several GB depending on activity
  • Multi-year catch-up in container: may OOM on first export; set DCE_CONTAINER_MEMORY=8g in scrape.env and use --salvage-before-scrape (see Troubleshooting)

Smoke test validation

Run the full offline suite from the repo root (requires jq). 23 offline smokes run by default; add --include-container for a 24th local-only check:

./scripts/run-all-smokes.sh

With Docker/Podman, include the container smoke:

./scripts/run-all-smokes.sh --include-container

Archive integrity helpers (not smokes; run against live ~/Documents trees):

./scripts/audit-archive-json.sh --target KotOR_discord_msgs
./scripts/salvage-truncated-export.sh path/to/export.json   # truncated JSON only
./scripts/prove-incremental-append.sh --target NAME        # live grow-only proof (needs token)
Script CI (recurring-scrape-smoke) Notes
archive-disk-space-smoke.sh yes Disk preflight / DCE_MIN_FREE_MB
audit-archive-json-smoke.sh yes Invalid JSON detection
bootstrap-recurring-scrape-smoke.sh yes Bootstrap dry-run
cron-idempotency-smoke.sh yes Cron installer idempotency
documents-scrape-smoke.sh yes Unified Documents workflow + lock gate
end-to-end-preflight-smoke.sh yes Preflight wiring
error-path-smoke.sh yes Failure paths
gh-approve-pr-runs-smoke.sh yes Fork PR workflow helper
operator-handoff-smoke.sh yes Operator handoff dry-run
print-scrape-summary-smoke.sh yes JSON summary pretty-print CLI
prove-incremental-append-smoke.sh yes Offline prove snapshot/compare
run-discord-scrape-host-lock-smoke.sh yes Archive-root scrape lock
run-discord-scrape-host-smoke.sh yes Host wrapper
run-discord-scrape-smoke.sh yes Append-only merge coverage
run-operator-proof-smoke.sh yes Scrape + prove dry-run
run-operator-validation-smoke.sh yes Validation runner dry-run
scrape-here-smoke.sh yes Workspace bridge launcher
scrape-lock-status-smoke.sh yes Lock status + stale reclaim
scrape-summary-json-smoke.sh yes Log marker extract + per-target path helper
setup-cron-smoke.sh yes Cron setup dry-run
sync-gui-bridge-doc-smoke.sh yes GUI bridge doc sync
verify-documents-auth-smoke.sh yes Archive verify + auth bootstrap
verify-operator-ready-smoke.sh yes Host prerequisite checks
container-smoke.sh no (local) Docker build + help / list-targets; use --include-container

GitHub Actions runs ./scripts/run-all-smokes.sh via .github/workflows/main.yml job recurring-scrape-smoke.

Next Steps