mirror of https://github.com/Tyrrrz/DiscordChatExporter.git synced 2026-06-10 00:02:37 -06:00

Copilot 3d65c0e8e5 feat(scrape): cron opt-in salvage-before-scrape

setup-cron.sh forwards --salvage-before-scrape to documents scrape for
operators recovering from OOM partials on scheduled runs.

2026-06-03 11:35:50 -05:00

13 KiB

Raw Blame History

Recurring Discord Scrape Automation - Setup Guide

This guide walks you through setting up automated recurring Discord exports using the built-in wrapper scripts.

Prerequisites

Linux or macOS with bash and cron
Docker or Podman installed
A Discord bot token or user token with access to the channels you want to export
Read/write access to a directory for archive storage

Quick Start

Fastest path: ./scripts/operator-handoff.sh (disk + verify + archive dry-run), then ./scripts/bootstrap-recurring-scrape.sh without --dry-run when ready (see operator checklist).

One-target live proof: ./scripts/run-operator-proof.sh --sync-gui --target eod_discord (handoff → scrape → grow-only check; use --dry-run for handoff only).

Disk space: verify-operator-ready and scrape entrypoints fail below 1 GiB free by default (DCE_MIN_FREE_MB, default 1024). Large channel JSON merges need extra headroom.

Append-only contract (read first)

Each target writes under its configured output_dir (for example ~/Documents/KotOR_discord_msgs/).
Existing files named Guild - Category - Channel [channel_id].json are discovered automatically and updated in place.
On the first run against an existing archive tree, the wrapper bootstraps output_dir/.dce-meta/channel-map.json from those filenames so it never creates a parallel export file.
Incremental exports use DiscordChatExporter --after with the highest existing message id, then merge new messages by id.
A merge that would reduce message count is rejected; the on-disk archive is left unchanged.

1. Configure Your Targets

Create or edit config/scrape-targets.json with your channel selections:

{
  "archive_root": "/home/user/discord-archives",
  "defaults": {
    "include_threads": "all",
    "include_voice_channels": false
  },
  "targets": [
    {
      "name": "my-servers",
      "kind": "guild",
      "output_dir": "/home/user/discord-archives/my-servers",
      "guild_ids": ["123456789"],
      "channel_ids": [],
      "guild_name_patterns": []
    }
  ]
}

Key fields:

archive_root: Parent directory for all exports (used for validation and path safety)
output_dir: Specific directory for each target (must be under archive_root)
guild_ids: Explicit Discord guild IDs (especially important for bot tokens)
channel_ids: Specific channels to export (leave empty to export all accessible)
guild_name_patterns: Regex patterns to match guild names (not used by bot tokens)

2. Set Your Discord Token

Cron requires scrape.env. Manual export DISCORD_TOKEN works for one-off runs, but scheduled jobs run in a minimal environment and need a persisted env file.

Either copy the environment template:

cp scrape.env.example scrape.env
# Edit scrape.env and set DISCORD_TOKEN=your-token-here
# OR set DISCORD_TOKEN_FILE=/path/to/token/file for automatic token rotation

Or export the token directly in your shell (the host wrapper accepts this when scrape.env is absent):

export DISCORD_TOKEN="your-token-here"
# optional: export DISCORD_TOKEN_FILE=/path/to/token/file

When no explicit token is set, the host wrapper runs scripts/discover-discord-token.sh, which tries (in order): DISCORD_TOKEN / DISCORD_TOKEN_FILE, optional ~/.config/discord-scrape/token, DiscordChatExporter GUI Settings.dat (via scripts/read-dce-gui-token.sh when DISCORDCHATEXPORTER_SETTINGS_PATH or a sibling Settings.dat next to the CLI binary is present), then Discord desktop leveldb token candidates (longest match wins).

To materialize scrape.env from exported credentials (mode 600, no manual editing):

export DISCORD_TOKEN="your-token-here"
./scripts/setup-scrape-auth.sh

2b. Verify existing ~/Documents archives

Before the first incremental run, confirm each enabled target points at the correct on-disk server folder and already has seeded channel JSON exports (the scraper appends in place and bootstraps .dce-meta/channel-map.json from these files):

./scripts/verify-documents-archives.sh --config config/scrape-targets.json

Each enabled target should show a non-zero JSON count and SEEDED channel IDs under your configured output_dir values (see archive_root in config/scrape-targets.json).

One-command workflow (verify → preflight → incremental scrape):

export DISCORD_TOKEN="your-token"   # or place token in ~/.config/discord-scrape/token
./scripts/run-documents-scrape.sh
./scripts/run-documents-scrape.sh --target KotOR_discord_msgs   # single server
./scripts/run-documents-scrape.sh --dry-run                     # archives only, no Discord

After a scrape, prove archives only grew in place:

./scripts/prove-incremental-append.sh --target KotOR_discord_msgs

3. Run Preflight Validation

Before installing cron, validate your setup:

export DISCORD_TOKEN="your-token"
./scripts/run-discord-scrape-host.sh preflight --config config/scrape-targets.json

Or run inside the container workflow directly:

./scripts/run-discord-scrape.sh preflight --config config/scrape-targets.json

This will:

Check token validity
Verify all configured targets are accessible
Show which channels will be scraped
Confirm archive directories are writable
Make NO changes to archives or cron

4. Install the Cron Job

Once preflight passes, install the recurring export:

./scripts/setup-cron.sh --config config/scrape-targets.json

This creates a managed cron entry that runs monthly (default). The entry can be updated or removed later.

For KotOR yes_general or other post-OOM catch-up, add --salvage-before-scrape so each run merges stale .dce-temp exports before incremental scrape:

./scripts/setup-cron.sh --config config/scrape-targets.json \
  --target KotOR_discord_msgs --channel 221726893064454144 --salvage-before-scrape

5. Verify Installation

Check that the cron job was installed:

crontab -l | grep discord-scrape

Customizing the Schedule

The default is monthly. Customize it with:

# Run every day at 2 AM
./scripts/setup-cron.sh --config config/scrape-targets.json --interval "daily" --at "2:00"

# Run every Sunday at noon (weekly uses Sunday; time is HH:MM only)
./scripts/setup-cron.sh --config config/scrape-targets.json --interval weekly --at "12:00"

# Custom cron expression (every 6 hours)
./scripts/setup-cron.sh --config config/scrape-targets.json --cron "0 */6 * * *"

Token Rotation

If using DISCORD_TOKEN_FILE, the host wrapper can automatically reload your token on each run:

# Protect your token file
chmod 600 /path/to/token/file

# Configure in scrape.env
DISCORD_TOKEN_FILE=/path/to/token/file
DCE_USERNS_MODE=keep-id  # for rootless podman

On each scheduled run, if the export fails with a 401 or 403 error, the wrapper:

Reloads the token file
Retries the export once
Logs the result

This keeps your token fresh without manual intervention.

Archive Layout

After first export, your archive directory will contain:

archive_root/
├── .dce-meta/
│   ├── channel-map.json        # Channel ID to file mappings
│   └── locks/                  # Per-target locks (during active runs)
├── my-servers/
│   ├── .dce-meta/
│   │   └── channel-map.json
│   ├── Guild Name - Category - Channel [123456].json
│   ├── Another Guild - General [789012].json
│   └── ...
└── ...

Existing exports are updated in-place with new messages appended and deduplicated by message ID. See Append-only contract at the top of this guide.

Troubleshooting

For common issues and solutions, see Recurring-Scrape-Troubleshooting.md.

Advanced Configuration

Bot Tokens vs User Tokens

Bot tokens cannot enumerate guilds or DMs, so you must provide explicit IDs:

{
  "name": "bot-scraped",
  "kind": "guild",
  "guild_ids": ["123456789", "987654321"],
  "channel_ids": ["111222333"]
}

User tokens can auto-discover but are against Discord TOS for automated use:

{
  "name": "user-scraped",
  "kind": "guild",
  "guild_ids": [],  // Will auto-discover
  "channel_ids": []  // Will auto-discover
}

Disabling Targets

Temporarily disable a target without removing it:

{
  "name": "disabled-target",
  "enabled": false,
  "kind": "guild",
  ...
}

SELinux and Rootless Podman

For SELinux:

# Label mounts for relabeling (already in docker-compose.yml)
DCE_MOUNT_OPTIONS=z

For rootless podman:

# Keep mounted dirs writable as your user
DCE_USERNS_MODE=keep-id
DCE_UID=$(id -u)
DCE_GID=$(id -g)

Managing Cron

View Current Schedule

crontab -l

Update Schedule

Re-run setup with new parameters (old entry replaced):

./scripts/setup-cron.sh --config config/scrape-targets.json --interval "daily"

Dry-run (Preview Changes)

./scripts/setup-cron.sh --config config/scrape-targets.json --dry-run

Remove Cron Entry

./scripts/setup-cron.sh --remove

Monitoring Exports

Check logs from your last run:

# Primary log file (default from setup-cron.sh)
tail -f logs/discord-scrape.log

# Machine-readable totals beside the cron log
./scripts/print-scrape-summary.sh logs/discord-scrape.summary.json

# Recent cron execution (system log)
sudo grep discord-scrape /var/log/syslog  # Debian/Ubuntu
sudo grep discord-scrape /var/log/cron    # CentOS/RHEL

# Container build/run issues
docker compose logs -f

After a scheduled run, confirm archives grew in place:

./scripts/prove-incremental-append.sh --target KotOR_discord_msgs

Performance Considerations

First export of a channel can be slow (API rate-limited)
Incremental updates are much faster (only new messages)
Large channels (100k+ messages) may take several minutes
Rate limiting: Discord's API has strict per-user limits; repeated failures may indicate you've hit them

Space requirements:

Typical channel: 1-10 MB per year of messages
Large channels: 50-100 MB per year
Full guild: 500 MB - several GB depending on activity
Multi-year catch-up in container: may OOM on first export; set DCE_CONTAINER_MEMORY=8g in scrape.env and use --salvage-before-scrape (see Troubleshooting)

Smoke test validation

Run the full offline suite from the repo root (requires jq). 23 offline smokes run by default; add --include-container for a 24th local-only check:

./scripts/run-all-smokes.sh

With Docker/Podman, include the container smoke:

./scripts/run-all-smokes.sh --include-container

Archive integrity helpers (not smokes; run against live ~/Documents trees):

./scripts/audit-archive-json.sh --target KotOR_discord_msgs
./scripts/salvage-truncated-export.sh path/to/export.json   # truncated JSON only
./scripts/prove-incremental-append.sh --target NAME        # live grow-only proof (needs token)

Script	CI (`recurring-scrape-smoke`)	Notes
`archive-disk-space-smoke.sh`	yes	Disk preflight / `DCE_MIN_FREE_MB`
`audit-archive-json-smoke.sh`	yes	Invalid JSON detection
`bootstrap-recurring-scrape-smoke.sh`	yes	Bootstrap dry-run
`cron-idempotency-smoke.sh`	yes	Cron installer idempotency
`documents-scrape-smoke.sh`	yes	Unified Documents workflow + lock gate
`end-to-end-preflight-smoke.sh`	yes	Preflight wiring
`error-path-smoke.sh`	yes	Failure paths
`gh-approve-pr-runs-smoke.sh`	yes	Fork PR workflow helper
`operator-handoff-smoke.sh`	yes	Operator handoff dry-run
`print-scrape-summary-smoke.sh`	yes	JSON summary pretty-print CLI
`prove-incremental-append-smoke.sh`	yes	Offline prove snapshot/compare
`run-discord-scrape-host-lock-smoke.sh`	yes	Archive-root scrape lock
`run-discord-scrape-host-smoke.sh`	yes	Host wrapper
`run-discord-scrape-smoke.sh`	yes	Append-only merge coverage
`run-operator-proof-smoke.sh`	yes	Scrape + prove dry-run
`run-operator-validation-smoke.sh`	yes	Validation runner dry-run
`scrape-here-smoke.sh`	yes	Workspace bridge launcher
`scrape-lock-status-smoke.sh`	yes	Lock status + stale reclaim
`scrape-summary-json-smoke.sh`	yes	Log marker extract + per-target path helper
`setup-cron-smoke.sh`	yes	Cron setup dry-run
`sync-gui-bridge-doc-smoke.sh`	yes	GUI bridge doc sync
`verify-documents-auth-smoke.sh`	yes	Archive verify + auth bootstrap
`verify-operator-ready-smoke.sh`	yes	Host prerequisite checks
`container-smoke.sh`	no (local)	Docker build + `help` / `list-targets`; use `--include-container`

GitHub Actions runs ./scripts/run-all-smokes.sh via .github/workflows/main.yml job recurring-scrape-smoke.

13 KiB Raw Blame History