DiscordChatExporter/.docs/Recurring-Scrape-Setup.md
Boden d66b9dab63 feat(validation): comprehensive recurring scraper validation suite and documentation
IMPLEMENTATION UNITS (U1-U6):

U1: Append-only merge test coverage
- Enhanced run-discord-scrape-smoke.sh with additional test scenarios
- Created append-partial-write.json and append-concurrent-conflict.json fixtures
- Added assertions for message sorting, deduplication, and idempotency
- All 10 merge scenarios validated

U2: Error handling validation
- Created error-path-smoke.sh with 6 error scenario tests
- Added test configs for invalid paths, missing files, bad JSON
- Verified fail-closed behavior on all error paths
- No silent data loss on any failure

U3: Cron idempotency and lifecycle
- Created cron-idempotency-smoke.sh with full lifecycle testing
- Created fixture crontab with unrelated entries (preservation test)
- Verified idempotent install, update, and remove operations
- Confirmed dry-run and entry preservation

U4: Preflight and end-to-end setup
- Created end-to-end-preflight-smoke.sh with 10 validation tests
- Verified preflight is read-only and gates cron installation
- Confirmed host-retry auth flow (commit 090884f)
- Added preflight validation section to Scheduling-Linux.md

U5: Documentation completion
- Updated Readme.md with recurring-scraper link
- Created Recurring-Scrape-Setup.md (6300+ chars comprehensive guide)
- Created Recurring-Scrape-Troubleshooting.md (9200+ chars with 30+ scenarios)
- Enhanced .docs/Scheduling-Linux.md with preflight section
- All documented behavior matches implementation

U6: Production-readiness checklist
- Created docs/recurring-scrape-production-checklist.md
- Compiled all validation results (33+ scenarios across U1-U5)
- Documented test execution commands for re-validation
- Provided deployment notes and monitoring guidance
- Clear sign-off criteria established

ARTIFACTS:
- 4 new smoke test scripts (1000+ lines total)
- 4 new fixtures and test configs
- 3 new documentation files (15500+ chars)
- 2 updated documentation files
- 1 validation checklist tracking document
- All tests passing

SAFETY GUARANTEES VERIFIED:
 No silent data loss on any error path
 Fail-closed behavior throughout
 Archive updates are append-only and idempotent
 Cron installation is idempotent
 Unrelated cron entries preserved
 Preflight is read-only
 Token validated before operations
 Path traversal prevented

STATUS: Production Ready
All 6 implementation units complete and validated.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-27 12:57:32 -05:00

6.3 KiB

Recurring Discord Scrape Automation - Setup Guide

This guide walks you through setting up automated recurring Discord exports using the built-in wrapper scripts.

Prerequisites

  • Linux or macOS with bash and cron
  • Docker or Podman installed
  • A Discord bot token or user token with access to the channels you want to export
  • Read/write access to a directory for archive storage

Quick Start

1. Configure Your Targets

Create or edit config/scrape-targets.json with your channel selections:

{
  "archive_root": "/home/user/discord-archives",
  "defaults": {
    "include_threads": "all",
    "include_voice_channels": false
  },
  "targets": [
    {
      "name": "my-servers",
      "kind": "guild",
      "output_dir": "/home/user/discord-archives/my-servers",
      "guild_ids": ["123456789"],
      "channel_ids": [],
      "guild_name_patterns": []
    }
  ]
}

Key fields:

  • archive_root: Parent directory for all exports (used for validation and path safety)
  • output_dir: Specific directory for each target (must be under archive_root)
  • guild_ids: Explicit Discord guild IDs (especially important for bot tokens)
  • channel_ids: Specific channels to export (leave empty to export all accessible)
  • guild_name_patterns: Regex patterns to match guild names (not used by bot tokens)

2. Set Your Discord Token

Copy the environment template and add your token:

cp scrape.env.example scrape.env
# Edit scrape.env and set DISCORD_TOKEN=your-token-here
# OR set DISCORD_TOKEN_FILE=/path/to/token/file for automatic token rotation

3. Run Preflight Validation

Before installing cron, validate your setup:

export DISCORD_TOKEN="your-token"
./scripts/run-discord-scrape.sh preflight --config config/scrape-targets.json

This will:

  • Check token validity
  • Verify all configured targets are accessible
  • Show which channels will be scraped
  • Confirm archive directories are writable
  • Make NO changes to archives or cron

4. Install the Cron Job

Once preflight passes, install the recurring export:

./scripts/setup-cron.sh --config config/scrape-targets.json

This creates a managed cron entry that runs monthly (default). The entry can be updated or removed later.

5. Verify Installation

Check that the cron job was installed:

crontab -l | grep discord-scrape

Customizing the Schedule

The default is monthly. Customize it with:

# Run every day at 2 AM
./scripts/setup-cron.sh --config config/scrape-targets.json --interval "daily" --at "2:00"

# Run every Sunday at noon
./scripts/setup-cron.sh --config config/scrape-targets.json --interval "weekly" --at "sun 12:00"

# Custom cron expression (every 6 hours)
./scripts/setup-cron.sh --config config/scrape-targets.json --cron "0 */6 * * *"

Token Rotation

If using DISCORD_TOKEN_FILE, the host wrapper can automatically reload your token on each run:

# Protect your token file
chmod 600 /path/to/token/file

# Configure in scrape.env
DISCORD_TOKEN_FILE=/path/to/token/file
DCE_USERNS_MODE=keep-id  # for rootless podman

On each scheduled run, if the export fails with a 401 or 403 error, the wrapper:

  1. Reloads the token file
  2. Retries the export once
  3. Logs the result

This keeps your token fresh without manual intervention.

Archive Layout

After first export, your archive directory will contain:

archive_root/
├── .dce-meta/
│   ├── channel-map.json        # Channel ID to file mappings
│   └── locks/                  # Per-target locks (during active runs)
├── my-servers/
│   ├── .dce-meta/
│   │   └── channel-map.json
│   ├── Guild Name - Category - Channel [123456].json
│   ├── Another Guild - General [789012].json
│   └── ...
└── ...

Existing exports are updated in-place with new messages appended and deduplicated by message ID.

Troubleshooting

For common issues and solutions, see Recurring-Scrape-Troubleshooting.md.

Advanced Configuration

Bot Tokens vs User Tokens

Bot tokens cannot enumerate guilds or DMs, so you must provide explicit IDs:

{
  "name": "bot-scraped",
  "kind": "guild",
  "guild_ids": ["123456789", "987654321"],
  "channel_ids": ["111222333"]
}

User tokens can auto-discover but are against Discord TOS for automated use:

{
  "name": "user-scraped",
  "kind": "guild",
  "guild_ids": [],  // Will auto-discover
  "channel_ids": []  // Will auto-discover
}

Disabling Targets

Temporarily disable a target without removing it:

{
  "name": "disabled-target",
  "enabled": false,
  "kind": "guild",
  ...
}

SELinux and Rootless Podman

For SELinux:

# Label mounts for relabeling (already in docker-compose.yml)
DCE_MOUNT_OPTIONS=z

For rootless podman:

# Keep mounted dirs writable as your user
DCE_USERNS_MODE=keep-id
DCE_UID=$(id -u)
DCE_GID=$(id -g)

Managing Cron

View Current Schedule

crontab -l

Update Schedule

Re-run setup with new parameters (old entry replaced):

./scripts/setup-cron.sh --config config/scrape-targets.json --interval "daily"

Dry-run (Preview Changes)

./scripts/setup-cron.sh --config config/scrape-targets.json --dry-run

Remove Cron Entry

./scripts/setup-cron.sh --remove

Monitoring Exports

Check logs from your last run:

# Recent cron execution
sudo grep discord-scrape /var/log/syslog  # Debian/Ubuntu
sudo grep discord-scrape /var/log/cron    # CentOS/RHEL

# Or check via Docker logs if using containers
docker-compose logs -f

Performance Considerations

  • First export of a channel can be slow (API rate-limited)
  • Incremental updates are much faster (only new messages)
  • Large channels (100k+ messages) may take several minutes
  • Rate limiting: Discord's API has strict per-user limits; repeated failures may indicate you've hit them

Space requirements:

  • Typical channel: 1-10 MB per year of messages
  • Large channels: 50-100 MB per year
  • Full guild: 500 MB - several GB depending on activity

Next Steps