DiscordChatExporter/.docs/Recurring-Scrape-Troubleshooting.md
Copilot 27e133f692 docs(scrape): sync KotOR wrapper across GUI bridge docs (plan 084)
GUI bridge and troubleshooting lead with run-kotor-yes-general-catchup.sh;
merge-readiness HEAD updated; bridge sync smoke asserts wrapper and 24/24 gate.
2026-06-03 12:02:03 -05:00

12 KiB

Recurring Discord Scrape Automation - Troubleshooting Guide

This guide covers common issues and their solutions.

Setup Issues

"Required file not found" Error

Symptoms: Setup fails with "Required file not found: /path/to/config.json"

Solutions:

  1. Verify config file exists: ls -la config/scrape-targets.json
  2. Check file permissions: chmod 644 config/scrape-targets.json
  3. Use absolute path in setup command: ./scripts/setup-cron.sh --config $(pwd)/config/scrape-targets.json

"Invalid JSON config" Error

Symptoms: Setup fails with "Invalid JSON config: ..."

Solutions:

  1. Validate JSON syntax: jq empty config/scrape-targets.json
  2. Common mistakes:
    • Trailing commas in arrays/objects
    • Unquoted keys
    • Missing closing braces
  3. Use an online JSON validator if needed

"DISCORD_TOKEN must be set" Error

Symptoms: Preflight or scrape fails with token error

Solutions:

  1. Set token in current session:

    export DISCORD_TOKEN="your-token-here"
    ./scripts/run-discord-scrape.sh preflight
    
  2. Or set in scrape.env and source it:

    source scrape.env
    ./scripts/run-discord-scrape.sh preflight
    
  3. Or use DISCORD_TOKEN_FILE for file-based tokens:

    export DISCORD_TOKEN_FILE="/path/to/token/file"
    chmod 600 /path/to/token/file
    

"Target output_dir is outside archive_root" Error

Symptoms: Setup fails with path validation error

Solution: Update config to ensure output_dir is under archive_root:

{
  "archive_root": "/home/user/discord-archives",
  "targets": [
    {
      "output_dir": "/home/user/discord-archives/target1"  // ✓ Under archive_root
    }
  ]
}

Not this:

{
  "archive_root": "/home/user/discord-archives",
  "targets": [
    {
      "output_dir": "/tmp/exports"  // ✗ Outside archive_root
    }
  ]
}

Authentication Issues

"Guild discovery failed" Error

Symptoms: Preflight or scrape fails with guild discovery message

Causes:

  • Using a bot token (cannot enumerate guilds)
  • Invalid token
  • Token lacks required permissions

Solutions:

  1. For bot tokens: Provide explicit guild and channel IDs:

    {
      "name": "my-target",
      "guild_ids": ["123456789"],
      "channel_ids": ["111222333"]
    }
    
  2. For user tokens: Ensure the token is valid:

    • Generate a new token from Discord Developer Portal
    • Test token validity: DISCORD_TOKEN=xxx ./scripts/run-discord-scrape.sh list-targets
  3. Check permissions:

    • Bot needs at least "Read Messages/View Channels" and "Read Message History"
    • User token needs access to the target guilds/channels

"Export ... belongs to channel XXX, expected YYY" Error

Symptoms: Scrape fails when updating an existing archive

Cause: Archive's embedded channel ID doesn't match the configured channel

Solutions:

  1. Update config to match archive:

    • Check the existing archive file for the correct channel ID
    • Update channel_ids in config
  2. Or move the archive:

    mv archive/old-location.json archive/target1/
    
  3. Or update the channel mapping manually:

    jq '.["111"] = "path/to/archive.json"' archive/.dce-meta/channel-map.json > tmp.json && mv tmp.json archive/.dce-meta/channel-map.json
    

Cron Schedule Issues

Cron Job Not Running

Symptoms: Cron job installed but exports aren't happening

Diagnostic steps:

  1. Verify cron is installed:

    crontab -l | grep discord-scrape
    
  2. Check if cron daemon is running:

    sudo systemctl status cron
    # or on macOS:
    sudo launchctl list | grep cron
    
  3. Check system logs:

    # Linux
    sudo grep CRON /var/log/syslog
    # or
    sudo grep discord-scrape /var/log/cron
    
    # macOS
    log stream --predicate 'eventMessage contains[c] "cron"'
    
  4. Test the script manually:

    source scrape.env
    bash scripts/run-discord-scrape-host.sh scrape
    

"No such file or directory" in Cron Logs

Symptoms: Cron log shows script not found even though it exists

Causes:

  • Path in crontab uses relative paths
  • Directory changed since cron was installed
  • Script permissions changed

Solutions:

  1. Re-install cron with absolute paths:

    cd /path/to/DiscordChatExporter
    ./scripts/setup-cron.sh --config $(pwd)/config/scrape-targets.json
    
  2. Ensure script is executable:

    chmod +x scripts/run-discord-scrape-host.sh
    chmod +x scripts/run-discord-scrape.sh
    chmod +x scripts/setup-cron.sh
    

Cron Jobs Running at Wrong Time

Symptoms: Export runs at unexpected times

Solutions:

  1. Check timezone setting:

    date  # System time
    timedatectl  # System timezone
    
  2. Verify crontab schedule:

    crontab -l
    
  3. Update schedule:

    ./scripts/setup-cron.sh --interval "daily" --at "2:00"
    
  4. Validate cron expression at crontab.guru


Export Issues

Exports Complete but Produce Empty Files

Symptoms: Archive files created but contain minimal/no messages

Solutions:

  1. Verify channels are accessible:

    export DISCORD_TOKEN="your-token"
    ./scripts/run-discord-scrape.sh preflight
    
  2. Check channel permissions:

    • Ensure token has "Read Message History"
    • Verify channel is not archived/deleted
  3. Manual test export:

    ./scripts/run-discord-scrape.sh scrape --target target-name
    

"Archive is not valid JSON" Error

Symptoms: Existing archive file becomes corrupted

Solutions:

  1. Audit all archives for a target:

    ./scripts/audit-archive-json.sh --target target-name
    
  2. Validate one file:

    jq empty archive-file.json
    
  3. Truncated export (parse error mid-message): salvage drops the incomplete tail and keeps earlier messages. A timestamped .bak.* backup is created first:

    ./scripts/salvage-truncated-export.sh path/to/export.json
    
  4. If corrupted beyond salvage, restore from backup (if available)

  5. If no backup, move the archive aside and re-export:

    mv archive-file.json archive-file.json.bak
    ./scripts/run-discord-scrape.sh scrape --target target-name
    

Incremental Exports Are Too Slow

Symptoms: Each scheduled export takes several minutes

Solutions:

  1. Check API rate limiting:

    • Discord limits API calls per user
    • Too many frequent exports can trigger rate limiting
    • Increase interval between exports: --interval "weekly"
  2. Reduce scope:

    • Export only recent messages: configure after date in export
    • Split large channels into separate targets
  3. Check system resources:

    • Disk I/O bottleneck: iostat -x 1
    • CPU usage: top
    • Memory: free -h

Channel Export SKIPPED (OOM / Aborted / Killed)

Symptoms: Log shows SKIPPED for one channel, Aborted (core dumped), Killed, or out of memory; other channels in the target may still succeed.

Cause: Large multi-year catch-up (for example KotOR yes_general) builds a big in-memory JSON export inside the container. Partial progress is kept under output_dir/.dce-temp/ for salvage on the next run.

Solutions:

  1. Salvage partial temps before re-scraping (avoids re-downloading from the archive cursor):

    ./scripts/scrape-lock-status.sh
    ./scripts/operator-handoff.sh --salvage-only --target KotOR_discord_msgs --channel 221726893064454144
    
  2. Raise container memory in scrape.env if needed (default 0 = no compose cap; KotOR_discord_msgs already sets container_memory: "8g" for single-target runs):

    # scrape.env — optional global override
    DCE_CONTAINER_MEMORY=8g
    

    Then run the one-command catch-up:

    ./scripts/run-kotor-yes-general-catchup.sh
    # Inspect totals: ./scripts/print-scrape-summary.sh logs/kotor-yes-general.summary.json
    
  3. Ensure only one scrape holds {archive_root}/.dce-scrape.lock (see next section).

  4. Confirm host disk headroom — merges need temporary space on the archive volume (df -h ~/Documents).


Scrape Lock Already Held

Symptoms: Scrape lock is held or Another scrape is already running when starting validation or documents scrape.

Cause: Only one scrape should run per archive_root. A long validation, cron job, or a second checkout (for example Downloads vs MyBook) can hold {archive_root}/.dce-scrape.lock.

Solutions:

  1. Inspect lock state:

    ./scripts/scrape-lock-status.sh
    
  2. Wait for the active scrape to finish if PID is live.

  3. Reclaim stale lock after a crash (only when status shows stale/free):

    ./scripts/scrape-lock-status.sh --reclaim-stale
    
  4. Do not delete the lock while a scrape is still running — twin exports can OOM-loop on the same channel.


Partial Export Stuck in .dce-temp

Symptoms: Large folder under output_dir/.dce-temp/export.<channel_id>.*; archive cursor not advancing; audit excludes .dce-temp (expected).

Solutions:

  1. Stop any active export writing that temp (check lock status and running podman/docker processes).

  2. Salvage quiescent temps (default skips temps modified in the last ~120s):

    ./scripts/run-documents-scrape.sh --salvage-only --target NAME [--channel ID]
    
  3. Force salvage of an active temp only after confirming nothing is writing:

    DCE_SALVAGE_ACTIVE_TEMPS=1 ./scripts/run-documents-scrape.sh --salvage-only --target NAME --channel ID
    
  4. Truncated JSON in the archive file itself (not .dce-temp):

    ./scripts/salvage-truncated-export.sh path/to/archive.json
    

"Failed to write archive" or Permission Denied

Symptoms: Export fails with write permission errors

Solutions:

  1. Check directory permissions:

    ls -la archive/target-name/
    chmod 755 archive/target-name/
    chmod 644 archive/target-name/*.json
    
  2. If using Docker/Podman, set user mode:

    # For rootless podman
    export DCE_USERNS_MODE=keep-id
    export DCE_UID=$(id -u)
    export DCE_GID=$(id -g)
    
  3. Check SELinux (if enabled):

    getenforce
    # If "Enforcing", add `:z` to mount options:
    # docker-compose.yml should already have this
    

Docker/Container Issues

"Failed to build image" Error

Symptoms: Docker build fails during setup

Solutions:

  1. Verify Docker is running:

    docker ps
    docker version
    
  2. Check disk space:

    docker system df
    
  3. Clean up and retry:

    docker system prune -a
    docker-compose build --no-cache
    
  4. If using Podman:

    podman system prune -a
    podman-compose build --no-cache
    

"Cannot connect to Docker daemon" Error

Symptoms: Setup fails to reach Docker

Solutions:

  1. For Docker:

    sudo systemctl start docker
    sudo usermod -aG docker $USER
    newgrp docker
    
  2. For Podman (rootless):

    systemctl --user start podman
    systemctl --user enable podman
    

Authorization / Token Refresh Issues

Host Retry Auth Flow Not Working

Symptoms: Export fails with 401/403 errors even with DISCORD_TOKEN_FILE set

Solutions:

  1. Verify token file is readable:

    cat $DISCORD_TOKEN_FILE
    
  2. Ensure proper permissions:

    chmod 600 $DISCORD_TOKEN_FILE
    
  3. Check token is fresh:

    • Tokens can expire
    • Generate a new token from Discord Developer Portal
    • Update the token file
  4. Verify host wrapper is being used:

    grep run-discord-scrape-host scripts/run-discord-scrape-host.sh
    

Getting Help

If you're still stuck:

  1. Check existing issues: https://github.com/Tyrrrz/DiscordChatExporter/issues

  2. Run preflight in verbose mode:

    set -x  # Enable debug output
    ./scripts/run-discord-scrape.sh preflight
    
  3. Check logs:

    # Docker logs
    docker-compose logs --tail 50
    
    # Cron logs (on Linux)
    sudo journalctl -u cron --since "1 hour ago"
    
  4. Collect error details for reporting issues:

    • Config (sanitize token)
    • Full error message
    • OS/Docker version
    • Steps to reproduce