DiscordChatExporter/.docs/Recurring-Scrape-Troubleshooting.md

# Recurring Discord Scrape Automation - Troubleshooting Guide

This guide covers common issues and their solutions.

## Setup Issues

### "Required file not found" Error

**Symptoms:** Setup fails with "Required file not found: /path/to/config.json"

**Solutions:**
1. Verify config file exists: `ls -la config/scrape-targets.json`
2. Check file permissions: `chmod 644 config/scrape-targets.json`
3. Use absolute path in setup command: `./scripts/setup-cron.sh --config $(pwd)/config/scrape-targets.json`

---

### "Invalid JSON config" Error

**Symptoms:** Setup fails with "Invalid JSON config: ..."

**Solutions:**
1. Validate JSON syntax: `jq empty config/scrape-targets.json`
2. Common mistakes:
   - Trailing commas in arrays/objects
   - Unquoted keys
   - Missing closing braces
3. Use an online JSON validator if needed

---

### "DISCORD_TOKEN must be set" Error

**Symptoms:** Preflight or scrape fails with token error

**Solutions:**
1. Set token in current session:
   ```bash
   export DISCORD_TOKEN="your-token-here"
   ./scripts/run-discord-scrape.sh preflight
   ```

2. Or set in scrape.env and source it:
   ```bash
   source scrape.env
   ./scripts/run-discord-scrape.sh preflight
   ```

3. Or use DISCORD_TOKEN_FILE for file-based tokens:
   ```bash
   export DISCORD_TOKEN_FILE="/path/to/token/file"
   chmod 600 /path/to/token/file
   ```

---

### "Target output_dir is outside archive_root" Error

**Symptoms:** Setup fails with path validation error

**Solution:** Update config to ensure output_dir is under archive_root:

```json
{
  "archive_root": "/home/user/discord-archives",
  "targets": [
    {
      "output_dir": "/home/user/discord-archives/target1"  // ✓ Under archive_root
    }
  ]
}
```

Not this:
```json
{
  "archive_root": "/home/user/discord-archives",
  "targets": [
    {
      "output_dir": "/tmp/exports"  // ✗ Outside archive_root
    }
  ]
}
```

---

## Authentication Issues

### "Guild discovery failed" Error

**Symptoms:** Preflight or scrape fails with guild discovery message

**Causes:**
- Using a bot token (cannot enumerate guilds)
- Invalid token
- Token lacks required permissions

**Solutions:**

1. **For bot tokens:** Provide explicit guild and channel IDs:
   ```json
   {
     "name": "my-target",
     "guild_ids": ["123456789"],
     "channel_ids": ["111222333"]
   }
   ```

2. **For user tokens:** Ensure the token is valid:
   - Generate a new token from Discord Developer Portal
   - Test token validity: `DISCORD_TOKEN=xxx ./scripts/run-discord-scrape.sh list-targets`

3. **Check permissions:**
   - Bot needs at least "Read Messages/View Channels" and "Read Message History"
   - User token needs access to the target guilds/channels

---

### "Export ... belongs to channel XXX, expected YYY" Error

**Symptoms:** Scrape fails when updating an existing archive

**Cause:** Archive's embedded channel ID doesn't match the configured channel

**Solutions:**

1. **Update config to match archive:**
   - Check the existing archive file for the correct channel ID
   - Update channel_ids in config

2. **Or move the archive:**
   ```bash
   mv archive/old-location.json archive/target1/
   ```

3. **Or update the channel mapping manually:**
   ```bash
   jq '.["111"] = "path/to/archive.json"' archive/.dce-meta/channel-map.json > tmp.json && mv tmp.json archive/.dce-meta/channel-map.json
   ```

---

## Cron Schedule Issues

### Cron Job Not Running

**Symptoms:** Cron job installed but exports aren't happening

**Diagnostic steps:**

1. Verify cron is installed:
   ```bash
   crontab -l | grep discord-scrape
   ```

2. Check if cron daemon is running:
   ```bash
   sudo systemctl status cron
   # or on macOS:
   sudo launchctl list | grep cron
   ```

3. Check system logs:
   ```bash
   # Linux
   sudo grep CRON /var/log/syslog
   # or
   sudo grep discord-scrape /var/log/cron

   # macOS
   log stream --predicate 'eventMessage contains[c] "cron"'
   ```

4. Test the script manually:
   ```bash
   source scrape.env
   bash scripts/run-discord-scrape-host.sh scrape
   ```

---

### "No such file or directory" in Cron Logs

**Symptoms:** Cron log shows script not found even though it exists

**Causes:**
- Path in crontab uses relative paths
- Directory changed since cron was installed
- Script permissions changed

**Solutions:**

1. Re-install cron with absolute paths:
   ```bash
   cd /path/to/DiscordChatExporter
   ./scripts/setup-cron.sh --config $(pwd)/config/scrape-targets.json
   ```

2. Ensure script is executable:
   ```bash
   chmod +x scripts/run-discord-scrape-host.sh
   chmod +x scripts/run-discord-scrape.sh
   chmod +x scripts/setup-cron.sh
   ```

---

### Cron Jobs Running at Wrong Time

**Symptoms:** Export runs at unexpected times

**Solutions:**

1. Check timezone setting:
   ```bash
   date  # System time
   timedatectl  # System timezone
   ```

2. Verify crontab schedule:
   ```bash
   crontab -l
   ```

3. Update schedule:
   ```bash
   ./scripts/setup-cron.sh --interval "daily" --at "2:00"
   ```

4. Validate cron expression at [crontab.guru](https://crontab.guru)

---

## Export Issues

### Exports Complete but Produce Empty Files

**Symptoms:** Archive files created but contain minimal/no messages

**Solutions:**

1. Verify channels are accessible:
   ```bash
   export DISCORD_TOKEN="your-token"
   ./scripts/run-discord-scrape.sh preflight
   ```

2. Check channel permissions:
   - Ensure token has "Read Message History"
   - Verify channel is not archived/deleted

3. Manual test export:
   ```bash
   ./scripts/run-discord-scrape.sh scrape --target target-name
   ```

---

### "Archive is not valid JSON" Error

**Symptoms:** Existing archive file becomes corrupted

**Solutions:**

1. **Audit all archives for a target:**
   ```bash
   ./scripts/audit-archive-json.sh --target target-name
   ```

2. **Validate one file:**
   ```bash
   jq empty archive-file.json
   ```

3. **Truncated export (parse error mid-message):** salvage drops the incomplete tail and keeps earlier messages. A timestamped `.bak.*` backup is created first:
   ```bash
   ./scripts/salvage-truncated-export.sh path/to/export.json
   ```

4. **If corrupted beyond salvage, restore from backup** (if available)

5. **If no backup, move the archive aside and re-export:**
   ```bash
   mv archive-file.json archive-file.json.bak
   ./scripts/run-discord-scrape.sh scrape --target target-name
   ```

---

### Incremental Exports Are Too Slow

**Symptoms:** Each scheduled export takes several minutes

**Solutions:**

1. **Check API rate limiting:**
   - Discord limits API calls per user
   - Too many frequent exports can trigger rate limiting
   - Increase interval between exports: `--interval "weekly"`

2. **Reduce scope:**
   - Export only recent messages: configure `after` date in export
   - Split large channels into separate targets

3. **Check system resources:**
   - Disk I/O bottleneck: `iostat -x 1`
   - CPU usage: `top`
   - Memory: `free -h`

---

### Channel Export SKIPPED (OOM / Aborted / Killed)

**Symptoms:** Log shows `SKIPPED` for one channel, `Aborted (core dumped)`, `Killed`, or `out of memory`; other channels in the target may still succeed.

**Cause:** Large multi-year catch-up (for example KotOR `yes_general`) builds a big in-memory JSON export inside the container. Partial progress is kept under `output_dir/.dce-temp/` for salvage on the next run.

**Solutions:**

1. **Salvage partial temps before re-scraping** (avoids re-downloading from the archive cursor):
   ```bash
   ./scripts/scrape-lock-status.sh
   ./scripts/operator-handoff.sh --salvage-only --target KotOR_discord_msgs --channel 221726893064454144
   ```

2. **Raise container memory** in `scrape.env` if needed (default `0` = no compose cap; `KotOR_discord_msgs` already sets `container_memory: "8g"` for single-target runs):
   ```bash
   # scrape.env — optional global override
   DCE_CONTAINER_MEMORY=8g
   ```
   Then run the one-command catch-up:
   ```bash
   ./scripts/run-kotor-yes-general-catchup.sh
   # Inspect totals: ./scripts/print-scrape-summary.sh logs/kotor-yes-general.summary.json
   ```

3. **Ensure only one scrape** holds `{archive_root}/.dce-scrape.lock` (see next section).

4. **Confirm host disk headroom** — merges need temporary space on the archive volume (`df -h ~/Documents`).

---

### Scrape Lock Already Held

**Symptoms:** `Scrape lock is held` or `Another scrape is already running` when starting validation or documents scrape.

**Cause:** Only one scrape should run per `archive_root`. A long validation, cron job, or a second checkout (for example Downloads vs MyBook) can hold `{archive_root}/.dce-scrape.lock`.

**Solutions:**

1. **Inspect lock state:**
   ```bash
   ./scripts/scrape-lock-status.sh
   ```

2. **Wait** for the active scrape to finish if PID is live.

3. **Reclaim stale lock** after a crash (only when status shows stale/free):
   ```bash
   ./scripts/scrape-lock-status.sh --reclaim-stale
   ```

4. **Do not delete the lock** while a scrape is still running — twin exports can OOM-loop on the same channel.

---

### Partial Export Stuck in `.dce-temp`

**Symptoms:** Large folder under `output_dir/.dce-temp/export.<channel_id>.*`; archive cursor not advancing; audit excludes `.dce-temp` (expected).

**Solutions:**

1. **Stop any active export** writing that temp (check lock status and running `podman`/`docker` processes).

2. **Salvage quiescent temps** (default skips temps modified in the last ~120s):
   ```bash
   ./scripts/run-documents-scrape.sh --salvage-only --target NAME [--channel ID]
   ```

3. **Force salvage of an active temp** only after confirming nothing is writing:
   ```bash
   DCE_SALVAGE_ACTIVE_TEMPS=1 ./scripts/run-documents-scrape.sh --salvage-only --target NAME --channel ID
   ```

4. **Truncated JSON in the archive file itself** (not `.dce-temp`):
   ```bash
   ./scripts/salvage-truncated-export.sh path/to/archive.json
   ```

---

### "Failed to write archive" or Permission Denied

**Symptoms:** Export fails with write permission errors

**Solutions:**

1. **Check directory permissions:**
   ```bash
   ls -la archive/target-name/
   chmod 755 archive/target-name/
   chmod 644 archive/target-name/*.json
   ```

2. **If using Docker/Podman, set user mode:**
   ```bash
   # For rootless podman
   export DCE_USERNS_MODE=keep-id
   export DCE_UID=$(id -u)
   export DCE_GID=$(id -g)
   ```

3. **Check SELinux (if enabled):**
   ```bash
   getenforce
   # If "Enforcing", add `:z` to mount options:
   # docker-compose.yml should already have this
   ```

---

## Docker/Container Issues

### "Failed to build image" Error

**Symptoms:** Docker build fails during setup

**Solutions:**

1. **Verify Docker is running:**
   ```bash
   docker ps
   docker version
   ```

2. **Check disk space:**
   ```bash
   docker system df
   ```

3. **Clean up and retry:**
   ```bash
   docker system prune -a
   docker-compose build --no-cache
   ```

4. **If using Podman:**
   ```bash
   podman system prune -a
   podman-compose build --no-cache
   ```

---

### "Cannot connect to Docker daemon" Error

**Symptoms:** Setup fails to reach Docker

**Solutions:**

1. **For Docker:**
   ```bash
   sudo systemctl start docker
   sudo usermod -aG docker $USER
   newgrp docker
   ```

2. **For Podman (rootless):**
   ```bash
   systemctl --user start podman
   systemctl --user enable podman
   ```

---

## Authorization / Token Refresh Issues

### Host Retry Auth Flow Not Working

**Symptoms:** Export fails with 401/403 errors even with DISCORD_TOKEN_FILE set

**Solutions:**

1. **Verify token file is readable:**
   ```bash
   cat $DISCORD_TOKEN_FILE
   ```

2. **Ensure proper permissions:**
   ```bash
   chmod 600 $DISCORD_TOKEN_FILE
   ```

3. **Check token is fresh:**
   - Tokens can expire
   - Generate a new token from Discord Developer Portal
   - Update the token file

4. **Verify host wrapper is being used:**
   ```bash
   grep run-discord-scrape-host scripts/run-discord-scrape-host.sh
   ```

---

## Getting Help

If you're still stuck:

1. **Check existing issues:** https://github.com/Tyrrrz/DiscordChatExporter/issues
2. **Run preflight in verbose mode:**
   ```bash
   set -x  # Enable debug output
   ./scripts/run-discord-scrape.sh preflight
   ```

3. **Check logs:**
   ```bash
   # Docker logs
   docker-compose logs --tail 50

   # Cron logs (on Linux)
   sudo journalctl -u cron --since "1 hour ago"
   ```

4. **Collect error details** for reporting issues:
   - Config (sanitize token)
   - Full error message
   - OS/Docker version
   - Steps to reproduce