mirror of
https://github.com/Tyrrrz/DiscordChatExporter.git
synced 2026-06-10 00:02:37 -06:00
feat(scrape): harden preflight and cron config for Documents archives
Preflight probes skip forbidden channels when seeded archives exist. Cron installer passes container config path and supports --config override. Compose and docs align with append-only ~/Documents scrape workflow.
This commit is contained in:
parent
d415f9246b
commit
90bd9da143
|
|
@ -11,6 +11,14 @@ This guide walks you through setting up automated recurring Discord exports usin
|
|||
|
||||
## Quick Start
|
||||
|
||||
**Append-only contract (read first)**
|
||||
|
||||
- Each target writes under its configured `output_dir` (for example `~/Documents/KotOR_discord_msgs/`).
|
||||
- Existing files named `Guild - Category - Channel [channel_id].json` are discovered automatically and updated in place.
|
||||
- On the first run against an existing archive tree, the wrapper bootstraps `output_dir/.dce-meta/channel-map.json` from those filenames so it never creates a parallel export file.
|
||||
- Incremental exports use DiscordChatExporter `--after` with the highest existing message id, then merge new messages by id.
|
||||
- A merge that would reduce message count is rejected; the on-disk archive is left unchanged.
|
||||
|
||||
### 1. Configure Your Targets
|
||||
|
||||
Create or edit `config/scrape-targets.json` with your channel selections:
|
||||
|
|
@ -44,6 +52,8 @@ Create or edit `config/scrape-targets.json` with your channel selections:
|
|||
|
||||
### 2. Set Your Discord Token
|
||||
|
||||
**Cron requires `scrape.env`.** Manual `export DISCORD_TOKEN` works for one-off runs, but scheduled jobs run in a minimal environment and need a persisted env file.
|
||||
|
||||
Either copy the environment template:
|
||||
|
||||
```bash
|
||||
|
|
@ -76,7 +86,7 @@ Before the first incremental run, confirm each enabled target points at the corr
|
|||
./scripts/verify-documents-archives.sh --config config/scrape-targets.json
|
||||
```
|
||||
|
||||
Each enabled target should show a non-zero **JSON** count and **SEEDED** channel IDs under `/home/brunner56/Documents/<server>/`.
|
||||
Each enabled target should show a non-zero **JSON** count and **SEEDED** channel IDs under your configured `output_dir` values (see `archive_root` in `config/scrape-targets.json`).
|
||||
|
||||
**One-command workflow** (verify → preflight → incremental scrape):
|
||||
|
||||
|
|
@ -141,8 +151,8 @@ The default is monthly. Customize it with:
|
|||
# Run every day at 2 AM
|
||||
./scripts/setup-cron.sh --config config/scrape-targets.json --interval "daily" --at "2:00"
|
||||
|
||||
# Run every Sunday at noon
|
||||
./scripts/setup-cron.sh --config config/scrape-targets.json --interval "weekly" --at "sun 12:00"
|
||||
# Run every Sunday at noon (weekly uses Sunday; time is HH:MM only)
|
||||
./scripts/setup-cron.sh --config config/scrape-targets.json --interval weekly --at "12:00"
|
||||
|
||||
# Custom cron expression (every 6 hours)
|
||||
./scripts/setup-cron.sh --config config/scrape-targets.json --cron "0 */6 * * *"
|
||||
|
|
@ -186,15 +196,7 @@ archive_root/
|
|||
└── ...
|
||||
```
|
||||
|
||||
Existing exports are updated in-place with new messages appended and deduplicated by message ID.
|
||||
|
||||
**In-place append contract**
|
||||
|
||||
- Each target writes under its configured `output_dir` (for example `~/Documents/KotOR_discord_msgs/`).
|
||||
- Existing files named `Guild - Category - Channel [channel_id].json` are discovered automatically and updated in place.
|
||||
- On the first run against an existing archive tree, the wrapper bootstraps `output_dir/.dce-meta/channel-map.json` from those filenames so it never creates a parallel export file.
|
||||
- Incremental exports use DiscordChatExporter `--after` with the highest existing message id, then merge new messages by id.
|
||||
- A merge that would reduce message count is rejected; the on-disk archive is left unchanged.
|
||||
Existing exports are updated in-place with new messages appended and deduplicated by message ID. See **Append-only contract** at the top of this guide.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
|
|
@ -286,12 +288,21 @@ Re-run setup with new parameters (old entry replaced):
|
|||
Check logs from your last run:
|
||||
|
||||
```bash
|
||||
# Recent cron execution
|
||||
# Primary log file (default from setup-cron.sh)
|
||||
tail -f logs/discord-scrape.log
|
||||
|
||||
# Recent cron execution (system log)
|
||||
sudo grep discord-scrape /var/log/syslog # Debian/Ubuntu
|
||||
sudo grep discord-scrape /var/log/cron # CentOS/RHEL
|
||||
|
||||
# Or check via Docker logs if using containers
|
||||
docker-compose logs -f
|
||||
# Container build/run issues
|
||||
docker compose logs -f
|
||||
```
|
||||
|
||||
After a scheduled run, confirm archives grew in place:
|
||||
|
||||
```bash
|
||||
./scripts/prove-incremental-append.sh --target KotOR_discord_msgs
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
|
|
|||
|
|
@ -82,5 +82,6 @@ To learn more about the war and how you can help, [click here](https://tyrrrz.me
|
|||
## See also
|
||||
|
||||
- [**Recurring Exports**](.docs/Recurring-Scrape-Setup.md) — automated scheduled exports using cron (Linux/macOS)
|
||||
- [**Documented solutions**](docs/solutions/) — searchable learnings (append-only scrape, Docker/cron workflow); YAML frontmatter: `module`, `tags`, `problem_type`
|
||||
- [**Chat Analytics**](https://github.com/mlomb/chat-analytics) — solution for analyzing chat patterns of Discord users, using exports produced by **DiscordChatExporter**.
|
||||
- [**DiscordChatExporter-frontend**](https://github.com/slatinsky/DiscordChatExporter-frontend) — convenient viewer for exports produced by **DiscordChatExporter**.
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
name: Recurring Discord scrape automation
|
||||
last_updated: 2026-05-25
|
||||
last_updated: 2026-05-29
|
||||
---
|
||||
|
||||
# Recurring Discord scrape automation Strategy
|
||||
|
|
|
|||
|
|
@ -8,11 +8,14 @@ services:
|
|||
user: "${DCE_UID:-1000}:${DCE_GID:-1000}"
|
||||
userns_mode: "${DCE_USERNS_MODE:-}"
|
||||
working_dir: /workspace
|
||||
env_file:
|
||||
- path: scrape.env
|
||||
required: false
|
||||
environment:
|
||||
DISCORD_TOKEN: ${DISCORD_TOKEN:?Set DISCORD_TOKEN in scrape.env or your shell environment.}
|
||||
DISCORD_TOKEN: ${DISCORD_TOKEN:-}
|
||||
TZ: ${TZ:-UTC}
|
||||
volumes:
|
||||
- ./config:/config:ro,z
|
||||
- ./scripts/run-discord-scrape.sh:/opt/dce-scheduler/run-discord-scrape.sh:ro,z
|
||||
- /home/brunner56/Documents:/home/brunner56/Documents:z
|
||||
- ${DCE_ARCHIVE_ROOT:-/home/brunner56/Documents}:${DCE_ARCHIVE_ROOT:-/home/brunner56/Documents}:z
|
||||
command: ["help"]
|
||||
|
|
|
|||
|
|
@ -102,3 +102,10 @@ Operators rely on `run-documents-scrape.sh`, `verify-documents-archives.sh`, and
|
|||
- All ten smoke scripts exit 0.
|
||||
|
||||
**Verification:** Single shell loop over `scripts/tests/*.sh`.
|
||||
|
||||
---
|
||||
|
||||
### Delta Update (2026-05-29)
|
||||
- **Landed:** Source-built Docker + compose + `setup-cron.sh` (monthly default); append-only merge; custom `~/Documents/*` targets; compound solution doc; preflight skips forbidden channels when seeded archives exist; `--config` on `setup-cron.sh`; compose `DCE_ARCHIVE_ROOT` + optional `scrape.env` for builds; operator doc fixes (append contract, weekly schedule, monitoring log path).
|
||||
- **Partial:** Live grow-only proof on all enabled targets not run in this pass; some channels remain forbidden under current token.
|
||||
- **Next:** `prove-incremental-append.sh` per enabled target; consider `container-smoke.sh` in CI when Docker is available on runners.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,67 @@
|
|||
---
|
||||
title: feat: Documents recurring scrape verification and operator closure
|
||||
type: feat
|
||||
status: completed
|
||||
date: 2026-05-29
|
||||
origin: LFG — Docker/cron append-only Discord scrape for ~/Documents archive folders
|
||||
---
|
||||
|
||||
# feat: Documents recurring scrape verification and operator closure
|
||||
|
||||
## Summary
|
||||
|
||||
Close the recurring Discord scrape vertical slice: source-built Docker image, compose mounts for `config/scrape-targets.json` and `/home/brunner56/Documents` archives, append-only JSON merge in `scripts/run-discord-scrape.sh`, monthly cron via `scripts/setup-cron.sh`, and runtime proof (preflight + incremental scrape on at least one enabled target).
|
||||
|
||||
## Problem Frame
|
||||
|
||||
Operators need monthly (configurable) incremental exports into existing `~/Documents/*_discord*` folders without re-downloading full history or overwriting archives when Discord deletes messages server-side. Infrastructure exists on `feat/recurring-cli-scrape`; this pass validates end-to-end behavior and documents the operator path.
|
||||
|
||||
## Requirements
|
||||
|
||||
| ID | Requirement |
|
||||
|----|-------------|
|
||||
| R1 | `Dockerfile` builds `DiscordChatExporter.Cli` from source; compose mounts config, scripts, and `archive_root` |
|
||||
| R2 | `config/scrape-targets.json` maps user Documents folders; empty `channel_ids` exports all accessible channels per target |
|
||||
| R3 | `run-discord-scrape.sh` uses `--after` + merge-by-id; rejects shrink merges |
|
||||
| R4 | `setup-cron.sh` defaults to monthly schedule; supports `--target`, `--guild`, `--channel`, `--interval`, `--cron` |
|
||||
| R5 | `scrape.env` (gitignored) supplies token for compose; never commit secrets |
|
||||
| R6 | Preflight and one-target scrape succeed against live Discord API |
|
||||
| R7 | Smoke tests pass; operator docs list validation commands |
|
||||
|
||||
## Scope Boundaries
|
||||
|
||||
- No changes to upstream C# merge API (wrapper-only append).
|
||||
- Do not enable `discord_dms` without user token.
|
||||
- Token stays in `scrape.env` only.
|
||||
|
||||
## Implementation Units
|
||||
|
||||
### U1. Harden bootstrap and compose paths
|
||||
|
||||
**Requirements:** R1, R2
|
||||
|
||||
**Files:** `scripts/run-discord-scrape.sh`, `docker-compose.yml`, `Dockerfile`
|
||||
|
||||
**Test scenarios:** Archive seed files bootstrap channel-map; compose bind-mount resolves host Documents path.
|
||||
|
||||
### U2. Cron installer and docs alignment
|
||||
|
||||
**Requirements:** R4, R7
|
||||
|
||||
**Files:** `scripts/setup-cron.sh`, `.docs/Recurring-Scrape-Setup.md`, `Readme.md`
|
||||
|
||||
**Test scenarios:** `setup-cron.sh --dry-run` emits monthly block; `--remove` idempotent.
|
||||
|
||||
### U3. Runtime verification
|
||||
|
||||
**Requirements:** R5, R6
|
||||
|
||||
**Commands:** `docker compose build`, `run-discord-scrape-host.sh preflight`, scrape `--target` with smallest enabled archive.
|
||||
|
||||
**Test scenarios:** Message count non-decreasing after scrape; logs show `--after` when archive non-empty.
|
||||
|
||||
## Verification Ladder
|
||||
|
||||
1. `bash -n` on changed shell scripts
|
||||
2. `scripts/tests/setup-cron-smoke.sh`, `run-discord-scrape-smoke.sh`
|
||||
3. `docker compose build` + preflight + single-target scrape
|
||||
|
|
@ -645,26 +645,16 @@ resolve_target_channels() {
|
|||
fi
|
||||
}
|
||||
|
||||
preflight_target() {
|
||||
local target_json=$1
|
||||
local defaults_json=$2
|
||||
local target_name output_dir
|
||||
local probe_channel_id probe_dir probe_output
|
||||
local -a channel_ids
|
||||
preflight_probe_channel() {
|
||||
local probe_channel_id=$1
|
||||
local output_dir=$2
|
||||
local probe_dir probe_output probe_log
|
||||
local -a probe_command after_id probe_destination
|
||||
local probe_status=0
|
||||
|
||||
target_name=$(jq -r '.name' <<<"$target_json")
|
||||
output_dir=$(jq -r '.output_dir' <<<"$target_json")
|
||||
bootstrap_channel_map_from_archives "$output_dir"
|
||||
|
||||
mapfile -t channel_ids < <(resolve_target_channels "$target_json" "$defaults_json")
|
||||
if (( ${#channel_ids[@]} == 0 )); then
|
||||
die "Target '$target_name' resolved no channels during preflight."
|
||||
fi
|
||||
|
||||
probe_channel_id="${channel_ids[0]}"
|
||||
probe_dir=$(mktemp -d "${TMPDIR:-/tmp}/dce-preflight.${probe_channel_id}.XXXXXX")
|
||||
probe_output="$probe_dir/probe.json"
|
||||
local -a probe_command after_id probe_destination
|
||||
probe_log=$(mktemp "${TMPDIR:-/tmp}/dce-preflight-log.${probe_channel_id}.XXXXXX")
|
||||
|
||||
probe_destination=$(resolve_destination_path "$output_dir" "$probe_channel_id")
|
||||
after_id=""
|
||||
|
|
@ -683,13 +673,75 @@ preflight_target() {
|
|||
probe_command+=(--after "$after_id")
|
||||
fi
|
||||
|
||||
if ! "${probe_command[@]}"; then
|
||||
set +e
|
||||
"${probe_command[@]}" >"$probe_log" 2>&1
|
||||
probe_status=$?
|
||||
set -e
|
||||
|
||||
if (( probe_status == 0 )); then
|
||||
rm -f "$probe_log"
|
||||
rm -rf "$probe_dir"
|
||||
die "Target '$target_name' failed authenticated preflight on channel '$probe_channel_id'."
|
||||
return 0
|
||||
fi
|
||||
|
||||
if is_skippable_channel_export_failure "$probe_log"; then
|
||||
log "Preflight probe skipped channel $probe_channel_id (forbidden or inaccessible)."
|
||||
cat "$probe_log" >&2
|
||||
rm -f "$probe_log"
|
||||
rm -rf "$probe_dir"
|
||||
return 2
|
||||
fi
|
||||
|
||||
cat "$probe_log" >&2
|
||||
rm -f "$probe_log"
|
||||
rm -rf "$probe_dir"
|
||||
log "Preflight ok for target '$target_name': ${#channel_ids[@]} channel(s) resolved for $output_dir."
|
||||
return 1
|
||||
}
|
||||
|
||||
preflight_target() {
|
||||
local target_json=$1
|
||||
local defaults_json=$2
|
||||
local target_name output_dir
|
||||
local probe_channel_id
|
||||
local -a channel_ids seeded_channel_ids
|
||||
local probe_status=0
|
||||
local skipped_channels=0
|
||||
local probed_channels=0
|
||||
|
||||
target_name=$(jq -r '.name' <<<"$target_json")
|
||||
output_dir=$(jq -r '.output_dir' <<<"$target_json")
|
||||
bootstrap_channel_map_from_archives "$output_dir"
|
||||
|
||||
mapfile -t channel_ids < <(resolve_target_channels "$target_json" "$defaults_json")
|
||||
if (( ${#channel_ids[@]} == 0 )); then
|
||||
die "Target '$target_name' resolved no channels during preflight."
|
||||
fi
|
||||
|
||||
for probe_channel_id in "${channel_ids[@]}"; do
|
||||
probed_channels=$((probed_channels + 1))
|
||||
preflight_probe_channel "$probe_channel_id" "$output_dir" || probe_status=$?
|
||||
case "$probe_status" in
|
||||
0)
|
||||
log "Preflight ok for target '$target_name': ${#channel_ids[@]} channel(s) resolved for $output_dir."
|
||||
return 0
|
||||
;;
|
||||
2)
|
||||
skipped_channels=$((skipped_channels + 1))
|
||||
probe_status=0
|
||||
;;
|
||||
*)
|
||||
die "Target '$target_name' failed authenticated preflight on channel '$probe_channel_id'."
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
mapfile -t seeded_channel_ids < <(load_archive_seed_channel_ids "$output_dir" | sort -u)
|
||||
if (( skipped_channels == probed_channels && ${#seeded_channel_ids[@]} > 0 )); then
|
||||
log "Preflight ok for target '$target_name' with warning: all ${#channel_ids[@]} resolved channel(s) are inaccessible, but ${#seeded_channel_ids[@]} seeded archive(s) exist under $output_dir."
|
||||
return 0
|
||||
fi
|
||||
|
||||
die "Target '$target_name' failed preflight: every resolved channel is inaccessible and no seeded archives exist under $output_dir."
|
||||
}
|
||||
|
||||
scrape_target() {
|
||||
|
|
|
|||
|
|
@ -45,6 +45,7 @@ Options:
|
|||
--cron EXPR Use an explicit five-field cron expression instead of --interval/--at.
|
||||
--job-name NAME Marker name for the installed cron block. Default: discord-scrape
|
||||
--log-file PATH Cron log file. Default: $LOG_FILE
|
||||
--config PATH Scrape targets JSON. Default: $CONFIG_FILE
|
||||
--env-file PATH Compose env file. Default: $ENV_FILE
|
||||
--skip-preflight Install the cron job without running the authenticated container preflight.
|
||||
--dry-run Print the cron block instead of installing it.
|
||||
|
|
@ -130,6 +131,22 @@ append_target_args() {
|
|||
done
|
||||
}
|
||||
|
||||
container_config_path() {
|
||||
local config_path=$1
|
||||
|
||||
if [[ "$config_path" == "$REPO_ROOT/config/"* ]]; then
|
||||
printf '/config/%s\n' "$(basename "$config_path")"
|
||||
return 0
|
||||
fi
|
||||
|
||||
if [[ "$config_path" == config/* ]]; then
|
||||
printf '/config/%s\n' "${config_path#config/}"
|
||||
return 0
|
||||
fi
|
||||
|
||||
printf '%s\n' "$config_path"
|
||||
}
|
||||
|
||||
ensure_target_directories() {
|
||||
local selected_targets_json archive_root output_dir
|
||||
|
||||
|
|
@ -182,6 +199,7 @@ run_preflight() {
|
|||
--env-file "$ENV_FILE"
|
||||
--compose-file "$COMPOSE_FILE"
|
||||
preflight
|
||||
--config "$(container_config_path "$CONFIG_FILE")"
|
||||
)
|
||||
append_target_args preflight_args
|
||||
"${preflight_args[@]}"
|
||||
|
|
@ -230,6 +248,11 @@ main() {
|
|||
LOG_FILE=$2
|
||||
shift 2
|
||||
;;
|
||||
--config)
|
||||
[[ $# -ge 2 ]] || die "Missing value for --config."
|
||||
CONFIG_FILE=$2
|
||||
shift 2
|
||||
;;
|
||||
--env-file)
|
||||
[[ $# -ge 2 ]] || die "Missing value for --env-file."
|
||||
ENV_FILE=$2
|
||||
|
|
@ -315,6 +338,7 @@ main() {
|
|||
--env-file "$ENV_FILE"
|
||||
--compose-file "$COMPOSE_FILE"
|
||||
scrape
|
||||
--config "$(container_config_path "$CONFIG_FILE")"
|
||||
)
|
||||
append_target_args scrape_args
|
||||
scrape_command=$(printf '%q ' "${scrape_args[@]}")
|
||||
|
|
|
|||
Loading…
Reference in a new issue