Add recurring CLI scrape automation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Your Name 2026-05-24 17:04:07 -05:00
parent 48b52041a5
commit 43f5fa3b71
16 changed files with 1299 additions and 0 deletions

View file

@ -77,3 +77,22 @@ $ docker run -it --rm -v $PWD/data:/out --user $(id -u):$(id -g) tyrrrz/discordc
DiscordChatExpoter CLI accepts the `DISCORD_TOKEN` environment variable as a fallback for the `--token` option. You can set this variable either with the `--env` Docker option or with a combination of the `--env-file` Docker option and a `.env` file.
Please refer to the [Docker documentation](https://docs.docker.com/engine/reference/commandline/run/#set-environment-variables--e---env---env-file) for more information.
## Source-built recurring scraper
This repo also ships a local recurring wrapper around the CLI for source-built automation:
- `Dockerfile` builds `DiscordChatExporter.Cli` from source.
- `docker-compose.yml` runs the wrapper container.
- `scripts/run-discord-scrape.sh preflight` validates token/config/target resolution without writing archives.
- `scripts/run-discord-scrape.sh scrape` performs append-oriented JSON updates by exporting newer messages and merging them into the existing local archive instead of blindly replacing the destination file.
For the recurring flow, keep secrets in `scrape.env` (copied from `scrape.env.example`) and keep target/output mapping in `config/scrape-targets.json`.
For recurring runs, targets with `enabled: false` are skipped by default. This is the recommended way to keep unresolved archive roots in the config without blocking the rest of the schedule.
If you authenticate with a **bot token**, do not rely on guild-name or DM discovery. Discord does not expose "list my accessible guilds" or DM enumeration for bot-only REST auth, so recurring targets should be configured with explicit `guild_ids` / `channel_ids` or backed by existing archive files whose names already embed channel IDs (for example, `general [123456789012345678].json`). The `discord_dms` target should stay disabled unless you switch to a user token.
`preflight` now probes one resolved channel per selected target with the source-built CLI before cron is installed. If the token cannot read that channel, setup fails closed and leaves the existing crontab untouched.
If you run the recurring flow through podman on an SELinux-enabled host, keep the bind mounts relabeled (`:z`). The checked-in `docker-compose.yml` already applies this to the recurring wrapper mounts.

View file

@ -1,5 +1,31 @@
# Scheduling exports with Cron
## Recommended recurring wrapper
This repo now includes a source-built recurring wrapper around the CLI:
- `scripts/setup-cron.sh` installs, previews, updates, and removes one managed cron block.
- `Dockerfile` + `docker-compose.yml` build and run the CLI from source.
- `scripts/run-discord-scrape.sh preflight` validates token/config/target resolution without writing archives.
- `scripts/run-discord-scrape.sh scrape` performs append-oriented JSON updates so existing local history is retained instead of overwritten.
The recommended Linux flow is:
1. Copy `scrape.env.example` to `scrape.env` and set `DISCORD_TOKEN`.
2. Review `config/scrape-targets.json` and keep archive roots under the configured `archive_root`.
3. Run `./scripts/setup-cron.sh` for the default monthly schedule, or pass `--interval`, `--at`, or `--cron` to customize it.
4. Re-run the same script later to update the managed cron block idempotently. Use `--remove` to delete only the managed block.
The host cron schedule is authoritative for execution time. Container `TZ` only affects process/runtime timestamps.
Targets with `enabled: false` are skipped by default. Use that field for archive roots that you want to keep in the config but cannot currently resolve safely.
If you are using a **bot token**, do not depend on guild-name or DM discovery. Bot tokens cannot enumerate accessible guilds or direct messages through the Discord REST API, so recurring targets need either explicit `guild_ids` / `channel_ids` or existing archive filenames that already encode channel IDs. The recurring wrapper can seed channel selection from those archive filenames, but setup still probes one real channel per target before touching crontab state.
If any selected target fails that authenticated probe, `setup-cron.sh` stops without mutating the live crontab. In practice this means the token must already have access to every enabled target you expect cron to update.
If you are running the recurring wrapper through podman on an SELinux-enabled host, keep the bind mounts relabeled (`:z`). The checked-in `docker-compose.yml` already includes that for the recurring config and archive mounts.
## Creating the script
1. Open Terminal and create a new text file with `nano /path/to/DiscordChatExporter/cron.sh`

4
.gitignore vendored
View file

@ -10,3 +10,7 @@ obj/
# Test results
TestResults/
# Local automation secrets and logs
scrape.env
logs/

View file

@ -224,6 +224,15 @@ public class DiscordClient(
[EnumeratorCancellation] CancellationToken cancellationToken = default
)
{
if (await ResolveTokenKindAsync(cancellationToken) == TokenKind.Bot)
{
throw new DiscordChatExporterException(
"Bot tokens cannot enumerate accessible servers through the Discord REST API. "
+ "Provide explicit server or channel IDs instead of relying on guild discovery.",
true
);
}
yield return Guild.DirectMessages;
var currentAfter = Snowflake.Zero;
@ -271,6 +280,15 @@ public class DiscordClient(
{
if (guildId == Guild.DirectMessages.Id)
{
if (await ResolveTokenKindAsync(cancellationToken) == TokenKind.Bot)
{
throw new DiscordChatExporterException(
"Bot tokens cannot access direct message channels. "
+ "Disable DM scraping or use a user token for DM discovery.",
true
);
}
var response = await GetJsonResponseAsync("users/@me/channels", cancellationToken);
foreach (var channelJson in response.EnumerateArray())
yield return Channel.Parse(channelJson);

41
Dockerfile Normal file
View file

@ -0,0 +1,41 @@
FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:10.0-alpine AS build
ARG TARGETARCH
ARG VERSION=0.0.0
WORKDIR /src
COPY favicon.ico .
COPY NuGet.config .
COPY Directory.Build.props .
COPY Directory.Packages.props .
COPY DiscordChatExporter.Core DiscordChatExporter.Core
COPY DiscordChatExporter.Cli DiscordChatExporter.Cli
RUN dotnet publish DiscordChatExporter.Cli \
-p:Version=$VERSION \
-p:CSharpier_Bypass=true \
--configuration Release \
--self-contained \
--use-current-runtime \
--arch "$TARGETARCH" \
--output /opt/publish
FROM --platform=$TARGETPLATFORM mcr.microsoft.com/dotnet/runtime-deps:10.0-alpine AS run
RUN apk add --no-cache bash jq icu-libs icu-data-full tzdata
ENV DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=false \
LC_ALL=en_US.UTF-8 \
LANG=en_US.UTF-8
WORKDIR /workspace
COPY --from=build /opt/publish /opt/app
COPY config /opt/dce-config
COPY scripts/run-discord-scrape.sh /opt/dce-scheduler/run-discord-scrape.sh
RUN chmod 755 /opt/dce-scheduler/run-discord-scrape.sh
ENTRYPOINT ["/opt/dce-scheduler/run-discord-scrape.sh"]
CMD ["help"]

139
config/scrape-targets.json Normal file
View file

@ -0,0 +1,139 @@
{
"archive_root": "/home/brunner56/Documents",
"defaults": {
"include_threads": "all",
"include_voice_channels": false
},
"targets": [
{
"name": "ror_orig_discord",
"kind": "guild",
"output_dir": "/home/brunner56/Documents/ror_orig_discord",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["ror_orig_discord", "ror orig"]
},
{
"name": "ror_new_discord",
"kind": "guild",
"output_dir": "/home/brunner56/Documents/ror_new_discord",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["ror_new_discord", "ror new"]
},
{
"name": "Rabbit",
"enabled": false,
"kind": "guild",
"output_dir": "/home/brunner56/Documents/Rabbit",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["Rabbit"],
"disabled_reason": "No archived channel IDs were found locally. Add explicit guild_ids/channel_ids or switch to a user token for discovery."
},
{
"name": "OpenKotOR_discord_msgs",
"kind": "guild",
"output_dir": "/home/brunner56/Documents/OpenKotOR_discord_msgs",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["OpenKotOR_discord_msgs", "OpenKotOR"]
},
{
"name": "openkotor_discord_msgs",
"kind": "guild",
"output_dir": "/home/brunner56/Documents/openkotor_discord_msgs",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["openkotor_discord_msgs", "openkotor"]
},
{
"name": "MuseScore4",
"enabled": false,
"kind": "guild",
"output_dir": "/home/brunner56/Documents/MuseScore4",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["MuseScore4", "MuseScore"],
"disabled_reason": "No archived channel IDs were found locally. Add explicit guild_ids/channel_ids or switch to a user token for discovery."
},
{
"name": "lmms",
"enabled": false,
"kind": "guild",
"output_dir": "/home/brunner56/Documents/lmms",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["lmms"],
"disabled_reason": "No archived channel IDs were found locally. Add explicit guild_ids/channel_ids or switch to a user token for discovery."
},
{
"name": "KotOR_Speedrun_Discord",
"kind": "guild",
"output_dir": "/home/brunner56/Documents/KotOR_Speedrun_Discord",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["KotOR_Speedrun_Discord", "KotOR Speedrun"]
},
{
"name": "KotOR_discord_msgs",
"kind": "guild",
"output_dir": "/home/brunner56/Documents/KotOR_discord_msgs",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["KotOR_discord_msgs", "KotOR"]
},
{
"name": "holocron_toolset_discord",
"kind": "guild",
"output_dir": "/home/brunner56/Documents/holocron_toolset_discord",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["holocron_toolset_discord", "holocron toolset"]
},
{
"name": "expanded_kotor_discord",
"kind": "guild",
"output_dir": "/home/brunner56/Documents/expanded_kotor_discord",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["expanded_kotor_discord", "expanded kotor"]
},
{
"name": "eod_discord",
"kind": "guild",
"output_dir": "/home/brunner56/Documents/eod_discord",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["eod_discord", "eod"]
},
{
"name": "DS_Discord_msgs",
"kind": "guild",
"output_dir": "/home/brunner56/Documents/DS_Discord_msgs",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["DS_Discord_msgs", "DS"]
},
{
"name": "discord_dms",
"enabled": false,
"kind": "dms",
"output_dir": "/home/brunner56/Documents/discord_dms",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": [],
"disabled_reason": "The current token cannot read direct messages. Enable this target only with a user token."
},
{
"name": "Cline",
"enabled": false,
"kind": "guild",
"output_dir": "/home/brunner56/Documents/Cline",
"guild_ids": [],
"channel_ids": [],
"guild_name_patterns": ["Cline"],
"disabled_reason": "No archived channel IDs were found locally. Add explicit guild_ids/channel_ids or switch to a user token for discovery."
}
]
}

16
docker-compose.yml Normal file
View file

@ -0,0 +1,16 @@
services:
discord-scraper:
build:
context: .
dockerfile: Dockerfile
image: discordchatexporter-cron:local
init: true
user: "${DCE_UID:-1000}:${DCE_GID:-1000}"
working_dir: /workspace
environment:
DISCORD_TOKEN: ${DISCORD_TOKEN:?Set DISCORD_TOKEN in scrape.env or your shell environment.}
TZ: ${TZ:-UTC}
volumes:
- ./config:/config:ro,z
- /home/brunner56/Documents:/home/brunner56/Documents:z
command: ["help"]

View file

@ -0,0 +1,383 @@
---
title: feat: Add recurring CLI scrape automation
type: feat
status: active
date: 2026-05-24
---
# feat: Add recurring CLI scrape automation
## Summary
Add a CLI-first recurring scrape workflow that builds `DiscordChatExporter.Cli` from source, installs a host-side cron job with safe defaults, and writes exports into the configured `Documents` roots without overwriting previously archived history.
The implementation stays in the shell/Docker wrapper layer rather than changing the core exporter. The core safety guarantee is append-oriented local history preservation: incremental reruns only add newer messages and must fail closed when local state looks ambiguous or unsafe.
---
## Problem Frame
The repo now has a draft Docker/cron wrapper, but the user needs it hardened enough to run with a real token and existing archives. That means the system has to be idempotent at the cron layer, deterministic about which Discord targets map into which output roots, and strict about not corrupting local data even though the upstream CLI itself writes fresh files by default.
---
## Assumptions
*This plan was authored without synchronous user confirmation. The items below are agent inferences that fill gaps in the input — un-validated bets that should be reviewed before implementation proceeds.*
- `DiscordChatExporter.Cli/` in this repo is the source of truth for implementation; the sibling `DiscordChatExporter.Cli.linux-x64` bundle in the parent folder is only a runtime reference point.
- Named target directories should not be allowed to float across unrelated Discord servers; ambiguous guild-name matches should fail until the target is made explicit.
- `--guild` and `--channel` overrides are intended to narrow a configured target, not remap that target to unrelated Discord surfaces.
- Preserving local history is more important than refreshing edits or reactions on already archived older messages.
- Host cron time is the schedule authority; container `TZ` is only for process/runtime timestamp behavior.
---
## Requirements
- R1. Build the recurring scraper from the source repos CLI project (`DiscordChatExporter.Cli/`) rather than relying on the downloaded binary bundle.
- R2. Use the configured custom output roots under the users `Documents` tree and reject configuration that points outside the approved roots or reuses the same directory for multiple targets.
- R3. Install, update, preview, and remove a single managed cron block idempotently, with a monthly default schedule and CLI options for alternate cadence and target selection.
- R4. For a selected target with no explicit channel list, scrape everything discoverable for that target through the CLI discovery surface (`guilds`, `channels`, `dm`) rather than a separate config-only inventory.
- R5. Preserve previously archived local history by using incremental CLI exports and safe merge semantics instead of overwriting destination exports, even when upstream Discord history has changed or deleted messages are no longer retrievable.
- R6. Fail closed on ambiguous or unsafe state: ambiguous target resolution, invalid destination JSON, mismatched channel identity, missing token/runtime config, or failed preflight must not mutate cron state or corrupt archives.
- R7. Provide a real setup-time validation path that exercises the source-built container and authenticated CLI discovery before the cron job is installed for unattended use.
---
## Scope Boundaries
- No changes to the core C# exporter to add native append semantics; safety stays in the wrapper/runtime layer.
- No attempt to backfill edits, reaction changes, or other mutations on already archived older messages unless a future requirement explicitly asks for that behavior.
- No support for output roots outside the configured `Documents` archive area.
- No new standalone service or daemon; the recurring workflow remains host cron + one-shot container runs.
### Deferred to Follow-Up Work
- CI automation for the new shell/Docker smoke checks if the repo later decides to enforce them on every PR.
- Cross-platform scheduler parity beyond the Linux cron flow already documented in this repo.
---
## Context & Research
### Relevant Code and Patterns
- `DiscordChatExporter.Cli.dockerfile` is the upstream source-build container pattern for the CLI.
- `docker-entrypoint.sh` shows the current containers ownership and startup assumptions.
- `DiscordChatExporter.Cli/Commands/Base/DiscordCommandBase.cs` defines token handling via `DISCORD_TOKEN`.
- `DiscordChatExporter.Cli/Commands/Base/ExportCommandBase.cs` defines `--after`, output path behavior, and multi-channel export constraints.
- `DiscordChatExporter.Cli/Commands/GetGuildsCommand.cs`, `DiscordChatExporter.Cli/Commands/GetChannelsCommand.cs`, and `DiscordChatExporter.Cli/Commands/GetDirectChannelsCommand.cs` are the authoritative discovery surfaces for accessible guilds, guild channels, and DMs.
- `DiscordChatExporter.Core/Exporting/MessageExporter.cs` confirms the upstream exporter writes fresh files, so append-only behavior must live outside the core writer.
- `DiscordChatExporter.Core/Exporting/ExportRequest.cs` defines output naming and path token rules that the wrapper must either honor or constrain deliberately.
- `.docs/Docker.md` and `.docs/Scheduling-Linux.md` are the existing repo docs for container usage and cron-based scheduling.
### Institutional Learnings
- No relevant `docs/solutions/` or equivalent institutional learning artifacts were found in this repo.
### External References
- None used. Repo-local CLI, Docker, and scheduling docs were sufficient for this plan.
---
## Key Technical Decisions
- **Keep the feature in the wrapper layer:** use the existing CLI as the only export engine and build it from source, instead of modifying the core exporter or introducing a second export path. This keeps behavior aligned with upstream CLI semantics and avoids forking core export logic.
- **Prefer exact target identity over fuzzy convenience:** explicit `guild_ids` remain the strongest configuration, while name-pattern resolution must fail when it cannot resolve to one intended target safely. Wrong-server exports are worse than an operator-visible failure.
- **Treat append-only safety as a hard data-integrity contract:** incremental runs will derive a checkpoint from the existing archive, export newer messages to a temporary file, validate channel identity, merge by message ID, and replace the destination only after a complete merged file exists.
- **Constrain overrides to the selected target:** runtime `--guild` and `--channel` overrides should narrow the configured targets scope, not redirect that targets archive root to unrelated Discord surfaces.
- **Use host cron as the scheduler of record:** the cron job owns cadence, overlap protection, and idempotent installation. The container remains a one-shot worker that receives token/config/runtime context from the host environment.
- **Require setup-time preflight before mutating cron:** install should not stop at static file checks; it should prove the source-built container starts correctly and that authenticated CLI discovery works with the configured env before a recurring job is written.
---
## Open Questions
### Resolved During Planning
- **What should happen on ambiguous guild-name matches?** Fail the target instead of exporting all matches.
- **How should `--guild` and `--channel` overrides behave?** They should narrow one selected target only.
- **What is the safety bar for merge writes?** Use recoverable replacement semantics rather than in-place rewrite or best-effort append.
- **What should happen if an existing JSON file belongs to a different channel?** Hard-fail the affected target and leave the existing archive untouched.
- **Should reruns revisit old edited messages?** No; append-only snapshot preservation wins unless a later requirement changes that.
- **Which timezone governs the monthly default?** The host cron schedule does; container timezone is informational/runtime only.
### Deferred to Implementation
- **How deep should authenticated setup preflight go by default?** The implementer should choose the safest proof path after wiring the container end-to-end: prefer discovery-only validation if it gives enough confidence, but allow a disposable sandbox export if discovery alone cannot prove the scrape path.
- **How should shell smoke coverage be executed in this repo?** The implementer should decide whether the new shell checks live as standalone scripts only or are also wrapped by an existing repo test entrypoint if one emerges during execution.
---
## High-Level Technical Design
> *This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce.*
```mermaid
flowchart TD
A[setup-cron.sh] --> B[Validate env, config, compose surface]
B --> C[Run container preflight]
C --> D{Preflight clean?}
D -- no --> E[Stop without changing crontab]
D -- yes --> F[Install or replace one managed cron block]
G[Cron fires] --> H[Acquire host-side lock]
H --> I[docker compose run discord-scraper scrape]
I --> J[Resolve target -> guilds/channels/DMs through CLI]
J --> K[Resolve destination archive file for each channel]
K --> L[Read last exported message ID]
L --> M[CLI export --after checkpoint to temp JSON]
M --> N[Validate temp + existing archive identity]
N --> O{New messages?}
O -- no --> P[Leave archive unchanged]
O -- yes --> Q[Merge by message ID in temp workspace]
Q --> R[Replace destination only after merged file is complete]
```
---
## Implementation Units
### U1. Lock down the target and config contract
**Goal:** Make target selection deterministic and safe so each configured archive root only accepts the intended Discord surfaces.
**Requirements:** R2, R4, R6
**Dependencies:** None
**Files:**
- Modify: `config/scrape-targets.json`
- Modify: `scripts/run-discord-scrape.sh`
- Modify: `scripts/setup-cron.sh`
- Create: `scripts/tests/run-discord-scrape-smoke.sh`
- Create: `scripts/tests/test-fixtures/`
**Approach:**
- Define the supported recurring-scrape config fields explicitly and reject unsupported or unsafe values instead of silently ignoring them.
- Enforce that configured `output_dir` values stay under the approved archive root and remain unique across targets.
- Tighten target resolution so explicit IDs win, name-pattern resolution fails when it is ambiguous, and runtime overrides can only narrow one selected target.
- Normalize failure handling so invalid config or target mapping errors stop the affected setup/run path before any archive mutation.
**Execution note:** Start with characterization-style smoke coverage for target selection and config validation before tightening behavior in the scripts.
**Patterns to follow:**
- `DiscordChatExporter.Cli/Commands/GetGuildsCommand.cs`
- `DiscordChatExporter.Cli/Commands/GetChannelsCommand.cs`
- `DiscordChatExporter.Cli/Commands/GetDirectChannelsCommand.cs`
- `.docs/Scheduling-Linux.md`
**Test scenarios:**
- Happy path: a target with explicit `guild_ids` resolves only that guilds channels and keeps the configured archive root.
- Happy path: the DM target resolves all accessible DM channels when no explicit channel list is configured.
- Edge case: a target with zero guild-name matches fails that target cleanly without mutating unrelated targets.
- Edge case: a target with multiple guild-name matches fails closed and reports the ambiguity.
- Error path: duplicate `output_dir` values across targets are rejected before cron install or scrape execution.
- Error path: an override channel or guild outside the selected targets allowed scope is rejected.
- Integration: repeated validation runs against the same config produce the same resolved target set and do not create extra state files.
**Verification:**
- Invalid target/config state is surfaced before cron or archive writes occur.
- A selected target always maps to one predictable archive root and one intended Discord surface set.
---
### U2. Enforce append-only archive safety in the scrape wrapper
**Goal:** Make recurring runs preserve already-downloaded local history while still importing newer messages safely.
**Requirements:** R4, R5, R6
**Dependencies:** U1
**Files:**
- Modify: `scripts/run-discord-scrape.sh`
- Create: `scripts/tests/run-discord-scrape-smoke.sh`
- Create: `scripts/tests/test-fixtures/append-existing.json`
- Create: `scripts/tests/test-fixtures/append-incremental.json`
- Create: `scripts/tests/test-fixtures/wrong-channel.json`
**Approach:**
- Keep the CLI export path authoritative, but derive an incremental checkpoint from the existing archive before each rerun.
- Write incremental exports and merged results in temporary locations on the same filesystem as the destination archive.
- Validate that an existing archives embedded channel metadata matches the channel being updated before any merge occurs.
- Merge message arrays by message ID and preserve pre-existing local history when newer incremental exports omit older or deleted upstream content.
- Adopt one consistent local-state failure policy: fail the affected target, leave the destination file unchanged, and continue only unrelated targets.
**Patterns to follow:**
- `DiscordChatExporter.Core/Exporting/MessageExporter.cs`
- `DiscordChatExporter.Core/Exporting/ExportRequest.cs`
- `DiscordChatExporter.Cli.Tests/Specs/DateRangeSpecs.cs`
**Test scenarios:**
- Happy path: a first-time scrape creates a new JSON archive for a channel.
- Happy path: a rerun with newer messages appends only the new messages and keeps previously archived older messages intact.
- Edge case: a rerun with zero newer messages leaves the destination archive unchanged.
- Edge case: overlapping incremental data is deduplicated by message ID instead of producing duplicate messages.
- Error path: invalid destination JSON aborts the affected target without replacing the archive.
- Error path: a destination archive whose embedded channel identity does not match the requested channel aborts the affected target without replacement.
- Integration: a fixture that removes older messages from the incremental file still produces a merged archive containing the original older history.
**Verification:**
- Repeated reruns never truncate the existing `messages` history for a valid archive.
- Unsafe local state causes a visible failure without replacing the destination archive.
---
### U3. Align the source-built container and runtime preflight path
**Goal:** Make the container/runtime layer reflect the CLI-only source build, mounted archive roots, and token/config contract needed for safe recurring operation.
**Requirements:** R1, R2, R6, R7
**Dependencies:** U1
**Files:**
- Modify: `Dockerfile`
- Modify: `docker-compose.yml`
- Modify: `scrape.env.example`
- Modify: `.gitignore`
- Create: `scripts/tests/container-smoke.sh`
**Approach:**
- Keep the runtime image focused on the CLI wrapper surface and source-build `DiscordChatExporter.Cli` inside the Docker image.
- Prefer environment-based token injection over committed files or inline command flags.
- Mount only the archive area and config surface needed for recurring runs, avoiding container startup behaviors that could recursively rewrite ownership across the whole archive tree.
- Make config discovery deterministic: mounted repo config should win when present, with a built-in fallback only to keep the wrapper executable when the mount is absent.
- Add a preflight surface that proves the image, wrapper, and config visibility are working before cron is installed.
**Patterns to follow:**
- `DiscordChatExporter.Cli.dockerfile`
- `docker-entrypoint.sh`
- `.docs/Docker.md`
**Test scenarios:**
- Happy path: the image builds from source and the wrapper can render help/list configured targets successfully.
- Happy path: the container sees the mounted config and archive root expected by the recurring scripts.
- Edge case: missing `DISCORD_TOKEN` fails fast with a clear message and no archive writes.
- Error path: missing or invalid config visibility fails predictably instead of silently falling back to an unsafe runtime state.
- Integration: a preflight container run exercises source-built startup and authenticated discovery without mutating the users existing archive roots.
**Verification:**
- The Docker/compose layer is source-built, token-driven, and consistent with the recurring wrapper contract.
- Container startup does not perform broad archive permission rewrites as a side effect.
---
### U4. Make cron installation idempotent and safe to rerun
**Goal:** Ensure setup can be run repeatedly with real credentials and evolving target selections without duplicating cron entries or masking broken runtime state.
**Requirements:** R3, R6, R7
**Dependencies:** U1, U3
**Files:**
- Modify: `scripts/setup-cron.sh`
- Create: `scripts/tests/setup-cron-smoke.sh`
- Modify: `.docs/Scheduling-Linux.md`
**Approach:**
- Keep one managed cron block identified by stable markers, preserving unrelated crontab entries.
- Treat `--dry-run`, install/update, and `--remove` as the same managed block lifecycle rather than separate code paths with drift.
- Use a host-side lock in the scheduled command to avoid overlapping runs from cron.
- Run setup-time preflight before mutating crontab state unless the caller is previewing or removing.
- Improve operator-facing output so no-op health, validation failures, and install success are distinguishable.
**Patterns to follow:**
- `.docs/Scheduling-Linux.md`
- Existing `scripts/setup-cron.sh`
**Test scenarios:**
- Happy path: a default install generates one monthly cron entry at the expected managed marker block.
- Happy path: rerunning setup with new schedule or target options replaces only the managed block and preserves unrelated crontab lines.
- Edge case: `--dry-run` previews the exact managed block without touching the live crontab.
- Edge case: `--remove` deletes only the managed block and leaves unrelated entries intact.
- Error path: failed preflight leaves the crontab unchanged.
- Integration: a fixture crontab exercised through install -> reinstall -> remove remains stable and idempotent across the full lifecycle.
**Verification:**
- Repeated setup runs converge to one managed cron block.
- Cron is never mutated after a failed validation/preflight pass.
---
### U5. Publish the operator contract for the recurring scraper
**Goal:** Align repo docs with the new recurring CLI wrapper so operators know the supported config keys, safety rules, and live setup expectations.
**Requirements:** R2, R3, R5, R7
**Dependencies:** U2, U4
**Files:**
- Modify: `.docs/Docker.md`
- Modify: `.docs/Scheduling-Linux.md`
- Modify: `Readme.md`
**Approach:**
- Document that the recurring workflow is CLI-first, source-built, and configured through the wrapper-owned env/config files rather than a native DCE config surface.
- Explain the append-only contract clearly: new messages are merged in, existing local history is retained, and older edits/reactions are not backfilled.
- Document the approved archive-root rule, target selection behavior, and the host-timezone interpretation of cron schedules.
- Add a short “first authenticated setup” path so the operator can run the safe preflight and understand what success vs no-op vs failure looks like.
**Patterns to follow:**
- `.docs/Docker.md`
- `.docs/Scheduling-Linux.md`
- `.docs/Using-the-CLI.md`
**Test scenarios:**
- Test expectation: none -- documentation-only unit. Review should confirm that documented flags, defaults, and safety constraints match the implemented wrapper behavior.
**Verification:**
- The docs describe the same config keys, schedule behavior, and safety limits that the scripts actually enforce.
---
## System-Wide Impact
- **Interaction graph:** host cron, `docker compose`, the wrapper scripts, the source-built CLI binary, and the on-disk JSON archives all participate in each recurring run.
- **Error propagation:** config/setup failures should stop before crontab mutation; target-level local-state failures should stop that target without silently replacing archives; unrelated targets may continue when safe.
- **State lifecycle risks:** channel-map state, destination archives, and temporary merge files must stay coherent across repeated runs and interrupted execution.
- **API surface parity:** the recurring wrapper must stay aligned with the existing CLI discovery/export surface rather than inventing a second target-resolution contract.
- **Integration coverage:** source-built container startup, authenticated discovery, archive merge safety, and crontab lifecycle need smoke coverage beyond existing C# export tests.
- **Unchanged invariants:** the upstream CLI remains the exporter of record; this plan does not change the core C# export writers overwrite semantics, only how recurring automation uses it safely.
---
## Risks & Dependencies
| Risk | Mitigation |
|------|------------|
| Ambiguous guild-name matching routes data into the wrong archive root | Fail the target unless resolution is explicit and unique; prefer configured IDs over fuzzy matches |
| Append-only merge logic corrupts an existing archive on bad local state | Validate destination JSON and embedded channel identity before merge; replace only after a complete merged file exists |
| Cron install succeeds even though runtime auth/config is broken | Require setup-time preflight before mutating crontab state |
| Broad container ownership or mount behavior touches the archive tree unexpectedly | Keep runtime mounts narrow and avoid recursive ownership rewrites across the archive root |
| Operators misread “no new messages” as failure | Standardize logs and summaries for success, no-op, skipped target, and failure outcomes |
---
## Documentation / Operational Notes
- The recurring workflow should continue to rely on `DISCORD_TOKEN` through env-file or shell environment, not committed token material.
- The host cron schedule is the timing authority; container `TZ` should only influence runtime formatting/log interpretation.
- The sibling `DiscordChatExporter.Cli.linux-x64` bundle may be useful as a runtime comparison point during execution, but the implementation target remains the source repos CLI project.
---
## Sources & References
- Related code: `DiscordChatExporter.Cli.dockerfile`
- Related code: `docker-entrypoint.sh`
- Related code: `DiscordChatExporter.Cli/Commands/Base/DiscordCommandBase.cs`
- Related code: `DiscordChatExporter.Cli/Commands/Base/ExportCommandBase.cs`
- Related code: `DiscordChatExporter.Cli/Commands/GetGuildsCommand.cs`
- Related code: `DiscordChatExporter.Cli/Commands/GetChannelsCommand.cs`
- Related code: `DiscordChatExporter.Cli/Commands/GetDirectChannelsCommand.cs`
- Related code: `DiscordChatExporter.Core/Exporting/MessageExporter.cs`
- Related code: `DiscordChatExporter.Core/Exporting/ExportRequest.cs`
- Related docs: `.docs/Docker.md`
- Related docs: `.docs/Scheduling-Linux.md`
- Related docs: `.docs/Using-the-CLI.md`

7
scrape.env.example Normal file
View file

@ -0,0 +1,7 @@
# Copy this file to scrape.env and fill in your real values.
DISCORD_TOKEN=
TZ=UTC
# Match these to the host user that should own created files.
DCE_UID=1000
DCE_GID=1000

353
scripts/setup-cron.sh Executable file
View file

@ -0,0 +1,353 @@
#!/usr/bin/env bash
set -Eeuo pipefail
SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd -P)
REPO_ROOT="${DCE_REPO_ROOT:-$(cd "$SCRIPT_DIR/.." && pwd -P)}"
COMPOSE_FILE="${DCE_COMPOSE_FILE:-$REPO_ROOT/docker-compose.yml}"
ENV_FILE="${DCE_ENV_FILE:-$REPO_ROOT/scrape.env}"
CONFIG_FILE="${DCE_CONFIG_FILE:-$REPO_ROOT/config/scrape-targets.json}"
LOG_FILE="${DCE_LOG_FILE:-$REPO_ROOT/logs/discord-scrape.log}"
JOB_NAME="discord-scrape"
INTERVAL="monthly"
RUN_AT="04:00"
CRON_EXPRESSION=""
DRY_RUN=0
REMOVE=0
SKIP_PREFLIGHT=0
TARGETS=()
GUILDS=()
CHANNELS=()
JQ_BIN="${DCE_JQ_BIN:-jq}"
CRONTAB_BIN="${DCE_CRONTAB_BIN:-crontab}"
DOCKER_BIN="${DCE_DOCKER_BIN:-docker}"
COMPOSE_BIN="${DCE_COMPOSE_BIN:-}"
DOCKER_BIN_OVERRIDDEN=0
if [[ -n "${DCE_DOCKER_BIN:-}" ]]; then
DOCKER_BIN_OVERRIDDEN=1
fi
usage() {
cat <<EOF
Usage:
$(basename "$0") [options]
Options:
--target NAME Restrict the cron job to one configured target. Repeatable.
--guild ID Narrow a selected target to one of its allowed guild IDs. Repeatable.
--channel ID Narrow a selected target to one of its allowed channel IDs. Repeatable.
--interval VALUE monthly, weekly, or daily. Default: monthly
--at HH:MM Run time in 24-hour format. Default: 04:00
--cron EXPR Use an explicit five-field cron expression instead of --interval/--at.
--job-name NAME Marker name for the installed cron block. Default: discord-scrape
--log-file PATH Cron log file. Default: $LOG_FILE
--env-file PATH Compose env file. Default: $ENV_FILE
--skip-preflight Install the cron job without running the authenticated container preflight.
--dry-run Print the cron block instead of installing it.
--remove Remove the managed cron block and exit.
--help Show this help text.
Examples:
$(basename "$0")
$(basename "$0") --target discord_dms --interval weekly --at 02:30
$(basename "$0") --target Cline --channel 123456789012345678 --channel 234567890123456789
EOF
}
die() {
printf 'ERROR: %s\n' "$*" >&2
exit 1
}
require_program() {
local program_path=$1
command -v "$program_path" >/dev/null 2>&1 || die "Required command '$program_path' is missing."
}
cron_from_schedule() {
local interval=$1
local run_at=$2
local hour minute
[[ "$run_at" =~ ^([0-1][0-9]|2[0-3]):([0-5][0-9])$ ]] || die "--at must use HH:MM in 24-hour time."
hour=${BASH_REMATCH[1]}
minute=${BASH_REMATCH[2]}
case "$interval" in
monthly) printf '%s %s 1 * *' "$minute" "$hour" ;;
weekly) printf '%s %s * * 0' "$minute" "$hour" ;;
daily) printf '%s %s * * *' "$minute" "$hour" ;;
*) die "Unsupported --interval '$interval'. Use monthly, weekly, or daily." ;;
esac
}
strip_existing_job() {
local existing_crontab=$1
local begin_marker=$2
local end_marker=$3
awk -v begin="$begin_marker" -v end="$end_marker" '
$0 == begin { skipping = 1; next }
$0 == end { skipping = 0; next }
!skipping { print }
' <<<"$existing_crontab"
}
build_compose_command() {
local subcommand=$1
local -a command_parts
if [[ -n "$COMPOSE_BIN" ]]; then
command_parts=(
"$COMPOSE_BIN"
--env-file "$ENV_FILE"
-f "$COMPOSE_FILE"
run
-T
--rm
discord-scraper
"$subcommand"
)
elif (( DOCKER_BIN_OVERRIDDEN == 0 )) && command -v docker-compose >/dev/null 2>&1; then
command_parts=(
docker-compose
--env-file "$ENV_FILE"
-f "$COMPOSE_FILE"
run
-T
--rm
discord-scraper
"$subcommand"
)
else
command_parts=(
"$DOCKER_BIN"
compose
--env-file "$ENV_FILE"
-f "$COMPOSE_FILE"
run
-T
--rm
discord-scraper
"$subcommand"
)
fi
local target
for target in "${TARGETS[@]}"; do
command_parts+=(--target "$target")
done
local guild_id
for guild_id in "${GUILDS[@]}"; do
command_parts+=(--guild "$guild_id")
done
local channel_id
for channel_id in "${CHANNELS[@]}"; do
command_parts+=(--channel "$channel_id")
done
printf '%q ' "${command_parts[@]}"
}
ensure_target_directories() {
local selected_targets_json archive_root output_dir
archive_root=$("$JQ_BIN" -r '.archive_root // empty' "$CONFIG_FILE")
[[ -n "$archive_root" ]] || die "Config is missing archive_root."
mkdir -p "$archive_root"
selected_targets_json=$("$JQ_BIN" -cn '$ARGS.positional' --args "${TARGETS[@]}")
if (( ${#TARGETS[@]} == 0 )); then
while IFS= read -r output_dir; do
mkdir -p "$output_dir"
done < <("$JQ_BIN" -r '.targets[] | select(.enabled != false) | .output_dir' "$CONFIG_FILE")
return 0
fi
while IFS= read -r output_dir; do
mkdir -p "$output_dir"
done < <(
"$JQ_BIN" -r \
--argjson selected_targets "$selected_targets_json" \
'.targets[]
| select(.name as $name | $selected_targets | index($name))
| .output_dir' \
"$CONFIG_FILE"
)
}
validate_targets() {
(( ${#TARGETS[@]} == 0 )) && return 0
local requested_targets_json resolved_count
requested_targets_json=$("$JQ_BIN" -cn '$ARGS.positional' --args "${TARGETS[@]}")
resolved_count=$(
"$JQ_BIN" -r \
--argjson requested_targets "$requested_targets_json" \
'[.targets[] | select(.name as $name | $requested_targets | index($name))] | length' \
"$CONFIG_FILE"
)
[[ "$resolved_count" == "${#TARGETS[@]}" ]] || die "One or more --target values are missing from $CONFIG_FILE."
}
run_preflight() {
local preflight_command
[[ -f "$ENV_FILE" ]] || die "Missing env file: $ENV_FILE"
preflight_command=$(build_compose_command preflight)
eval "$preflight_command"
}
main() {
while (($#)); do
case "$1" in
--target)
[[ $# -ge 2 ]] || die "Missing value for --target."
TARGETS+=("$2")
shift 2
;;
--guild)
[[ $# -ge 2 ]] || die "Missing value for --guild."
GUILDS+=("$2")
shift 2
;;
--channel)
[[ $# -ge 2 ]] || die "Missing value for --channel."
CHANNELS+=("$2")
shift 2
;;
--interval)
[[ $# -ge 2 ]] || die "Missing value for --interval."
INTERVAL=$2
shift 2
;;
--at)
[[ $# -ge 2 ]] || die "Missing value for --at."
RUN_AT=$2
shift 2
;;
--cron)
[[ $# -ge 2 ]] || die "Missing value for --cron."
CRON_EXPRESSION=$2
shift 2
;;
--job-name)
[[ $# -ge 2 ]] || die "Missing value for --job-name."
JOB_NAME=$2
shift 2
;;
--log-file)
[[ $# -ge 2 ]] || die "Missing value for --log-file."
LOG_FILE=$2
shift 2
;;
--env-file)
[[ $# -ge 2 ]] || die "Missing value for --env-file."
ENV_FILE=$2
shift 2
;;
--skip-preflight)
SKIP_PREFLIGHT=1
shift
;;
--dry-run)
DRY_RUN=1
shift
;;
--remove)
REMOVE=1
shift
;;
--help|-h)
usage
exit 0
;;
*)
die "Unknown option: $1"
;;
esac
done
require_program "$JQ_BIN"
require_program "$CRONTAB_BIN"
if [[ -n "$COMPOSE_BIN" ]]; then
require_program "$COMPOSE_BIN"
elif (( DOCKER_BIN_OVERRIDDEN == 0 )) && command -v docker-compose >/dev/null 2>&1; then
:
else
require_program "$DOCKER_BIN"
fi
[[ -f "$COMPOSE_FILE" ]] || die "Missing compose file: $COMPOSE_FILE"
[[ -f "$CONFIG_FILE" ]] || die "Missing config file: $CONFIG_FILE"
"$JQ_BIN" empty "$CONFIG_FILE" >/dev/null 2>&1 || die "Invalid JSON config: $CONFIG_FILE"
validate_targets
if (( (${#GUILDS[@]} > 0 || ${#CHANNELS[@]} > 0) && ${#TARGETS[@]} != 1 )); then
die "--guild and --channel overrides require exactly one --target."
fi
local cron_line
if [[ -n "$CRON_EXPRESSION" ]]; then
cron_line=$CRON_EXPRESSION
else
cron_line=$(cron_from_schedule "$INTERVAL" "$RUN_AT")
fi
local begin_marker="# BEGIN ${JOB_NAME}"
local end_marker="# END ${JOB_NAME}"
local current_crontab cleaned_crontab compose_command job_line lock_prefix
current_crontab=$("$CRONTAB_BIN" -l 2>/dev/null || true)
cleaned_crontab=$(strip_existing_job "$current_crontab" "$begin_marker" "$end_marker")
if (( REMOVE == 1 )); then
if (( DRY_RUN == 1 )); then
printf '%s\n' "$cleaned_crontab"
exit 0
fi
printf '%s\n' "$cleaned_crontab" | "$CRONTAB_BIN" -
exit 0
fi
mkdir -p "$(dirname "$LOG_FILE")"
ensure_target_directories
if (( SKIP_PREFLIGHT == 0 )); then
run_preflight
fi
compose_command=$(build_compose_command scrape)
if command -v flock >/dev/null 2>&1; then
lock_prefix=$(printf '%q ' "$(command -v flock)" "-n" "/tmp/${JOB_NAME}.lock")
else
lock_prefix=""
fi
job_line="$cron_line cd $(printf '%q' "$REPO_ROOT") && ${lock_prefix}${compose_command}>> $(printf '%q' "$LOG_FILE") 2>&1"
local cron_block
cron_block=$(printf '%s\n%s\n%s\n' "$begin_marker" "$job_line" "$end_marker")
if (( DRY_RUN == 1 )); then
printf '%s\n' "$cron_block"
exit 0
fi
{
if [[ -n "$cleaned_crontab" ]]; then
printf '%s\n\n' "$cleaned_crontab"
fi
printf '%s\n' "$cron_block"
} | "$CRONTAB_BIN" -
}
main "$@"

View file

@ -0,0 +1,25 @@
#!/usr/bin/env bash
set -Eeuo pipefail
REPO_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd -P)
TMP_ENV=$(mktemp "${TMPDIR:-/tmp}/dce-container-smoke.XXXXXX.env")
cleanup() {
rm -f "$TMP_ENV"
}
trap cleanup EXIT
cat >"$TMP_ENV" <<EOF
DISCORD_TOKEN=dummy
DCE_UID=$(id -u)
DCE_GID=$(id -g)
TZ=UTC
EOF
cd "$REPO_ROOT"
docker compose --env-file "$TMP_ENV" build
docker compose --env-file "$TMP_ENV" run --rm discord-scraper help >/dev/null
docker compose --env-file "$TMP_ENV" run --rm discord-scraper list-targets >/dev/null
echo "container smoke test passed"

View file

@ -0,0 +1,107 @@
#!/usr/bin/env bash
set -Eeuo pipefail
REPO_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd -P)
FIXTURE_DIR="$REPO_ROOT/scripts/tests/test-fixtures"
TMP_DIR=$(mktemp -d "${TMPDIR:-/tmp}/dce-run-smoke.XXXXXX")
ARCHIVE_ROOT="$TMP_DIR/archive"
CONFIG_PATH="$TMP_DIR/config.json"
FAKE_CLI="$TMP_DIR/fake-cli.sh"
cleanup() {
rm -rf "$TMP_DIR"
}
trap cleanup EXIT
cat >"$CONFIG_PATH" <<JSON
{
"archive_root": "$ARCHIVE_ROOT",
"defaults": {
"include_threads": "all",
"include_voice_channels": false
},
"targets": [
{
"name": "demo",
"kind": "guild",
"output_dir": "$ARCHIVE_ROOT/demo",
"channel_ids": ["111"],
"guild_ids": [],
"guild_name_patterns": []
}
]
}
JSON
cat >"$FAKE_CLI" <<'EOF'
#!/usr/bin/env bash
set -Eeuo pipefail
mode=${FAKE_DCE_MODE:?}
fixture_dir=${FAKE_DCE_FIXTURE_DIR:?}
subcommand=${1:?}
shift || true
case "$subcommand" in
export)
output=""
while (($#)); do
case "$1" in
--output)
output=$2
shift 2
;;
--channel|--format|--after)
shift 2
;;
*)
shift
;;
esac
done
case "$mode" in
initial) cp "$fixture_dir/append-existing.json" "$output" ;;
append) cp "$fixture_dir/append-incremental.json" "$output" ;;
wrong-channel) cp "$fixture_dir/wrong-channel.json" "$output" ;;
*) echo "unexpected mode: $mode" >&2; exit 1 ;;
esac
;;
*)
echo "unexpected subcommand: $subcommand" >&2
exit 1
;;
esac
EOF
chmod +x "$FAKE_CLI"
run_wrapper() {
DISCORD_TOKEN=dummy \
DCE_CLI_BIN="$FAKE_CLI" \
DCE_PRIMARY_CONFIG="$CONFIG_PATH" \
DCE_FALLBACK_CONFIG="$CONFIG_PATH" \
FAKE_DCE_FIXTURE_DIR="$FIXTURE_DIR" \
FAKE_DCE_MODE="$1" \
"$REPO_ROOT/scripts/run-discord-scrape.sh" scrape --target demo
}
run_wrapper initial
DEST="$ARCHIVE_ROOT/demo/channels/111.json"
[[ -f "$DEST" ]] || { echo "expected destination archive missing" >&2; exit 1; }
[[ "$(jq -r '.messages | length' "$DEST")" == "2" ]] || { echo "expected initial message count of 2" >&2; exit 1; }
run_wrapper append
[[ "$(jq -r '.messages | length' "$DEST")" == "3" ]] || { echo "expected appended message count of 3" >&2; exit 1; }
[[ "$(jq -r '.messages[-1].id' "$DEST")" == "3" ]] || { echo "expected last message id 3 after append" >&2; exit 1; }
before_checksum=$(sha256sum "$DEST" | awk '{print $1}')
if run_wrapper wrong-channel; then
echo "wrong-channel fixture should have failed" >&2
exit 1
fi
after_checksum=$(sha256sum "$DEST" | awk '{print $1}')
[[ "$before_checksum" == "$after_checksum" ]] || { echo "destination archive changed after failed wrong-channel run" >&2; exit 1; }
echo "run-discord-scrape smoke test passed"

100
scripts/tests/setup-cron-smoke.sh Executable file
View file

@ -0,0 +1,100 @@
#!/usr/bin/env bash
set -Eeuo pipefail
REPO_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd -P)
TMP_DIR=$(mktemp -d "${TMPDIR:-/tmp}/dce-cron-smoke.XXXXXX")
ARCHIVE_ROOT="$TMP_DIR/archive"
CONFIG_PATH="$TMP_DIR/config.json"
ENV_PATH="$TMP_DIR/scrape.env"
CRONTAB_FILE="$TMP_DIR/crontab.txt"
DOCKER_LOG="$TMP_DIR/docker.log"
FAKE_DOCKER="$TMP_DIR/docker"
FAKE_CRONTAB="$TMP_DIR/crontab"
cleanup() {
rm -rf "$TMP_DIR"
}
trap cleanup EXIT
cat >"$CONFIG_PATH" <<JSON
{
"archive_root": "$ARCHIVE_ROOT",
"defaults": {
"include_threads": "all",
"include_voice_channels": false
},
"targets": [
{
"name": "demo",
"kind": "guild",
"output_dir": "$ARCHIVE_ROOT/demo",
"channel_ids": ["111"],
"guild_ids": [],
"guild_name_patterns": []
}
]
}
JSON
cat >"$ENV_PATH" <<EOF
DISCORD_TOKEN=dummy
DCE_UID=1000
DCE_GID=1000
TZ=UTC
EOF
cat >"$FAKE_DOCKER" <<EOF
#!/usr/bin/env bash
set -Eeuo pipefail
printf '%s\n' "\$*" >>"$DOCKER_LOG"
exit 0
EOF
chmod +x "$FAKE_DOCKER"
cat >"$FAKE_CRONTAB" <<EOF
#!/usr/bin/env bash
set -Eeuo pipefail
file="$CRONTAB_FILE"
if [[ "\${1:-}" == "-l" ]]; then
[[ -f "\$file" ]] || exit 1
cat "\$file"
elif [[ "\${1:-}" == "-" ]]; then
cat >"\$file"
else
echo "unexpected crontab args: \$*" >&2
exit 1
fi
EOF
chmod +x "$FAKE_CRONTAB"
printf 'MAILTO=test@example.com\n' >"$CRONTAB_FILE"
run_setup() {
DCE_CONFIG_FILE="$CONFIG_PATH" \
DCE_ENV_FILE="$ENV_PATH" \
DCE_CRONTAB_BIN="$FAKE_CRONTAB" \
DCE_DOCKER_BIN="$FAKE_DOCKER" \
DCE_JQ_BIN="$(command -v jq)" \
DCE_REPO_ROOT="$REPO_ROOT" \
DCE_LOG_FILE="$TMP_DIR/logs/discord-scrape.log" \
"$REPO_ROOT/scripts/setup-cron.sh" --target demo "$@"
}
run_setup
grep -q '^MAILTO=test@example.com$' "$CRONTAB_FILE" || { echo "expected unrelated crontab line to remain" >&2; exit 1; }
[[ "$(grep -c '^# BEGIN discord-scrape$' "$CRONTAB_FILE")" == "1" ]] || { echo "expected exactly one managed cron block after install" >&2; exit 1; }
grep -q 'compose --env-file' "$DOCKER_LOG" || { echo "expected docker preflight to run during install" >&2; exit 1; }
run_setup
[[ "$(grep -c '^# BEGIN discord-scrape$' "$CRONTAB_FILE")" == "1" ]] || { echo "expected exactly one managed cron block after reinstall" >&2; exit 1; }
preview_output=$(run_setup --dry-run)
grep -q '^# BEGIN discord-scrape$' <<<"$preview_output" || { echo "expected dry-run preview to contain managed block" >&2; exit 1; }
[[ "$(grep -c '^# BEGIN discord-scrape$' "$CRONTAB_FILE")" == "1" ]] || { echo "dry-run should not alter crontab state" >&2; exit 1; }
run_setup --remove
grep -q '^MAILTO=test@example.com$' "$CRONTAB_FILE" || { echo "expected unrelated crontab line to remain after remove" >&2; exit 1; }
! grep -q '^# BEGIN discord-scrape$' "$CRONTAB_FILE" || { echo "expected managed cron block to be removed" >&2; exit 1; }
echo "setup-cron smoke test passed"

View file

@ -0,0 +1,22 @@
{
"channel": {
"id": "111"
},
"messages": [
{
"id": "1",
"timestamp": "2026-01-01T00:00:00Z",
"content": "first"
},
{
"id": "2",
"timestamp": "2026-01-02T00:00:00Z",
"content": "second"
}
],
"dateRange": {
"after": null,
"before": null
},
"exportedAt": "2026-01-02T00:00:00Z"
}

View file

@ -0,0 +1,22 @@
{
"channel": {
"id": "111"
},
"messages": [
{
"id": "2",
"timestamp": "2026-01-02T00:00:00Z",
"content": "second"
},
{
"id": "3",
"timestamp": "2026-01-03T00:00:00Z",
"content": "third"
}
],
"dateRange": {
"after": "2026-01-02T00:00:00Z",
"before": null
},
"exportedAt": "2026-01-03T00:00:00Z"
}

View file

@ -0,0 +1,17 @@
{
"channel": {
"id": "999"
},
"messages": [
{
"id": "4",
"timestamp": "2026-01-04T00:00:00Z",
"content": "wrong"
}
],
"dateRange": {
"after": "2026-01-03T00:00:00Z",
"before": null
},
"exportedAt": "2026-01-04T00:00:00Z"
}