docs(#142): 1.9.2-beta1 changelog + device_status_log VACUUM maintenance note

Documents the #142 changes and tells operators with an already-bloated device_status_log to reclaim space with a one-time manual VACUUM in a maintenance window (retention now bounds further growth). Explains why auto-VACUUM is not enabled. New doc: docs/maintenance-device-status-log.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 09:23:16 -06:00 · 2026-06-27 19:59:17 -05:00 · 2026-06-27 19:59:17 -05:00 · ce78d0dde4
parent f206537fed
commit ce78d0dde4
2 changed files with 75 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,5 +1,36 @@
 # Changelog

+## 1.9.2-beta1 — unreleased
+
+### Fixed — server resilience (#142)
+- **A single flapping device can no longer saturate the event loop.** A new
+  load-aware, per-device reconnect throttle (`lib/reconnect-throttle.js`) gates
+  genuine reconnects *before* the heavy register work (DB writes + playlist build).
+  The verdict is per-device; global event-loop lag only multiplies an
+  already-flagged device's backoff and never throttles a healthy one. Hard ceiling
+  + cold-start warm-up so a full-fleet reconnect after a deploy is never throttled.
+- **`device_status_log` growth is bounded.** Added
+  `idx_device_status_log_device_ts`, a global retention sweep (`pruneStatusLog`,
+  `STATUS_LOG_RETENTION_DAYS` default 3) covering removed/idle devices and the
+  `offline_timeout` path, and de-duplicated the table's `CREATE TABLE`.
+- **`content-ack` spam de-duplicated.** Repeated identical
+  `(device_id, content_id, status)` reports are suppressed within
+  `CONTENT_ACK_DEDUP_MS` (default 10s).
+- **Provisioning cleanup window corrected.** Unclaimed provisioning devices are now
+  swept after 24h (the code used `365 * 86400` — a year — contradicting its own
+  comment).
+
+### Added — observability (#142)
+- **Event-loop lag telemetry** via `perf_hooks.monitorEventLoopDelay()`. Sampled to
+  a bounded `event_loop_lag` table (indexed + pruned, `LAG_TELEMETRY_RETENTION_DAYS`)
+  and surfaced on `/api/status` as `loop_lag` (mean/p50/p99/max + band).
+
+### Maintenance
+- Operators whose `device_status_log` is already bloated from a pre-1.9.2 deployment
+  should reclaim disk with a **one-time manual `VACUUM`** in a maintenance window;
+  retention now bounds further growth. Auto-VACUUM is intentionally not enabled.
+  See [`docs/maintenance-device-status-log.md`](docs/maintenance-device-status-log.md).
+
 ## 1.9.1-beta3 — unreleased

 ### Fixed — Tizen player
--- a/docs/maintenance-device-status-log.md
+++ b/docs/maintenance-device-status-log.md
@ -0,0 +1,44 @@
+# Maintenance: `device_status_log` growth & space reclaim (#142)
+
+## What changed in 1.9.2-beta1
+
+`device_status_log` previously grew without an effective bound (the per-device
+insert-time prune missed removed/idle devices and the heartbeat `offline_timeout`
+insert). In one deployment it reached ~1.2M rows / ~119 MB over ~23 days and
+degraded dashboard performance.
+
+1.9.2-beta1 bounds further growth:
+
+- **Index** `idx_device_status_log_device_ts(device_id, timestamp)` — the dashboard
+  uptime query and the prunes now use an index instead of a full scan.
+- **Global retention sweep** (`pruneStatusLog()`), run on startup and on the
+  heartbeat interval, deletes rows older than **`STATUS_LOG_RETENTION_DAYS`**
+  (default **3**) across *all* devices — including removed/idle devices and the
+  `offline_timeout` rows the per-device prune never revisited.
+
+## Reclaiming space on an already-bloated database
+
+> **Operator action — only needed once, only if your `device_status_log` is already
+> bloated from a pre-1.9.2 deployment.**
+
+Retention bounds *future* growth, but SQLite does **not** return freed pages to the
+filesystem on `DELETE` — the file stays at its high-water mark until a `VACUUM`.
+After upgrading (which prunes the old rows), reclaim the disk with a **one-time
+manual `VACUUM` in a maintenance window**:
+
+```sh
+# stop the server (or do this during a low-traffic window — VACUUM takes a global
+# write lock and rewrites the whole DB file; the app cannot write during it)
+sqlite3 /opt/screentinker/server/db/remote_display.db 'VACUUM;'
+```
+
+In the reference incident this took the DB from **119 MB → 39 MB**.
+
+### Why VACUUM is not automatic
+
+`VACUUM` locks the database and rewrites the entire file — unacceptable on the hot
+path. `PRAGMA auto_vacuum=INCREMENTAL` is **not** enabled either: it only takes
+effect on a freshly-created database (set before the first table) or after a
+one-time full `VACUUM` to convert an existing DB, so enabling it would be a no-op on
+existing installs and a silent behavior change on new ones. Space reclaim is left as
+a deliberate operator decision; ongoing growth is already bounded by retention.